aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/tools/memory-model/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'tools/memory-model/Documentation')
-rw-r--r--tools/memory-model/Documentation/README76
-rw-r--r--tools/memory-model/Documentation/access-marking.txt598
-rw-r--r--tools/memory-model/Documentation/cheatsheet.txt33
-rw-r--r--tools/memory-model/Documentation/control-dependencies.txt258
-rw-r--r--tools/memory-model/Documentation/explanation.txt430
-rw-r--r--tools/memory-model/Documentation/glossary.txt178
-rw-r--r--tools/memory-model/Documentation/litmus-tests.txt1083
-rw-r--r--tools/memory-model/Documentation/locking.txt298
-rw-r--r--tools/memory-model/Documentation/ordering.txt556
-rw-r--r--tools/memory-model/Documentation/recipes.txt6
-rw-r--r--tools/memory-model/Documentation/references.txt23
-rw-r--r--tools/memory-model/Documentation/simple.txt270
12 files changed, 3702 insertions, 107 deletions
diff --git a/tools/memory-model/Documentation/README b/tools/memory-model/Documentation/README
new file mode 100644
index 000000000000..db90a26dbdf4
--- /dev/null
+++ b/tools/memory-model/Documentation/README
@@ -0,0 +1,76 @@
+It has been said that successful communication requires first identifying
+what your audience knows and then building a bridge from their current
+knowledge to what they need to know. Unfortunately, the expected
+Linux-kernel memory model (LKMM) audience might be anywhere from novice
+to expert both in kernel hacking and in understanding LKMM.
+
+This document therefore points out a number of places to start reading,
+depending on what you know and what you would like to learn. Please note
+that the documents later in this list assume that the reader understands
+the material provided by documents earlier in this list.
+
+o You are new to Linux-kernel concurrency: simple.txt
+
+o You have some background in Linux-kernel concurrency, and would
+ like an overview of the types of low-level concurrency primitives
+ that the Linux kernel provides: ordering.txt
+
+ Here, "low level" means atomic operations to single variables.
+
+o You are familiar with the Linux-kernel concurrency primitives
+ that you need, and just want to get started with LKMM litmus
+ tests: litmus-tests.txt
+
+o You are familiar with Linux-kernel concurrency, and would
+ like a detailed intuitive understanding of LKMM, including
+ situations involving more than two threads: recipes.txt
+
+o You would like a detailed understanding of what your compiler can
+ and cannot do to control dependencies: control-dependencies.txt
+
+o You are familiar with Linux-kernel concurrency and the use of
+ LKMM, and would like a quick reference: cheatsheet.txt
+
+o You are familiar with Linux-kernel concurrency and the use
+ of LKMM, and would like to learn about LKMM's requirements,
+ rationale, and implementation: explanation.txt
+
+o You are interested in the publications related to LKMM, including
+ hardware manuals, academic literature, standards-committee
+ working papers, and LWN articles: references.txt
+
+
+====================
+DESCRIPTION OF FILES
+====================
+
+README
+ This file.
+
+cheatsheet.txt
+ Quick-reference guide to the Linux-kernel memory model.
+
+control-dependencies.txt
+ Guide to preventing compiler optimizations from destroying
+ your control dependencies.
+
+explanation.txt
+ Detailed description of the memory model.
+
+litmus-tests.txt
+ The format, features, capabilities, and limitations of the litmus
+ tests that LKMM can evaluate.
+
+ordering.txt
+ Overview of the Linux kernel's low-level memory-ordering
+ primitives by category.
+
+recipes.txt
+ Common memory-ordering patterns.
+
+references.txt
+ Background information.
+
+simple.txt
+ Starting point for someone new to Linux-kernel concurrency.
+ And also a reminder of the simpler approaches to concurrency!
diff --git a/tools/memory-model/Documentation/access-marking.txt b/tools/memory-model/Documentation/access-marking.txt
new file mode 100644
index 000000000000..65778222183e
--- /dev/null
+++ b/tools/memory-model/Documentation/access-marking.txt
@@ -0,0 +1,598 @@
+MARKING SHARED-MEMORY ACCESSES
+==============================
+
+This document provides guidelines for marking intentionally concurrent
+normal accesses to shared memory, that is "normal" as in accesses that do
+not use read-modify-write atomic operations. It also describes how to
+document these accesses, both with comments and with special assertions
+processed by the Kernel Concurrency Sanitizer (KCSAN). This discussion
+builds on an earlier LWN article [1].
+
+
+ACCESS-MARKING OPTIONS
+======================
+
+The Linux kernel provides the following access-marking options:
+
+1. Plain C-language accesses (unmarked), for example, "a = b;"
+
+2. Data-race marking, for example, "data_race(a = b);"
+
+3. READ_ONCE(), for example, "a = READ_ONCE(b);"
+ The various forms of atomic_read() also fit in here.
+
+4. WRITE_ONCE(), for example, "WRITE_ONCE(a, b);"
+ The various forms of atomic_set() also fit in here.
+
+
+These may be used in combination, as shown in this admittedly improbable
+example:
+
+ WRITE_ONCE(a, b + data_race(c + d) + READ_ONCE(e));
+
+Neither plain C-language accesses nor data_race() (#1 and #2 above) place
+any sort of constraint on the compiler's choice of optimizations [2].
+In contrast, READ_ONCE() and WRITE_ONCE() (#3 and #4 above) restrict the
+compiler's use of code-motion and common-subexpression optimizations.
+Therefore, if a given access is involved in an intentional data race,
+using READ_ONCE() for loads and WRITE_ONCE() for stores is usually
+preferable to data_race(), which in turn is usually preferable to plain
+C-language accesses. It is permissible to combine #2 and #3, for example,
+data_race(READ_ONCE(a)), which will both restrict compiler optimizations
+and disable KCSAN diagnostics.
+
+KCSAN will complain about many types of data races involving plain
+C-language accesses, but marking all accesses involved in a given data
+race with one of data_race(), READ_ONCE(), or WRITE_ONCE(), will prevent
+KCSAN from complaining. Of course, lack of KCSAN complaints does not
+imply correct code. Therefore, please take a thoughtful approach
+when responding to KCSAN complaints. Churning the code base with
+ill-considered additions of data_race(), READ_ONCE(), and WRITE_ONCE()
+is unhelpful.
+
+In fact, the following sections describe situations where use of
+data_race() and even plain C-language accesses is preferable to
+READ_ONCE() and WRITE_ONCE().
+
+
+Use of the data_race() Macro
+----------------------------
+
+Here are some situations where data_race() should be used instead of
+READ_ONCE() and WRITE_ONCE():
+
+1. Data-racy loads from shared variables whose values are used only
+ for diagnostic purposes.
+
+2. Data-racy reads whose values are checked against marked reload.
+
+3. Reads whose values feed into error-tolerant heuristics.
+
+4. Writes setting values that feed into error-tolerant heuristics.
+
+
+Data-Racy Reads for Approximate Diagnostics
+
+Approximate diagnostics include lockdep reports, monitoring/statistics
+(including /proc and /sys output), WARN*()/BUG*() checks whose return
+values are ignored, and other situations where reads from shared variables
+are not an integral part of the core concurrency design.
+
+In fact, use of data_race() instead READ_ONCE() for these diagnostic
+reads can enable better checking of the remaining accesses implementing
+the core concurrency design. For example, suppose that the core design
+prevents any non-diagnostic reads from shared variable x from running
+concurrently with updates to x. Then using plain C-language writes
+to x allows KCSAN to detect reads from x from within regions of code
+that fail to exclude the updates. In this case, it is important to use
+data_race() for the diagnostic reads because otherwise KCSAN would give
+false-positive warnings about these diagnostic reads.
+
+If it is necessary to both restrict compiler optimizations and disable
+KCSAN diagnostics, use both data_race() and READ_ONCE(), for example,
+data_race(READ_ONCE(a)).
+
+In theory, plain C-language loads can also be used for this use case.
+However, in practice this will have the disadvantage of causing KCSAN
+to generate false positives because KCSAN will have no way of knowing
+that the resulting data race was intentional.
+
+
+Data-Racy Reads That Are Checked Against Marked Reload
+
+The values from some reads are not implicitly trusted. They are instead
+fed into some operation that checks the full value against a later marked
+load from memory, which means that the occasional arbitrarily bogus value
+is not a problem. For example, if a bogus value is fed into cmpxchg(),
+all that happens is that this cmpxchg() fails, which normally results
+in a retry. Unless the race condition that resulted in the bogus value
+recurs, this retry will with high probability succeed, so no harm done.
+
+However, please keep in mind that a data_race() load feeding into
+a cmpxchg_relaxed() might still be subject to load fusing on some
+architectures. Therefore, it is best to capture the return value from
+the failing cmpxchg() for the next iteration of the loop, an approach
+that provides the compiler much less scope for mischievous optimizations.
+Capturing the return value from cmpxchg() also saves a memory reference
+in many cases.
+
+In theory, plain C-language loads can also be used for this use case.
+However, in practice this will have the disadvantage of causing KCSAN
+to generate false positives because KCSAN will have no way of knowing
+that the resulting data race was intentional.
+
+
+Reads Feeding Into Error-Tolerant Heuristics
+
+Values from some reads feed into heuristics that can tolerate occasional
+errors. Such reads can use data_race(), thus allowing KCSAN to focus on
+the other accesses to the relevant shared variables. But please note
+that data_race() loads are subject to load fusing, which can result in
+consistent errors, which in turn are quite capable of breaking heuristics.
+Therefore use of data_race() should be limited to cases where some other
+code (such as a barrier() call) will force the occasional reload.
+
+Note that this use case requires that the heuristic be able to handle
+any possible error. In contrast, if the heuristics might be fatally
+confused by one or more of the possible erroneous values, use READ_ONCE()
+instead of data_race().
+
+In theory, plain C-language loads can also be used for this use case.
+However, in practice this will have the disadvantage of causing KCSAN
+to generate false positives because KCSAN will have no way of knowing
+that the resulting data race was intentional.
+
+
+Writes Setting Values Feeding Into Error-Tolerant Heuristics
+
+The values read into error-tolerant heuristics come from somewhere,
+for example, from sysfs. This means that some code in sysfs writes
+to this same variable, and these writes can also use data_race().
+After all, if the heuristic can tolerate the occasional bogus value
+due to compiler-mangled reads, it can also tolerate the occasional
+compiler-mangled write, at least assuming that the proper value is in
+place once the write completes.
+
+Plain C-language stores can also be used for this use case. However,
+in kernels built with CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=n, this
+will have the disadvantage of causing KCSAN to generate false positives
+because KCSAN will have no way of knowing that the resulting data race
+was intentional.
+
+
+Use of Plain C-Language Accesses
+--------------------------------
+
+Here are some example situations where plain C-language accesses should
+used instead of READ_ONCE(), WRITE_ONCE(), and data_race():
+
+1. Accesses protected by mutual exclusion, including strict locking
+ and sequence locking.
+
+2. Initialization-time and cleanup-time accesses. This covers a
+ wide variety of situations, including the uniprocessor phase of
+ system boot, variables to be used by not-yet-spawned kthreads,
+ structures not yet published to reference-counted or RCU-protected
+ data structures, and the cleanup side of any of these situations.
+
+3. Per-CPU variables that are not accessed from other CPUs.
+
+4. Private per-task variables, including on-stack variables, some
+ fields in the task_struct structure, and task-private heap data.
+
+5. Any other loads for which there is not supposed to be a concurrent
+ store to that same variable.
+
+6. Any other stores for which there should be neither concurrent
+ loads nor concurrent stores to that same variable.
+
+ But note that KCSAN makes two explicit exceptions to this rule
+ by default, refraining from flagging plain C-language stores:
+
+ a. No matter what. You can override this default by building
+ with CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=n.
+
+ b. When the store writes the value already contained in
+ that variable. You can override this default by building
+ with CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=n.
+
+ c. When one of the stores is in an interrupt handler and
+ the other in the interrupted code. You can override this
+ default by building with CONFIG_KCSAN_INTERRUPT_WATCHER=y.
+
+Note that it is important to use plain C-language accesses in these cases,
+because doing otherwise prevents KCSAN from detecting violations of your
+code's synchronization rules.
+
+
+ACCESS-DOCUMENTATION OPTIONS
+============================
+
+It is important to comment marked accesses so that people reading your
+code, yourself included, are reminded of the synchronization design.
+However, it is even more important to comment plain C-language accesses
+that are intentionally involved in data races. Such comments are
+needed to remind people reading your code, again, yourself included,
+of how the compiler has been prevented from optimizing those accesses
+into concurrency bugs.
+
+It is also possible to tell KCSAN about your synchronization design.
+For example, ASSERT_EXCLUSIVE_ACCESS(foo) tells KCSAN that any
+concurrent access to variable foo by any other CPU is an error, even
+if that concurrent access is marked with READ_ONCE(). In addition,
+ASSERT_EXCLUSIVE_WRITER(foo) tells KCSAN that although it is OK for there
+to be concurrent reads from foo from other CPUs, it is an error for some
+other CPU to be concurrently writing to foo, even if that concurrent
+write is marked with data_race() or WRITE_ONCE().
+
+Note that although KCSAN will call out data races involving either
+ASSERT_EXCLUSIVE_ACCESS() or ASSERT_EXCLUSIVE_WRITER() on the one hand
+and data_race() writes on the other, KCSAN will not report the location
+of these data_race() writes.
+
+
+EXAMPLES
+========
+
+As noted earlier, the goal is to prevent the compiler from destroying
+your concurrent algorithm, to help the human reader, and to inform
+KCSAN of aspects of your concurrency design. This section looks at a
+few examples showing how this can be done.
+
+
+Lock Protection With Lockless Diagnostic Access
+-----------------------------------------------
+
+For example, suppose a shared variable "foo" is read only while a
+reader-writer spinlock is read-held, written only while that same
+spinlock is write-held, except that it is also read locklessly for
+diagnostic purposes. The code might look as follows:
+
+ int foo;
+ DEFINE_RWLOCK(foo_rwlock);
+
+ void update_foo(int newval)
+ {
+ write_lock(&foo_rwlock);
+ foo = newval;
+ do_something(newval);
+ write_unlock(&foo_rwlock);
+ }
+
+ int read_foo(void)
+ {
+ int ret;
+
+ read_lock(&foo_rwlock);
+ do_something_else();
+ ret = foo;
+ read_unlock(&foo_rwlock);
+ return ret;
+ }
+
+ void read_foo_diagnostic(void)
+ {
+ pr_info("Current value of foo: %d\n", data_race(foo));
+ }
+
+The reader-writer lock prevents the compiler from introducing concurrency
+bugs into any part of the main algorithm using foo, which means that
+the accesses to foo within both update_foo() and read_foo() can (and
+should) be plain C-language accesses. One benefit of making them be
+plain C-language accesses is that KCSAN can detect any erroneous lockless
+reads from or updates to foo. The data_race() in read_foo_diagnostic()
+tells KCSAN that data races are expected, and should be silently
+ignored. This data_race() also tells the human reading the code that
+read_foo_diagnostic() might sometimes return a bogus value.
+
+If it is necessary to suppress compiler optimization and also detect
+buggy lockless writes, read_foo_diagnostic() can be updated as follows:
+
+ void read_foo_diagnostic(void)
+ {
+ pr_info("Current value of foo: %d\n", data_race(READ_ONCE(foo)));
+ }
+
+Alternatively, given that KCSAN is to ignore all accesses in this function,
+this function can be marked __no_kcsan and the data_race() can be dropped:
+
+ void __no_kcsan read_foo_diagnostic(void)
+ {
+ pr_info("Current value of foo: %d\n", READ_ONCE(foo));
+ }
+
+However, in order for KCSAN to detect buggy lockless writes, your kernel
+must be built with CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=n. If you
+need KCSAN to detect such a write even if that write did not change
+the value of foo, you also need CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=n.
+If you need KCSAN to detect such a write happening in an interrupt handler
+running on the same CPU doing the legitimate lock-protected write, you
+also need CONFIG_KCSAN_INTERRUPT_WATCHER=y. With some or all of these
+Kconfig options set properly, KCSAN can be quite helpful, although
+it is not necessarily a full replacement for hardware watchpoints.
+On the other hand, neither are hardware watchpoints a full replacement
+for KCSAN because it is not always easy to tell hardware watchpoint to
+conditionally trap on accesses.
+
+
+Lock-Protected Writes With Lockless Reads
+-----------------------------------------
+
+For another example, suppose a shared variable "foo" is updated only
+while holding a spinlock, but is read locklessly. The code might look
+as follows:
+
+ int foo;
+ DEFINE_SPINLOCK(foo_lock);
+
+ void update_foo(int newval)
+ {
+ spin_lock(&foo_lock);
+ WRITE_ONCE(foo, newval);
+ ASSERT_EXCLUSIVE_WRITER(foo);
+ do_something(newval);
+ spin_unlock(&foo_wlock);
+ }
+
+ int read_foo(void)
+ {
+ do_something_else();
+ return READ_ONCE(foo);
+ }
+
+Because foo is read locklessly, all accesses are marked. The purpose
+of the ASSERT_EXCLUSIVE_WRITER() is to allow KCSAN to check for a buggy
+concurrent lockless write.
+
+
+Lock-Protected Writes With Heuristic Lockless Reads
+---------------------------------------------------
+
+For another example, suppose that the code can normally make use of
+a per-data-structure lock, but there are times when a global lock
+is required. These times are indicated via a global flag. The code
+might look as follows, and is based loosely on nf_conntrack_lock(),
+nf_conntrack_all_lock(), and nf_conntrack_all_unlock():
+
+ bool global_flag;
+ DEFINE_SPINLOCK(global_lock);
+ struct foo {
+ spinlock_t f_lock;
+ int f_data;
+ };
+
+ /* All foo structures are in the following array. */
+ int nfoo;
+ struct foo *foo_array;
+
+ void do_something_locked(struct foo *fp)
+ {
+ /* This works even if data_race() returns nonsense. */
+ if (!data_race(global_flag)) {
+ spin_lock(&fp->f_lock);
+ if (!smp_load_acquire(&global_flag)) {
+ do_something(fp);
+ spin_unlock(&fp->f_lock);
+ return;
+ }
+ spin_unlock(&fp->f_lock);
+ }
+ spin_lock(&global_lock);
+ /* global_lock held, thus global flag cannot be set. */
+ spin_lock(&fp->f_lock);
+ spin_unlock(&global_lock);
+ /*
+ * global_flag might be set here, but begin_global()
+ * will wait for ->f_lock to be released.
+ */
+ do_something(fp);
+ spin_unlock(&fp->f_lock);
+ }
+
+ void begin_global(void)
+ {
+ int i;
+
+ spin_lock(&global_lock);
+ WRITE_ONCE(global_flag, true);
+ for (i = 0; i < nfoo; i++) {
+ /*
+ * Wait for pre-existing local locks. One at
+ * a time to avoid lockdep limitations.
+ */
+ spin_lock(&fp->f_lock);
+ spin_unlock(&fp->f_lock);
+ }
+ }
+
+ void end_global(void)
+ {
+ smp_store_release(&global_flag, false);
+ spin_unlock(&global_lock);
+ }
+
+All code paths leading from the do_something_locked() function's first
+read from global_flag acquire a lock, so endless load fusing cannot
+happen.
+
+If the value read from global_flag is true, then global_flag is
+rechecked while holding ->f_lock, which, if global_flag is now false,
+prevents begin_global() from completing. It is therefore safe to invoke
+do_something().
+
+Otherwise, if either value read from global_flag is true, then after
+global_lock is acquired global_flag must be false. The acquisition of
+->f_lock will prevent any call to begin_global() from returning, which
+means that it is safe to release global_lock and invoke do_something().
+
+For this to work, only those foo structures in foo_array[] may be passed
+to do_something_locked(). The reason for this is that the synchronization
+with begin_global() relies on momentarily holding the lock of each and
+every foo structure.
+
+The smp_load_acquire() and smp_store_release() are required because
+changes to a foo structure between calls to begin_global() and
+end_global() are carried out without holding that structure's ->f_lock.
+The smp_load_acquire() and smp_store_release() ensure that the next
+invocation of do_something() from do_something_locked() will see those
+changes.
+
+
+Lockless Reads and Writes
+-------------------------
+
+For another example, suppose a shared variable "foo" is both read and
+updated locklessly. The code might look as follows:
+
+ int foo;
+
+ int update_foo(int newval)
+ {
+ int ret;
+
+ ret = xchg(&foo, newval);
+ do_something(newval);
+ return ret;
+ }
+
+ int read_foo(void)
+ {
+ do_something_else();
+ return READ_ONCE(foo);
+ }
+
+Because foo is accessed locklessly, all accesses are marked. It does
+not make sense to use ASSERT_EXCLUSIVE_WRITER() in this case because
+there really can be concurrent lockless writers. KCSAN would
+flag any concurrent plain C-language reads from foo, and given
+CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=n, also any concurrent plain
+C-language writes to foo.
+
+
+Lockless Reads and Writes, But With Single-Threaded Initialization
+------------------------------------------------------------------
+
+For yet another example, suppose that foo is initialized in a
+single-threaded manner, but that a number of kthreads are then created
+that locklessly and concurrently access foo. Some snippets of this code
+might look as follows:
+
+ int foo;
+
+ void initialize_foo(int initval, int nkthreads)
+ {
+ int i;
+
+ foo = initval;
+ ASSERT_EXCLUSIVE_ACCESS(foo);
+ for (i = 0; i < nkthreads; i++)
+ kthread_run(access_foo_concurrently, ...);
+ }
+
+ /* Called from access_foo_concurrently(). */
+ int update_foo(int newval)
+ {
+ int ret;
+
+ ret = xchg(&foo, newval);
+ do_something(newval);
+ return ret;
+ }
+
+ /* Also called from access_foo_concurrently(). */
+ int read_foo(void)
+ {
+ do_something_else();
+ return READ_ONCE(foo);
+ }
+
+The initialize_foo() uses a plain C-language write to foo because there
+are not supposed to be concurrent accesses during initialization. The
+ASSERT_EXCLUSIVE_ACCESS() allows KCSAN to flag buggy concurrent unmarked
+reads, and the ASSERT_EXCLUSIVE_ACCESS() call further allows KCSAN to
+flag buggy concurrent writes, even if: (1) Those writes are marked or
+(2) The kernel was built with CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=y.
+
+
+Checking Stress-Test Race Coverage
+----------------------------------
+
+When designing stress tests it is important to ensure that race conditions
+of interest really do occur. For example, consider the following code
+fragment:
+
+ int foo;
+
+ int update_foo(int newval)
+ {
+ return xchg(&foo, newval);
+ }
+
+ int xor_shift_foo(int shift, int mask)
+ {
+ int old, new, newold;
+
+ newold = data_race(foo); /* Checked by cmpxchg(). */
+ do {
+ old = newold;
+ new = (old << shift) ^ mask;
+ newold = cmpxchg(&foo, old, new);
+ } while (newold != old);
+ return old;
+ }
+
+ int read_foo(void)
+ {
+ return READ_ONCE(foo);
+ }
+
+If it is possible for update_foo(), xor_shift_foo(), and read_foo() to be
+invoked concurrently, the stress test should force this concurrency to
+actually happen. KCSAN can evaluate the stress test when the above code
+is modified to read as follows:
+
+ int foo;
+
+ int update_foo(int newval)
+ {
+ ASSERT_EXCLUSIVE_ACCESS(foo);
+ return xchg(&foo, newval);
+ }
+
+ int xor_shift_foo(int shift, int mask)
+ {
+ int old, new, newold;
+
+ newold = data_race(foo); /* Checked by cmpxchg(). */
+ do {
+ old = newold;
+ new = (old << shift) ^ mask;
+ ASSERT_EXCLUSIVE_ACCESS(foo);
+ newold = cmpxchg(&foo, old, new);
+ } while (newold != old);
+ return old;
+ }
+
+
+ int read_foo(void)
+ {
+ ASSERT_EXCLUSIVE_ACCESS(foo);
+ return READ_ONCE(foo);
+ }
+
+If a given stress-test run does not result in KCSAN complaints from
+each possible pair of ASSERT_EXCLUSIVE_ACCESS() invocations, the
+stress test needs improvement. If the stress test was to be evaluated
+on a regular basis, it would be wise to place the above instances of
+ASSERT_EXCLUSIVE_ACCESS() under #ifdef so that they did not result in
+false positives when not evaluating the stress test.
+
+
+REFERENCES
+==========
+
+[1] "Concurrency bugs should fear the big bad data-race detector (part 2)"
+ https://lwn.net/Articles/816854/
+
+[2] "Who's afraid of a big bad optimizing compiler?"
+ https://lwn.net/Articles/793253/
diff --git a/tools/memory-model/Documentation/cheatsheet.txt b/tools/memory-model/Documentation/cheatsheet.txt
index 33ba98d72b16..99d00870b160 100644
--- a/tools/memory-model/Documentation/cheatsheet.txt
+++ b/tools/memory-model/Documentation/cheatsheet.txt
@@ -3,9 +3,9 @@
C Self R W RMW Self R W DR DW RMW SV
-- ---- - - --- ---- - - -- -- --- --
-Store, e.g., WRITE_ONCE() Y Y
-Load, e.g., READ_ONCE() Y Y Y Y
-Unsuccessful RMW operation Y Y Y Y
+Relaxed store Y Y
+Relaxed load Y Y Y Y
+Relaxed RMW operation Y Y Y Y
rcu_dereference() Y Y Y Y
Successful *_acquire() R Y Y Y Y Y Y
Successful *_release() C Y Y Y W Y
@@ -17,14 +17,19 @@ smp_mb__before_atomic() CP Y Y Y a a a a Y
smp_mb__after_atomic() CP a a Y Y Y Y Y Y
-Key: C: Ordering is cumulative
- P: Ordering propagates
- R: Read, for example, READ_ONCE(), or read portion of RMW
- W: Write, for example, WRITE_ONCE(), or write portion of RMW
- Y: Provides ordering
- a: Provides ordering given intervening RMW atomic operation
- DR: Dependent read (address dependency)
- DW: Dependent write (address, data, or control dependency)
- RMW: Atomic read-modify-write operation
- SELF: Orders self, as opposed to accesses before and/or after
- SV: Orders later accesses to the same variable
+Key: Relaxed: A relaxed operation is either READ_ONCE(), WRITE_ONCE(),
+ a *_relaxed() RMW operation, an unsuccessful RMW
+ operation, a non-value-returning RMW operation such
+ as atomic_inc(), or one of the atomic*_read() and
+ atomic*_set() family of operations.
+ C: Ordering is cumulative
+ P: Ordering propagates
+ R: Read, for example, READ_ONCE(), or read portion of RMW
+ W: Write, for example, WRITE_ONCE(), or write portion of RMW
+ Y: Provides ordering
+ a: Provides ordering given intervening RMW atomic operation
+ DR: Dependent read (address dependency)
+ DW: Dependent write (address, data, or control dependency)
+ RMW: Atomic read-modify-write operation
+ SELF: Orders self, as opposed to accesses before and/or after
+ SV: Orders later accesses to the same variable
diff --git a/tools/memory-model/Documentation/control-dependencies.txt b/tools/memory-model/Documentation/control-dependencies.txt
new file mode 100644
index 000000000000..8b743d20fe27
--- /dev/null
+++ b/tools/memory-model/Documentation/control-dependencies.txt
@@ -0,0 +1,258 @@
+CONTROL DEPENDENCIES
+====================
+
+A major difficulty with control dependencies is that current compilers
+do not support them. One purpose of this document is therefore to
+help you prevent your compiler from breaking your code. However,
+control dependencies also pose other challenges, which leads to the
+second purpose of this document, namely to help you to avoid breaking
+your own code, even in the absence of help from your compiler.
+
+One such challenge is that control dependencies order only later stores.
+Therefore, a load-load control dependency will not preserve ordering
+unless a read memory barrier is provided. Consider the following code:
+
+ q = READ_ONCE(a);
+ if (q)
+ p = READ_ONCE(b);
+
+This is not guaranteed to provide any ordering because some types of CPUs
+are permitted to predict the result of the load from "b". This prediction
+can cause other CPUs to see this load as having happened before the load
+from "a". This means that an explicit read barrier is required, for example
+as follows:
+
+ q = READ_ONCE(a);
+ if (q) {
+ smp_rmb();
+ p = READ_ONCE(b);
+ }
+
+However, stores are not speculated. This means that ordering is
+(usually) guaranteed for load-store control dependencies, as in the
+following example:
+
+ q = READ_ONCE(a);
+ if (q)
+ WRITE_ONCE(b, 1);
+
+Control dependencies can pair with each other and with other types
+of ordering. But please note that neither the READ_ONCE() nor the
+WRITE_ONCE() are optional. Without the READ_ONCE(), the compiler might
+fuse the load from "a" with other loads. Without the WRITE_ONCE(),
+the compiler might fuse the store to "b" with other stores. Worse yet,
+the compiler might convert the store into a load and a check followed
+by a store, and this compiler-generated load would not be ordered by
+the control dependency.
+
+Furthermore, if the compiler is able to prove that the value of variable
+"a" is always non-zero, it would be well within its rights to optimize
+the original example by eliminating the "if" statement as follows:
+
+ q = a;
+ b = 1; /* BUG: Compiler and CPU can both reorder!!! */
+
+So don't leave out either the READ_ONCE() or the WRITE_ONCE().
+In particular, although READ_ONCE() does force the compiler to emit a
+load, it does *not* force the compiler to actually use the loaded value.
+
+It is tempting to try use control dependencies to enforce ordering on
+identical stores on both branches of the "if" statement as follows:
+
+ q = READ_ONCE(a);
+ if (q) {
+ barrier();
+ WRITE_ONCE(b, 1);
+ do_something();
+ } else {
+ barrier();
+ WRITE_ONCE(b, 1);
+ do_something_else();
+ }
+
+Unfortunately, current compilers will transform this as follows at high
+optimization levels:
+
+ q = READ_ONCE(a);
+ barrier();
+ WRITE_ONCE(b, 1); /* BUG: No ordering vs. load from a!!! */
+ if (q) {
+ /* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
+ do_something();
+ } else {
+ /* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
+ do_something_else();
+ }
+
+Now there is no conditional between the load from "a" and the store to
+"b", which means that the CPU is within its rights to reorder them: The
+conditional is absolutely required, and must be present in the final
+assembly code, after all of the compiler and link-time optimizations
+have been applied. Therefore, if you need ordering in this example,
+you must use explicit memory ordering, for example, smp_store_release():
+
+ q = READ_ONCE(a);
+ if (q) {
+ smp_store_release(&b, 1);
+ do_something();
+ } else {
+ smp_store_release(&b, 1);
+ do_something_else();
+ }
+
+Without explicit memory ordering, control-dependency-based ordering is
+guaranteed only when the stores differ, for example:
+
+ q = READ_ONCE(a);
+ if (q) {
+ WRITE_ONCE(b, 1);
+ do_something();
+ } else {
+ WRITE_ONCE(b, 2);
+ do_something_else();
+ }
+
+The initial READ_ONCE() is still required to prevent the compiler from
+knowing too much about the value of "a".
+
+But please note that you need to be careful what you do with the local
+variable "q", otherwise the compiler might be able to guess the value
+and again remove the conditional branch that is absolutely required to
+preserve ordering. For example:
+
+ q = READ_ONCE(a);
+ if (q % MAX) {
+ WRITE_ONCE(b, 1);
+ do_something();
+ } else {
+ WRITE_ONCE(b, 2);
+ do_something_else();
+ }
+
+If MAX is compile-time defined to be 1, then the compiler knows that
+(q % MAX) must be equal to zero, regardless of the value of "q".
+The compiler is therefore within its rights to transform the above code
+into the following:
+
+ q = READ_ONCE(a);
+ WRITE_ONCE(b, 2);
+ do_something_else();
+
+Given this transformation, the CPU is not required to respect the ordering
+between the load from variable "a" and the store to variable "b". It is
+tempting to add a barrier(), but this does not help. The conditional
+is gone, and the barrier won't bring it back. Therefore, if you need
+to relying on control dependencies to produce this ordering, you should
+make sure that MAX is greater than one, perhaps as follows:
+
+ q = READ_ONCE(a);
+ BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */
+ if (q % MAX) {
+ WRITE_ONCE(b, 1);
+ do_something();
+ } else {
+ WRITE_ONCE(b, 2);
+ do_something_else();
+ }
+
+Please note once again that each leg of the "if" statement absolutely
+must store different values to "b". As in previous examples, if the two
+values were identical, the compiler could pull this store outside of the
+"if" statement, destroying the control dependency's ordering properties.
+
+You must also be careful avoid relying too much on boolean short-circuit
+evaluation. Consider this example:
+
+ q = READ_ONCE(a);
+ if (q || 1 > 0)
+ WRITE_ONCE(b, 1);
+
+Because the first condition cannot fault and the second condition is
+always true, the compiler can transform this example as follows, again
+destroying the control dependency's ordering:
+
+ q = READ_ONCE(a);
+ WRITE_ONCE(b, 1);
+
+This is yet another example showing the importance of preventing the
+compiler from out-guessing your code. Again, although READ_ONCE() really
+does force the compiler to emit code for a given load, the compiler is
+within its rights to discard the loaded value.
+
+In addition, control dependencies apply only to the then-clause and
+else-clause of the "if" statement in question. In particular, they do
+not necessarily order the code following the entire "if" statement:
+
+ q = READ_ONCE(a);
+ if (q) {
+ WRITE_ONCE(b, 1);
+ } else {
+ WRITE_ONCE(b, 2);
+ }
+ WRITE_ONCE(c, 1); /* BUG: No ordering against the read from "a". */
+
+It is tempting to argue that there in fact is ordering because the
+compiler cannot reorder volatile accesses and also cannot reorder
+the writes to "b" with the condition. Unfortunately for this line
+of reasoning, the compiler might compile the two writes to "b" as
+conditional-move instructions, as in this fanciful pseudo-assembly
+language:
+
+ ld r1,a
+ cmp r1,$0
+ cmov,ne r4,$1
+ cmov,eq r4,$2
+ st r4,b
+ st $1,c
+
+The control dependencies would then extend only to the pair of cmov
+instructions and the store depending on them. This means that a weakly
+ordered CPU would have no dependency of any sort between the load from
+"a" and the store to "c". In short, control dependencies provide ordering
+only to the stores in the then-clause and else-clause of the "if" statement
+in question (including functions invoked by those two clauses), and not
+to code following that "if" statement.
+
+
+In summary:
+
+ (*) Control dependencies can order prior loads against later stores.
+ However, they do *not* guarantee any other sort of ordering:
+ Not prior loads against later loads, nor prior stores against
+ later anything. If you need these other forms of ordering, use
+ smp_load_acquire(), smp_store_release(), or, in the case of prior
+ stores and later loads, smp_mb().
+
+ (*) If both legs of the "if" statement contain identical stores to
+ the same variable, then you must explicitly order those stores,
+ either by preceding both of them with smp_mb() or by using
+ smp_store_release(). Please note that it is *not* sufficient to use
+ barrier() at beginning and end of each leg of the "if" statement
+ because, as shown by the example above, optimizing compilers can
+ destroy the control dependency while respecting the letter of the
+ barrier() law.
+
+ (*) Control dependencies require at least one run-time conditional
+ between the prior load and the subsequent store, and this
+ conditional must involve the prior load. If the compiler is able
+ to optimize the conditional away, it will have also optimized
+ away the ordering. Careful use of READ_ONCE() and WRITE_ONCE()
+ can help to preserve the needed conditional.
+
+ (*) Control dependencies require that the compiler avoid reordering the
+ dependency into nonexistence. Careful use of READ_ONCE() or
+ atomic{,64}_read() can help to preserve your control dependency.
+
+ (*) Control dependencies apply only to the then-clause and else-clause
+ of the "if" statement containing the control dependency, including
+ any functions that these two clauses call. Control dependencies
+ do *not* apply to code beyond the end of that "if" statement.
+
+ (*) Control dependencies pair normally with other types of barriers.
+
+ (*) Control dependencies do *not* provide multicopy atomicity. If you
+ need all the CPUs to agree on the ordering of a given store against
+ all other accesses, use smp_mb().
+
+ (*) Compilers do not understand control dependencies. It is therefore
+ your job to ensure that they do not break your code.
diff --git a/tools/memory-model/Documentation/explanation.txt b/tools/memory-model/Documentation/explanation.txt
index e91a2eb19592..6dc8b3642458 100644
--- a/tools/memory-model/Documentation/explanation.txt
+++ b/tools/memory-model/Documentation/explanation.txt
@@ -28,9 +28,10 @@ Explanation of the Linux-Kernel Memory Consistency Model
20. THE HAPPENS-BEFORE RELATION: hb
21. THE PROPAGATES-BEFORE RELATION: pb
22. RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb
- 23. LOCKING
- 24. PLAIN ACCESSES AND DATA RACES
- 25. ODDS AND ENDS
+ 23. SRCU READ-SIDE CRITICAL SECTIONS
+ 24. LOCKING
+ 25. PLAIN ACCESSES AND DATA RACES
+ 26. ODDS AND ENDS
@@ -464,9 +465,10 @@ to address dependencies, since the address of a location accessed
through a pointer will depend on the value read earlier from that
pointer.
-Finally, a read event and another memory access event are linked by a
-control dependency if the value obtained by the read affects whether
-the second event is executed at all. Simple example:
+Finally, a read event X and a write event Y are linked by a control
+dependency if Y syntactically lies within an arm of an if statement and
+X affects the evaluation of the if condition via a data or address
+dependency (or similarly for a switch statement). Simple example:
int x, y;
@@ -485,6 +487,57 @@ have R ->po X. It wouldn't make sense for a computation to depend
somehow on a value that doesn't get loaded from shared memory until
later in the code!
+Here's a trick question: When is a dependency not a dependency? Answer:
+When it is purely syntactic rather than semantic. We say a dependency
+between two accesses is purely syntactic if the second access doesn't
+actually depend on the result of the first. Here is a trivial example:
+
+ r1 = READ_ONCE(x);
+ WRITE_ONCE(y, r1 * 0);
+
+There appears to be a data dependency from the load of x to the store
+of y, since the value to be stored is computed from the value that was
+loaded. But in fact, the value stored does not really depend on
+anything since it will always be 0. Thus the data dependency is only
+syntactic (it appears to exist in the code) but not semantic (the
+second access will always be the same, regardless of the value of the
+first access). Given code like this, a compiler could simply discard
+the value returned by the load from x, which would certainly destroy
+any dependency. (The compiler is not permitted to eliminate entirely
+the load generated for a READ_ONCE() -- that's one of the nice
+properties of READ_ONCE() -- but it is allowed to ignore the load's
+value.)
+
+It's natural to object that no one in their right mind would write
+code like the above. However, macro expansions can easily give rise
+to this sort of thing, in ways that often are not apparent to the
+programmer.
+
+Another mechanism that can lead to purely syntactic dependencies is
+related to the notion of "undefined behavior". Certain program
+behaviors are called "undefined" in the C language specification,
+which means that when they occur there are no guarantees at all about
+the outcome. Consider the following example:
+
+ int a[1];
+ int i;
+
+ r1 = READ_ONCE(i);
+ r2 = READ_ONCE(a[r1]);
+
+Access beyond the end or before the beginning of an array is one kind
+of undefined behavior. Therefore the compiler doesn't have to worry
+about what will happen if r1 is nonzero, and it can assume that r1
+will always be zero regardless of the value actually loaded from i.
+(If the assumption turns out to be wrong the resulting behavior will
+be undefined anyway, so the compiler doesn't care!) Thus the value
+from the load can be discarded, breaking the address dependency.
+
+The LKMM is unaware that purely syntactic dependencies are different
+from semantic dependencies and therefore mistakenly predicts that the
+accesses in the two examples above will be ordered. This is another
+example of how the compiler can undermine the memory model. Be warned.
+
THE READS-FROM RELATION: rf, rfi, and rfe
-----------------------------------------
@@ -955,6 +1008,36 @@ order. Equivalently,
where the rmw relation links the read and write events making up each
atomic update. This is what the LKMM's "atomic" axiom says.
+Atomic rmw updates play one more role in the LKMM: They can form "rmw
+sequences". An rmw sequence is simply a bunch of atomic updates where
+each update reads from the previous one. Written using events, it
+looks like this:
+
+ Z0 ->rf Y1 ->rmw Z1 ->rf ... ->rf Yn ->rmw Zn,
+
+where Z0 is some store event and n can be any number (even 0, in the
+degenerate case). We write this relation as: Z0 ->rmw-sequence Zn.
+Note that this implies Z0 and Zn are stores to the same variable.
+
+Rmw sequences have a special property in the LKMM: They can extend the
+cumul-fence relation. That is, if we have:
+
+ U ->cumul-fence X -> rmw-sequence Y
+
+then also U ->cumul-fence Y. Thinking about this in terms of the
+operational model, U ->cumul-fence X says that the store U propagates
+to each CPU before the store X does. Then the fact that X and Y are
+linked by an rmw sequence means that U also propagates to each CPU
+before Y does. In an analogous way, rmw sequences can also extend
+the w-post-bounded relation defined below in the PLAIN ACCESSES AND
+DATA RACES section.
+
+(The notion of rmw sequences in the LKMM is similar to, but not quite
+the same as, that of release sequences in the C11 memory model. They
+were added to the LKMM to fix an obscure bug; without them, atomic
+updates with full-barrier semantics did not always guarantee ordering
+at least as strong as atomic updates with release-barrier semantics.)
+
THE PRESERVED PROGRAM ORDER RELATION: ppo
-----------------------------------------
@@ -1122,12 +1205,10 @@ maintain at least the appearance of FIFO order.
In practice, this difficulty is solved by inserting a special fence
between P1's two loads when the kernel is compiled for the Alpha
architecture. In fact, as of version 4.15, the kernel automatically
-adds this fence (called smp_read_barrier_depends() and defined as
-nothing at all on non-Alpha builds) after every READ_ONCE() and atomic
-load. The effect of the fence is to cause the CPU not to execute any
-po-later instructions until after the local cache has finished
-processing all the stores it has already received. Thus, if the code
-was changed to:
+adds this fence after every READ_ONCE() and atomic load on Alpha. The
+effect of the fence is to cause the CPU not to execute any po-later
+instructions until after the local cache has finished processing all
+the stores it has already received. Thus, if the code was changed to:
P1()
{
@@ -1146,14 +1227,14 @@ READ_ONCE() or another synchronization primitive rather than accessed
directly.
The LKMM requires that smp_rmb(), acquire fences, and strong fences
-share this property with smp_read_barrier_depends(): They do not allow
-the CPU to execute any po-later instructions (or po-later loads in the
-case of smp_rmb()) until all outstanding stores have been processed by
-the local cache. In the case of a strong fence, the CPU first has to
-wait for all of its po-earlier stores to propagate to every other CPU
-in the system; then it has to wait for the local cache to process all
-the stores received as of that time -- not just the stores received
-when the strong fence began.
+share this property: They do not allow the CPU to execute any po-later
+instructions (or po-later loads in the case of smp_rmb()) until all
+outstanding stores have been processed by the local cache. In the
+case of a strong fence, the CPU first has to wait for all of its
+po-earlier stores to propagate to every other CPU in the system; then
+it has to wait for the local cache to process all the stores received
+as of that time -- not just the stores received when the strong fence
+began.
And of course, none of this matters for any architecture other than
Alpha.
@@ -1768,14 +1849,169 @@ section in P0 both starts before P1's grace period does and ends
before it does, and the critical section in P2 both starts after P1's
grace period does and ends after it does.
-Addendum: The LKMM now supports SRCU (Sleepable Read-Copy-Update) in
-addition to normal RCU. The ideas involved are much the same as
-above, with new relations srcu-gp and srcu-rscsi added to represent
-SRCU grace periods and read-side critical sections. There is a
-restriction on the srcu-gp and srcu-rscsi links that can appear in an
-rcu-order sequence (the srcu-rscsi links must be paired with srcu-gp
-links having the same SRCU domain with proper nesting); the details
-are relatively unimportant.
+The LKMM supports SRCU (Sleepable Read-Copy-Update) in addition to
+normal RCU. The ideas involved are much the same as above, with new
+relations srcu-gp and srcu-rscsi added to represent SRCU grace periods
+and read-side critical sections. However, there are some significant
+differences between RCU read-side critical sections and their SRCU
+counterparts, as described in the next section.
+
+
+SRCU READ-SIDE CRITICAL SECTIONS
+--------------------------------
+
+The LKMM uses the srcu-rscsi relation to model SRCU read-side critical
+sections. They differ from RCU read-side critical sections in the
+following respects:
+
+1. Unlike the analogous RCU primitives, synchronize_srcu(),
+ srcu_read_lock(), and srcu_read_unlock() take a pointer to a
+ struct srcu_struct as an argument. This structure is called
+ an SRCU domain, and calls linked by srcu-rscsi must have the
+ same domain. Read-side critical sections and grace periods
+ associated with different domains are independent of one
+ another; the SRCU version of the RCU Guarantee applies only
+ to pairs of critical sections and grace periods having the
+ same domain.
+
+2. srcu_read_lock() returns a value, called the index, which must
+ be passed to the matching srcu_read_unlock() call. Unlike
+ rcu_read_lock() and rcu_read_unlock(), an srcu_read_lock()
+ call does not always have to match the next unpaired
+ srcu_read_unlock(). In fact, it is possible for two SRCU
+ read-side critical sections to overlap partially, as in the
+ following example (where s is an srcu_struct and idx1 and idx2
+ are integer variables):
+
+ idx1 = srcu_read_lock(&s); // Start of first RSCS
+ idx2 = srcu_read_lock(&s); // Start of second RSCS
+ srcu_read_unlock(&s, idx1); // End of first RSCS
+ srcu_read_unlock(&s, idx2); // End of second RSCS
+
+ The matching is determined entirely by the domain pointer and
+ index value. By contrast, if the calls had been
+ rcu_read_lock() and rcu_read_unlock() then they would have
+ created two nested (fully overlapping) read-side critical
+ sections: an inner one and an outer one.
+
+3. The srcu_down_read() and srcu_up_read() primitives work
+ exactly like srcu_read_lock() and srcu_read_unlock(), except
+ that matching calls don't have to execute on the same CPU.
+ (The names are meant to be suggestive of operations on
+ semaphores.) Since the matching is determined by the domain
+ pointer and index value, these primitives make it possible for
+ an SRCU read-side critical section to start on one CPU and end
+ on another, so to speak.
+
+In order to account for these properties of SRCU, the LKMM models
+srcu_read_lock() as a special type of load event (which is
+appropriate, since it takes a memory location as argument and returns
+a value, just as a load does) and srcu_read_unlock() as a special type
+of store event (again appropriate, since it takes as arguments a
+memory location and a value). These loads and stores are annotated as
+belonging to the "srcu-lock" and "srcu-unlock" event classes
+respectively.
+
+This approach allows the LKMM to tell whether two events are
+associated with the same SRCU domain, simply by checking whether they
+access the same memory location (i.e., they are linked by the loc
+relation). It also gives a way to tell which unlock matches a
+particular lock, by checking for the presence of a data dependency
+from the load (srcu-lock) to the store (srcu-unlock). For example,
+given the situation outlined earlier (with statement labels added):
+
+ A: idx1 = srcu_read_lock(&s);
+ B: idx2 = srcu_read_lock(&s);
+ C: srcu_read_unlock(&s, idx1);
+ D: srcu_read_unlock(&s, idx2);
+
+the LKMM will treat A and B as loads from s yielding values saved in
+idx1 and idx2 respectively. Similarly, it will treat C and D as
+though they stored the values from idx1 and idx2 in s. The end result
+is much as if we had written:
+
+ A: idx1 = READ_ONCE(s);
+ B: idx2 = READ_ONCE(s);
+ C: WRITE_ONCE(s, idx1);
+ D: WRITE_ONCE(s, idx2);
+
+except for the presence of the special srcu-lock and srcu-unlock
+annotations. You can see at once that we have A ->data C and
+B ->data D. These dependencies tell the LKMM that C is the
+srcu-unlock event matching srcu-lock event A, and D is the
+srcu-unlock event matching srcu-lock event B.
+
+This approach is admittedly a hack, and it has the potential to lead
+to problems. For example, in:
+
+ idx1 = srcu_read_lock(&s);
+ srcu_read_unlock(&s, idx1);
+ idx2 = srcu_read_lock(&s);
+ srcu_read_unlock(&s, idx2);
+
+the LKMM will believe that idx2 must have the same value as idx1,
+since it reads from the immediately preceding store of idx1 in s.
+Fortunately this won't matter, assuming that litmus tests never do
+anything with SRCU index values other than pass them to
+srcu_read_unlock() or srcu_up_read() calls.
+
+However, sometimes it is necessary to store an index value in a
+shared variable temporarily. In fact, this is the only way for
+srcu_down_read() to pass the index it gets to an srcu_up_read() call
+on a different CPU. In more detail, we might have soething like:
+
+ struct srcu_struct s;
+ int x;
+
+ P0()
+ {
+ int r0;
+
+ A: r0 = srcu_down_read(&s);
+ B: WRITE_ONCE(x, r0);
+ }
+
+ P1()
+ {
+ int r1;
+
+ C: r1 = READ_ONCE(x);
+ D: srcu_up_read(&s, r1);
+ }
+
+Assuming that P1 executes after P0 and does read the index value
+stored in x, we can write this (using brackets to represent event
+annotations) as:
+
+ A[srcu-lock] ->data B[once] ->rf C[once] ->data D[srcu-unlock].
+
+The LKMM defines a carry-srcu-data relation to express this pattern;
+it permits an arbitrarily long sequence of
+
+ data ; rf
+
+pairs (that is, a data link followed by an rf link) to occur between
+an srcu-lock event and the final data dependency leading to the
+matching srcu-unlock event. carry-srcu-data is complicated by the
+need to ensure that none of the intermediate store events in this
+sequence are instances of srcu-unlock. This is necessary because in a
+pattern like the one above:
+
+ A: idx1 = srcu_read_lock(&s);
+ B: srcu_read_unlock(&s, idx1);
+ C: idx2 = srcu_read_lock(&s);
+ D: srcu_read_unlock(&s, idx2);
+
+the LKMM treats B as a store to the variable s and C as a load from
+that variable, creating an undesirable rf link from B to C:
+
+ A ->data B ->rf C ->data D.
+
+This would cause carry-srcu-data to mistakenly extend a data
+dependency from A to D, giving the impression that D was the
+srcu-unlock event matching A's srcu-lock. To avoid such problems,
+carry-srcu-data does not accept sequences in which the ends of any of
+the intermediate ->data links (B above) is an srcu-unlock event.
LOCKING
@@ -1815,15 +2051,16 @@ spin_trylock() -- we can call these things lock-releases and
lock-acquires -- have two properties beyond those of ordinary releases
and acquires.
-First, when a lock-acquire reads from a lock-release, the LKMM
-requires that every instruction po-before the lock-release must
-execute before any instruction po-after the lock-acquire. This would
-naturally hold if the release and acquire operations were on different
-CPUs, but the LKMM says it holds even when they are on the same CPU.
-For example:
+First, when a lock-acquire reads from or is po-after a lock-release,
+the LKMM requires that every instruction po-before the lock-release
+must execute before any instruction po-after the lock-acquire. This
+would naturally hold if the release and acquire operations were on
+different CPUs and accessed the same lock variable, but the LKMM says
+it also holds when they are on the same CPU, even if they access
+different lock variables. For example:
int x, y;
- spinlock_t s;
+ spinlock_t s, t;
P0()
{
@@ -1832,9 +2069,9 @@ For example:
spin_lock(&s);
r1 = READ_ONCE(x);
spin_unlock(&s);
- spin_lock(&s);
+ spin_lock(&t);
r2 = READ_ONCE(y);
- spin_unlock(&s);
+ spin_unlock(&t);
}
P1()
@@ -1844,10 +2081,10 @@ For example:
WRITE_ONCE(x, 1);
}
-Here the second spin_lock() reads from the first spin_unlock(), and
-therefore the load of x must execute before the load of y. Thus we
-cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
-MP pattern).
+Here the second spin_lock() is po-after the first spin_unlock(), and
+therefore the load of x must execute before the load of y, even though
+the two locking operations use different locks. Thus we cannot have
+r1 = 1 and r2 = 0 at the end (this is an instance of the MP pattern).
This requirement does not apply to ordinary release and acquire
fences, only to lock-related operations. For instance, suppose P0()
@@ -1874,13 +2111,13 @@ instructions in the following order:
and thus it could load y before x, obtaining r2 = 0 and r1 = 1.
-Second, when a lock-acquire reads from a lock-release, and some other
-stores W and W' occur po-before the lock-release and po-after the
-lock-acquire respectively, the LKMM requires that W must propagate to
-each CPU before W' does. For example, consider:
+Second, when a lock-acquire reads from or is po-after a lock-release,
+and some other stores W and W' occur po-before the lock-release and
+po-after the lock-acquire respectively, the LKMM requires that W must
+propagate to each CPU before W' does. For example, consider:
int x, y;
- spinlock_t x;
+ spinlock_t s;
P0()
{
@@ -1910,7 +2147,12 @@ each CPU before W' does. For example, consider:
If r1 = 1 at the end then the spin_lock() in P1 must have read from
the spin_unlock() in P0. Hence the store to x must propagate to P2
-before the store to y does, so we cannot have r2 = 1 and r3 = 0.
+before the store to y does, so we cannot have r2 = 1 and r3 = 0. But
+if P1 had used a lock variable different from s, the writes could have
+propagated in either order. (On the other hand, if the code in P0 and
+P1 had all executed on a single CPU, as in the example before this
+one, then the writes would have propagated in order even if the two
+critical sections used different lock variables.)
These two special requirements for lock-release and lock-acquire do
not arise from the operational model. Nevertheless, kernel developers
@@ -1987,28 +2229,36 @@ outcome undefined.
In technical terms, the compiler is allowed to assume that when the
program executes, there will not be any data races. A "data race"
-occurs when two conflicting memory accesses execute concurrently;
-two memory accesses "conflict" if:
+occurs when there are two memory accesses such that:
+
+1. they access the same location,
- they access the same location,
+2. at least one of them is a store,
- they occur on different CPUs (or in different threads on the
- same CPU),
+3. at least one of them is plain,
- at least one of them is a plain access,
+4. they occur on different CPUs (or in different threads on the
+ same CPU), and
- and at least one of them is a store.
+5. they execute concurrently.
-The LKMM tries to determine whether a program contains two conflicting
-accesses which may execute concurrently; if it does then the LKMM says
-there is a potential data race and makes no predictions about the
-program's outcome.
+In the literature, two accesses are said to "conflict" if they satisfy
+1 and 2 above. We'll go a little farther and say that two accesses
+are "race candidates" if they satisfy 1 - 4. Thus, whether or not two
+race candidates actually do race in a given execution depends on
+whether they are concurrent.
-Determining whether two accesses conflict is easy; you can see that
-all the concepts involved in the definition above are already part of
-the memory model. The hard part is telling whether they may execute
-concurrently. The LKMM takes a conservative attitude, assuming that
-accesses may be concurrent unless it can prove they cannot.
+The LKMM tries to determine whether a program contains race candidates
+which may execute concurrently; if it does then the LKMM says there is
+a potential data race and makes no predictions about the program's
+outcome.
+
+Determining whether two accesses are race candidates is easy; you can
+see that all the concepts involved in the definition above are already
+part of the memory model. The hard part is telling whether they may
+execute concurrently. The LKMM takes a conservative attitude,
+assuming that accesses may be concurrent unless it can prove they
+are not.
If two memory accesses aren't concurrent then one must execute before
the other. Therefore the LKMM decides two accesses aren't concurrent
@@ -2171,8 +2421,8 @@ again, now using plain accesses for buf:
}
This program does not contain a data race. Although the U and V
-accesses conflict, the LKMM can prove they are not concurrent as
-follows:
+accesses are race candidates, the LKMM can prove they are not
+concurrent as follows:
The smp_wmb() fence in P0 is both a compiler barrier and a
cumul-fence. It guarantees that no matter what hash of
@@ -2326,12 +2576,11 @@ could now perform the load of x before the load of ptr (there might be
a control dependency but no address dependency at the machine level).
Finally, it turns out there is a situation in which a plain write does
-not need to be w-post-bounded: when it is separated from the
-conflicting access by a fence. At first glance this may seem
-impossible. After all, to be conflicting the second access has to be
-on a different CPU from the first, and fences don't link events on
-different CPUs. Well, normal fences don't -- but rcu-fence can!
-Here's an example:
+not need to be w-post-bounded: when it is separated from the other
+race-candidate access by a fence. At first glance this may seem
+impossible. After all, to be race candidates the two accesses must
+be on different CPUs, and fences don't link events on different CPUs.
+Well, normal fences don't -- but rcu-fence can! Here's an example:
int x, y;
@@ -2367,7 +2616,7 @@ concurrent and there is no race, even though P1's plain store to y
isn't w-post-bounded by any marked accesses.
Putting all this material together yields the following picture. For
-two conflicting stores W and W', where W ->co W', the LKMM says the
+race-candidate stores W and W', where W ->co W', the LKMM says the
stores don't race if W can be linked to W' by a
w-post-bounded ; vis ; w-pre-bounded
@@ -2380,8 +2629,8 @@ sequence, and if W' is plain then they also have to be linked by a
w-post-bounded ; vis ; r-pre-bounded
-sequence. For a conflicting load R and store W, the LKMM says the two
-accesses don't race if R can be linked to W by an
+sequence. For race-candidate load R and store W, the LKMM says the
+two accesses don't race if R can be linked to W by an
r-post-bounded ; xb* ; w-pre-bounded
@@ -2413,20 +2662,20 @@ is, the rules governing the memory subsystem's choice of a store to
satisfy a load request and its determination of where a store will
fall in the coherence order):
- If R and W conflict and it is possible to link R to W by one
- of the xb* sequences listed above, then W ->rfe R is not
- allowed (i.e., a load cannot read from a store that it
+ If R and W are race candidates and it is possible to link R to
+ W by one of the xb* sequences listed above, then W ->rfe R is
+ not allowed (i.e., a load cannot read from a store that it
executes before, even if one or both is plain).
- If W and R conflict and it is possible to link W to R by one
- of the vis sequences listed above, then R ->fre W is not
- allowed (i.e., if a store is visible to a load then the load
- must read from that store or one coherence-after it).
+ If W and R are race candidates and it is possible to link W to
+ R by one of the vis sequences listed above, then R ->fre W is
+ not allowed (i.e., if a store is visible to a load then the
+ load must read from that store or one coherence-after it).
- If W and W' conflict and it is possible to link W to W' by one
- of the vis sequences listed above, then W' ->co W is not
- allowed (i.e., if one store is visible to a second then the
- second must come after the first in the coherence order).
+ If W and W' are race candidates and it is possible to link W
+ to W' by one of the vis sequences listed above, then W' ->co W
+ is not allowed (i.e., if one store is visible to a second then
+ the second must come after the first in the coherence order).
This is the extent to which the LKMM deals with plain accesses.
Perhaps it could say more (for example, plain accesses might
@@ -2482,7 +2731,7 @@ smp_store_release() -- which is basically how the Linux kernel treats
them.
Although we said that plain accesses are not linked by the ppo
-relation, they do contribute to it indirectly. Namely, when there is
+relation, they do contribute to it indirectly. Firstly, when there is
an address dependency from a marked load R to a plain store W,
followed by smp_wmb() and then a marked store W', the LKMM creates a
ppo link from R to W'. The reasoning behind this is perhaps a little
@@ -2491,6 +2740,13 @@ for this source code in which W' could execute before R. Just as with
pre-bounding by address dependencies, it is possible for the compiler
to undermine this relation if sufficient care is not taken.
+Secondly, plain accesses can carry dependencies: If a data dependency
+links a marked load R to a store W, and the store is read by a load R'
+from the same thread, then the data loaded by R' depends on the data
+loaded originally by R. Thus, if R' is linked to any access X by a
+dependency, R is also linked to access X by the same dependency, even
+if W' or R' (or both!) are plain.
+
There are a few oddball fences which need special treatment:
smp_mb__before_atomic(), smp_mb__after_atomic(), and
smp_mb__after_spinlock(). The LKMM uses fence events with special
@@ -2505,7 +2761,7 @@ they behave as follows:
smp_mb__after_atomic() orders po-earlier atomic updates and
the events preceding them against all po-later events;
- smp_mb_after_spinlock() orders po-earlier lock acquisition
+ smp_mb__after_spinlock() orders po-earlier lock acquisition
events and the events preceding them against all po-later
events.
diff --git a/tools/memory-model/Documentation/glossary.txt b/tools/memory-model/Documentation/glossary.txt
new file mode 100644
index 000000000000..6f3d16dbf467
--- /dev/null
+++ b/tools/memory-model/Documentation/glossary.txt
@@ -0,0 +1,178 @@
+This document contains brief definitions of LKMM-related terms. Like most
+glossaries, it is not intended to be read front to back (except perhaps
+as a way of confirming a diagnosis of OCD), but rather to be searched
+for specific terms.
+
+
+Address Dependency: When the address of a later memory access is computed
+ based on the value returned by an earlier load, an "address
+ dependency" extends from that load extending to the later access.
+ Address dependencies are quite common in RCU read-side critical
+ sections:
+
+ 1 rcu_read_lock();
+ 2 p = rcu_dereference(gp);
+ 3 do_something(p->a);
+ 4 rcu_read_unlock();
+
+ In this case, because the address of "p->a" on line 3 is computed
+ from the value returned by the rcu_dereference() on line 2, the
+ address dependency extends from that rcu_dereference() to that
+ "p->a". In rare cases, optimizing compilers can destroy address
+ dependencies. Please see Documentation/RCU/rcu_dereference.rst
+ for more information.
+
+ See also "Control Dependency" and "Data Dependency".
+
+Acquire: With respect to a lock, acquiring that lock, for example,
+ using spin_lock(). With respect to a non-lock shared variable,
+ a special operation that includes a load and which orders that
+ load before later memory references running on that same CPU.
+ An example special acquire operation is smp_load_acquire(),
+ but atomic_read_acquire() and atomic_xchg_acquire() also include
+ acquire loads.
+
+ When an acquire load returns the value stored by a release store
+ to that same variable, (in other words, the acquire load "reads
+ from" the release store), then all operations preceding that
+ store "happen before" any operations following that load acquire.
+
+ See also "Happens-Before", "Reads-From", "Relaxed", and "Release".
+
+Coherence (co): When one CPU's store to a given variable overwrites
+ either the value from another CPU's store or some later value,
+ there is said to be a coherence link from the second CPU to
+ the first.
+
+ It is also possible to have a coherence link within a CPU, which
+ is a "coherence internal" (coi) link. The term "coherence
+ external" (coe) link is used when it is necessary to exclude
+ the coi case.
+
+ See also "From-reads" and "Reads-from".
+
+Control Dependency: When a later store's execution depends on a test
+ of a value computed from a value returned by an earlier load,
+ a "control dependency" extends from that load to that store.
+ For example:
+
+ 1 if (READ_ONCE(x))
+ 2 WRITE_ONCE(y, 1);
+
+ Here, the control dependency extends from the READ_ONCE() on
+ line 1 to the WRITE_ONCE() on line 2. Control dependencies are
+ fragile, and can be easily destroyed by optimizing compilers.
+ Please see control-dependencies.txt for more information.
+
+ See also "Address Dependency" and "Data Dependency".
+
+Cycle: Memory-barrier pairing is restricted to a pair of CPUs, as the
+ name suggests. And in a great many cases, a pair of CPUs is all
+ that is required. In other cases, the notion of pairing must be
+ extended to additional CPUs, and the result is called a "cycle".
+ In a cycle, each CPU's ordering interacts with that of the next:
+
+ CPU 0 CPU 1 CPU 2
+ WRITE_ONCE(x, 1); WRITE_ONCE(y, 1); WRITE_ONCE(z, 1);
+ smp_mb(); smp_mb(); smp_mb();
+ r0 = READ_ONCE(y); r1 = READ_ONCE(z); r2 = READ_ONCE(x);
+
+ CPU 0's smp_mb() interacts with that of CPU 1, which interacts
+ with that of CPU 2, which in turn interacts with that of CPU 0
+ to complete the cycle. Because of the smp_mb() calls between
+ each pair of memory accesses, the outcome where r0, r1, and r2
+ are all equal to zero is forbidden by LKMM.
+
+ See also "Pairing".
+
+Data Dependency: When the data written by a later store is computed based
+ on the value returned by an earlier load, a "data dependency"
+ extends from that load to that later store. For example:
+
+ 1 r1 = READ_ONCE(x);
+ 2 WRITE_ONCE(y, r1 + 1);
+
+ In this case, the data dependency extends from the READ_ONCE()
+ on line 1 to the WRITE_ONCE() on line 2. Data dependencies are
+ fragile and can be easily destroyed by optimizing compilers.
+ Because optimizing compilers put a great deal of effort into
+ working out what values integer variables might have, this is
+ especially true in cases where the dependency is carried through
+ an integer.
+
+ See also "Address Dependency" and "Control Dependency".
+
+From-Reads (fr): When one CPU's store to a given variable happened
+ too late to affect the value returned by another CPU's
+ load from that same variable, there is said to be a from-reads
+ link from the load to the store.
+
+ It is also possible to have a from-reads link within a CPU, which
+ is a "from-reads internal" (fri) link. The term "from-reads
+ external" (fre) link is used when it is necessary to exclude
+ the fri case.
+
+ See also "Coherence" and "Reads-from".
+
+Fully Ordered: An operation such as smp_mb() that orders all of
+ its CPU's prior accesses with all of that CPU's subsequent
+ accesses, or a marked access such as atomic_add_return()
+ that orders all of its CPU's prior accesses, itself, and
+ all of its CPU's subsequent accesses.
+
+Happens-Before (hb): A relation between two accesses in which LKMM
+ guarantees the first access precedes the second. For more
+ detail, please see the "THE HAPPENS-BEFORE RELATION: hb"
+ section of explanation.txt.
+
+Marked Access: An access to a variable that uses an special function or
+ macro such as "r1 = READ_ONCE(x)" or "smp_store_release(&a, 1)".
+
+ See also "Unmarked Access".
+
+Pairing: "Memory-barrier pairing" reflects the fact that synchronizing
+ data between two CPUs requires that both CPUs their accesses.
+ Memory barriers thus tend to come in pairs, one executed by
+ one of the CPUs and the other by the other CPU. Of course,
+ pairing also occurs with other types of operations, so that a
+ smp_store_release() pairs with an smp_load_acquire() that reads
+ the value stored.
+
+ See also "Cycle".
+
+Reads-From (rf): When one CPU's load returns the value stored by some other
+ CPU, there is said to be a reads-from link from the second
+ CPU's store to the first CPU's load. Reads-from links have the
+ nice property that time must advance from the store to the load,
+ which means that algorithms using reads-from links can use lighter
+ weight ordering and synchronization compared to algorithms using
+ coherence and from-reads links.
+
+ It is also possible to have a reads-from link within a CPU, which
+ is a "reads-from internal" (rfi) link. The term "reads-from
+ external" (rfe) link is used when it is necessary to exclude
+ the rfi case.
+
+ See also Coherence" and "From-reads".
+
+Relaxed: A marked access that does not imply ordering, for example, a
+ READ_ONCE(), WRITE_ONCE(), a non-value-returning read-modify-write
+ operation, or a value-returning read-modify-write operation whose
+ name ends in "_relaxed".
+
+ See also "Acquire" and "Release".
+
+Release: With respect to a lock, releasing that lock, for example,
+ using spin_unlock(). With respect to a non-lock shared variable,
+ a special operation that includes a store and which orders that
+ store after earlier memory references that ran on that same CPU.
+ An example special release store is smp_store_release(), but
+ atomic_set_release() and atomic_cmpxchg_release() also include
+ release stores.
+
+ See also "Acquire" and "Relaxed".
+
+Unmarked Access: An access to a variable that uses normal C-language
+ syntax, for example, "a = b[2]";
+
+ See also "Marked Access".
diff --git a/tools/memory-model/Documentation/litmus-tests.txt b/tools/memory-model/Documentation/litmus-tests.txt
new file mode 100644
index 000000000000..acac527328a1
--- /dev/null
+++ b/tools/memory-model/Documentation/litmus-tests.txt
@@ -0,0 +1,1083 @@
+Linux-Kernel Memory Model Litmus Tests
+======================================
+
+This file describes the LKMM litmus-test format by example, describes
+some tricks and traps, and finally outlines LKMM's limitations. Earlier
+versions of this material appeared in a number of LWN articles, including:
+
+https://lwn.net/Articles/720550/
+ A formal kernel memory-ordering model (part 2)
+https://lwn.net/Articles/608550/
+ Axiomatic validation of memory barriers and atomic instructions
+https://lwn.net/Articles/470681/
+ Validating Memory Barriers and Atomic Instructions
+
+This document presents information in decreasing order of applicability,
+so that, where possible, the information that has proven more commonly
+useful is shown near the beginning.
+
+For information on installing LKMM, including the underlying "herd7"
+tool, please see tools/memory-model/README.
+
+
+Copy-Pasta
+==========
+
+As with other software, it is often better (if less macho) to adapt an
+existing litmus test than it is to create one from scratch. A number
+of litmus tests may be found in the kernel source tree:
+
+ tools/memory-model/litmus-tests/
+ Documentation/litmus-tests/
+
+Several thousand more example litmus tests are available on github
+and kernel.org:
+
+ https://github.com/paulmckrcu/litmus
+ https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git/tree/CodeSamples/formal/herd
+ https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git/tree/CodeSamples/formal/litmus
+
+The -l and -L arguments to "git grep" can be quite helpful in identifying
+existing litmus tests that are similar to the one you need. But even if
+you start with an existing litmus test, it is still helpful to have a
+good understanding of the litmus-test format.
+
+
+Examples and Format
+===================
+
+This section describes the overall format of litmus tests, starting
+with a small example of the message-passing pattern and moving on to
+more complex examples that illustrate explicit initialization and LKMM's
+minimalistic set of flow-control statements.
+
+
+Message-Passing Example
+-----------------------
+
+This section gives an overview of the format of a litmus test using an
+example based on the common message-passing use case. This use case
+appears often in the Linux kernel. For example, a flag (modeled by "y"
+below) indicates that a buffer (modeled by "x" below) is now completely
+filled in and ready for use. It would be very bad if the consumer saw the
+flag set, but, due to memory misordering, saw old values in the buffer.
+
+This example asks whether smp_store_release() and smp_load_acquire()
+suffices to avoid this bad outcome:
+
+ 1 C MP+pooncerelease+poacquireonce
+ 2
+ 3 {}
+ 4
+ 5 P0(int *x, int *y)
+ 6 {
+ 7 WRITE_ONCE(*x, 1);
+ 8 smp_store_release(y, 1);
+ 9 }
+10
+11 P1(int *x, int *y)
+12 {
+13 int r0;
+14 int r1;
+15
+16 r0 = smp_load_acquire(y);
+17 r1 = READ_ONCE(*x);
+18 }
+19
+20 exists (1:r0=1 /\ 1:r1=0)
+
+Line 1 starts with "C", which identifies this file as being in the
+LKMM C-language format (which, as we will see, is a small fragment
+of the full C language). The remainder of line 1 is the name of
+the test, which by convention is the filename with the ".litmus"
+suffix stripped. In this case, the actual test may be found in
+tools/memory-model/litmus-tests/MP+pooncerelease+poacquireonce.litmus
+in the Linux-kernel source tree.
+
+Mechanically generated litmus tests will often have an optional
+double-quoted comment string on the second line. Such strings are ignored
+when running the test. Yes, you can add your own comments to litmus
+tests, but this is a bit involved due to the use of multiple parsers.
+For now, you can use C-language comments in the C code, and these comments
+may be in either the "/* */" or the "//" style. A later section will
+cover the full litmus-test commenting story.
+
+Line 3 is the initialization section. Because the default initialization
+to zero suffices for this test, the "{}" syntax is used, which mean the
+initialization section is empty. Litmus tests requiring non-default
+initialization must have non-empty initialization sections, as in the
+example that will be presented later in this document.
+
+Lines 5-9 show the first process and lines 11-18 the second process. Each
+process corresponds to a Linux-kernel task (or kthread, workqueue, thread,
+and so on; LKMM discussions often use these terms interchangeably).
+The name of the first process is "P0" and that of the second "P1".
+You can name your processes anything you like as long as the names consist
+of a single "P" followed by a number, and as long as the numbers are
+consecutive starting with zero. This can actually be quite helpful,
+for example, a .litmus file matching "^P1(" but not matching "^P2("
+must contain a two-process litmus test.
+
+The argument list for each function are pointers to the global variables
+used by that function. Unlike normal C-language function parameters, the
+names are significant. The fact that both P0() and P1() have a formal
+parameter named "x" means that these two processes are working with the
+same global variable, also named "x". So the "int *x, int *y" on P0()
+and P1() mean that both processes are working with two shared global
+variables, "x" and "y". Global variables are always passed to processes
+by reference, hence "P0(int *x, int *y)", but *never* "P0(int x, int y)".
+
+P0() has no local variables, but P1() has two of them named "r0" and "r1".
+These names may be freely chosen, but for historical reasons stemming from
+other litmus-test formats, it is conventional to use names consisting of
+"r" followed by a number as shown here. A common bug in litmus tests
+is forgetting to add a global variable to a process's parameter list.
+This will sometimes result in an error message, but can also cause the
+intended global to instead be silently treated as an undeclared local
+variable.
+
+Each process's code is similar to Linux-kernel C, as can be seen on lines
+7-8 and 13-17. This code may use many of the Linux kernel's atomic
+operations, some of its exclusive-lock functions, and some of its RCU
+and SRCU functions. An approximate list of the currently supported
+functions may be found in the linux-kernel.def file.
+
+The P0() process does "WRITE_ONCE(*x, 1)" on line 7. Because "x" is a
+pointer in P0()'s parameter list, this does an unordered store to global
+variable "x". Line 8 does "smp_store_release(y, 1)", and because "y"
+is also in P0()'s parameter list, this does a release store to global
+variable "y".
+
+The P1() process declares two local variables on lines 13 and 14.
+Line 16 does "r0 = smp_load_acquire(y)" which does an acquire load
+from global variable "y" into local variable "r0". Line 17 does a
+"r1 = READ_ONCE(*x)", which does an unordered load from "*x" into local
+variable "r1". Both "x" and "y" are in P1()'s parameter list, so both
+reference the same global variables that are used by P0().
+
+Line 20 is the "exists" assertion expression to evaluate the final state.
+This final state is evaluated after the dust has settled: both processes
+have completed and all of their memory references and memory barriers
+have propagated to all parts of the system. The references to the local
+variables "r0" and "r1" in line 24 must be prefixed with "1:" to specify
+which process they are local to.
+
+Note that the assertion expression is written in the litmus-test
+language rather than in C. For example, single "=" is an equality
+operator rather than an assignment. The "/\" character combination means
+"and". Similarly, "\/" stands for "or". Both of these are ASCII-art
+representations of the corresponding mathematical symbols. Finally,
+"~" stands for "logical not", which is "!" in C, and not to be confused
+with the C-language "~" operator which instead stands for "bitwise not".
+Parentheses may be used to override precedence.
+
+The "exists" assertion on line 20 is satisfied if the consumer sees the
+flag ("y") set but the buffer ("x") as not yet filled in, that is, if P1()
+loaded a value from "x" that was equal to 1 but loaded a value from "y"
+that was still equal to zero.
+
+This example can be checked by running the following command, which
+absolutely must be run from the tools/memory-model directory and from
+this directory only:
+
+herd7 -conf linux-kernel.cfg litmus-tests/MP+pooncerelease+poacquireonce.litmus
+
+The output is the result of something similar to a full state-space
+search, and is as follows:
+
+ 1 Test MP+pooncerelease+poacquireonce Allowed
+ 2 States 3
+ 3 1:r0=0; 1:r1=0;
+ 4 1:r0=0; 1:r1=1;
+ 5 1:r0=1; 1:r1=1;
+ 6 No
+ 7 Witnesses
+ 8 Positive: 0 Negative: 3
+ 9 Condition exists (1:r0=1 /\ 1:r1=0)
+10 Observation MP+pooncerelease+poacquireonce Never 0 3
+11 Time MP+pooncerelease+poacquireonce 0.00
+12 Hash=579aaa14d8c35a39429b02e698241d09
+
+The most pertinent line is line 10, which contains "Never 0 3", which
+indicates that the bad result flagged by the "exists" clause never
+happens. This line might instead say "Sometimes" to indicate that the
+bad result happened in some but not all executions, or it might say
+"Always" to indicate that the bad result happened in all executions.
+(The herd7 tool doesn't judge, so it is only an LKMM convention that the
+"exists" clause indicates a bad result. To see this, invert the "exists"
+clause's condition and run the test.) The numbers ("0 3") at the end
+of this line indicate the number of end states satisfying the "exists"
+clause (0) and the number not not satisfying that clause (3).
+
+Another important part of this output is shown in lines 2-5, repeated here:
+
+ 2 States 3
+ 3 1:r0=0; 1:r1=0;
+ 4 1:r0=0; 1:r1=1;
+ 5 1:r0=1; 1:r1=1;
+
+Line 2 gives the total number of end states, and each of lines 3-5 list
+one of these states, with the first ("1:r0=0; 1:r1=0;") indicating that
+both of P1()'s loads returned the value "0". As expected, given the
+"Never" on line 10, the state flagged by the "exists" clause is not
+listed. This full list of states can be helpful when debugging a new
+litmus test.
+
+The rest of the output is not normally needed, either due to irrelevance
+or due to being redundant with the lines discussed above. However, the
+following paragraph lists them for the benefit of readers possessed of
+an insatiable curiosity. Other readers should feel free to skip ahead.
+
+Line 1 echos the test name, along with the "Test" and "Allowed". Line 6's
+"No" says that the "exists" clause was not satisfied by any execution,
+and as such it has the same meaning as line 10's "Never". Line 7 is a
+lead-in to line 8's "Positive: 0 Negative: 3", which lists the number
+of end states satisfying and not satisfying the "exists" clause, just
+like the two numbers at the end of line 10. Line 9 repeats the "exists"
+clause so that you don't have to look it up in the litmus-test file.
+The number at the end of line 11 (which begins with "Time") gives the
+time in seconds required to analyze the litmus test. Small tests such
+as this one complete in a few milliseconds, so "0.00" is quite common.
+Line 12 gives a hash of the contents for the litmus-test file, and is used
+by tooling that manages litmus tests and their output. This tooling is
+used by people modifying LKMM itself, and among other things lets such
+people know which of the several thousand relevant litmus tests were
+affected by a given change to LKMM.
+
+
+Initialization
+--------------
+
+The previous example relied on the default zero initialization for
+"x" and "y", but a similar litmus test could instead initialize them
+to some other value:
+
+ 1 C MP+pooncerelease+poacquireonce
+ 2
+ 3 {
+ 4 x=42;
+ 5 y=42;
+ 6 }
+ 7
+ 8 P0(int *x, int *y)
+ 9 {
+10 WRITE_ONCE(*x, 1);
+11 smp_store_release(y, 1);
+12 }
+13
+14 P1(int *x, int *y)
+15 {
+16 int r0;
+17 int r1;
+18
+19 r0 = smp_load_acquire(y);
+20 r1 = READ_ONCE(*x);
+21 }
+22
+23 exists (1:r0=1 /\ 1:r1=42)
+
+Lines 3-6 now initialize both "x" and "y" to the value 42. This also
+means that the "exists" clause on line 23 must change "1:r1=0" to
+"1:r1=42".
+
+Running the test gives the same overall result as before, but with the
+value 42 appearing in place of the value zero:
+
+ 1 Test MP+pooncerelease+poacquireonce Allowed
+ 2 States 3
+ 3 1:r0=1; 1:r1=1;
+ 4 1:r0=42; 1:r1=1;
+ 5 1:r0=42; 1:r1=42;
+ 6 No
+ 7 Witnesses
+ 8 Positive: 0 Negative: 3
+ 9 Condition exists (1:r0=1 /\ 1:r1=42)
+10 Observation MP+pooncerelease+poacquireonce Never 0 3
+11 Time MP+pooncerelease+poacquireonce 0.02
+12 Hash=ab9a9b7940a75a792266be279a980156
+
+It is tempting to avoid the open-coded repetitions of the value "42"
+by defining another global variable "initval=42" and replacing all
+occurrences of "42" with "initval". This will not, repeat *not*,
+initialize "x" and "y" to 42, but instead to the address of "initval"
+(try it!). See the section below on linked lists to learn more about
+why this approach to initialization can be useful.
+
+
+Control Structures
+------------------
+
+LKMM supports the C-language "if" statement, which allows modeling of
+conditional branches. In LKMM, conditional branches can affect ordering,
+but only if you are *very* careful (compilers are surprisingly able
+to optimize away conditional branches). The following example shows
+the "load buffering" (LB) use case that is used in the Linux kernel to
+synchronize between ring-buffer producers and consumers. In the example
+below, P0() is one side checking to see if an operation may proceed and
+P1() is the other side completing its update.
+
+ 1 C LB+fencembonceonce+ctrlonceonce
+ 2
+ 3 {}
+ 4
+ 5 P0(int *x, int *y)
+ 6 {
+ 7 int r0;
+ 8
+ 9 r0 = READ_ONCE(*x);
+10 if (r0)
+11 WRITE_ONCE(*y, 1);
+12 }
+13
+14 P1(int *x, int *y)
+15 {
+16 int r0;
+17
+18 r0 = READ_ONCE(*y);
+19 smp_mb();
+20 WRITE_ONCE(*x, 1);
+21 }
+22
+23 exists (0:r0=1 /\ 1:r0=1)
+
+P1()'s "if" statement on line 10 works as expected, so that line 11 is
+executed only if line 9 loads a non-zero value from "x". Because P1()'s
+write of "1" to "x" happens only after P1()'s read from "y", one would
+hope that the "exists" clause cannot be satisfied. LKMM agrees:
+
+ 1 Test LB+fencembonceonce+ctrlonceonce Allowed
+ 2 States 2
+ 3 0:r0=0; 1:r0=0;
+ 4 0:r0=1; 1:r0=0;
+ 5 No
+ 6 Witnesses
+ 7 Positive: 0 Negative: 2
+ 8 Condition exists (0:r0=1 /\ 1:r0=1)
+ 9 Observation LB+fencembonceonce+ctrlonceonce Never 0 2
+10 Time LB+fencembonceonce+ctrlonceonce 0.00
+11 Hash=e5260556f6de495fd39b556d1b831c3b
+
+However, there is no "while" statement due to the fact that full
+state-space search has some difficulty with iteration. However, there
+are tricks that may be used to handle some special cases, which are
+discussed below. In addition, loop-unrolling tricks may be applied,
+albeit sparingly.
+
+
+Tricks and Traps
+================
+
+This section covers extracting debug output from herd7, emulating
+spin loops, handling trivial linked lists, adding comments to litmus tests,
+emulating call_rcu(), and finally tricks to improve herd7 performance
+in order to better handle large litmus tests.
+
+
+Debug Output
+------------
+
+By default, the herd7 state output includes all variables mentioned
+in the "exists" clause. But sometimes debugging efforts are greatly
+aided by the values of other variables. Consider this litmus test
+(tools/memory-order/litmus-tests/SB+rfionceonce-poonceonces.litmus but
+slightly modified), which probes an obscure corner of hardware memory
+ordering:
+
+ 1 C SB+rfionceonce-poonceonces
+ 2
+ 3 {}
+ 4
+ 5 P0(int *x, int *y)
+ 6 {
+ 7 int r1;
+ 8 int r2;
+ 9
+10 WRITE_ONCE(*x, 1);
+11 r1 = READ_ONCE(*x);
+12 r2 = READ_ONCE(*y);
+13 }
+14
+15 P1(int *x, int *y)
+16 {
+17 int r3;
+18 int r4;
+19
+20 WRITE_ONCE(*y, 1);
+21 r3 = READ_ONCE(*y);
+22 r4 = READ_ONCE(*x);
+23 }
+24
+25 exists (0:r2=0 /\ 1:r4=0)
+
+The herd7 output is as follows:
+
+ 1 Test SB+rfionceonce-poonceonces Allowed
+ 2 States 4
+ 3 0:r2=0; 1:r4=0;
+ 4 0:r2=0; 1:r4=1;
+ 5 0:r2=1; 1:r4=0;
+ 6 0:r2=1; 1:r4=1;
+ 7 Ok
+ 8 Witnesses
+ 9 Positive: 1 Negative: 3
+10 Condition exists (0:r2=0 /\ 1:r4=0)
+11 Observation SB+rfionceonce-poonceonces Sometimes 1 3
+12 Time SB+rfionceonce-poonceonces 0.01
+13 Hash=c7f30fe0faebb7d565405d55b7318ada
+
+(This output indicates that CPUs are permitted to "snoop their own
+store buffers", which all of Linux's CPU families other than s390 will
+happily do. Such snooping results in disagreement among CPUs on the
+order of stores from different CPUs, which is rarely an issue.)
+
+But the herd7 output shows only the two variables mentioned in the
+"exists" clause. Someone modifying this test might wish to know the
+values of "x", "y", "0:r1", and "0:r3" as well. The "locations"
+statement on line 25 shows how to cause herd7 to display additional
+variables:
+
+ 1 C SB+rfionceonce-poonceonces
+ 2
+ 3 {}
+ 4
+ 5 P0(int *x, int *y)
+ 6 {
+ 7 int r1;
+ 8 int r2;
+ 9
+10 WRITE_ONCE(*x, 1);
+11 r1 = READ_ONCE(*x);
+12 r2 = READ_ONCE(*y);
+13 }
+14
+15 P1(int *x, int *y)
+16 {
+17 int r3;
+18 int r4;
+19
+20 WRITE_ONCE(*y, 1);
+21 r3 = READ_ONCE(*y);
+22 r4 = READ_ONCE(*x);
+23 }
+24
+25 locations [0:r1; 1:r3; x; y]
+26 exists (0:r2=0 /\ 1:r4=0)
+
+The herd7 output then displays the values of all the variables:
+
+ 1 Test SB+rfionceonce-poonceonces Allowed
+ 2 States 4
+ 3 0:r1=1; 0:r2=0; 1:r3=1; 1:r4=0; x=1; y=1;
+ 4 0:r1=1; 0:r2=0; 1:r3=1; 1:r4=1; x=1; y=1;
+ 5 0:r1=1; 0:r2=1; 1:r3=1; 1:r4=0; x=1; y=1;
+ 6 0:r1=1; 0:r2=1; 1:r3=1; 1:r4=1; x=1; y=1;
+ 7 Ok
+ 8 Witnesses
+ 9 Positive: 1 Negative: 3
+10 Condition exists (0:r2=0 /\ 1:r4=0)
+11 Observation SB+rfionceonce-poonceonces Sometimes 1 3
+12 Time SB+rfionceonce-poonceonces 0.01
+13 Hash=40de8418c4b395388f6501cafd1ed38d
+
+What if you would like to know the value of a particular global variable
+at some particular point in a given process's execution? One approach
+is to use a READ_ONCE() to load that global variable into a new local
+variable, then add that local variable to the "locations" clause.
+But be careful: In some litmus tests, adding a READ_ONCE() will change
+the outcome! For one example, please see the C-READ_ONCE.litmus and
+C-READ_ONCE-omitted.litmus tests located here:
+
+ https://github.com/paulmckrcu/litmus/blob/master/manual/kernel/
+
+
+Spin Loops
+----------
+
+The analysis carried out by herd7 explores full state space, which is
+at best of exponential time complexity. Adding processes and increasing
+the amount of code in a give process can greatly increase execution time.
+Potentially infinite loops, such as those used to wait for locks to
+become available, are clearly problematic.
+
+Fortunately, it is possible to avoid state-space explosion by specially
+modeling such loops. For example, the following litmus tests emulates
+locking using xchg_acquire(), but instead of enclosing xchg_acquire()
+in a spin loop, it instead excludes executions that fail to acquire the
+lock using a herd7 "filter" clause. Note that for exclusive locking, you
+are better off using the spin_lock() and spin_unlock() that LKMM directly
+models, if for no other reason that these are much faster. However, the
+techniques illustrated in this section can be used for other purposes,
+such as emulating reader-writer locking, which LKMM does not yet model.
+
+ 1 C C-SB+l-o-o-u+l-o-o-u-X
+ 2
+ 3 {
+ 4 }
+ 5
+ 6 P0(int *sl, int *x0, int *x1)
+ 7 {
+ 8 int r2;
+ 9 int r1;
+10
+11 r2 = xchg_acquire(sl, 1);
+12 WRITE_ONCE(*x0, 1);
+13 r1 = READ_ONCE(*x1);
+14 smp_store_release(sl, 0);
+15 }
+16
+17 P1(int *sl, int *x0, int *x1)
+18 {
+19 int r2;
+20 int r1;
+21
+22 r2 = xchg_acquire(sl, 1);
+23 WRITE_ONCE(*x1, 1);
+24 r1 = READ_ONCE(*x0);
+25 smp_store_release(sl, 0);
+26 }
+27
+28 filter (0:r2=0 /\ 1:r2=0)
+29 exists (0:r1=0 /\ 1:r1=0)
+
+This litmus test may be found here:
+
+https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git/tree/CodeSamples/formal/herd/C-SB+l-o-o-u+l-o-o-u-X.litmus
+
+This test uses two global variables, "x1" and "x2", and also emulates a
+single global spinlock named "sl". This spinlock is held by whichever
+process changes the value of "sl" from "0" to "1", and is released when
+that process sets "sl" back to "0". P0()'s lock acquisition is emulated
+on line 11 using xchg_acquire(), which unconditionally stores the value
+"1" to "sl" and stores either "0" or "1" to "r2", depending on whether
+the lock acquisition was successful or unsuccessful (due to "sl" already
+having the value "1"), respectively. P1() operates in a similar manner.
+
+Rather unconventionally, execution appears to proceed to the critical
+section on lines 12 and 13 in either case. Line 14 then uses an
+smp_store_release() to store zero to "sl", thus emulating lock release.
+
+The case where xchg_acquire() fails to acquire the lock is handled by
+the "filter" clause on line 28, which tells herd7 to keep only those
+executions in which both "0:r2" and "1:r2" are zero, that is to pay
+attention only to those executions in which both locks are actually
+acquired. Thus, the bogus executions that would execute the critical
+sections are discarded and any effects that they might have had are
+ignored. Note well that the "filter" clause keeps those executions
+for which its expression is satisfied, that is, for which the expression
+evaluates to true. In other words, the "filter" clause says what to
+keep, not what to discard.
+
+The result of running this test is as follows:
+
+ 1 Test C-SB+l-o-o-u+l-o-o-u-X Allowed
+ 2 States 2
+ 3 0:r1=0; 1:r1=1;
+ 4 0:r1=1; 1:r1=0;
+ 5 No
+ 6 Witnesses
+ 7 Positive: 0 Negative: 2
+ 8 Condition exists (0:r1=0 /\ 1:r1=0)
+ 9 Observation C-SB+l-o-o-u+l-o-o-u-X Never 0 2
+10 Time C-SB+l-o-o-u+l-o-o-u-X 0.03
+
+The "Never" on line 9 indicates that this use of xchg_acquire() and
+smp_store_release() really does correctly emulate locking.
+
+Why doesn't the litmus test take the simpler approach of using a spin loop
+to handle failed spinlock acquisitions, like the kernel does? The key
+insight behind this litmus test is that spin loops have no effect on the
+possible "exists"-clause outcomes of program execution in the absence
+of deadlock. In other words, given a high-quality lock-acquisition
+primitive in a deadlock-free program running on high-quality hardware,
+each lock acquisition will eventually succeed. Because herd7 already
+explores the full state space, the length of time required to actually
+acquire the lock does not matter. After all, herd7 already models all
+possible durations of the xchg_acquire() statements.
+
+Why not just add the "filter" clause to the "exists" clause, thus
+avoiding the "filter" clause entirely? This does work, but is slower.
+The reason that the "filter" clause is faster is that (in the common case)
+herd7 knows to abandon an execution as soon as the "filter" expression
+fails to be satisfied. In contrast, the "exists" clause is evaluated
+only at the end of time, thus requiring herd7 to waste time on bogus
+executions in which both critical sections proceed concurrently. In
+addition, some LKMM users like the separation of concerns provided by
+using the both the "filter" and "exists" clauses.
+
+Readers lacking a pathological interest in odd corner cases should feel
+free to skip the remainder of this section.
+
+But what if the litmus test were to temporarily set "0:r2" to a non-zero
+value? Wouldn't that cause herd7 to abandon the execution prematurely
+due to an early mismatch of the "filter" clause?
+
+Why not just try it? Line 4 of the following modified litmus test
+introduces a new global variable "x2" that is initialized to "1". Line 23
+of P1() reads that variable into "1:r2" to force an early mismatch with
+the "filter" clause. Line 24 does a known-true "if" condition to avoid
+and static analysis that herd7 might do. Finally the "exists" clause
+on line 32 is updated to a condition that is alway satisfied at the end
+of the test.
+
+ 1 C C-SB+l-o-o-u+l-o-o-u-X
+ 2
+ 3 {
+ 4 x2=1;
+ 5 }
+ 6
+ 7 P0(int *sl, int *x0, int *x1)
+ 8 {
+ 9 int r2;
+10 int r1;
+11
+12 r2 = xchg_acquire(sl, 1);
+13 WRITE_ONCE(*x0, 1);
+14 r1 = READ_ONCE(*x1);
+15 smp_store_release(sl, 0);
+16 }
+17
+18 P1(int *sl, int *x0, int *x1, int *x2)
+19 {
+20 int r2;
+21 int r1;
+22
+23 r2 = READ_ONCE(*x2);
+24 if (r2)
+25 r2 = xchg_acquire(sl, 1);
+26 WRITE_ONCE(*x1, 1);
+27 r1 = READ_ONCE(*x0);
+28 smp_store_release(sl, 0);
+29 }
+30
+31 filter (0:r2=0 /\ 1:r2=0)
+32 exists (x1=1)
+
+If the "filter" clause were to check each variable at each point in the
+execution, running this litmus test would display no executions because
+all executions would be filtered out at line 23. However, the output
+is instead as follows:
+
+ 1 Test C-SB+l-o-o-u+l-o-o-u-X Allowed
+ 2 States 1
+ 3 x1=1;
+ 4 Ok
+ 5 Witnesses
+ 6 Positive: 2 Negative: 0
+ 7 Condition exists (x1=1)
+ 8 Observation C-SB+l-o-o-u+l-o-o-u-X Always 2 0
+ 9 Time C-SB+l-o-o-u+l-o-o-u-X 0.04
+10 Hash=080bc508da7f291e122c6de76c0088e3
+
+Line 3 shows that there is one execution that did not get filtered out,
+so the "filter" clause is evaluated only on the last assignment to
+the variables that it checks. In this case, the "filter" clause is a
+disjunction, so it might be evaluated twice, once at the final (and only)
+assignment to "0:r2" and once at the final assignment to "1:r2".
+
+
+Linked Lists
+------------
+
+LKMM can handle linked lists, but only linked lists in which each node
+contains nothing except a pointer to the next node in the list. This is
+of course quite restrictive, but there is nevertheless quite a bit that
+can be done within these confines, as can be seen in the litmus test
+at tools/memory-model/litmus-tests/MP+onceassign+derefonce.litmus:
+
+ 1 C MP+onceassign+derefonce
+ 2
+ 3 {
+ 4 y=z;
+ 5 z=0;
+ 6 }
+ 7
+ 8 P0(int *x, int **y)
+ 9 {
+10 WRITE_ONCE(*x, 1);
+11 rcu_assign_pointer(*y, x);
+12 }
+13
+14 P1(int *x, int **y)
+15 {
+16 int *r0;
+17 int r1;
+18
+19 rcu_read_lock();
+20 r0 = rcu_dereference(*y);
+21 r1 = READ_ONCE(*r0);
+22 rcu_read_unlock();
+23 }
+24
+25 exists (1:r0=x /\ 1:r1=0)
+
+Line 4's "y=z" may seem odd, given that "z" has not yet been initialized.
+But "y=z" does not set the value of "y" to that of "z", but instead
+sets the value of "y" to the *address* of "z". Lines 4 and 5 therefore
+create a simple linked list, with "y" pointing to "z" and "z" having a
+NULL pointer. A much longer linked list could be created if desired,
+and circular singly linked lists can also be created and manipulated.
+
+The "exists" clause works the same way, with the "1:r0=x" comparing P1()'s
+"r0" not to the value of "x", but again to its address. This term of the
+"exists" clause therefore tests whether line 20's load from "y" saw the
+value stored by line 11, which is in fact what is required in this case.
+
+P0()'s line 10 initializes "x" to the value 1 then line 11 links to "x"
+from "y", replacing "z".
+
+P1()'s line 20 loads a pointer from "y", and line 21 dereferences that
+pointer. The RCU read-side critical section spanning lines 19-22 is just
+for show in this example. Note that the address used for line 21's load
+depends on (in this case, "is exactly the same as") the value loaded by
+line 20. This is an example of what is called an "address dependency".
+This particular address dependency extends from the load on line 20 to the
+load on line 21. Address dependencies provide a weak form of ordering.
+
+Running this test results in the following:
+
+ 1 Test MP+onceassign+derefonce Allowed
+ 2 States 2
+ 3 1:r0=x; 1:r1=1;
+ 4 1:r0=z; 1:r1=0;
+ 5 No
+ 6 Witnesses
+ 7 Positive: 0 Negative: 2
+ 8 Condition exists (1:r0=x /\ 1:r1=0)
+ 9 Observation MP+onceassign+derefonce Never 0 2
+10 Time MP+onceassign+derefonce 0.00
+11 Hash=49ef7a741563570102448a256a0c8568
+
+The only possible outcomes feature P1() loading a pointer to "z"
+(which contains zero) on the one hand and P1() loading a pointer to "x"
+(which contains the value one) on the other. This should be reassuring
+because it says that RCU readers cannot see the old preinitialization
+values when accessing a newly inserted list node. This undesirable
+scenario is flagged by the "exists" clause, and would occur if P1()
+loaded a pointer to "x", but obtained the pre-initialization value of
+zero after dereferencing that pointer.
+
+
+Comments
+--------
+
+Different portions of a litmus test are processed by different parsers,
+which has the charming effect of requiring different comment syntax in
+different portions of the litmus test. The C-syntax portions use
+C-language comments (either "/* */" or "//"), while the other portions
+use Ocaml comments "(* *)".
+
+The following litmus test illustrates the comment style corresponding
+to each syntactic unit of the test:
+
+ 1 C MP+onceassign+derefonce (* A *)
+ 2
+ 3 (* B *)
+ 4
+ 5 {
+ 6 y=z; (* C *)
+ 7 z=0;
+ 8 } // D
+ 9
+10 // E
+11
+12 P0(int *x, int **y) // F
+13 {
+14 WRITE_ONCE(*x, 1); // G
+15 rcu_assign_pointer(*y, x);
+16 }
+17
+18 // H
+19
+20 P1(int *x, int **y)
+21 {
+22 int *r0;
+23 int r1;
+24
+25 rcu_read_lock();
+26 r0 = rcu_dereference(*y);
+27 r1 = READ_ONCE(*r0);
+28 rcu_read_unlock();
+29 }
+30
+31 // I
+32
+33 exists (* J *) (1:r0=x /\ (* K *) 1:r1=0) (* L *)
+
+In short, use C-language comments in the C code and Ocaml comments in
+the rest of the litmus test.
+
+On the other hand, if you prefer C-style comments everywhere, the
+C preprocessor is your friend.
+
+
+Asynchronous RCU Grace Periods
+------------------------------
+
+The following litmus test is derived from the example show in
+Documentation/litmus-tests/rcu/RCU+sync+free.litmus, but converted to
+emulate call_rcu():
+
+ 1 C RCU+sync+free
+ 2
+ 3 {
+ 4 int x = 1;
+ 5 int *y = &x;
+ 6 int z = 1;
+ 7 }
+ 8
+ 9 P0(int *x, int *z, int **y)
+10 {
+11 int *r0;
+12 int r1;
+13
+14 rcu_read_lock();
+15 r0 = rcu_dereference(*y);
+16 r1 = READ_ONCE(*r0);
+17 rcu_read_unlock();
+18 }
+19
+20 P1(int *z, int **y, int *c)
+21 {
+22 rcu_assign_pointer(*y, z);
+23 smp_store_release(*c, 1); // Emulate call_rcu().
+24 }
+25
+26 P2(int *x, int *z, int **y, int *c)
+27 {
+28 int r0;
+29
+30 r0 = smp_load_acquire(*c); // Note call_rcu() request.
+31 synchronize_rcu(); // Wait one grace period.
+32 WRITE_ONCE(*x, 0); // Emulate the RCU callback.
+33 }
+34
+35 filter (2:r0=1) (* Reject too-early starts. *)
+36 exists (0:r0=x /\ 0:r1=0)
+
+Lines 4-6 initialize a linked list headed by "y" that initially contains
+"x". In addition, "z" is pre-initialized to prepare for P1(), which
+will replace "x" with "z" in this list.
+
+P0() on lines 9-18 enters an RCU read-side critical section, loads the
+list header "y" and dereferences it, leaving the node in "0:r0" and
+the node's value in "0:r1".
+
+P1() on lines 20-24 updates the list header to instead reference "z",
+then emulates call_rcu() by doing a release store into "c".
+
+P2() on lines 27-33 emulates the behind-the-scenes effect of doing a
+call_rcu(). Line 30 first does an acquire load from "c", then line 31
+waits for an RCU grace period to elapse, and finally line 32 emulates
+the RCU callback, which in turn emulates a call to kfree().
+
+Of course, it is possible for P2() to start too soon, so that the
+value of "2:r0" is zero rather than the required value of "1".
+The "filter" clause on line 35 handles this possibility, rejecting
+all executions in which "2:r0" is not equal to the value "1".
+
+
+Performance
+-----------
+
+LKMM's exploration of the full state-space can be extremely helpful,
+but it does not come for free. The price is exponential computational
+complexity in terms of the number of processes, the average number
+of statements in each process, and the total number of stores in the
+litmus test.
+
+So it is best to start small and then work up. Where possible, break
+your code down into small pieces each representing a core concurrency
+requirement.
+
+That said, herd7 is quite fast. On an unprepossessing x86 laptop, it
+was able to analyze the following 10-process RCU litmus test in about
+six seconds.
+
+https://github.com/paulmckrcu/litmus/blob/master/auto/C-RW-R+RW-R+RW-G+RW-G+RW-G+RW-G+RW-R+RW-R+RW-R+RW-R.litmus
+
+One way to make herd7 run faster is to use the "-speedcheck true" option.
+This option prevents herd7 from generating all possible end states,
+instead causing it to focus solely on whether or not the "exists"
+clause can be satisfied. With this option, herd7 evaluates the above
+litmus test in about 300 milliseconds, for more than an order of magnitude
+improvement in performance.
+
+Larger 16-process litmus tests that would normally consume 15 minutes
+of time complete in about 40 seconds with this option. To be fair,
+you do get an extra 65,535 states when you leave off the "-speedcheck
+true" option.
+
+https://github.com/paulmckrcu/litmus/blob/master/auto/C-RW-R+RW-R+RW-G+RW-G+RW-G+RW-G+RW-R+RW-R+RW-R+RW-R+RW-G+RW-G+RW-G+RW-G+RW-R+RW-R.litmus
+
+Nevertheless, litmus-test analysis really is of exponential complexity,
+whether with or without "-speedcheck true". Increasing by just three
+processes to a 19-process litmus test requires 2 hours and 40 minutes
+without, and about 8 minutes with "-speedcheck true". Each of these
+results represent roughly an order of magnitude slowdown compared to the
+16-process litmus test. Again, to be fair, the multi-hour run explores
+no fewer than 524,287 additional states compared to the shorter one.
+
+https://github.com/paulmckrcu/litmus/blob/master/auto/C-RW-R+RW-R+RW-G+RW-G+RW-G+RW-G+RW-R+RW-R+RW-R+RW-R+RW-R+RW-R+RW-G+RW-G+RW-G+RW-G+RW-R+RW-R+RW-R.litmus
+
+If you don't like command-line arguments, you can obtain a similar speedup
+by adding a "filter" clause with exactly the same expression as your
+"exists" clause.
+
+However, please note that seeing the full set of states can be extremely
+helpful when developing and debugging litmus tests.
+
+
+LIMITATIONS
+===========
+
+Limitations of the Linux-kernel memory model (LKMM) include:
+
+1. Compiler optimizations are not accurately modeled. Of course,
+ the use of READ_ONCE() and WRITE_ONCE() limits the compiler's
+ ability to optimize, but under some circumstances it is possible
+ for the compiler to undermine the memory model. For more
+ information, see Documentation/explanation.txt (in particular,
+ the "THE PROGRAM ORDER RELATION: po AND po-loc" and "A WARNING"
+ sections).
+
+ Note that this limitation in turn limits LKMM's ability to
+ accurately model address, control, and data dependencies.
+ For example, if the compiler can deduce the value of some variable
+ carrying a dependency, then the compiler can break that dependency
+ by substituting a constant of that value.
+
+ Conversely, LKMM will sometimes overestimate the amount of
+ reordering compilers and CPUs can carry out, leading it to miss
+ some pretty obvious cases of ordering. A simple example is:
+
+ r1 = READ_ONCE(x);
+ if (r1 == 0)
+ smp_mb();
+ WRITE_ONCE(y, 1);
+
+ The WRITE_ONCE() does not depend on the READ_ONCE(), and as a
+ result, LKMM does not claim ordering. However, even though no
+ dependency is present, the WRITE_ONCE() will not be executed before
+ the READ_ONCE(). There are two reasons for this:
+
+ The presence of the smp_mb() in one of the branches
+ prevents the compiler from moving the WRITE_ONCE()
+ up before the "if" statement, since the compiler has
+ to assume that r1 will sometimes be 0 (but see the
+ comment below);
+
+ CPUs do not execute stores before po-earlier conditional
+ branches, even in cases where the store occurs after the
+ two arms of the branch have recombined.
+
+ It is clear that it is not dangerous in the slightest for LKMM to
+ make weaker guarantees than architectures. In fact, it is
+ desirable, as it gives compilers room for making optimizations.
+ For instance, suppose that a 0 value in r1 would trigger undefined
+ behavior elsewhere. Then a clever compiler might deduce that r1
+ can never be 0 in the if condition. As a result, said clever
+ compiler might deem it safe to optimize away the smp_mb(),
+ eliminating the branch and any ordering an architecture would
+ guarantee otherwise.
+
+2. Multiple access sizes for a single variable are not supported,
+ and neither are misaligned or partially overlapping accesses.
+
+3. Exceptions and interrupts are not modeled. In some cases,
+ this limitation can be overcome by modeling the interrupt or
+ exception with an additional process.
+
+4. I/O such as MMIO or DMA is not supported.
+
+5. Self-modifying code (such as that found in the kernel's
+ alternatives mechanism, function tracer, Berkeley Packet Filter
+ JIT compiler, and module loader) is not supported.
+
+6. Complete modeling of all variants of atomic read-modify-write
+ operations, locking primitives, and RCU is not provided.
+ For example, call_rcu() and rcu_barrier() are not supported.
+ However, a substantial amount of support is provided for these
+ operations, as shown in the linux-kernel.def file.
+
+ Here are specific limitations:
+
+ a. When rcu_assign_pointer() is passed NULL, the Linux
+ kernel provides no ordering, but LKMM models this
+ case as a store release.
+
+ b. The "unless" RMW operations are not currently modeled:
+ atomic_long_add_unless(), atomic_inc_unless_negative(),
+ and atomic_dec_unless_positive(). These can be emulated
+ in litmus tests, for example, by using atomic_cmpxchg().
+
+ One exception of this limitation is atomic_add_unless(),
+ which is provided directly by herd7 (so no corresponding
+ definition in linux-kernel.def). atomic_add_unless() is
+ modeled by herd7 therefore it can be used in litmus tests.
+
+ c. The call_rcu() function is not modeled. As was shown above,
+ it can be emulated in litmus tests by adding another
+ process that invokes synchronize_rcu() and the body of the
+ callback function, with (for example) a release-acquire
+ from the site of the emulated call_rcu() to the beginning
+ of the additional process.
+
+ d. The rcu_barrier() function is not modeled. It can be
+ emulated in litmus tests emulating call_rcu() via
+ (for example) a release-acquire from the end of each
+ additional call_rcu() process to the site of the
+ emulated rcu-barrier().
+
+ e. Reader-writer locking is not modeled. It can be
+ emulated in litmus tests using atomic read-modify-write
+ operations.
+
+The fragment of the C language supported by these litmus tests is quite
+limited and in some ways non-standard:
+
+1. There is no automatic C-preprocessor pass. You can of course
+ run it manually, if you choose.
+
+2. There is no way to create functions other than the Pn() functions
+ that model the concurrent processes.
+
+3. The Pn() functions' formal parameters must be pointers to the
+ global shared variables. Nothing can be passed by value into
+ these functions.
+
+4. The only functions that can be invoked are those built directly
+ into herd7 or that are defined in the linux-kernel.def file.
+
+5. The "switch", "do", "for", "while", and "goto" C statements are
+ not supported. The "switch" statement can be emulated by the
+ "if" statement. The "do", "for", and "while" statements can
+ often be emulated by manually unrolling the loop, or perhaps by
+ enlisting the aid of the C preprocessor to minimize the resulting
+ code duplication. Some uses of "goto" can be emulated by "if",
+ and some others by unrolling.
+
+6. Although you can use a wide variety of types in litmus-test
+ variable declarations, and especially in global-variable
+ declarations, the "herd7" tool understands only int and
+ pointer types. There is no support for floating-point types,
+ enumerations, characters, strings, arrays, or structures.
+
+7. Parsing of variable declarations is very loose, with almost no
+ type checking.
+
+8. Initializers differ from their C-language counterparts.
+ For example, when an initializer contains the name of a shared
+ variable, that name denotes a pointer to that variable, not
+ the current value of that variable. For example, "int x = y"
+ is interpreted the way "int x = &y" would be in C.
+
+9. Dynamic memory allocation is not supported, although this can
+ be worked around in some cases by supplying multiple statically
+ allocated variables.
+
+Some of these limitations may be overcome in the future, but others are
+more likely to be addressed by incorporating the Linux-kernel memory model
+into other tools.
+
+Finally, please note that LKMM is subject to change as hardware, use cases,
+and compilers evolve.
diff --git a/tools/memory-model/Documentation/locking.txt b/tools/memory-model/Documentation/locking.txt
new file mode 100644
index 000000000000..65c898c64a93
--- /dev/null
+++ b/tools/memory-model/Documentation/locking.txt
@@ -0,0 +1,298 @@
+Locking
+=======
+
+Locking is well-known and the common use cases are straightforward: Any
+CPU holding a given lock sees any changes previously seen or made by any
+CPU before it previously released that same lock. This last sentence
+is the only part of this document that most developers will need to read.
+
+However, developers who would like to also access lock-protected shared
+variables outside of their corresponding locks should continue reading.
+
+
+Locking and Prior Accesses
+--------------------------
+
+The basic rule of locking is worth repeating:
+
+ Any CPU holding a given lock sees any changes previously seen
+ or made by any CPU before it previously released that same lock.
+
+Note that this statement is a bit stronger than "Any CPU holding a
+given lock sees all changes made by any CPU during the time that CPU was
+previously holding this same lock". For example, consider the following
+pair of code fragments:
+
+ /* See MP+polocks.litmus. */
+ void CPU0(void)
+ {
+ WRITE_ONCE(x, 1);
+ spin_lock(&mylock);
+ WRITE_ONCE(y, 1);
+ spin_unlock(&mylock);
+ }
+
+ void CPU1(void)
+ {
+ spin_lock(&mylock);
+ r0 = READ_ONCE(y);
+ spin_unlock(&mylock);
+ r1 = READ_ONCE(x);
+ }
+
+The basic rule guarantees that if CPU0() acquires mylock before CPU1(),
+then both r0 and r1 must be set to the value 1. This also has the
+consequence that if the final value of r0 is equal to 1, then the final
+value of r1 must also be equal to 1. In contrast, the weaker rule would
+say nothing about the final value of r1.
+
+
+Locking and Subsequent Accesses
+-------------------------------
+
+The converse to the basic rule also holds: Any CPU holding a given
+lock will not see any changes that will be made by any CPU after it
+subsequently acquires this same lock. This converse statement is
+illustrated by the following litmus test:
+
+ /* See MP+porevlocks.litmus. */
+ void CPU0(void)
+ {
+ r0 = READ_ONCE(y);
+ spin_lock(&mylock);
+ r1 = READ_ONCE(x);
+ spin_unlock(&mylock);
+ }
+
+ void CPU1(void)
+ {
+ spin_lock(&mylock);
+ WRITE_ONCE(x, 1);
+ spin_unlock(&mylock);
+ WRITE_ONCE(y, 1);
+ }
+
+This converse to the basic rule guarantees that if CPU0() acquires
+mylock before CPU1(), then both r0 and r1 must be set to the value 0.
+This also has the consequence that if the final value of r1 is equal
+to 0, then the final value of r0 must also be equal to 0. In contrast,
+the weaker rule would say nothing about the final value of r0.
+
+These examples show only a single pair of CPUs, but the effects of the
+locking basic rule extend across multiple acquisitions of a given lock
+across multiple CPUs.
+
+
+Double-Checked Locking
+----------------------
+
+It is well known that more than just a lock is required to make
+double-checked locking work correctly, This litmus test illustrates
+one incorrect approach:
+
+ /* See Documentation/litmus-tests/locking/DCL-broken.litmus. */
+ void CPU0(void)
+ {
+ r0 = READ_ONCE(flag);
+ if (r0 == 0) {
+ spin_lock(&lck);
+ r1 = READ_ONCE(flag);
+ if (r1 == 0) {
+ WRITE_ONCE(data, 1);
+ WRITE_ONCE(flag, 1);
+ }
+ spin_unlock(&lck);
+ }
+ r2 = READ_ONCE(data);
+ }
+ /* CPU1() is the exactly the same as CPU0(). */
+
+There are two problems. First, there is no ordering between the first
+READ_ONCE() of "flag" and the READ_ONCE() of "data". Second, there is
+no ordering between the two WRITE_ONCE() calls. It should therefore be
+no surprise that "r2" can be zero, and a quick herd7 run confirms this.
+
+One way to fix this is to use smp_load_acquire() and smp_store_release()
+as shown in this corrected version:
+
+ /* See Documentation/litmus-tests/locking/DCL-fixed.litmus. */
+ void CPU0(void)
+ {
+ r0 = smp_load_acquire(&flag);
+ if (r0 == 0) {
+ spin_lock(&lck);
+ r1 = READ_ONCE(flag);
+ if (r1 == 0) {
+ WRITE_ONCE(data, 1);
+ smp_store_release(&flag, 1);
+ }
+ spin_unlock(&lck);
+ }
+ r2 = READ_ONCE(data);
+ }
+ /* CPU1() is the exactly the same as CPU0(). */
+
+The smp_load_acquire() guarantees that its load from "flags" will
+be ordered before the READ_ONCE() from data, thus solving the first
+problem. The smp_store_release() guarantees that its store will be
+ordered after the WRITE_ONCE() to "data", solving the second problem.
+The smp_store_release() pairs with the smp_load_acquire(), thus ensuring
+that the ordering provided by each actually takes effect. Again, a
+quick herd7 run confirms this.
+
+In short, if you access a lock-protected variable without holding the
+corresponding lock, you will need to provide additional ordering, in
+this case, via the smp_load_acquire() and the smp_store_release().
+
+
+Ordering Provided by a Lock to CPUs Not Holding That Lock
+---------------------------------------------------------
+
+It is not necessarily the case that accesses ordered by locking will be
+seen as ordered by CPUs not holding that lock. Consider this example:
+
+ /* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */
+ void CPU0(void)
+ {
+ spin_lock(&mylock);
+ WRITE_ONCE(x, 1);
+ WRITE_ONCE(y, 1);
+ spin_unlock(&mylock);
+ }
+
+ void CPU1(void)
+ {
+ spin_lock(&mylock);
+ r0 = READ_ONCE(y);
+ WRITE_ONCE(z, 1);
+ spin_unlock(&mylock);
+ }
+
+ void CPU2(void)
+ {
+ WRITE_ONCE(z, 2);
+ smp_mb();
+ r1 = READ_ONCE(x);
+ }
+
+Counter-intuitive though it might be, it is quite possible to have
+the final value of r0 be 1, the final value of z be 2, and the final
+value of r1 be 0. The reason for this surprising outcome is that CPU2()
+never acquired the lock, and thus did not fully benefit from the lock's
+ordering properties.
+
+Ordering can be extended to CPUs not holding the lock by careful use
+of smp_mb__after_spinlock():
+
+ /* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */
+ void CPU0(void)
+ {
+ spin_lock(&mylock);
+ WRITE_ONCE(x, 1);
+ WRITE_ONCE(y, 1);
+ spin_unlock(&mylock);
+ }
+
+ void CPU1(void)
+ {
+ spin_lock(&mylock);
+ smp_mb__after_spinlock();
+ r0 = READ_ONCE(y);
+ WRITE_ONCE(z, 1);
+ spin_unlock(&mylock);
+ }
+
+ void CPU2(void)
+ {
+ WRITE_ONCE(z, 2);
+ smp_mb();
+ r1 = READ_ONCE(x);
+ }
+
+This addition of smp_mb__after_spinlock() strengthens the lock
+acquisition sufficiently to rule out the counter-intuitive outcome.
+In other words, the addition of the smp_mb__after_spinlock() prohibits
+the counter-intuitive result where the final value of r0 is 1, the final
+value of z is 2, and the final value of r1 is 0.
+
+
+No Roach-Motel Locking!
+-----------------------
+
+This example requires familiarity with the herd7 "filter" clause, so
+please read up on that topic in litmus-tests.txt.
+
+It is tempting to allow memory-reference instructions to be pulled
+into a critical section, but this cannot be allowed in the general case.
+For example, consider a spin loop preceding a lock-based critical section.
+Now, herd7 does not model spin loops, but we can emulate one with two
+loads, with a "filter" clause to constrain the first to return the
+initial value and the second to return the updated value, as shown below:
+
+ /* See Documentation/litmus-tests/locking/RM-fixed.litmus. */
+ void CPU0(void)
+ {
+ spin_lock(&lck);
+ r2 = atomic_inc_return(&y);
+ WRITE_ONCE(x, 1);
+ spin_unlock(&lck);
+ }
+
+ void CPU1(void)
+ {
+ r0 = READ_ONCE(x);
+ r1 = READ_ONCE(x);
+ spin_lock(&lck);
+ r2 = atomic_inc_return(&y);
+ spin_unlock(&lck);
+ }
+
+ filter (1:r0=0 /\ 1:r1=1)
+ exists (1:r2=1)
+
+The variable "x" is the control variable for the emulated spin loop.
+CPU0() sets it to "1" while holding the lock, and CPU1() emulates the
+spin loop by reading it twice, first into "1:r0" (which should get the
+initial value "0") and then into "1:r1" (which should get the updated
+value "1").
+
+The "filter" clause takes this into account, constraining "1:r0" to
+equal "0" and "1:r1" to equal 1.
+
+Then the "exists" clause checks to see if CPU1() acquired its lock first,
+which should not happen given the filter clause because CPU0() updates
+"x" while holding the lock. And herd7 confirms this.
+
+But suppose that the compiler was permitted to reorder the spin loop
+into CPU1()'s critical section, like this:
+
+ /* See Documentation/litmus-tests/locking/RM-broken.litmus. */
+ void CPU0(void)
+ {
+ int r2;
+
+ spin_lock(&lck);
+ r2 = atomic_inc_return(&y);
+ WRITE_ONCE(x, 1);
+ spin_unlock(&lck);
+ }
+
+ void CPU1(void)
+ {
+ spin_lock(&lck);
+ r0 = READ_ONCE(x);
+ r1 = READ_ONCE(x);
+ r2 = atomic_inc_return(&y);
+ spin_unlock(&lck);
+ }
+
+ filter (1:r0=0 /\ 1:r1=1)
+ exists (1:r2=1)
+
+If "1:r0" is equal to "0", "1:r1" can never equal "1" because CPU0()
+cannot update "x" while CPU1() holds the lock. And herd7 confirms this,
+showing zero executions matching the "filter" criteria.
+
+And this is why Linux-kernel lock and unlock primitives must prevent
+code from entering critical sections. It is not sufficient to only
+prevent code from leaving them.
diff --git a/tools/memory-model/Documentation/ordering.txt b/tools/memory-model/Documentation/ordering.txt
new file mode 100644
index 000000000000..9b0949d3f5ec
--- /dev/null
+++ b/tools/memory-model/Documentation/ordering.txt
@@ -0,0 +1,556 @@
+This document gives an overview of the categories of memory-ordering
+operations provided by the Linux-kernel memory model (LKMM).
+
+
+Categories of Ordering
+======================
+
+This section lists LKMM's three top-level categories of memory-ordering
+operations in decreasing order of strength:
+
+1. Barriers (also known as "fences"). A barrier orders some or
+ all of the CPU's prior operations against some or all of its
+ subsequent operations.
+
+2. Ordered memory accesses. These operations order themselves
+ against some or all of the CPU's prior accesses or some or all
+ of the CPU's subsequent accesses, depending on the subcategory
+ of the operation.
+
+3. Unordered accesses, as the name indicates, have no ordering
+ properties except to the extent that they interact with an
+ operation in the previous categories. This being the real world,
+ some of these "unordered" operations provide limited ordering
+ in some special situations.
+
+Each of the above categories is described in more detail by one of the
+following sections.
+
+
+Barriers
+========
+
+Each of the following categories of barriers is described in its own
+subsection below:
+
+a. Full memory barriers.
+
+b. Read-modify-write (RMW) ordering augmentation barriers.
+
+c. Write memory barrier.
+
+d. Read memory barrier.
+
+e. Compiler barrier.
+
+Note well that many of these primitives generate absolutely no code
+in kernels built with CONFIG_SMP=n. Therefore, if you are writing
+a device driver, which must correctly order accesses to a physical
+device even in kernels built with CONFIG_SMP=n, please use the
+ordering primitives provided for that purpose. For example, instead of
+smp_mb(), use mb(). See the "Linux Kernel Device Drivers" book or the
+https://lwn.net/Articles/698014/ article for more information.
+
+
+Full Memory Barriers
+--------------------
+
+The Linux-kernel primitives that provide full ordering include:
+
+o The smp_mb() full memory barrier.
+
+o Value-returning RMW atomic operations whose names do not end in
+ _acquire, _release, or _relaxed.
+
+o RCU's grace-period primitives.
+
+First, the smp_mb() full memory barrier orders all of the CPU's prior
+accesses against all subsequent accesses from the viewpoint of all CPUs.
+In other words, all CPUs will agree that any earlier action taken
+by that CPU happened before any later action taken by that same CPU.
+For example, consider the following:
+
+ WRITE_ONCE(x, 1);
+ smp_mb(); // Order store to x before load from y.
+ r1 = READ_ONCE(y);
+
+All CPUs will agree that the store to "x" happened before the load
+from "y", as indicated by the comment. And yes, please comment your
+memory-ordering primitives. It is surprisingly hard to remember their
+purpose after even a few months.
+
+Second, some RMW atomic operations provide full ordering. These
+operations include value-returning RMW atomic operations (that is, those
+with non-void return types) whose names do not end in _acquire, _release,
+or _relaxed. Examples include atomic_add_return(), atomic_dec_and_test(),
+cmpxchg(), and xchg(). Note that conditional RMW atomic operations such
+as cmpxchg() are only guaranteed to provide ordering when they succeed.
+When RMW atomic operations provide full ordering, they partition the
+CPU's accesses into three groups:
+
+1. All code that executed prior to the RMW atomic operation.
+
+2. The RMW atomic operation itself.
+
+3. All code that executed after the RMW atomic operation.
+
+All CPUs will agree that any operation in a given partition happened
+before any operation in a higher-numbered partition.
+
+In contrast, non-value-returning RMW atomic operations (that is, those
+with void return types) do not guarantee any ordering whatsoever. Nor do
+value-returning RMW atomic operations whose names end in _relaxed.
+Examples of the former include atomic_inc() and atomic_dec(),
+while examples of the latter include atomic_cmpxchg_relaxed() and
+atomic_xchg_relaxed(). Similarly, value-returning non-RMW atomic
+operations such as atomic_read() do not guarantee full ordering, and
+are covered in the later section on unordered operations.
+
+Value-returning RMW atomic operations whose names end in _acquire or
+_release provide limited ordering, and will be described later in this
+document.
+
+Finally, RCU's grace-period primitives provide full ordering. These
+primitives include synchronize_rcu(), synchronize_rcu_expedited(),
+synchronize_srcu() and so on. However, these primitives have orders
+of magnitude greater overhead than smp_mb(), atomic_xchg(), and so on.
+Furthermore, RCU's grace-period primitives can only be invoked in
+sleepable contexts. Therefore, RCU's grace-period primitives are
+typically instead used to provide ordering against RCU read-side critical
+sections, as documented in their comment headers. But of course if you
+need a synchronize_rcu() to interact with readers, it costs you nothing
+to also rely on its additional full-memory-barrier semantics. Just please
+carefully comment this, otherwise your future self will hate you.
+
+
+RMW Ordering Augmentation Barriers
+----------------------------------
+
+As noted in the previous section, non-value-returning RMW operations
+such as atomic_inc() and atomic_dec() guarantee no ordering whatsoever.
+Nevertheless, a number of popular CPU families, including x86, provide
+full ordering for these primitives. One way to obtain full ordering on
+all architectures is to add a call to smp_mb():
+
+ WRITE_ONCE(x, 1);
+ atomic_inc(&my_counter);
+ smp_mb(); // Inefficient on x86!!!
+ r1 = READ_ONCE(y);
+
+This works, but the added smp_mb() adds needless overhead for
+x86, on which atomic_inc() provides full ordering all by itself.
+The smp_mb__after_atomic() primitive can be used instead:
+
+ WRITE_ONCE(x, 1);
+ atomic_inc(&my_counter);
+ smp_mb__after_atomic(); // Order store to x before load from y.
+ r1 = READ_ONCE(y);
+
+The smp_mb__after_atomic() primitive emits code only on CPUs whose
+atomic_inc() implementations do not guarantee full ordering, thus
+incurring no unnecessary overhead on x86. There are a number of
+variations on the smp_mb__*() theme:
+
+o smp_mb__before_atomic(), which provides full ordering prior
+ to an unordered RMW atomic operation.
+
+o smp_mb__after_atomic(), which, as shown above, provides full
+ ordering subsequent to an unordered RMW atomic operation.
+
+o smp_mb__after_spinlock(), which provides full ordering subsequent
+ to a successful spinlock acquisition. Note that spin_lock() is
+ always successful but spin_trylock() might not be.
+
+o smp_mb__after_srcu_read_unlock(), which provides full ordering
+ subsequent to an srcu_read_unlock().
+
+It is bad practice to place code between the smp__*() primitive and the
+operation whose ordering that it is augmenting. The reason is that the
+ordering of this intervening code will differ from one CPU architecture
+to another.
+
+
+Write Memory Barrier
+--------------------
+
+The Linux kernel's write memory barrier is smp_wmb(). If a CPU executes
+the following code:
+
+ WRITE_ONCE(x, 1);
+ smp_wmb();
+ WRITE_ONCE(y, 1);
+
+Then any given CPU will see the write to "x" has having happened before
+the write to "y". However, you are usually better off using a release
+store, as described in the "Release Operations" section below.
+
+Note that smp_wmb() might fail to provide ordering for unmarked C-language
+stores because profile-driven optimization could determine that the
+value being overwritten is almost always equal to the new value. Such a
+compiler might then reasonably decide to transform "x = 1" and "y = 1"
+as follows:
+
+ if (x != 1)
+ x = 1;
+ smp_wmb(); // BUG: does not order the reads!!!
+ if (y != 1)
+ y = 1;
+
+Therefore, if you need to use smp_wmb() with unmarked C-language writes,
+you will need to make sure that none of the compilers used to build
+the Linux kernel carry out this sort of transformation, both now and in
+the future.
+
+
+Read Memory Barrier
+-------------------
+
+The Linux kernel's read memory barrier is smp_rmb(). If a CPU executes
+the following code:
+
+ r0 = READ_ONCE(y);
+ smp_rmb();
+ r1 = READ_ONCE(x);
+
+Then any given CPU will see the read from "y" as having preceded the read from
+"x". However, you are usually better off using an acquire load, as described
+in the "Acquire Operations" section below.
+
+Compiler Barrier
+----------------
+
+The Linux kernel's compiler barrier is barrier(). This primitive
+prohibits compiler code-motion optimizations that might move memory
+references across the point in the code containing the barrier(), but
+does not constrain hardware memory ordering. For example, this can be
+used to prevent to compiler from moving code across an infinite loop:
+
+ WRITE_ONCE(x, 1);
+ while (dontstop)
+ barrier();
+ r1 = READ_ONCE(y);
+
+Without the barrier(), the compiler would be within its rights to move the
+WRITE_ONCE() to follow the loop. This code motion could be problematic
+in the case where an interrupt handler terminates the loop. Another way
+to handle this is to use READ_ONCE() for the load of "dontstop".
+
+Note that the barriers discussed previously use barrier() or its low-level
+equivalent in their implementations.
+
+
+Ordered Memory Accesses
+=======================
+
+The Linux kernel provides a wide variety of ordered memory accesses:
+
+a. Release operations.
+
+b. Acquire operations.
+
+c. RCU read-side ordering.
+
+d. Control dependencies.
+
+Each of the above categories has its own section below.
+
+
+Release Operations
+------------------
+
+Release operations include smp_store_release(), atomic_set_release(),
+rcu_assign_pointer(), and value-returning RMW operations whose names
+end in _release. These operations order their own store against all
+of the CPU's prior memory accesses. Release operations often provide
+improved readability and performance compared to explicit barriers.
+For example, use of smp_store_release() saves a line compared to the
+smp_wmb() example above:
+
+ WRITE_ONCE(x, 1);
+ smp_store_release(&y, 1);
+
+More important, smp_store_release() makes it easier to connect up the
+different pieces of the concurrent algorithm. The variable stored to
+by the smp_store_release(), in this case "y", will normally be used in
+an acquire operation in other parts of the concurrent algorithm.
+
+To see the performance advantages, suppose that the above example read
+from "x" instead of writing to it. Then an smp_wmb() could not guarantee
+ordering, and an smp_mb() would be needed instead:
+
+ r1 = READ_ONCE(x);
+ smp_mb();
+ WRITE_ONCE(y, 1);
+
+But smp_mb() often incurs much higher overhead than does
+smp_store_release(), which still provides the needed ordering of "x"
+against "y". On x86, the version using smp_store_release() might compile
+to a simple load instruction followed by a simple store instruction.
+In contrast, the smp_mb() compiles to an expensive instruction that
+provides the needed ordering.
+
+There is a wide variety of release operations:
+
+o Store operations, including not only the aforementioned
+ smp_store_release(), but also atomic_set_release(), and
+ atomic_long_set_release().
+
+o RCU's rcu_assign_pointer() operation. This is the same as
+ smp_store_release() except that: (1) It takes the pointer to
+ be assigned to instead of a pointer to that pointer, (2) It
+ is intended to be used in conjunction with rcu_dereference()
+ and similar rather than smp_load_acquire(), and (3) It checks
+ for an RCU-protected pointer in "sparse" runs.
+
+o Value-returning RMW operations whose names end in _release,
+ such as atomic_fetch_add_release() and cmpxchg_release().
+ Note that release ordering is guaranteed only against the
+ memory-store portion of the RMW operation, and not against the
+ memory-load portion. Note also that conditional operations such
+ as cmpxchg_release() are only guaranteed to provide ordering
+ when they succeed.
+
+As mentioned earlier, release operations are often paired with acquire
+operations, which are the subject of the next section.
+
+
+Acquire Operations
+------------------
+
+Acquire operations include smp_load_acquire(), atomic_read_acquire(),
+and value-returning RMW operations whose names end in _acquire. These
+operations order their own load against all of the CPU's subsequent
+memory accesses. Acquire operations often provide improved performance
+and readability compared to explicit barriers. For example, use of
+smp_load_acquire() saves a line compared to the smp_rmb() example above:
+
+ r0 = smp_load_acquire(&y);
+ r1 = READ_ONCE(x);
+
+As with smp_store_release(), this also makes it easier to connect
+the different pieces of the concurrent algorithm by looking for the
+smp_store_release() that stores to "y". In addition, smp_load_acquire()
+improves upon smp_rmb() by ordering against subsequent stores as well
+as against subsequent loads.
+
+There are a couple of categories of acquire operations:
+
+o Load operations, including not only the aforementioned
+ smp_load_acquire(), but also atomic_read_acquire(), and
+ atomic64_read_acquire().
+
+o Value-returning RMW operations whose names end in _acquire,
+ such as atomic_xchg_acquire() and atomic_cmpxchg_acquire().
+ Note that acquire ordering is guaranteed only against the
+ memory-load portion of the RMW operation, and not against the
+ memory-store portion. Note also that conditional operations
+ such as atomic_cmpxchg_acquire() are only guaranteed to provide
+ ordering when they succeed.
+
+Symmetry being what it is, acquire operations are often paired with the
+release operations covered earlier. For example, consider the following
+example, where task0() and task1() execute concurrently:
+
+ void task0(void)
+ {
+ WRITE_ONCE(x, 1);
+ smp_store_release(&y, 1);
+ }
+
+ void task1(void)
+ {
+ r0 = smp_load_acquire(&y);
+ r1 = READ_ONCE(x);
+ }
+
+If "x" and "y" are both initially zero, then either r0's final value
+will be zero or r1's final value will be one, thus providing the required
+ordering.
+
+
+RCU Read-Side Ordering
+----------------------
+
+This category includes read-side markers such as rcu_read_lock()
+and rcu_read_unlock() as well as pointer-traversal primitives such as
+rcu_dereference() and srcu_dereference().
+
+Compared to locking primitives and RMW atomic operations, markers
+for RCU read-side critical sections incur very low overhead because
+they interact only with the corresponding grace-period primitives.
+For example, the rcu_read_lock() and rcu_read_unlock() markers interact
+with synchronize_rcu(), synchronize_rcu_expedited(), and call_rcu().
+The way this works is that if a given call to synchronize_rcu() cannot
+prove that it started before a given call to rcu_read_lock(), then
+that synchronize_rcu() must block until the matching rcu_read_unlock()
+is reached. For more information, please see the synchronize_rcu()
+docbook header comment and the material in Documentation/RCU.
+
+RCU's pointer-traversal primitives, including rcu_dereference() and
+srcu_dereference(), order their load (which must be a pointer) against any
+of the CPU's subsequent memory accesses whose address has been calculated
+from the value loaded. There is said to be an *address dependency*
+from the value returned by the rcu_dereference() or srcu_dereference()
+to that subsequent memory access.
+
+A call to rcu_dereference() for a given RCU-protected pointer is
+usually paired with a call to a call to rcu_assign_pointer() for that
+same pointer in much the same way that a call to smp_load_acquire() is
+paired with a call to smp_store_release(). Calls to rcu_dereference()
+and rcu_assign_pointer are often buried in other APIs, for example,
+the RCU list API members defined in include/linux/rculist.h. For more
+information, please see the docbook headers in that file, the most
+recent LWN article on the RCU API (https://lwn.net/Articles/777036/),
+and of course the material in Documentation/RCU.
+
+If the pointer value is manipulated between the rcu_dereference()
+that returned it and a later dereference(), please read
+Documentation/RCU/rcu_dereference.rst. It can also be quite helpful to
+review uses in the Linux kernel.
+
+
+Control Dependencies
+--------------------
+
+A control dependency extends from a marked load (READ_ONCE() or stronger)
+through an "if" condition to a marked store (WRITE_ONCE() or stronger)
+that is executed only by one of the legs of that "if" statement.
+Control dependencies are so named because they are mediated by
+control-flow instructions such as comparisons and conditional branches.
+
+In short, you can use a control dependency to enforce ordering between
+an READ_ONCE() and a WRITE_ONCE() when there is an "if" condition
+between them. The canonical example is as follows:
+
+ q = READ_ONCE(a);
+ if (q)
+ WRITE_ONCE(b, 1);
+
+In this case, all CPUs would see the read from "a" as happening before
+the write to "b".
+
+However, control dependencies are easily destroyed by compiler
+optimizations, so any use of control dependencies must take into account
+all of the compilers used to build the Linux kernel. Please see the
+"control-dependencies.txt" file for more information.
+
+
+Unordered Accesses
+==================
+
+Each of these two categories of unordered accesses has a section below:
+
+a. Unordered marked operations.
+
+b. Unmarked C-language accesses.
+
+
+Unordered Marked Operations
+---------------------------
+
+Unordered operations to different variables are just that, unordered.
+However, if a group of CPUs apply these operations to a single variable,
+all the CPUs will agree on the operation order. Of course, the ordering
+of unordered marked accesses can also be constrained using the mechanisms
+described earlier in this document.
+
+These operations come in three categories:
+
+o Marked writes, such as WRITE_ONCE() and atomic_set(). These
+ primitives required the compiler to emit the corresponding store
+ instructions in the expected execution order, thus suppressing
+ a number of destructive optimizations. However, they provide no
+ hardware ordering guarantees, and in fact many CPUs will happily
+ reorder marked writes with each other or with other unordered
+ operations, unless these operations are to the same variable.
+
+o Marked reads, such as READ_ONCE() and atomic_read(). These
+ primitives required the compiler to emit the corresponding load
+ instructions in the expected execution order, thus suppressing
+ a number of destructive optimizations. However, they provide no
+ hardware ordering guarantees, and in fact many CPUs will happily
+ reorder marked reads with each other or with other unordered
+ operations, unless these operations are to the same variable.
+
+o Unordered RMW atomic operations. These are non-value-returning
+ RMW atomic operations whose names do not end in _acquire or
+ _release, and also value-returning RMW operations whose names
+ end in _relaxed. Examples include atomic_add(), atomic_or(),
+ and atomic64_fetch_xor_relaxed(). These operations do carry
+ out the specified RMW operation atomically, for example, five
+ concurrent atomic_inc() operations applied to a given variable
+ will reliably increase the value of that variable by five.
+ However, many CPUs will happily reorder these operations with
+ each other or with other unordered operations.
+
+ This category of operations can be efficiently ordered using
+ smp_mb__before_atomic() and smp_mb__after_atomic(), as was
+ discussed in the "RMW Ordering Augmentation Barriers" section.
+
+In short, these operations can be freely reordered unless they are all
+operating on a single variable or unless they are constrained by one of
+the operations called out earlier in this document.
+
+
+Unmarked C-Language Accesses
+----------------------------
+
+Unmarked C-language accesses are normal variable accesses to normal
+variables, that is, to variables that are not "volatile" and are not
+C11 atomic variables. These operations provide no ordering guarantees,
+and further do not guarantee "atomic" access. For example, the compiler
+might (and sometimes does) split a plain C-language store into multiple
+smaller stores. A load from that same variable running on some other
+CPU while such a store is executing might see a value that is a mashup
+of the old value and the new value.
+
+Unmarked C-language accesses are unordered, and are also subject to
+any number of compiler optimizations, many of which can break your
+concurrent code. It is possible to used unmarked C-language accesses for
+shared variables that are subject to concurrent access, but great care
+is required on an ongoing basis. The compiler-constraining barrier()
+primitive can be helpful, as can the various ordering primitives discussed
+in this document. It nevertheless bears repeating that use of unmarked
+C-language accesses requires careful attention to not just your code,
+but to all the compilers that might be used to build it. Such compilers
+might replace a series of loads with a single load, and might replace
+a series of stores with a single store. Some compilers will even split
+a single store into multiple smaller stores.
+
+But there are some ways of using unmarked C-language accesses for shared
+variables without such worries:
+
+o Guard all accesses to a given variable by a particular lock,
+ so that there are never concurrent conflicting accesses to
+ that variable. (There are "conflicting accesses" when
+ (1) at least one of the concurrent accesses to a variable is an
+ unmarked C-language access and (2) when at least one of those
+ accesses is a write, whether marked or not.)
+
+o As above, but using other synchronization primitives such
+ as reader-writer locks or sequence locks.
+
+o Use locking or other means to ensure that all concurrent accesses
+ to a given variable are reads.
+
+o Restrict use of a given variable to statistics or heuristics
+ where the occasional bogus value can be tolerated.
+
+o Declare the accessed variables as C11 atomics.
+ https://lwn.net/Articles/691128/
+
+o Declare the accessed variables as "volatile".
+
+If you need to live more dangerously, please do take the time to
+understand the compilers. One place to start is these two LWN
+articles:
+
+Who's afraid of a big bad optimizing compiler?
+ https://lwn.net/Articles/793253
+Calibrating your fear of big bad optimizing compilers
+ https://lwn.net/Articles/799218
+
+Used properly, unmarked C-language accesses can reduce overhead on
+fastpaths. However, the price is great care and continual attention
+to your compiler as new versions come out and as new optimizations
+are enabled.
diff --git a/tools/memory-model/Documentation/recipes.txt b/tools/memory-model/Documentation/recipes.txt
index 7fe8d7aa3029..03f58b11c252 100644
--- a/tools/memory-model/Documentation/recipes.txt
+++ b/tools/memory-model/Documentation/recipes.txt
@@ -1,7 +1,7 @@
This document provides "recipes", that is, litmus tests for commonly
occurring situations, as well as a few that illustrate subtly broken but
attractive nuisances. Many of these recipes include example code from
-v4.13 of the Linux kernel.
+v5.7 of the Linux kernel.
The first section covers simple special cases, the second section
takes off the training wheels to cover more involved examples,
@@ -126,7 +126,7 @@ However, it is not necessarily the case that accesses ordered by
locking will be seen as ordered by CPUs not holding that lock.
Consider this example:
- /* See Z6.0+pooncerelease+poacquirerelease+fencembonceonce.litmus. */
+ /* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */
void CPU0(void)
{
spin_lock(&mylock);
@@ -278,7 +278,7 @@ is present if the value loaded determines the address of a later access
first place (control dependency). Note that the term "data dependency"
is sometimes casually used to cover both address and data dependencies.
-In lib/prime_numbers.c, the expand_to_next_prime() function invokes
+In lib/math/prime_numbers.c, the expand_to_next_prime() function invokes
rcu_assign_pointer(), and the next_prime_number() function invokes
rcu_dereference(). This combination mediates access to a bit vector
that is expanded as additional primes are needed.
diff --git a/tools/memory-model/Documentation/references.txt b/tools/memory-model/Documentation/references.txt
index b177f3e4a614..c5fdfd19df24 100644
--- a/tools/memory-model/Documentation/references.txt
+++ b/tools/memory-model/Documentation/references.txt
@@ -73,6 +73,18 @@ o Christopher Pulte, Shaked Flur, Will Deacon, Jon French,
Linux-kernel memory model
=========================
+o Jade Alglave, Will Deacon, Boqun Feng, David Howells, Daniel
+ Lustig, Luc Maranget, Paul E. McKenney, Andrea Parri, Nicholas
+ Piggin, Alan Stern, Akira Yokosawa, and Peter Zijlstra.
+ 2019. "Calibrating your fear of big bad optimizing compilers"
+ Linux Weekly News. https://lwn.net/Articles/799218/
+
+o Jade Alglave, Will Deacon, Boqun Feng, David Howells, Daniel
+ Lustig, Luc Maranget, Paul E. McKenney, Andrea Parri, Nicholas
+ Piggin, Alan Stern, Akira Yokosawa, and Peter Zijlstra.
+ 2019. "Who's afraid of a big bad optimizing compiler?"
+ Linux Weekly News. https://lwn.net/Articles/793253/
+
o Jade Alglave, Luc Maranget, Paul E. McKenney, Andrea Parri, and
Alan Stern. 2018. "Frightening small children and disconcerting
grown-ups: Concurrency in the Linux kernel". In Proceedings of
@@ -88,6 +100,11 @@ o Jade Alglave, Luc Maranget, Paul E. McKenney, Andrea Parri, and
Alan Stern. 2017. "A formal kernel memory-ordering model (part 2)"
Linux Weekly News. https://lwn.net/Articles/720550/
+o Jade Alglave, Luc Maranget, Paul E. McKenney, Andrea Parri, and
+ Alan Stern. 2017-2019. "A Formal Model of Linux-Kernel Memory
+ Ordering" (backup material for the LWN articles)
+ https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/LWNLinuxMM/
+
Memory-model tooling
====================
@@ -103,12 +120,12 @@ o Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. "Herding
o Jade Alglave, Patrick Cousot, and Luc Maranget. 2016. "Syntax and
semantics of the weak consistency model specification language
- cat". CoRR abs/1608.07531 (2016). http://arxiv.org/abs/1608.07531
+ cat". CoRR abs/1608.07531 (2016). https://arxiv.org/abs/1608.07531
Memory-model comparisons
========================
o Paul E. McKenney, Ulrich Weigand, Andrea Parri, and Boqun
- Feng. 2016. "Linux-Kernel Memory Model". (6 June 2016).
- http://open-std.org/JTC1/SC22/WG21/docs/papers/2016/p0124r2.html.
+ Feng. 2018. "Linux-Kernel Memory Model". (27 September 2018).
+ http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0124r6.html.
diff --git a/tools/memory-model/Documentation/simple.txt b/tools/memory-model/Documentation/simple.txt
new file mode 100644
index 000000000000..4c789ec8334f
--- /dev/null
+++ b/tools/memory-model/Documentation/simple.txt
@@ -0,0 +1,270 @@
+This document provides options for those wishing to keep their
+memory-ordering lives simple, as is necessary for those whose domain
+is complex. After all, there are bugs other than memory-ordering bugs,
+and the time spent gaining memory-ordering knowledge is not available
+for gaining domain knowledge. Furthermore Linux-kernel memory model
+(LKMM) is quite complex, with subtle differences in code often having
+dramatic effects on correctness.
+
+The options near the beginning of this list are quite simple. The idea
+is not that kernel hackers don't already know about them, but rather
+that they might need the occasional reminder.
+
+Please note that this is a generic guide, and that specific subsystems
+will often have special requirements or idioms. For example, developers
+of MMIO-based device drivers will often need to use mb(), rmb(), and
+wmb(), and therefore might find smp_mb(), smp_rmb(), and smp_wmb()
+to be more natural than smp_load_acquire() and smp_store_release().
+On the other hand, those coming in from other environments will likely
+be more familiar with these last two.
+
+
+Single-threaded code
+====================
+
+In single-threaded code, there is no reordering, at least assuming
+that your toolchain and hardware are working correctly. In addition,
+it is generally a mistake to assume your code will only run in a single
+threaded context as the kernel can enter the same code path on multiple
+CPUs at the same time. One important exception is a function that makes
+no external data references.
+
+In the general case, you will need to take explicit steps to ensure that
+your code really is executed within a single thread that does not access
+shared variables. A simple way to achieve this is to define a global lock
+that you acquire at the beginning of your code and release at the end,
+taking care to ensure that all references to your code's shared data are
+also carried out under that same lock. Because only one thread can hold
+this lock at a given time, your code will be executed single-threaded.
+This approach is called "code locking".
+
+Code locking can severely limit both performance and scalability, so it
+should be used with caution, and only on code paths that execute rarely.
+After all, a huge amount of effort was required to remove the Linux
+kernel's old "Big Kernel Lock", so let's please be very careful about
+adding new "little kernel locks".
+
+One of the advantages of locking is that, in happy contrast with the
+year 1981, almost all kernel developers are very familiar with locking.
+The Linux kernel's lockdep (CONFIG_PROVE_LOCKING=y) is very helpful with
+the formerly feared deadlock scenarios.
+
+Please use the standard locking primitives provided by the kernel rather
+than rolling your own. For one thing, the standard primitives interact
+properly with lockdep. For another thing, these primitives have been
+tuned to deal better with high contention. And for one final thing, it is
+surprisingly hard to correctly code production-quality lock acquisition
+and release functions. After all, even simple non-production-quality
+locking functions must carefully prevent both the CPU and the compiler
+from moving code in either direction across the locking function.
+
+Despite the scalability limitations of single-threaded code, RCU
+takes this approach for much of its grace-period processing and also
+for early-boot operation. The reason RCU is able to scale despite
+single-threaded grace-period processing is use of batching, where all
+updates that accumulated during one grace period are handled by the
+next one. In other words, slowing down grace-period processing makes
+it more efficient. Nor is RCU unique: Similar batching optimizations
+are used in many I/O operations.
+
+
+Packaged code
+=============
+
+Even if performance and scalability concerns prevent your code from
+being completely single-threaded, it is often possible to use library
+functions that handle the concurrency nearly or entirely on their own.
+This approach delegates any LKMM worries to the library maintainer.
+
+In the kernel, what is the "library"? Quite a bit. It includes the
+contents of the lib/ directory, much of the include/linux/ directory along
+with a lot of other heavily used APIs. But heavily used examples include
+the list macros (for example, include/linux/{,rcu}list.h), workqueues,
+smp_call_function(), and the various hash tables and search trees.
+
+
+Data locking
+============
+
+With code locking, we use single-threaded code execution to guarantee
+serialized access to the data that the code is accessing. However,
+we can also achieve this by instead associating the lock with specific
+instances of the data structures. This creates a "critical section"
+in the code execution that will execute as though it is single threaded.
+By placing all the accesses and modifications to a shared data structure
+inside a critical section, we ensure that the execution context that
+holds the lock has exclusive access to the shared data.
+
+The poster boy for this approach is the hash table, where placing a lock
+in each hash bucket allows operations on different buckets to proceed
+concurrently. This works because the buckets do not overlap with each
+other, so that an operation on one bucket does not interfere with any
+other bucket.
+
+As the number of buckets increases, data locking scales naturally.
+In particular, if the amount of data increases with the number of CPUs,
+increasing the number of buckets as the number of CPUs increase results
+in a naturally scalable data structure.
+
+
+Per-CPU processing
+==================
+
+Partitioning processing and data over CPUs allows each CPU to take
+a single-threaded approach while providing excellent performance and
+scalability. Of course, there is no free lunch: The dark side of this
+excellence is substantially increased memory footprint.
+
+In addition, it is sometimes necessary to occasionally update some global
+view of this processing and data, in which case something like locking
+must be used to protect this global view. This is the approach taken
+by the percpu_counter infrastructure. In many cases, there are already
+generic/library variants of commonly used per-cpu constructs available.
+Please use them rather than rolling your own.
+
+RCU uses DEFINE_PER_CPU*() declaration to create a number of per-CPU
+data sets. For example, each CPU does private quiescent-state processing
+within its instance of the per-CPU rcu_data structure, and then uses data
+locking to report quiescent states up the grace-period combining tree.
+
+
+Packaged primitives: Sequence locking
+=====================================
+
+Lockless programming is considered by many to be more difficult than
+lock-based programming, but there are a few lockless design patterns that
+have been built out into an API. One of these APIs is sequence locking.
+Although this APIs can be used in extremely complex ways, there are simple
+and effective ways of using it that avoid the need to pay attention to
+memory ordering.
+
+The basic keep-things-simple rule for sequence locking is "do not write
+in read-side code". Yes, you can do writes from within sequence-locking
+readers, but it won't be so simple. For example, such writes will be
+lockless and should be idempotent.
+
+For more sophisticated use cases, LKMM can guide you, including use
+cases involving combining sequence locking with other synchronization
+primitives. (LKMM does not yet know about sequence locking, so it is
+currently necessary to open-code it in your litmus tests.)
+
+Additional information may be found in include/linux/seqlock.h.
+
+Packaged primitives: RCU
+========================
+
+Another lockless design pattern that has been baked into an API
+is RCU. The Linux kernel makes sophisticated use of RCU, but the
+keep-things-simple rules for RCU are "do not write in read-side code"
+and "do not update anything that is visible to and accessed by readers",
+and "protect updates with locking".
+
+These rules are illustrated by the functions foo_update_a() and
+foo_get_a() shown in Documentation/RCU/whatisRCU.rst. Additional
+RCU usage patterns maybe found in Documentation/RCU and in the
+source code.
+
+
+Packaged primitives: Atomic operations
+======================================
+
+Back in the day, the Linux kernel had three types of atomic operations:
+
+1. Initialization and read-out, such as atomic_set() and atomic_read().
+
+2. Operations that did not return a value and provided no ordering,
+ such as atomic_inc() and atomic_dec().
+
+3. Operations that returned a value and provided full ordering, such as
+ atomic_add_return() and atomic_dec_and_test(). Note that some
+ value-returning operations provide full ordering only conditionally.
+ For example, cmpxchg() provides ordering only upon success.
+
+More recent kernels have operations that return a value but do not
+provide full ordering. These are flagged with either a _relaxed()
+suffix (providing no ordering), or an _acquire() or _release() suffix
+(providing limited ordering).
+
+Additional information may be found in these files:
+
+Documentation/atomic_t.txt
+Documentation/atomic_bitops.txt
+Documentation/core-api/refcount-vs-atomic.rst
+
+Reading code using these primitives is often also quite helpful.
+
+
+Lockless, fully ordered
+=======================
+
+When using locking, there often comes a time when it is necessary
+to access some variable or another without holding the data lock
+that serializes access to that variable.
+
+If you want to keep things simple, use the initialization and read-out
+operations from the previous section only when there are no racing
+accesses. Otherwise, use only fully ordered operations when accessing
+or modifying the variable. This approach guarantees that code prior
+to a given access to that variable will be seen by all CPUs has having
+happened before any code following any later access to that same variable.
+
+Please note that per-CPU functions are not atomic operations and
+hence they do not provide any ordering guarantees at all.
+
+If the lockless accesses are frequently executed reads that are used
+only for heuristics, or if they are frequently executed writes that
+are used only for statistics, please see the next section.
+
+
+Lockless statistics and heuristics
+==================================
+
+Unordered primitives such as atomic_read(), atomic_set(), READ_ONCE(), and
+WRITE_ONCE() can safely be used in some cases. These primitives provide
+no ordering, but they do prevent the compiler from carrying out a number
+of destructive optimizations (for which please see the next section).
+One example use for these primitives is statistics, such as per-CPU
+counters exemplified by the rt_cache_stat structure's routing-cache
+statistics counters. Another example use case is heuristics, such as
+the jiffies_till_first_fqs and jiffies_till_next_fqs kernel parameters
+controlling how often RCU scans for idle CPUs.
+
+But be careful. "Unordered" really does mean "unordered". It is all
+too easy to assume ordering, and this assumption must be avoided when
+using these primitives.
+
+
+Don't let the compiler trip you up
+==================================
+
+It can be quite tempting to use plain C-language accesses for lockless
+loads from and stores to shared variables. Although this is both
+possible and quite common in the Linux kernel, it does require a
+surprising amount of analysis, care, and knowledge about the compiler.
+Yes, some decades ago it was not unfair to consider a C compiler to be
+an assembler with added syntax and better portability, but the advent of
+sophisticated optimizing compilers mean that those days are long gone.
+Today's optimizing compilers can profoundly rewrite your code during the
+translation process, and have long been ready, willing, and able to do so.
+
+Therefore, if you really need to use C-language assignments instead of
+READ_ONCE(), WRITE_ONCE(), and so on, you will need to have a very good
+understanding of both the C standard and your compiler. Here are some
+introductory references and some tooling to start you on this noble quest:
+
+Who's afraid of a big bad optimizing compiler?
+ https://lwn.net/Articles/793253/
+Calibrating your fear of big bad optimizing compilers
+ https://lwn.net/Articles/799218/
+Concurrency bugs should fear the big bad data-race detector (part 1)
+ https://lwn.net/Articles/816850/
+Concurrency bugs should fear the big bad data-race detector (part 2)
+ https://lwn.net/Articles/816854/
+
+
+More complex use cases
+======================
+
+If the alternatives above do not do what you need, please look at the
+recipes-pairs.txt file to peel off the next layer of the memory-ordering
+onion.