aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/locking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/locking')
-rw-r--r--Documentation/locking/crossrelease.txt874
-rw-r--r--Documentation/locking/rt-mutex-design.txt432
-rw-r--r--Documentation/locking/rt-mutex.txt58
3 files changed, 1005 insertions, 359 deletions
diff --git a/Documentation/locking/crossrelease.txt b/Documentation/locking/crossrelease.txt
new file mode 100644
index 000000000000..bdf1423d5f99
--- /dev/null
+++ b/Documentation/locking/crossrelease.txt
@@ -0,0 +1,874 @@
+Crossrelease
+============
+
+Started by Byungchul Park <byungchul.park@lge.com>
+
+Contents:
+
+ (*) Background
+
+ - What causes deadlock
+ - How lockdep works
+
+ (*) Limitation
+
+ - Limit lockdep
+ - Pros from the limitation
+ - Cons from the limitation
+ - Relax the limitation
+
+ (*) Crossrelease
+
+ - Introduce crossrelease
+ - Introduce commit
+
+ (*) Implementation
+
+ - Data structures
+ - How crossrelease works
+
+ (*) Optimizations
+
+ - Avoid duplication
+ - Lockless for hot paths
+
+ (*) APPENDIX A: What lockdep does to work aggresively
+
+ (*) APPENDIX B: How to avoid adding false dependencies
+
+
+==========
+Background
+==========
+
+What causes deadlock
+--------------------
+
+A deadlock occurs when a context is waiting for an event to happen,
+which is impossible because another (or the) context who can trigger the
+event is also waiting for another (or the) event to happen, which is
+also impossible due to the same reason.
+
+For example:
+
+ A context going to trigger event C is waiting for event A to happen.
+ A context going to trigger event A is waiting for event B to happen.
+ A context going to trigger event B is waiting for event C to happen.
+
+A deadlock occurs when these three wait operations run at the same time,
+because event C cannot be triggered if event A does not happen, which in
+turn cannot be triggered if event B does not happen, which in turn
+cannot be triggered if event C does not happen. After all, no event can
+be triggered since any of them never meets its condition to wake up.
+
+A dependency might exist between two waiters and a deadlock might happen
+due to an incorrect releationship between dependencies. Thus, we must
+define what a dependency is first. A dependency exists between them if:
+
+ 1. There are two waiters waiting for each event at a given time.
+ 2. The only way to wake up each waiter is to trigger its event.
+ 3. Whether one can be woken up depends on whether the other can.
+
+Each wait in the example creates its dependency like:
+
+ Event C depends on event A.
+ Event A depends on event B.
+ Event B depends on event C.
+
+ NOTE: Precisely speaking, a dependency is one between whether a
+ waiter for an event can be woken up and whether another waiter for
+ another event can be woken up. However from now on, we will describe
+ a dependency as if it's one between an event and another event for
+ simplicity.
+
+And they form circular dependencies like:
+
+ -> C -> A -> B -
+ / \
+ \ /
+ ----------------
+
+ where 'A -> B' means that event A depends on event B.
+
+Such circular dependencies lead to a deadlock since no waiter can meet
+its condition to wake up as described.
+
+CONCLUSION
+
+Circular dependencies cause a deadlock.
+
+
+How lockdep works
+-----------------
+
+Lockdep tries to detect a deadlock by checking dependencies created by
+lock operations, acquire and release. Waiting for a lock corresponds to
+waiting for an event, and releasing a lock corresponds to triggering an
+event in the previous section.
+
+In short, lockdep does:
+
+ 1. Detect a new dependency.
+ 2. Add the dependency into a global graph.
+ 3. Check if that makes dependencies circular.
+ 4. Report a deadlock or its possibility if so.
+
+For example, consider a graph built by lockdep that looks like:
+
+ A -> B -
+ \
+ -> E
+ /
+ C -> D -
+
+ where A, B,..., E are different lock classes.
+
+Lockdep will add a dependency into the graph on detection of a new
+dependency. For example, it will add a dependency 'E -> C' when a new
+dependency between lock E and lock C is detected. Then the graph will be:
+
+ A -> B -
+ \
+ -> E -
+ / \
+ -> C -> D - \
+ / /
+ \ /
+ ------------------
+
+ where A, B,..., E are different lock classes.
+
+This graph contains a subgraph which demonstrates circular dependencies:
+
+ -> E -
+ / \
+ -> C -> D - \
+ / /
+ \ /
+ ------------------
+
+ where C, D and E are different lock classes.
+
+This is the condition under which a deadlock might occur. Lockdep
+reports it on detection after adding a new dependency. This is the way
+how lockdep works.
+
+CONCLUSION
+
+Lockdep detects a deadlock or its possibility by checking if circular
+dependencies were created after adding each new dependency.
+
+
+==========
+Limitation
+==========
+
+Limit lockdep
+-------------
+
+Limiting lockdep to work on only typical locks e.g. spin locks and
+mutexes, which are released within the acquire context, the
+implementation becomes simple but its capacity for detection becomes
+limited. Let's check pros and cons in next section.
+
+
+Pros from the limitation
+------------------------
+
+Given the limitation, when acquiring a lock, locks in a held_locks
+cannot be released if the context cannot acquire it so has to wait to
+acquire it, which means all waiters for the locks in the held_locks are
+stuck. It's an exact case to create dependencies between each lock in
+the held_locks and the lock to acquire.
+
+For example:
+
+ CONTEXT X
+ ---------
+ acquire A
+ acquire B /* Add a dependency 'A -> B' */
+ release B
+ release A
+
+ where A and B are different lock classes.
+
+When acquiring lock A, the held_locks of CONTEXT X is empty thus no
+dependency is added. But when acquiring lock B, lockdep detects and adds
+a new dependency 'A -> B' between lock A in the held_locks and lock B.
+They can be simply added whenever acquiring each lock.
+
+And data required by lockdep exists in a local structure, held_locks
+embedded in task_struct. Forcing to access the data within the context,
+lockdep can avoid racy problems without explicit locks while handling
+the local data.
+
+Lastly, lockdep only needs to keep locks currently being held, to build
+a dependency graph. However, relaxing the limitation, it needs to keep
+even locks already released, because a decision whether they created
+dependencies might be long-deferred.
+
+To sum up, we can expect several advantages from the limitation:
+
+ 1. Lockdep can easily identify a dependency when acquiring a lock.
+ 2. Races are avoidable while accessing local locks in a held_locks.
+ 3. Lockdep only needs to keep locks currently being held.
+
+CONCLUSION
+
+Given the limitation, the implementation becomes simple and efficient.
+
+
+Cons from the limitation
+------------------------
+
+Given the limitation, lockdep is applicable only to typical locks. For
+example, page locks for page access or completions for synchronization
+cannot work with lockdep.
+
+Can we detect deadlocks below, under the limitation?
+
+Example 1:
+
+ CONTEXT X CONTEXT Y CONTEXT Z
+ --------- --------- ----------
+ mutex_lock A
+ lock_page B
+ lock_page B
+ mutex_lock A /* DEADLOCK */
+ unlock_page B held by X
+ unlock_page B
+ mutex_unlock A
+ mutex_unlock A
+
+ where A and B are different lock classes.
+
+No, we cannot.
+
+Example 2:
+
+ CONTEXT X CONTEXT Y
+ --------- ---------
+ mutex_lock A
+ mutex_lock A
+ wait_for_complete B /* DEADLOCK */
+ complete B
+ mutex_unlock A
+ mutex_unlock A
+
+ where A is a lock class and B is a completion variable.
+
+No, we cannot.
+
+CONCLUSION
+
+Given the limitation, lockdep cannot detect a deadlock or its
+possibility caused by page locks or completions.
+
+
+Relax the limitation
+--------------------
+
+Under the limitation, things to create dependencies are limited to
+typical locks. However, synchronization primitives like page locks and
+completions, which are allowed to be released in any context, also
+create dependencies and can cause a deadlock. So lockdep should track
+these locks to do a better job. We have to relax the limitation for
+these locks to work with lockdep.
+
+Detecting dependencies is very important for lockdep to work because
+adding a dependency means adding an opportunity to check whether it
+causes a deadlock. The more lockdep adds dependencies, the more it
+thoroughly works. Thus Lockdep has to do its best to detect and add as
+many true dependencies into a graph as possible.
+
+For example, considering only typical locks, lockdep builds a graph like:
+
+ A -> B -
+ \
+ -> E
+ /
+ C -> D -
+
+ where A, B,..., E are different lock classes.
+
+On the other hand, under the relaxation, additional dependencies might
+be created and added. Assuming additional 'FX -> C' and 'E -> GX' are
+added thanks to the relaxation, the graph will be:
+
+ A -> B -
+ \
+ -> E -> GX
+ /
+ FX -> C -> D -
+
+ where A, B,..., E, FX and GX are different lock classes, and a suffix
+ 'X' is added on non-typical locks.
+
+The latter graph gives us more chances to check circular dependencies
+than the former. However, it might suffer performance degradation since
+relaxing the limitation, with which design and implementation of lockdep
+can be efficient, might introduce inefficiency inevitably. So lockdep
+should provide two options, strong detection and efficient detection.
+
+Choosing efficient detection:
+
+ Lockdep works with only locks restricted to be released within the
+ acquire context. However, lockdep works efficiently.
+
+Choosing strong detection:
+
+ Lockdep works with all synchronization primitives. However, lockdep
+ suffers performance degradation.
+
+CONCLUSION
+
+Relaxing the limitation, lockdep can add additional dependencies giving
+additional opportunities to check circular dependencies.
+
+
+============
+Crossrelease
+============
+
+Introduce crossrelease
+----------------------
+
+In order to allow lockdep to handle additional dependencies by what
+might be released in any context, namely 'crosslock', we have to be able
+to identify those created by crosslocks. The proposed 'crossrelease'
+feature provoides a way to do that.
+
+Crossrelease feature has to do:
+
+ 1. Identify dependencies created by crosslocks.
+ 2. Add the dependencies into a dependency graph.
+
+That's all. Once a meaningful dependency is added into graph, then
+lockdep would work with the graph as it did. The most important thing
+crossrelease feature has to do is to correctly identify and add true
+dependencies into the global graph.
+
+A dependency e.g. 'A -> B' can be identified only in the A's release
+context because a decision required to identify the dependency can be
+made only in the release context. That is to decide whether A can be
+released so that a waiter for A can be woken up. It cannot be made in
+other than the A's release context.
+
+It's no matter for typical locks because each acquire context is same as
+its release context, thus lockdep can decide whether a lock can be
+released in the acquire context. However for crosslocks, lockdep cannot
+make the decision in the acquire context but has to wait until the
+release context is identified.
+
+Therefore, deadlocks by crosslocks cannot be detected just when it
+happens, because those cannot be identified until the crosslocks are
+released. However, deadlock possibilities can be detected and it's very
+worth. See 'APPENDIX A' section to check why.
+
+CONCLUSION
+
+Using crossrelease feature, lockdep can work with what might be released
+in any context, namely crosslock.
+
+
+Introduce commit
+----------------
+
+Since crossrelease defers the work adding true dependencies of
+crosslocks until they are actually released, crossrelease has to queue
+all acquisitions which might create dependencies with the crosslocks.
+Then it identifies dependencies using the queued data in batches at a
+proper time. We call it 'commit'.
+
+There are four types of dependencies:
+
+1. TT type: 'typical lock A -> typical lock B'
+
+ Just when acquiring B, lockdep can see it's in the A's release
+ context. So the dependency between A and B can be identified
+ immediately. Commit is unnecessary.
+
+2. TC type: 'typical lock A -> crosslock BX'
+
+ Just when acquiring BX, lockdep can see it's in the A's release
+ context. So the dependency between A and BX can be identified
+ immediately. Commit is unnecessary, too.
+
+3. CT type: 'crosslock AX -> typical lock B'
+
+ When acquiring B, lockdep cannot identify the dependency because
+ there's no way to know if it's in the AX's release context. It has
+ to wait until the decision can be made. Commit is necessary.
+
+4. CC type: 'crosslock AX -> crosslock BX'
+
+ When acquiring BX, lockdep cannot identify the dependency because
+ there's no way to know if it's in the AX's release context. It has
+ to wait until the decision can be made. Commit is necessary.
+ But, handling CC type is not implemented yet. It's a future work.
+
+Lockdep can work without commit for typical locks, but commit step is
+necessary once crosslocks are involved. Introducing commit, lockdep
+performs three steps. What lockdep does in each step is:
+
+1. Acquisition: For typical locks, lockdep does what it originally did
+ and queues the lock so that CT type dependencies can be checked using
+ it at the commit step. For crosslocks, it saves data which will be
+ used at the commit step and increases a reference count for it.
+
+2. Commit: No action is reauired for typical locks. For crosslocks,
+ lockdep adds CT type dependencies using the data saved at the
+ acquisition step.
+
+3. Release: No changes are required for typical locks. When a crosslock
+ is released, it decreases a reference count for it.
+
+CONCLUSION
+
+Crossrelease introduces commit step to handle dependencies of crosslocks
+in batches at a proper time.
+
+
+==============
+Implementation
+==============
+
+Data structures
+---------------
+
+Crossrelease introduces two main data structures.
+
+1. hist_lock
+
+ This is an array embedded in task_struct, for keeping lock history so
+ that dependencies can be added using them at the commit step. Since
+ it's local data, it can be accessed locklessly in the owner context.
+ The array is filled at the acquisition step and consumed at the
+ commit step. And it's managed in circular manner.
+
+2. cross_lock
+
+ One per lockdep_map exists. This is for keeping data of crosslocks
+ and used at the commit step.
+
+
+How crossrelease works
+----------------------
+
+It's the key of how crossrelease works, to defer necessary works to an
+appropriate point in time and perform in at once at the commit step.
+Let's take a look with examples step by step, starting from how lockdep
+works without crossrelease for typical locks.
+
+ acquire A /* Push A onto held_locks */
+ acquire B /* Push B onto held_locks and add 'A -> B' */
+ acquire C /* Push C onto held_locks and add 'B -> C' */
+ release C /* Pop C from held_locks */
+ release B /* Pop B from held_locks */
+ release A /* Pop A from held_locks */
+
+ where A, B and C are different lock classes.
+
+ NOTE: This document assumes that readers already understand how
+ lockdep works without crossrelease thus omits details. But there's
+ one thing to note. Lockdep pretends to pop a lock from held_locks
+ when releasing it. But it's subtly different from the original pop
+ operation because lockdep allows other than the top to be poped.
+
+In this case, lockdep adds 'the top of held_locks -> the lock to acquire'
+dependency every time acquiring a lock.
+
+After adding 'A -> B', a dependency graph will be:
+
+ A -> B
+
+ where A and B are different lock classes.
+
+And after adding 'B -> C', the graph will be:
+
+ A -> B -> C
+
+ where A, B and C are different lock classes.
+
+Let's performs commit step even for typical locks to add dependencies.
+Of course, commit step is not necessary for them, however, it would work
+well because this is a more general way.
+
+ acquire A
+ /*
+ * Queue A into hist_locks
+ *
+ * In hist_locks: A
+ * In graph: Empty
+ */
+
+ acquire B
+ /*
+ * Queue B into hist_locks
+ *
+ * In hist_locks: A, B
+ * In graph: Empty
+ */
+
+ acquire C
+ /*
+ * Queue C into hist_locks
+ *
+ * In hist_locks: A, B, C
+ * In graph: Empty
+ */
+
+ commit C
+ /*
+ * Add 'C -> ?'
+ * Answer the following to decide '?'
+ * What has been queued since acquire C: Nothing
+ *
+ * In hist_locks: A, B, C
+ * In graph: Empty
+ */
+
+ release C
+
+ commit B
+ /*
+ * Add 'B -> ?'
+ * Answer the following to decide '?'
+ * What has been queued since acquire B: C
+ *
+ * In hist_locks: A, B, C
+ * In graph: 'B -> C'
+ */
+
+ release B
+
+ commit A
+ /*
+ * Add 'A -> ?'
+ * Answer the following to decide '?'
+ * What has been queued since acquire A: B, C
+ *
+ * In hist_locks: A, B, C
+ * In graph: 'B -> C', 'A -> B', 'A -> C'
+ */
+
+ release A
+
+ where A, B and C are different lock classes.
+
+In this case, dependencies are added at the commit step as described.
+
+After commits for A, B and C, the graph will be:
+
+ A -> B -> C
+
+ where A, B and C are different lock classes.
+
+ NOTE: A dependency 'A -> C' is optimized out.
+
+We can see the former graph built without commit step is same as the
+latter graph built using commit steps. Of course the former way leads to
+earlier finish for building the graph, which means we can detect a
+deadlock or its possibility sooner. So the former way would be prefered
+when possible. But we cannot avoid using the latter way for crosslocks.
+
+Let's look at how commit steps work for crosslocks. In this case, the
+commit step is performed only on crosslock AX as real. And it assumes
+that the AX release context is different from the AX acquire context.
+
+ BX RELEASE CONTEXT BX ACQUIRE CONTEXT
+ ------------------ ------------------
+ acquire A
+ /*
+ * Push A onto held_locks
+ * Queue A into hist_locks
+ *
+ * In held_locks: A
+ * In hist_locks: A
+ * In graph: Empty
+ */
+
+ acquire BX
+ /*
+ * Add 'the top of held_locks -> BX'
+ *
+ * In held_locks: A
+ * In hist_locks: A
+ * In graph: 'A -> BX'
+ */
+
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ It must be guaranteed that the following operations are seen after
+ acquiring BX globally. It can be done by things like barrier.
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ acquire C
+ /*
+ * Push C onto held_locks
+ * Queue C into hist_locks
+ *
+ * In held_locks: C
+ * In hist_locks: C
+ * In graph: 'A -> BX'
+ */
+
+ release C
+ /*
+ * Pop C from held_locks
+ *
+ * In held_locks: Empty
+ * In hist_locks: C
+ * In graph: 'A -> BX'
+ */
+ acquire D
+ /*
+ * Push D onto held_locks
+ * Queue D into hist_locks
+ * Add 'the top of held_locks -> D'
+ *
+ * In held_locks: A, D
+ * In hist_locks: A, D
+ * In graph: 'A -> BX', 'A -> D'
+ */
+ acquire E
+ /*
+ * Push E onto held_locks
+ * Queue E into hist_locks
+ *
+ * In held_locks: E
+ * In hist_locks: C, E
+ * In graph: 'A -> BX', 'A -> D'
+ */
+
+ release E
+ /*
+ * Pop E from held_locks
+ *
+ * In held_locks: Empty
+ * In hist_locks: D, E
+ * In graph: 'A -> BX', 'A -> D'
+ */
+ release D
+ /*
+ * Pop D from held_locks
+ *
+ * In held_locks: A
+ * In hist_locks: A, D
+ * In graph: 'A -> BX', 'A -> D'
+ */
+ commit BX
+ /*
+ * Add 'BX -> ?'
+ * What has been queued since acquire BX: C, E
+ *
+ * In held_locks: Empty
+ * In hist_locks: D, E
+ * In graph: 'A -> BX', 'A -> D',
+ * 'BX -> C', 'BX -> E'
+ */
+
+ release BX
+ /*
+ * In held_locks: Empty
+ * In hist_locks: D, E
+ * In graph: 'A -> BX', 'A -> D',
+ * 'BX -> C', 'BX -> E'
+ */
+ release A
+ /*
+ * Pop A from held_locks
+ *
+ * In held_locks: Empty
+ * In hist_locks: A, D
+ * In graph: 'A -> BX', 'A -> D',
+ * 'BX -> C', 'BX -> E'
+ */
+
+ where A, BX, C,..., E are different lock classes, and a suffix 'X' is
+ added on crosslocks.
+
+Crossrelease considers all acquisitions after acqiuring BX are
+candidates which might create dependencies with BX. True dependencies
+will be determined when identifying the release context of BX. Meanwhile,
+all typical locks are queued so that they can be used at the commit step.
+And then two dependencies 'BX -> C' and 'BX -> E' are added at the
+commit step when identifying the release context.
+
+The final graph will be, with crossrelease:
+
+ -> C
+ /
+ -> BX -
+ / \
+ A - -> E
+ \
+ -> D
+
+ where A, BX, C,..., E are different lock classes, and a suffix 'X' is
+ added on crosslocks.
+
+However, the final graph will be, without crossrelease:
+
+ A -> D
+
+ where A and D are different lock classes.
+
+The former graph has three more dependencies, 'A -> BX', 'BX -> C' and
+'BX -> E' giving additional opportunities to check if they cause
+deadlocks. This way lockdep can detect a deadlock or its possibility
+caused by crosslocks.
+
+CONCLUSION
+
+We checked how crossrelease works with several examples.
+
+
+=============
+Optimizations
+=============
+
+Avoid duplication
+-----------------
+
+Crossrelease feature uses a cache like what lockdep already uses for
+dependency chains, but this time it's for caching CT type dependencies.
+Once that dependency is cached, the same will never be added again.
+
+
+Lockless for hot paths
+----------------------
+
+To keep all locks for later use at the commit step, crossrelease adopts
+a local array embedded in task_struct, which makes access to the data
+lockless by forcing it to happen only within the owner context. It's
+like how lockdep handles held_locks. Lockless implmentation is important
+since typical locks are very frequently acquired and released.
+
+
+=================================================
+APPENDIX A: What lockdep does to work aggresively
+=================================================
+
+A deadlock actually occurs when all wait operations creating circular
+dependencies run at the same time. Even though they don't, a potential
+deadlock exists if the problematic dependencies exist. Thus it's
+meaningful to detect not only an actual deadlock but also its potential
+possibility. The latter is rather valuable. When a deadlock occurs
+actually, we can identify what happens in the system by some means or
+other even without lockdep. However, there's no way to detect possiblity
+without lockdep unless the whole code is parsed in head. It's terrible.
+Lockdep does the both, and crossrelease only focuses on the latter.
+
+Whether or not a deadlock actually occurs depends on several factors.
+For example, what order contexts are switched in is a factor. Assuming
+circular dependencies exist, a deadlock would occur when contexts are
+switched so that all wait operations creating the dependencies run
+simultaneously. Thus to detect a deadlock possibility even in the case
+that it has not occured yet, lockdep should consider all possible
+combinations of dependencies, trying to:
+
+1. Use a global dependency graph.
+
+ Lockdep combines all dependencies into one global graph and uses them,
+ regardless of which context generates them or what order contexts are
+ switched in. Aggregated dependencies are only considered so they are
+ prone to be circular if a problem exists.
+
+2. Check dependencies between classes instead of instances.
+
+ What actually causes a deadlock are instances of lock. However,
+ lockdep checks dependencies between classes instead of instances.
+ This way lockdep can detect a deadlock which has not happened but
+ might happen in future by others but the same class.
+
+3. Assume all acquisitions lead to waiting.
+
+ Although locks might be acquired without waiting which is essential
+ to create dependencies, lockdep assumes all acquisitions lead to
+ waiting since it might be true some time or another.
+
+CONCLUSION
+
+Lockdep detects not only an actual deadlock but also its possibility,
+and the latter is more valuable.
+
+
+==================================================
+APPENDIX B: How to avoid adding false dependencies
+==================================================
+
+Remind what a dependency is. A dependency exists if:
+
+ 1. There are two waiters waiting for each event at a given time.
+ 2. The only way to wake up each waiter is to trigger its event.
+ 3. Whether one can be woken up depends on whether the other can.
+
+For example:
+
+ acquire A
+ acquire B /* A dependency 'A -> B' exists */
+ release B
+ release A
+
+ where A and B are different lock classes.
+
+A depedency 'A -> B' exists since:
+
+ 1. A waiter for A and a waiter for B might exist when acquiring B.
+ 2. Only way to wake up each is to release what it waits for.
+ 3. Whether the waiter for A can be woken up depends on whether the
+ other can. IOW, TASK X cannot release A if it fails to acquire B.
+
+For another example:
+
+ TASK X TASK Y
+ ------ ------
+ acquire AX
+ acquire B /* A dependency 'AX -> B' exists */
+ release B
+ release AX held by Y
+
+ where AX and B are different lock classes, and a suffix 'X' is added
+ on crosslocks.
+
+Even in this case involving crosslocks, the same rule can be applied. A
+depedency 'AX -> B' exists since:
+
+ 1. A waiter for AX and a waiter for B might exist when acquiring B.
+ 2. Only way to wake up each is to release what it waits for.
+ 3. Whether the waiter for AX can be woken up depends on whether the
+ other can. IOW, TASK X cannot release AX if it fails to acquire B.
+
+Let's take a look at more complicated example:
+
+ TASK X TASK Y
+ ------ ------
+ acquire B
+ release B
+ fork Y
+ acquire AX
+ acquire C /* A dependency 'AX -> C' exists */
+ release C
+ release AX held by Y
+
+ where AX, B and C are different lock classes, and a suffix 'X' is
+ added on crosslocks.
+
+Does a dependency 'AX -> B' exist? Nope.
+
+Two waiters are essential to create a dependency. However, waiters for
+AX and B to create 'AX -> B' cannot exist at the same time in this
+example. Thus the dependency 'AX -> B' cannot be created.
+
+It would be ideal if the full set of true ones can be considered. But
+we can ensure nothing but what actually happened. Relying on what
+actually happens at runtime, we can anyway add only true ones, though
+they might be a subset of true ones. It's similar to how lockdep works
+for typical locks. There might be more true dependencies than what
+lockdep has detected in runtime. Lockdep has no choice but to rely on
+what actually happens. Crossrelease also relies on it.
+
+CONCLUSION
+
+Relying on what actually happens, lockdep can avoid adding false
+dependencies.
diff --git a/Documentation/locking/rt-mutex-design.txt b/Documentation/locking/rt-mutex-design.txt
index 8666070d3189..6c6e8c2410de 100644
--- a/Documentation/locking/rt-mutex-design.txt
+++ b/Documentation/locking/rt-mutex-design.txt
@@ -97,9 +97,9 @@ waiter - A waiter is a struct that is stored on the stack of a blocked
a process being blocked on the mutex, it is fine to allocate
the waiter on the process's stack (local variable). This
structure holds a pointer to the task, as well as the mutex that
- the task is blocked on. It also has the plist node structures to
- place the task in the waiter_list of a mutex as well as the
- pi_list of a mutex owner task (described below).
+ the task is blocked on. It also has rbtree node structures to
+ place the task in the waiters rbtree of a mutex as well as the
+ pi_waiters rbtree of a mutex owner task (described below).
waiter is sometimes used in reference to the task that is waiting
on a mutex. This is the same as waiter->task.
@@ -179,53 +179,34 @@ again.
|
F->L5-+
+If process G has the highest priority in the chain, then all the tasks up
+the chain (A and B in this example), must have their priorities increased
+to that of G.
-Plist
------
-
-Before I go further and talk about how the PI chain is stored through lists
-on both mutexes and processes, I'll explain the plist. This is similar to
-the struct list_head functionality that is already in the kernel.
-The implementation of plist is out of scope for this document, but it is
-very important to understand what it does.
-
-There are a few differences between plist and list, the most important one
-being that plist is a priority sorted linked list. This means that the
-priorities of the plist are sorted, such that it takes O(1) to retrieve the
-highest priority item in the list. Obviously this is useful to store processes
-based on their priorities.
-
-Another difference, which is important for implementation, is that, unlike
-list, the head of the list is a different element than the nodes of a list.
-So the head of the list is declared as struct plist_head and nodes that will
-be added to the list are declared as struct plist_node.
-
-
-Mutex Waiter List
+Mutex Waiters Tree
-----------------
-Every mutex keeps track of all the waiters that are blocked on itself. The mutex
-has a plist to store these waiters by priority. This list is protected by
-a spin lock that is located in the struct of the mutex. This lock is called
-wait_lock. Since the modification of the waiter list is never done in
-interrupt context, the wait_lock can be taken without disabling interrupts.
+Every mutex keeps track of all the waiters that are blocked on itself. The
+mutex has a rbtree to store these waiters by priority. This tree is protected
+by a spin lock that is located in the struct of the mutex. This lock is called
+wait_lock.
-Task PI List
+Task PI Tree
------------
-To keep track of the PI chains, each process has its own PI list. This is
-a list of all top waiters of the mutexes that are owned by the process.
-Note that this list only holds the top waiters and not all waiters that are
+To keep track of the PI chains, each process has its own PI rbtree. This is
+a tree of all top waiters of the mutexes that are owned by the process.
+Note that this tree only holds the top waiters and not all waiters that are
blocked on mutexes owned by the process.
-The top of the task's PI list is always the highest priority task that
+The top of the task's PI tree is always the highest priority task that
is waiting on a mutex that is owned by the task. So if the task has
inherited a priority, it will always be the priority of the task that is
-at the top of this list.
+at the top of this tree.
-This list is stored in the task structure of a process as a plist called
-pi_list. This list is protected by a spin lock also in the task structure,
+This tree is stored in the task structure of a process as a rbtree called
+pi_waiters. It is protected by a spin lock also in the task structure,
called pi_lock. This lock may also be taken in interrupt context, so when
locking the pi_lock, interrupts must be disabled.
@@ -312,15 +293,12 @@ Mutex owner and flags
The mutex structure contains a pointer to the owner of the mutex. If the
mutex is not owned, this owner is set to NULL. Since all architectures
-have the task structure on at least a four byte alignment (and if this is
-not true, the rtmutex.c code will be broken!), this allows for the two
-least significant bits to be used as flags. This part is also described
-in Documentation/rt-mutex.txt, but will also be briefly described here.
-
-Bit 0 is used as the "Pending Owner" flag. This is described later.
-Bit 1 is used as the "Has Waiters" flags. This is also described later
- in more detail, but is set whenever there are waiters on a mutex.
+have the task structure on at least a two byte alignment (and if this is
+not true, the rtmutex.c code will be broken!), this allows for the least
+significant bit to be used as a flag. Bit 0 is used as the "Has Waiters"
+flag. It's set whenever there are waiters on a mutex.
+See Documentation/locking/rt-mutex.txt for further details.
cmpxchg Tricks
--------------
@@ -359,40 +337,31 @@ Priority adjustments
--------------------
The implementation of the PI code in rtmutex.c has several places that a
-process must adjust its priority. With the help of the pi_list of a
+process must adjust its priority. With the help of the pi_waiters of a
process this is rather easy to know what needs to be adjusted.
-The functions implementing the task adjustments are rt_mutex_adjust_prio,
-__rt_mutex_adjust_prio (same as the former, but expects the task pi_lock
-to already be taken), rt_mutex_getprio, and rt_mutex_setprio.
+The functions implementing the task adjustments are rt_mutex_adjust_prio
+and rt_mutex_setprio. rt_mutex_setprio is only used in rt_mutex_adjust_prio.
-rt_mutex_getprio and rt_mutex_setprio are only used in __rt_mutex_adjust_prio.
+rt_mutex_adjust_prio examines the priority of the task, and the highest
+priority process that is waiting any of mutexes owned by the task. Since
+the pi_waiters of a task holds an order by priority of all the top waiters
+of all the mutexes that the task owns, we simply need to compare the top
+pi waiter to its own normal/deadline priority and take the higher one.
+Then rt_mutex_setprio is called to adjust the priority of the task to the
+new priority. Note that rt_mutex_setprio is defined in kernel/sched/core.c
+to implement the actual change in priority.
-rt_mutex_getprio returns the priority that the task should have. Either the
-task's own normal priority, or if a process of a higher priority is waiting on
-a mutex owned by the task, then that higher priority should be returned.
-Since the pi_list of a task holds an order by priority list of all the top
-waiters of all the mutexes that the task owns, rt_mutex_getprio simply needs
-to compare the top pi waiter to its own normal priority, and return the higher
-priority back.
+(Note: For the "prio" field in task_struct, the lower the number, the
+ higher the priority. A "prio" of 5 is of higher priority than a
+ "prio" of 10.)
-(Note: if looking at the code, you will notice that the lower number of
- prio is returned. This is because the prio field in the task structure
- is an inverse order of the actual priority. So a "prio" of 5 is
- of higher priority than a "prio" of 10.)
-
-__rt_mutex_adjust_prio examines the result of rt_mutex_getprio, and if the
-result does not equal the task's current priority, then rt_mutex_setprio
-is called to adjust the priority of the task to the new priority.
-Note that rt_mutex_setprio is defined in kernel/sched/core.c to implement the
-actual change in priority.
-
-It is interesting to note that __rt_mutex_adjust_prio can either increase
+It is interesting to note that rt_mutex_adjust_prio can either increase
or decrease the priority of the task. In the case that a higher priority
-process has just blocked on a mutex owned by the task, __rt_mutex_adjust_prio
+process has just blocked on a mutex owned by the task, rt_mutex_adjust_prio
would increase/boost the task's priority. But if a higher priority task
were for some reason to leave the mutex (timeout or signal), this same function
-would decrease/unboost the priority of the task. That is because the pi_list
+would decrease/unboost the priority of the task. That is because the pi_waiters
always contains the highest priority task that is waiting on a mutex owned
by the task, so we only need to compare the priority of that top pi waiter
to the normal priority of the given task.
@@ -412,9 +381,10 @@ priorities.
rt_mutex_adjust_prio_chain is called with a task to be checked for PI
(de)boosting (the owner of a mutex that a process is blocking on), a flag to
-check for deadlocking, the mutex that the task owns, and a pointer to a waiter
+check for deadlocking, the mutex that the task owns, a pointer to a waiter
that is the process's waiter struct that is blocked on the mutex (although this
-parameter may be NULL for deboosting).
+parameter may be NULL for deboosting), a pointer to the mutex on which the task
+is blocked, and a top_task as the top waiter of the mutex.
For this explanation, I will not mention deadlock detection. This explanation
will try to stay at a high level.
@@ -424,133 +394,14 @@ that the state of the owner and lock can change when entered into this function.
Before this function is called, the task has already had rt_mutex_adjust_prio
performed on it. This means that the task is set to the priority that it
-should be at, but the plist nodes of the task's waiter have not been updated
-with the new priorities, and that this task may not be in the proper locations
-in the pi_lists and wait_lists that the task is blocked on. This function
+should be at, but the rbtree nodes of the task's waiter have not been updated
+with the new priorities, and this task may not be in the proper locations
+in the pi_waiters and waiters trees that the task is blocked on. This function
solves all that.
-A loop is entered, where task is the owner to be checked for PI changes that
-was passed by parameter (for the first iteration). The pi_lock of this task is
-taken to prevent any more changes to the pi_list of the task. This also
-prevents new tasks from completing the blocking on a mutex that is owned by this
-task.
-
-If the task is not blocked on a mutex then the loop is exited. We are at
-the top of the PI chain.
-
-A check is now done to see if the original waiter (the process that is blocked
-on the current mutex) is the top pi waiter of the task. That is, is this
-waiter on the top of the task's pi_list. If it is not, it either means that
-there is another process higher in priority that is blocked on one of the
-mutexes that the task owns, or that the waiter has just woken up via a signal
-or timeout and has left the PI chain. In either case, the loop is exited, since
-we don't need to do any more changes to the priority of the current task, or any
-task that owns a mutex that this current task is waiting on. A priority chain
-walk is only needed when a new top pi waiter is made to a task.
-
-The next check sees if the task's waiter plist node has the priority equal to
-the priority the task is set at. If they are equal, then we are done with
-the loop. Remember that the function started with the priority of the
-task adjusted, but the plist nodes that hold the task in other processes
-pi_lists have not been adjusted.
-
-Next, we look at the mutex that the task is blocked on. The mutex's wait_lock
-is taken. This is done by a spin_trylock, because the locking order of the
-pi_lock and wait_lock goes in the opposite direction. If we fail to grab the
-lock, the pi_lock is released, and we restart the loop.
-
-Now that we have both the pi_lock of the task as well as the wait_lock of
-the mutex the task is blocked on, we update the task's waiter's plist node
-that is located on the mutex's wait_list.
-
-Now we release the pi_lock of the task.
-
-Next the owner of the mutex has its pi_lock taken, so we can update the
-task's entry in the owner's pi_list. If the task is the highest priority
-process on the mutex's wait_list, then we remove the previous top waiter
-from the owner's pi_list, and replace it with the task.
-
-Note: It is possible that the task was the current top waiter on the mutex,
- in which case the task is not yet on the pi_list of the waiter. This
- is OK, since plist_del does nothing if the plist node is not on any
- list.
-
-If the task was not the top waiter of the mutex, but it was before we
-did the priority updates, that means we are deboosting/lowering the
-task. In this case, the task is removed from the pi_list of the owner,
-and the new top waiter is added.
-
-Lastly, we unlock both the pi_lock of the task, as well as the mutex's
-wait_lock, and continue the loop again. On the next iteration of the
-loop, the previous owner of the mutex will be the task that will be
-processed.
-
-Note: One might think that the owner of this mutex might have changed
- since we just grab the mutex's wait_lock. And one could be right.
- The important thing to remember is that the owner could not have
- become the task that is being processed in the PI chain, since
- we have taken that task's pi_lock at the beginning of the loop.
- So as long as there is an owner of this mutex that is not the same
- process as the tasked being worked on, we are OK.
-
- Looking closely at the code, one might be confused. The check for the
- end of the PI chain is when the task isn't blocked on anything or the
- task's waiter structure "task" element is NULL. This check is
- protected only by the task's pi_lock. But the code to unlock the mutex
- sets the task's waiter structure "task" element to NULL with only
- the protection of the mutex's wait_lock, which was not taken yet.
- Isn't this a race condition if the task becomes the new owner?
-
- The answer is No! The trick is the spin_trylock of the mutex's
- wait_lock. If we fail that lock, we release the pi_lock of the
- task and continue the loop, doing the end of PI chain check again.
-
- In the code to release the lock, the wait_lock of the mutex is held
- the entire time, and it is not let go when we grab the pi_lock of the
- new owner of the mutex. So if the switch of a new owner were to happen
- after the check for end of the PI chain and the grabbing of the
- wait_lock, the unlocking code would spin on the new owner's pi_lock
- but never give up the wait_lock. So the PI chain loop is guaranteed to
- fail the spin_trylock on the wait_lock, release the pi_lock, and
- try again.
-
- If you don't quite understand the above, that's OK. You don't have to,
- unless you really want to make a proof out of it ;)
-
-
-Pending Owners and Lock stealing
---------------------------------
-
-One of the flags in the owner field of the mutex structure is "Pending Owner".
-What this means is that an owner was chosen by the process releasing the
-mutex, but that owner has yet to wake up and actually take the mutex.
-
-Why is this important? Why can't we just give the mutex to another process
-and be done with it?
-
-The PI code is to help with real-time processes, and to let the highest
-priority process run as long as possible with little latencies and delays.
-If a high priority process owns a mutex that a lower priority process is
-blocked on, when the mutex is released it would be given to the lower priority
-process. What if the higher priority process wants to take that mutex again.
-The high priority process would fail to take that mutex that it just gave up
-and it would need to boost the lower priority process to run with full
-latency of that critical section (since the low priority process just entered
-it).
-
-There's no reason a high priority process that gives up a mutex should be
-penalized if it tries to take that mutex again. If the new owner of the
-mutex has not woken up yet, there's no reason that the higher priority process
-could not take that mutex away.
-
-To solve this, we introduced Pending Ownership and Lock Stealing. When a
-new process is given a mutex that it was blocked on, it is only given
-pending ownership. This means that it's the new owner, unless a higher
-priority process comes in and tries to grab that mutex. If a higher priority
-process does come along and wants that mutex, we let the higher priority
-process "steal" the mutex from the pending owner (only if it is still pending)
-and continue with the mutex.
-
+The main operation of this function is summarized by Thomas Gleixner in
+rtmutex.c. See the 'Chain walk basics and protection scope' comment for further
+details.
Taking of a mutex (The walk through)
------------------------------------
@@ -563,14 +414,14 @@ done when we have CMPXCHG enabled (otherwise the fast taking automatically
fails). Only when the owner field of the mutex is NULL can the lock be
taken with the CMPXCHG and nothing else needs to be done.
-If there is contention on the lock, whether it is owned or pending owner
-we go about the slow path (rt_mutex_slowlock).
+If there is contention on the lock, we go about the slow path
+(rt_mutex_slowlock).
The slow path function is where the task's waiter structure is created on
the stack. This is because the waiter structure is only needed for the
scope of this function. The waiter structure holds the nodes to store
-the task on the wait_list of the mutex, and if need be, the pi_list of
-the owner.
+the task on the waiters tree of the mutex, and if need be, the pi_waiters
+tree of the owner.
The wait_lock of the mutex is taken since the slow path of unlocking the
mutex also takes this lock.
@@ -581,102 +432,45 @@ contention).
try_to_take_rt_mutex is used every time the task tries to grab a mutex in the
slow path. The first thing that is done here is an atomic setting of
-the "Has Waiters" flag of the mutex's owner field. Yes, this could really
-be false, because if the mutex has no owner, there are no waiters and
-the current task also won't have any waiters. But we don't have the lock
-yet, so we assume we are going to be a waiter. The reason for this is to
-play nice for those architectures that do have CMPXCHG. By setting this flag
-now, the owner of the mutex can't release the mutex without going into the
-slow unlock path, and it would then need to grab the wait_lock, which this
-code currently holds. So setting the "Has Waiters" flag forces the owner
-to synchronize with this code.
-
-Now that we know that we can't have any races with the owner releasing the
-mutex, we check to see if we can take the ownership. This is done if the
-mutex doesn't have a owner, or if we can steal the mutex from a pending
-owner. Let's look at the situations we have here.
-
- 1) Has owner that is pending
- ----------------------------
-
- The mutex has a owner, but it hasn't woken up and the mutex flag
- "Pending Owner" is set. The first check is to see if the owner isn't the
- current task. This is because this function is also used for the pending
- owner to grab the mutex. When a pending owner wakes up, it checks to see
- if it can take the mutex, and this is done if the owner is already set to
- itself. If so, we succeed and leave the function, clearing the "Pending
- Owner" bit.
-
- If the pending owner is not current, we check to see if the current priority is
- higher than the pending owner. If not, we fail the function and return.
-
- There's also something special about a pending owner. That is a pending owner
- is never blocked on a mutex. So there is no PI chain to worry about. It also
- means that if the mutex doesn't have any waiters, there's no accounting needed
- to update the pending owner's pi_list, since we only worry about processes
- blocked on the current mutex.
-
- If there are waiters on this mutex, and we just stole the ownership, we need
- to take the top waiter, remove it from the pi_list of the pending owner, and
- add it to the current pi_list. Note that at this moment, the pending owner
- is no longer on the list of waiters. This is fine, since the pending owner
- would add itself back when it realizes that it had the ownership stolen
- from itself. When the pending owner tries to grab the mutex, it will fail
- in try_to_take_rt_mutex if the owner field points to another process.
-
- 2) No owner
- -----------
-
- If there is no owner (or we successfully stole the lock), we set the owner
- of the mutex to current, and set the flag of "Has Waiters" if the current
- mutex actually has waiters, or we clear the flag if it doesn't. See, it was
- OK that we set that flag early, since now it is cleared.
-
- 3) Failed to grab ownership
- ---------------------------
-
- The most interesting case is when we fail to take ownership. This means that
- there exists an owner, or there's a pending owner with equal or higher
- priority than the current task.
-
-We'll continue on the failed case.
-
-If the mutex has a timeout, we set up a timer to go off to break us out
-of this mutex if we failed to get it after a specified amount of time.
-
-Now we enter a loop that will continue to try to take ownership of the mutex, or
-fail from a timeout or signal.
-
-Once again we try to take the mutex. This will usually fail the first time
-in the loop, since it had just failed to get the mutex. But the second time
-in the loop, this would likely succeed, since the task would likely be
-the pending owner.
-
-If the mutex is TASK_INTERRUPTIBLE a check for signals and timeout is done
-here.
-
-The waiter structure has a "task" field that points to the task that is blocked
-on the mutex. This field can be NULL the first time it goes through the loop
-or if the task is a pending owner and had its mutex stolen. If the "task"
-field is NULL then we need to set up the accounting for it.
+the "Has Waiters" flag of the mutex's owner field. By setting this flag
+now, the current owner of the mutex being contended for can't release the mutex
+without going into the slow unlock path, and it would then need to grab the
+wait_lock, which this code currently holds. So setting the "Has Waiters" flag
+forces the current owner to synchronize with this code.
+
+The lock is taken if the following are true:
+ 1) The lock has no owner
+ 2) The current task is the highest priority against all other
+ waiters of the lock
+
+If the task succeeds to acquire the lock, then the task is set as the
+owner of the lock, and if the lock still has waiters, the top_waiter
+(highest priority task waiting on the lock) is added to this task's
+pi_waiters tree.
+
+If the lock is not taken by try_to_take_rt_mutex(), then the
+task_blocks_on_rt_mutex() function is called. This will add the task to
+the lock's waiter tree and propagate the pi chain of the lock as well
+as the lock's owner's pi_waiters tree. This is described in the next
+section.
Task blocks on mutex
--------------------
The accounting of a mutex and process is done with the waiter structure of
the process. The "task" field is set to the process, and the "lock" field
-to the mutex. The plist nodes are initialized to the processes current
-priority.
+to the mutex. The rbtree node of waiter are initialized to the processes
+current priority.
Since the wait_lock was taken at the entry of the slow lock, we can safely
-add the waiter to the wait_list. If the current process is the highest
-priority process currently waiting on this mutex, then we remove the
-previous top waiter process (if it exists) from the pi_list of the owner,
-and add the current process to that list. Since the pi_list of the owner
+add the waiter to the task waiter tree. If the current process is the
+highest priority process currently waiting on this mutex, then we remove the
+previous top waiter process (if it exists) from the pi_waiters of the owner,
+and add the current process to that tree. Since the pi_waiter of the owner
has changed, we call rt_mutex_adjust_prio on the owner to see if the owner
should adjust its priority accordingly.
-If the owner is also blocked on a lock, and had its pi_list changed
+If the owner is also blocked on a lock, and had its pi_waiters changed
(or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead
and run rt_mutex_adjust_prio_chain on the owner, as described earlier.
@@ -686,30 +480,23 @@ mutex (waiter "task" field is not NULL), then we go to sleep (call schedule).
Waking up in the loop
---------------------
-The schedule can then wake up for a few reasons.
- 1) we were given pending ownership of the mutex.
- 2) we received a signal and was TASK_INTERRUPTIBLE
- 3) we had a timeout and was TASK_INTERRUPTIBLE
+The task can then wake up for a couple of reasons:
+ 1) The previous lock owner released the lock, and the task now is top_waiter
+ 2) we received a signal or timeout
-In any of these cases, we continue the loop and once again try to grab the
-ownership of the mutex. If we succeed, we exit the loop, otherwise we continue
-and on signal and timeout, will exit the loop, or if we had the mutex stolen
-we just simply add ourselves back on the lists and go back to sleep.
+In both cases, the task will try again to acquire the lock. If it
+does, then it will take itself off the waiters tree and set itself back
+to the TASK_RUNNING state.
-Note: For various reasons, because of timeout and signals, the steal mutex
- algorithm needs to be careful. This is because the current process is
- still on the wait_list. And because of dynamic changing of priorities,
- especially on SCHED_OTHER tasks, the current process can be the
- highest priority task on the wait_list.
-
-Failed to get mutex on Timeout or Signal
-----------------------------------------
+In first case, if the lock was acquired by another task before this task
+could get the lock, then it will go back to sleep and wait to be woken again.
-If a timeout or signal occurred, the waiter's "task" field would not be
-NULL and the task needs to be taken off the wait_list of the mutex and perhaps
-pi_list of the owner. If this process was a high priority process, then
-the rt_mutex_adjust_prio_chain needs to be executed again on the owner,
-but this time it will be lowering the priorities.
+The second case is only applicable for tasks that are grabbing a mutex
+that can wake up before getting the lock, either due to a signal or
+a timeout (i.e. rt_mutex_timed_futex_lock()). When woken, it will try to
+take the lock again, if it succeeds, then the task will return with the
+lock held, otherwise it will return with -EINTR if the task was woken
+by a signal, or -ETIMEDOUT if it timed out.
Unlocking the Mutex
@@ -739,25 +526,12 @@ owner still needs to make this check. If there are no waiters then the mutex
owner field is set to NULL, the wait_lock is released and nothing more is
needed.
-If there are waiters, then we need to wake one up and give that waiter
-pending ownership.
+If there are waiters, then we need to wake one up.
On the wake up code, the pi_lock of the current owner is taken. The top
-waiter of the lock is found and removed from the wait_list of the mutex
-as well as the pi_list of the current owner. The task field of the new
-pending owner's waiter structure is set to NULL, and the owner field of the
-mutex is set to the new owner with the "Pending Owner" bit set, as well
-as the "Has Waiters" bit if there still are other processes blocked on the
-mutex.
-
-The pi_lock of the previous owner is released, and the new pending owner's
-pi_lock is taken. Remember that this is the trick to prevent the race
-condition in rt_mutex_adjust_prio_chain from adding itself as a waiter
-on the mutex.
-
-We now clear the "pi_blocked_on" field of the new pending owner, and if
-the mutex still has waiters pending, we add the new top waiter to the pi_list
-of the pending owner.
+waiter of the lock is found and removed from the waiters tree of the mutex
+as well as the pi_waiters tree of the current owner. The "Has Waiters" bit is
+marked to prevent lower priority tasks from stealing the lock.
Finally we unlock the pi_lock of the pending owner and wake it up.
@@ -772,10 +546,14 @@ Credits
-------
Author: Steven Rostedt <rostedt@goodmis.org>
+Updated: Alex Shi <alex.shi@linaro.org> - 7/6/2017
-Reviewers: Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and Randy Dunlap
+Original Reviewers: Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and
+ Randy Dunlap
+Update (7/6/2017) Reviewers: Steven Rostedt and Sebastian Siewior
Updates
-------
This document was originally written for 2.6.17-rc3-mm1
+was updated on 4.12
diff --git a/Documentation/locking/rt-mutex.txt b/Documentation/locking/rt-mutex.txt
index 243393d882ee..35793e003041 100644
--- a/Documentation/locking/rt-mutex.txt
+++ b/Documentation/locking/rt-mutex.txt
@@ -28,14 +28,13 @@ magic bullet for poorly designed applications, but it allows
well-designed applications to use userspace locks in critical parts of
an high priority thread, without losing determinism.
-The enqueueing of the waiters into the rtmutex waiter list is done in
+The enqueueing of the waiters into the rtmutex waiter tree is done in
priority order. For same priorities FIFO order is chosen. For each
rtmutex, only the top priority waiter is enqueued into the owner's
-priority waiters list. This list too queues in priority order. Whenever
+priority waiters tree. This tree too queues in priority order. Whenever
the top priority waiter of a task changes (for example it timed out or
-got a signal), the priority of the owner task is readjusted. [The
-priority enqueueing is handled by "plists", see include/linux/plist.h
-for more details.]
+got a signal), the priority of the owner task is readjusted. The
+priority enqueueing is handled by "pi_waiters".
RT-mutexes are optimized for fastpath operations and have no internal
locking overhead when locking an uncontended mutex or unlocking a mutex
@@ -46,34 +45,29 @@ is used]
The state of the rt-mutex is tracked via the owner field of the rt-mutex
structure:
-rt_mutex->owner holds the task_struct pointer of the owner. Bit 0 and 1
-are used to keep track of the "owner is pending" and "rtmutex has
-waiters" state.
+lock->owner holds the task_struct pointer of the owner. Bit 0 is used to
+keep track of the "lock has waiters" state.
- owner bit1 bit0
- NULL 0 0 mutex is free (fast acquire possible)
- NULL 0 1 invalid state
- NULL 1 0 Transitional state*
- NULL 1 1 invalid state
- taskpointer 0 0 mutex is held (fast release possible)
- taskpointer 0 1 task is pending owner
- taskpointer 1 0 mutex is held and has waiters
- taskpointer 1 1 task is pending owner and mutex has waiters
+ owner bit0
+ NULL 0 lock is free (fast acquire possible)
+ NULL 1 lock is free and has waiters and the top waiter
+ is going to take the lock*
+ taskpointer 0 lock is held (fast release possible)
+ taskpointer 1 lock is held and has waiters**
-Pending-ownership handling is a performance optimization:
-pending-ownership is assigned to the first (highest priority) waiter of
-the mutex, when the mutex is released. The thread is woken up and once
-it starts executing it can acquire the mutex. Until the mutex is taken
-by it (bit 0 is cleared) a competing higher priority thread can "steal"
-the mutex which puts the woken up thread back on the waiters list.
+The fast atomic compare exchange based acquire and release is only
+possible when bit 0 of lock->owner is 0.
-The pending-ownership optimization is especially important for the
-uninterrupted workflow of high-prio tasks which repeatedly
-takes/releases locks that have lower-prio waiters. Without this
-optimization the higher-prio thread would ping-pong to the lower-prio
-task [because at unlock time we always assign a new owner].
+(*) It also can be a transitional state when grabbing the lock
+with ->wait_lock is held. To prevent any fast path cmpxchg to the lock,
+we need to set the bit0 before looking at the lock, and the owner may be
+NULL in this small time, hence this can be a transitional state.
-(*) The "mutex has waiters" bit gets set to take the lock. If the lock
-doesn't already have an owner, this bit is quickly cleared if there are
-no waiters. So this is a transitional state to synchronize with looking
-at the owner field of the mutex and the mutex owner releasing the lock.
+(**) There is a small time when bit 0 is set but there are no
+waiters. This can happen when grabbing the lock in the slow path.
+To prevent a cmpxchg of the owner releasing the lock, we need to
+set this bit before looking at the lock.
+
+BTW, there is still technically a "Pending Owner", it's just not called
+that anymore. The pending owner happens to be the top_waiter of a lock
+that has no owner and has been woken up to grab the lock.