<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-dev/arch/powerpc/kernel/watchdog.c, branch linus/master</title>
<subtitle>Linux kernel development work - see feature branches</subtitle>
<id>https://git.zx2c4.com/linux-dev/atom/arch/powerpc/kernel/watchdog.c?h=linus%2Fmaster</id>
<link rel='self' href='https://git.zx2c4.com/linux-dev/atom/arch/powerpc/kernel/watchdog.c?h=linus%2Fmaster'/>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/'/>
<updated>2022-05-05T12:12:44Z</updated>
<entry>
<title>powerpc: fix typos in comments</title>
<updated>2022-05-05T12:12:44Z</updated>
<author>
<name>Julia Lawall</name>
<email>Julia.Lawall@inria.fr</email>
</author>
<published>2022-04-30T18:56:54Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=1fd02f6605b855b4af2883f29a2abc88bdf17857'/>
<id>urn:sha1:1fd02f6605b855b4af2883f29a2abc88bdf17857</id>
<content type='text'>
Various spelling mistakes in comments.
Detected with the help of Coccinelle.

Signed-off-by: Julia Lawall &lt;Julia.Lawall@inria.fr&gt;
Reviewed-by: Joel Stanley &lt;joel@jms.id.au&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20220430185654.5855-1-Julia.Lawall@inria.fr

</content>
</entry>
<entry>
<title>powerpc/watchdog: help remote CPUs to flush NMI printk output</title>
<updated>2021-11-29T12:08:43Z</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2021-11-19T11:31:46Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=e012c499985c608c936410d8bab29d9596d62859'/>
<id>urn:sha1:e012c499985c608c936410d8bab29d9596d62859</id>
<content type='text'>
The printk layer at the moment does not seem to have a good way to force
flush printk messages that are created in NMI context, except in the
panic path.

NMI-context printk messages normally get to the console with irq_work,
but that won't help if the CPU is stuck with irqs disabled, as can be
the case for hard lockup watchdog messages.

The watchdog currently flushes the printk buffers after detecting a
lockup on remote CPUs, but they may not have processed their NMI IPI
yet by that stage, or they may have self-detected a lockup in which
case they won't go via this NMI IPI path.

Improve the situation by having NMI-context mark a flag if it called
printk, and have watchdog timer interrupts check if that flag was set
and try to flush if it was. Latency is not a big problem because we
were already stuck for a while, just need to try to make sure the
messages eventually make it out.

Depends-on: 5d5e4522a7f4 ("printk: restore flushing of NMI buffers on remote CPUs after NMI backtraces")
Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Reviewed-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20211119113146.752759-6-npiggin@gmail.com
</content>
</entry>
<entry>
<title>powerpc/watchdog: Fix wd_smp_last_reset_tb reporting</title>
<updated>2021-11-25T10:48:39Z</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2021-11-25T10:33:46Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=3d030e301856da366380b3865fce6c03037b08a6'/>
<id>urn:sha1:3d030e301856da366380b3865fce6c03037b08a6</id>
<content type='text'>
wd_smp_last_reset_tb now gets reset by watchdog_smp_panic() as part of
marking CPUs stuck and removing them from the pending mask before it
begins any printing. This causes last reset times reported to be off.

Fix this by reading it into a local variable before it gets reset.

Fixes: 76521c4b0291 ("powerpc/watchdog: Avoid holding wd_smp_lock over printk and smp_send_nmi_ipi")
Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20211125103346.1188958-1-npiggin@gmail.com

</content>
</entry>
<entry>
<title>powerpc/watchdog: read TB close to where it is used</title>
<updated>2021-11-25T00:25:34Z</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2021-11-10T02:50:56Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=1f01bf90765fa5f88fbae452c131c1edf5cda7ba'/>
<id>urn:sha1:1f01bf90765fa5f88fbae452c131c1edf5cda7ba</id>
<content type='text'>
When taking watchdog actions, printing messages, comparing and
re-setting wd_smp_last_reset_tb, etc., read TB close to the point of use
and under wd_smp_lock or printing lock (if applicable).

This should keep timebase mostly monotonic with kernel log messages, and
could prevent (in theory) a laggy CPU updating wd_smp_last_reset_tb to
something a long way in the past, and causing other CPUs to appear to be
stuck.

These additional TB reads are all slowpath (lockup has been detected),
so performance does not matter.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Reviewed-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20211110025056.2084347-5-npiggin@gmail.com

</content>
</entry>
<entry>
<title>powerpc/watchdog: Avoid holding wd_smp_lock over printk and smp_send_nmi_ipi</title>
<updated>2021-11-25T00:25:33Z</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2021-11-10T02:50:55Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=76521c4b0291ad25723638ade5a0ff4d5f659771'/>
<id>urn:sha1:76521c4b0291ad25723638ade5a0ff4d5f659771</id>
<content type='text'>
There is a deadlock with the console_owner lock and the wd_smp_lock:

CPU x takes the console_owner lock
CPU y takes a watchdog timer interrupt and takes __wd_smp_lock
CPU x takes a soft-NMI interrupt, detects deadlock, spins on __wd_smp_lock
CPU y detects deadlock, tries to print something and spins on console_owner
-&gt; deadlock

Change the watchdog locking scheme so wd_smp_lock protects the watchdog
internal data, but "reporting" (printing, issuing NMI IPIs, taking any
action outside of watchdog) uses a non-waiting exclusion. If a CPU detects
a problem but can not take the reporting lock, it just returns because
something else is already reporting. It will try again at some point.

Typically hard lockup watchdog report usefulness is not impacted due to
failure to spewing a large enough amount of data in as short a time as
possible, but by messages getting garbled.

Laurent debugged this and found the deadlock, and this patch is based on
his general approach to avoid expensive operations while holding the lock.
With the addition of the reporting exclusion.

Signed-off-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
[np: rework to add reporting exclusion update changelog]
Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20211110025056.2084347-4-npiggin@gmail.com

</content>
</entry>
<entry>
<title>powerpc/watchdog: tighten non-atomic read-modify-write access</title>
<updated>2021-11-25T00:25:33Z</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2021-11-10T02:50:54Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=858c93c31504ac1507084493d7eafbe7e2302dc2'/>
<id>urn:sha1:858c93c31504ac1507084493d7eafbe7e2302dc2</id>
<content type='text'>
Most updates to wd_smp_cpus_pending are under lock except the watchdog
interrupt bit clear.

This can race with non-atomic RMW updates to the mask under lock, which
can happen in two instances:

Firstly, if another CPU detects this one is stuck, removes it from the
mask, mask becomes empty and is re-filled with non-atomic stores. This
is okay because it would re-fill the mask with this CPU's bit clear
anyway (because this CPU is now stuck), so it doesn't matter that the
bit clear update got "lost". Add a comment for this.

Secondly, if another CPU detects a different CPU is stuck and removes it
from the pending mask with a non-atomic store to bytes which also
include the bit of this CPU. This case can result in the bit clear being
lost and the end result being the bit is set. This should be so rare it
hardly matters, but to make things simpler to reason about just avoid
the non-atomic access for that case.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Reviewed-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20211110025056.2084347-3-npiggin@gmail.com

</content>
</entry>
<entry>
<title>powerpc/watchdog: Fix missed watchdog reset due to memory ordering race</title>
<updated>2021-11-25T00:25:33Z</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2021-11-10T02:50:53Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=5dad4ba68a2483fc80d70b9dc90bbe16e1f27263'/>
<id>urn:sha1:5dad4ba68a2483fc80d70b9dc90bbe16e1f27263</id>
<content type='text'>
It is possible for all CPUs to miss the pending cpumask becoming clear,
and then nobody resetting it, which will cause the lockup detector to
stop working. It will eventually expire, but watchdog_smp_panic will
avoid doing anything if the pending mask is clear and it will never be
reset.

Order the cpumask clear vs the subsequent test to close this race.

Add an extra check for an empty pending mask when the watchdog fires and
finds its bit still clear, to try to catch any other possible races or
bugs here and keep the watchdog working. The extra test in
arch_touch_nmi_watchdog is required to prevent the new warning from
firing off.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Reviewed-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
Debugged-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20211110025056.2084347-2-npiggin@gmail.com

</content>
</entry>
<entry>
<title>Merge branch 'rework/printk_safe-removal' into for-linus</title>
<updated>2021-11-18T09:03:47Z</updated>
<author>
<name>Petr Mladek</name>
<email>pmladek@suse.com</email>
</author>
<published>2021-11-18T09:03:47Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=bf6d0d1e1ab38309ea2a234e2e4ba2a18d014af9'/>
<id>urn:sha1:bf6d0d1e1ab38309ea2a234e2e4ba2a18d014af9</id>
<content type='text'>
</content>
</entry>
<entry>
<title>printk: restore flushing of NMI buffers on remote CPUs after NMI backtraces</title>
<updated>2021-11-10T15:12:00Z</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2021-11-07T04:51:16Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=5d5e4522a7f404d1a96fd6c703989d32a9c9568d'/>
<id>urn:sha1:5d5e4522a7f404d1a96fd6c703989d32a9c9568d</id>
<content type='text'>
printk from NMI context relies on irq work being raised on the local CPU
to print to console. This can be a problem if the NMI was raised by a
lockup detector to print lockup stack and regs, because the CPU may not
enable irqs (because it is locked up).

Introduce printk_trigger_flush() that can be called another CPU to try
to get those messages to the console, call that where printk_safe_flush
was previously called.

Fixes: 93d102f094be ("printk: remove safe buffers")
Cc: stable@vger.kernel.org # 5.15
Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Reviewed-by: Petr Mladek &lt;pmladek@suse.com&gt;
Reviewed-by: John Ogness &lt;john.ogness@linutronix.de&gt;
Signed-off-by: Petr Mladek &lt;pmladek@suse.com&gt;
Link: https://lore.kernel.org/r/20211107045116.1754411-1-npiggin@gmail.com
</content>
</entry>
<entry>
<title>Merge branch 'rework/printk_safe-removal' into for-linus</title>
<updated>2021-08-30T14:36:10Z</updated>
<author>
<name>Petr Mladek</name>
<email>pmladek@suse.com</email>
</author>
<published>2021-08-30T14:36:10Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=c985aafb60e972c0a6b8d0bd65e03af5890b748a'/>
<id>urn:sha1:c985aafb60e972c0a6b8d0bd65e03af5890b748a</id>
<content type='text'>
</content>
</entry>
</feed>
