x86/mce/therm_throt: Optimize notifications of thermal throttle - linux-dev - Linux kernel development work

diff options

author	Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2019-11-11 13:43:12 -0800
committer	Borislav Petkov <bp@suse.de>	2019-11-12 15:56:04 +0100
commit	f6656208f04e5b3804054008eba4bf7170f4c841 (patch)
tree	a4cf34449bebe8b5680e89a6a22f007f6b9e75d7 /arch/mips/fw/arc/misc.c
parent	x86/mce: Add Xeon Icelake to list of CPUs that support PPIN (diff)
download	linux-dev-f6656208f04e5b3804054008eba4bf7170f4c841.tar.xz linux-dev-f6656208f04e5b3804054008eba4bf7170f4c841.zip

x86/mce/therm_throt: Optimize notifications of thermal throttle

Some modern systems have very tight thermal tolerances. Because of this they may cross thermal thresholds when running normal workloads (even during boot). The CPU hardware will react by limiting power/frequency and using duty cycles to bring the temperature back into normal range. Thus users may see a "critical" message about the "temperature above threshold" which is soon followed by "temperature/speed normal". These messages are rate-limited, but still may repeat every few minutes. This issue became worse starting with the Ivy Bridge generation of CPUs because they include a TCC activation offset in the MSR IA32_TEMPERATURE_TARGET. OEMs use this to provide alerts long before critical temperatures are reached. A test run on a laptop with Intel 8th Gen i5 core for two hours with a workload resulted in 20K+ thermal interrupts per CPU for core level and another 20K+ interrupts at package level. The kernel logs were full of throttling messages. The real value of these threshold interrupts, is to debug problems with the external cooling solutions and performance issues due to excessive throttling. So the solution here is the following: - In the current thermal_throttle folder, show: - the maximum time for one throttling event and, - the total amount of time the system was in throttling state. - Do not log short excursions. - Log only when, in spite of thermal throttling, the temperature is rising. On the high threshold interrupt trigger a delayed workqueue that monitors the threshold violation log bit (THERM_STATUS_PROCHOT_LOG). When the log bit is set, this workqueue callback calculates three point moving average and logs a warning message when the temperature trend is rising. When this log bit is clear and temperature is below threshold temperature, then the workqueue callback logs a "Normal" message. Once a high threshold event is logged, the logging is rate-limited. With this patch on the same test laptop, no warnings are printed in the logs as the max time the processor could bring the temperature under control is only 280 ms. This implementation is done with the inputs from Alan Cox and Tony Luck. [ bp: Touchups. ] Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: bberg@redhat.com Cc: ckellner@redhat.com Cc: hdegoede@redhat.com Cc: Ingo Molnar <mingo@redhat.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191111214312.81365-1-srinivas.pandruvada@linux.intel.com

Diffstat (limited to 'arch/mips/fw/arc/misc.c')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: