aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2016-03-16 14:10:53 -0700
committerLinus Torvalds <torvalds@linux-foundation.org>2016-03-16 14:10:53 -0700
commit277edbabf6fece057b14fb6db5e3a34e00f42f42 (patch)
treed33314ae118cf387fa697643d10f1549ba4d6bfe /kernel
parentMerge branch 'akpm' (patches from Andrew) (diff)
parentMerge branch 'pm-tools' (diff)
downloadlinux-dev-277edbabf6fece057b14fb6db5e3a34e00f42f42.tar.xz
linux-dev-277edbabf6fece057b14fb6db5e3a34e00f42f42.zip
Merge tag 'pm+acpi-4.6-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI updates from Rafael Wysocki: "This time the majority of changes go into cpufreq and they are significant. First off, the way CPU frequency updates are triggered is different now. Instead of having to set up and manage a deferrable timer for each CPU in the system to evaluate and possibly change its frequency periodically, cpufreq governors set up callbacks to be invoked by the scheduler on a regular basis (basically on utilization updates). The "old" governors, "ondemand" and "conservative", still do all of their work in process context (although that is triggered by the scheduler now), but intel_pstate does it all in the callback invoked by the scheduler with no need for any additional asynchronous processing. Of course, this eliminates the overhead related to the management of all those timers, but also it allows the cpufreq governor code to be simplified quite a bit. On top of that, the common code and data structures used by the "ondemand" and "conservative" governors are cleaned up and made more straightforward and some long-standing and quite annoying problems are addressed. In particular, the handling of governor sysfs attributes is modified and the related locking becomes more fine grained which allows some concurrency problems to be avoided (particularly deadlocks with the core cpufreq code). In principle, the new mechanism for triggering frequency updates allows utilization information to be passed from the scheduler to cpufreq. Although the current code doesn't make use of it, in the works is a new cpufreq governor that will make decisions based on the scheduler's utilization data. That should allow the scheduler and cpufreq to work more closely together in the long run. In addition to the core and governor changes, cpufreq drivers are updated too. Fixes and optimizations go into intel_pstate, the cpufreq-dt driver is updated on top of some modification in the Operating Performance Points (OPP) framework and there are fixes and other updates in the powernv cpufreq driver. Apart from the cpufreq updates there is some new ACPICA material, including a fix for a problem introduced by previous ACPICA updates, and some less significant changes in the ACPI code, like CPPC code optimizations, ACPI processor driver cleanups and support for loading ACPI tables from initrd. Also updated are the generic power domains framework, the Intel RAPL power capping driver and the turbostat utility and we have a bunch of traditional assorted fixes and cleanups. Specifics: - Redesign of cpufreq governors and the intel_pstate driver to make them use callbacks invoked by the scheduler to trigger CPU frequency evaluation instead of using per-CPU deferrable timers for that purpose (Rafael Wysocki). - Reorganization and cleanup of cpufreq governor code to make it more straightforward and fix some concurrency problems in it (Rafael Wysocki, Viresh Kumar). - Cleanup and improvements of locking in the cpufreq core (Viresh Kumar). - Assorted cleanups in the cpufreq core (Rafael Wysocki, Viresh Kumar, Eric Biggers). - intel_pstate driver updates including fixes, optimizations and a modification to make it enable enable hardware-coordinated P-state selection (HWP) by default if supported by the processor (Philippe Longepe, Srinivas Pandruvada, Rafael Wysocki, Viresh Kumar, Felipe Franciosi). - Operating Performance Points (OPP) framework updates to improve its handling of voltage regulators and device clocks and updates of the cpufreq-dt driver on top of that (Viresh Kumar, Jon Hunter). - Updates of the powernv cpufreq driver to fix initialization and cleanup problems in it and correct its worker thread handling with respect to CPU offline, new powernv_throttle tracepoint (Shilpasri Bhat). - ACPI cpufreq driver optimization and cleanup (Rafael Wysocki). - ACPICA updates including one fix for a regression introduced by previos changes in the ACPICA code (Bob Moore, Lv Zheng, David Box, Colin Ian King). - Support for installing ACPI tables from initrd (Lv Zheng). - Optimizations of the ACPI CPPC code (Prashanth Prakash, Ashwin Chaugule). - Support for _HID(ACPI0010) devices (ACPI processor containers) and ACPI processor driver cleanups (Sudeep Holla). - Support for ACPI-based enumeration of the AMBA bus (Graeme Gregory, Aleksey Makarov). - Modification of the ACPI PCI IRQ management code to make it treat 255 in the Interrupt Line register as "not connected" on x86 (as per the specification) and avoid attempts to use that value as a valid interrupt vector (Chen Fan). - ACPI APEI fixes related to resource leaks (Josh Hunt). - Removal of modularity from a few ACPI drivers (BGRT, GHES, intel_pmic_crc) that cannot be built as modules in practice (Paul Gortmaker). - PNP framework update to make it treat ACPI_RESOURCE_TYPE_SERIAL_BUS as a valid resource type (Harb Abdulhamid). - New device ID (future AMD I2C controller) in the ACPI driver for AMD SoCs (APD) and in the designware I2C driver (Xiangliang Yu). - Assorted ACPI cleanups (Colin Ian King, Kaiyen Chang, Oleg Drokin). - cpuidle menu governor optimization to avoid a square root computation in it (Rasmus Villemoes). - Fix for potential use-after-free in the generic device properties framework (Heikki Krogerus). - Updates of the generic power domains (genpd) framework including support for multiple power states of a domain, fixes and debugfs output improvements (Axel Haslam, Jon Hunter, Laurent Pinchart, Geert Uytterhoeven). - Intel RAPL power capping driver updates to reduce IPI overhead in it (Jacob Pan). - System suspend/hibernation code cleanups (Eric Biggers, Saurabh Sengar). - Year 2038 fix for the process freezer (Abhilash Jindal). - turbostat utility updates including new features (decoding of more registers and CPUID fields, sub-second intervals support, GFX MHz and RC6 printout, --out command line option), fixes (syscall jitter detection and workaround, reductioin of the number of syscalls made, fixes related to Xeon x200 processors, compiler warning fixes) and cleanups (Len Brown, Hubert Chrzaniuk, Chen Yu)" * tag 'pm+acpi-4.6-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (182 commits) tools/power turbostat: bugfix: TDP MSRs print bits fixing tools/power turbostat: correct output for MSR_NHM_SNB_PKG_CST_CFG_CTL dump tools/power turbostat: call __cpuid() instead of __get_cpuid() tools/power turbostat: indicate SMX and SGX support tools/power turbostat: detect and work around syscall jitter tools/power turbostat: show GFX%rc6 tools/power turbostat: show GFXMHz tools/power turbostat: show IRQs per CPU tools/power turbostat: make fewer systems calls tools/power turbostat: fix compiler warnings tools/power turbostat: add --out option for saving output in a file tools/power turbostat: re-name "%Busy" field to "Busy%" tools/power turbostat: Intel Xeon x200: fix turbo-ratio decoding tools/power turbostat: Intel Xeon x200: fix erroneous bclk value tools/power turbostat: allow sub-sec intervals ACPI / APEI: ERST: Fixed leaked resources in erst_init ACPI / APEI: Fix leaked resources intel_pstate: Do not skip samples partially intel_pstate: Remove freq calculation from intel_pstate_calc_busy() intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance() ...
Diffstat (limited to 'kernel')
-rw-r--r--kernel/irq/manage.c9
-rw-r--r--kernel/power/process.c12
-rw-r--r--kernel/power/suspend.c6
-rw-r--r--kernel/sched/Makefile1
-rw-r--r--kernel/sched/cpufreq.c37
-rw-r--r--kernel/sched/deadline.c4
-rw-r--r--kernel/sched/fair.c26
-rw-r--r--kernel/sched/rt.c4
-rw-r--r--kernel/sched/sched.h48
-rw-r--r--kernel/trace/power-traces.c1
10 files changed, 136 insertions, 12 deletions
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 3ddd2297ee95..64731e84c982 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1607,6 +1607,9 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
struct irq_desc *desc;
int retval;
+ if (irq == IRQ_NOTCONNECTED)
+ return -ENOTCONN;
+
/*
* Sanity-check: shared interrupts must pass in a real dev-ID,
* otherwise we'll have trouble later trying to figure out
@@ -1697,9 +1700,13 @@ EXPORT_SYMBOL(request_threaded_irq);
int request_any_context_irq(unsigned int irq, irq_handler_t handler,
unsigned long flags, const char *name, void *dev_id)
{
- struct irq_desc *desc = irq_to_desc(irq);
+ struct irq_desc *desc;
int ret;
+ if (irq == IRQ_NOTCONNECTED)
+ return -ENOTCONN;
+
+ desc = irq_to_desc(irq);
if (!desc)
return -EINVAL;
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 564f786df470..df058bed53ce 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -30,13 +30,12 @@ static int try_to_freeze_tasks(bool user_only)
unsigned long end_time;
unsigned int todo;
bool wq_busy = false;
- struct timeval start, end;
- u64 elapsed_msecs64;
+ ktime_t start, end, elapsed;
unsigned int elapsed_msecs;
bool wakeup = false;
int sleep_usecs = USEC_PER_MSEC;
- do_gettimeofday(&start);
+ start = ktime_get_boottime();
end_time = jiffies + msecs_to_jiffies(freeze_timeout_msecs);
@@ -78,10 +77,9 @@ static int try_to_freeze_tasks(bool user_only)
sleep_usecs *= 2;
}
- do_gettimeofday(&end);
- elapsed_msecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);
- do_div(elapsed_msecs64, NSEC_PER_MSEC);
- elapsed_msecs = elapsed_msecs64;
+ end = ktime_get_boottime();
+ elapsed = ktime_sub(end, start);
+ elapsed_msecs = ktime_to_ms(elapsed);
if (todo) {
pr_cont("\n");
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index f9fe133c13e2..230a77225e2e 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -248,7 +248,7 @@ static int suspend_test(int level)
{
#ifdef CONFIG_PM_DEBUG
if (pm_test_level == level) {
- printk(KERN_INFO "suspend debug: Waiting for %d second(s).\n",
+ pr_info("suspend debug: Waiting for %d second(s).\n",
pm_test_delay);
mdelay(pm_test_delay * 1000);
return 1;
@@ -320,7 +320,7 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
error = dpm_suspend_late(PMSG_SUSPEND);
if (error) {
- printk(KERN_ERR "PM: late suspend of devices failed\n");
+ pr_err("PM: late suspend of devices failed\n");
goto Platform_finish;
}
error = platform_suspend_prepare_late(state);
@@ -329,7 +329,7 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
error = dpm_suspend_noirq(PMSG_SUSPEND);
if (error) {
- printk(KERN_ERR "PM: noirq suspend of devices failed\n");
+ pr_err("PM: noirq suspend of devices failed\n");
goto Platform_early_resume;
}
error = platform_suspend_prepare_noirq(state);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 7d4cba227cbd..302d6ebd64f7 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_FREQ) += cpufreq.o
diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
new file mode 100644
index 000000000000..928c4ba32f68
--- /dev/null
+++ b/kernel/sched/cpufreq.c
@@ -0,0 +1,37 @@
+/*
+ * Scheduler code and data structures related to cpufreq.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "sched.h"
+
+DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
+
+/**
+ * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer.
+ * @cpu: The CPU to set the pointer for.
+ * @data: New pointer value.
+ *
+ * Set and publish the update_util_data pointer for the given CPU. That pointer
+ * points to a struct update_util_data object containing a callback function
+ * to call from cpufreq_update_util(). That function will be called from an RCU
+ * read-side critical section, so it must not sleep.
+ *
+ * Callers must use RCU-sched callbacks to free any memory that might be
+ * accessed via the old update_util_data pointer or invoke synchronize_sched()
+ * right after this function to avoid use-after-free.
+ */
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
+{
+ if (WARN_ON(data && !data->func))
+ return;
+
+ rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
+}
+EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c7a036facbe1..affd97ec9f65 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -717,6 +717,10 @@ static void update_curr_dl(struct rq *rq)
if (!dl_task(curr) || !on_dl_rq(dl_se))
return;
+ /* Kick cpufreq (see the comment in linux/cpufreq.h). */
+ if (cpu_of(rq) == smp_processor_id())
+ cpufreq_trigger_update(rq_clock(rq));
+
/*
* Consumed budget is computed considering the time as
* observed by schedulable tasks (excluding time spent
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33130529e9b5..46d64e4ccfde 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2856,7 +2856,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
u64 now = cfs_rq_clock_task(cfs_rq);
- int cpu = cpu_of(rq_of(cfs_rq));
+ struct rq *rq = rq_of(cfs_rq);
+ int cpu = cpu_of(rq);
/*
* Track task load average for carrying it to new CPU after migrated, and
@@ -2868,6 +2869,29 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
update_tg_load_avg(cfs_rq, 0);
+
+ if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
+ unsigned long max = rq->cpu_capacity_orig;
+
+ /*
+ * There are a few boundary cases this might miss but it should
+ * get called often enough that that should (hopefully) not be
+ * a real problem -- added to that it only calls on the local
+ * CPU, so if we enqueue remotely we'll miss an update, but
+ * the next tick/schedule should update.
+ *
+ * It will not get called when we go idle, because the idle
+ * thread is a different class (!fair), nor will the utilization
+ * number include things like RT tasks.
+ *
+ * As is, the util number is not freq-invariant (we'd have to
+ * implement arch_scale_freq_capacity() for that).
+ *
+ * See cpu_util().
+ */
+ cpufreq_update_util(rq_clock(rq),
+ min(cfs_rq->avg.util_avg, max), max);
+ }
}
static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 562471329487..c41ea7ac1764 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -953,6 +953,10 @@ static void update_curr_rt(struct rq *rq)
if (curr->sched_class != &rt_sched_class)
return;
+ /* Kick cpufreq (see the comment in linux/cpufreq.h). */
+ if (cpu_of(rq) == smp_processor_id())
+ cpufreq_trigger_update(rq_clock(rq));
+
delta_exec = rq_clock_task(rq) - curr->se.exec_start;
if (unlikely((s64)delta_exec <= 0))
return;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b2ff5a2bd6df..382848a24ed9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1793,3 +1793,51 @@ static inline u64 irq_time_read(int cpu)
}
#endif /* CONFIG_64BIT */
#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
+#ifdef CONFIG_CPU_FREQ
+DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
+
+/**
+ * cpufreq_update_util - Take a note about CPU utilization changes.
+ * @time: Current time.
+ * @util: Current utilization.
+ * @max: Utilization ceiling.
+ *
+ * This function is called by the scheduler on every invocation of
+ * update_load_avg() on the CPU whose utilization is being updated.
+ *
+ * It can only be called from RCU-sched read-side critical sections.
+ */
+static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+{
+ struct update_util_data *data;
+
+ data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data));
+ if (data)
+ data->func(data, time, util, max);
+}
+
+/**
+ * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
+ * @time: Current time.
+ *
+ * The way cpufreq is currently arranged requires it to evaluate the CPU
+ * performance state (frequency/voltage) on a regular basis to prevent it from
+ * being stuck in a completely inadequate performance level for too long.
+ * That is not guaranteed to happen if the updates are only triggered from CFS,
+ * though, because they may not be coming in if RT or deadline tasks are active
+ * all the time (or there are RT and DL tasks only).
+ *
+ * As a workaround for that issue, this function is called by the RT and DL
+ * sched classes to trigger extra cpufreq updates to prevent it from stalling,
+ * but that really is a band-aid. Going forward it should be replaced with
+ * solutions targeted more specifically at RT and DL tasks.
+ */
+static inline void cpufreq_trigger_update(u64 time)
+{
+ cpufreq_update_util(time, ULONG_MAX, 0);
+}
+#else
+static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) {}
+static inline void cpufreq_trigger_update(u64 time) {}
+#endif /* CONFIG_CPU_FREQ */
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index eb4220a132ec..81b87451c0ea 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -15,4 +15,5 @@
EXPORT_TRACEPOINT_SYMBOL_GPL(suspend_resume);
EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
+EXPORT_TRACEPOINT_SYMBOL_GPL(powernv_throttle);