aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/cpuidle/cpuidle.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2020-09-28Merge back cpuidle material for 5.10.Rafael J. Wysocki1-10/+0
2020-09-23cpuidle: record state entry rejection statisticsLina Iyer1-0/+1
CPUs may fail to enter the chosen idle state if there was a pending interrupt, causing the cpuidle driver to return an error value. Record that and export it via sysfs along with the other idle state statistics. This could prove useful in understanding behavior of the governor and the system during usecases that involve multiple CPUs. Signed-off-by: Lina Iyer <ilina@codeaurora.org> [ rjw: Changelog and documentation edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-09-22cpuidle: Drop misleading comments about RCU usageUlf Hansson1-10/+0
The commit 1098582a0f6c ("sched,idle,rcu: Push rcu_idle deeper into the idle path"), moved the calls rcu_idle_enter|exit() into the cpuidle core. However, it forgot to remove a couple of comments in enter_s2idle_proper() about why RCU_NONIDLE earlier was needed. So, let's drop them as they have become a bit misleading. Fixes: 1098582a0f6c ("sched,idle,rcu: Push rcu_idle deeper into the idle path") Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-09-16cpuidle: Allow cpuidle drivers to take over RCU-idlePeter Zijlstra1-5/+10
Some drivers have to do significant work, some of which relies on RCU still being active. Instead of using RCU_NONIDLE in the drivers and flipping RCU back on, allow drivers to take over RCU-idle duty. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Tested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-08-26cpuidle: Make CPUIDLE_FLAG_TLB_FLUSHED genericPeter Zijlstra1-0/+4
This allows moving the leave_mm() call into generic code before rcu_idle_enter(). Gets rid of more trace_*_rcuidle() users. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Marco Elver <elver@google.com> Link: https://lkml.kernel.org/r/20200821085348.369441600@infradead.org
2020-08-26sched,idle,rcu: Push rcu_idle deeper into the idle pathPeter Zijlstra1-4/+8
Lots of things take locks, due to a wee bug, rcu_lockdep didn't notice that the locking tracepoints were using RCU. Push rcu_idle_{enter,exit}() as deep as possible into the idle paths, this also resolves a lot of _rcuidle()/RCU_NONIDLE() usage. Specifically, sched_clock_idle_wakeup_event() will use ktime which will use seqlocks which will tickle lockdep, and stop_critical_timings() uses lock. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Marco Elver <elver@google.com> Link: https://lkml.kernel.org/r/20200821085348.310943801@infradead.org
2020-08-26cpuidle: Fixup IRQ statePeter Zijlstra1-1/+2
Match the pattern elsewhere in this file. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Marco Elver <elver@google.com> Link: https://lkml.kernel.org/r/20200821085348.251340558@infradead.org
2020-06-25cpuidle: Rearrange s2idle-specific idle state entry codeRafael J. Wysocki1-3/+3
Implement call_cpuidle_s2idle() in analogy with call_cpuidle() for the s2idle-specific idle state entry and invoke it from cpuidle_idle_call() to make the s2idle-specific idle entry code path look more similar to the "regular" idle entry one. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Chen Yu <yu.c.chen@intel.com>
2020-06-23PM: s2idle: Clear _TIF_POLLING_NRFLAG before suspend to idleChen Yu1-1/+2
Suspend to idle was found to not work on Goldmont CPU recently. The issue happens due to: 1. On Goldmont the CPU in idle can only be woken up via IPIs, not POLLING mode, due to commit 08e237fa56a1 ("x86/cpu: Add workaround for MONITOR instruction erratum on Goldmont based CPUs") 2. When the CPU is entering suspend to idle process, the _TIF_POLLING_NRFLAG remains on, because cpuidle_enter_s2idle() doesn't match call_cpuidle() exactly. 3. Commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") makes use of _TIF_POLLING_NRFLAG to avoid sending IPIs to idle CPUs. 4. As a result, some IPIs related functions might not work well during suspend to idle on Goldmont. For example, one suspected victim: tick_unfreeze() -> timekeeping_resume() -> hrtimers_resume() -> clock_was_set() -> on_each_cpu() might wait forever, because the IPIs will not be sent to the CPUs which are sleeping with _TIF_POLLING_NRFLAG set, and Goldmont CPU could not be woken up by only setting _TIF_NEED_RESCHED on the monitor address. To avoid that, clear the _TIF_POLLING_NRFLAG flag before invoking enter_s2idle_proper() in cpuidle_enter_s2idle() in analogy with the call_cpuidle() code flow. Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Suggested-by: Rafael J. Wysocki <rafael@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> [ rjw: Subject / changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-02-13PM: QoS: Drop PM_QOS_CPU_DMA_LATENCY notifier chainRafael J. Wysocki1-39/+1
Notice that pm_qos_remove_notifier() is not used at all and the only caller of pm_qos_add_notifier() is the cpuidle core, which only needs the PM_QOS_CPU_DMA_LATENCY notifier to invoke wake_up_all_idle_cpus() upon changes of the PM_QOS_CPU_DMA_LATENCY target value. First, to ensure that wake_up_all_idle_cpus() will be called whenever the PM_QOS_CPU_DMA_LATENCY target value changes, modify the pm_qos_add/update/remove_request() family of functions to check if the effective constraint for the PM_QOS_CPU_DMA_LATENCY has changed and call wake_up_all_idle_cpus() directly in that case. Next, drop the PM_QOS_CPU_DMA_LATENCY notifier from cpuidle as it is not necessary any more. Finally, drop both pm_qos_add_notifier() and pm_qos_remove_notifier(), as they have no callers now, along with cpu_dma_lat_notifier which is only used by them. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-01-23Merge branch 'intel_idle+acpi'Rafael J. Wysocki1-1/+5
Merge changes updating the ACPI processor driver in order to export acpi_processor_evaluate_cst() to the code outside of it and adding ACPI support to the intel_idle driver based on that. * intel_idle+acpi: Documentation: admin-guide: PM: Add intel_idle document intel_idle: Use ACPI _CST on server systems intel_idle: Add module parameter to prevent ACPI _CST from being used intel_idle: Allow ACPI _CST to be used for selected known processors cpuidle: Allow idle states to be disabled by default intel_idle: Use ACPI _CST for processor models without C-state tables intel_idle: Refactor intel_idle_cpuidle_driver_init() ACPI: processor: Export acpi_processor_evaluate_cst() ACPI: processor: Make ACPI_PROCESSOR_CSTATE depend on ACPI_PROCESSOR ACPI: processor: Clean up acpi_processor_evaluate_cst() ACPI: processor: Introduce acpi_processor_evaluate_cst() ACPI: processor: Export function to claim _CST control
2020-01-23cpuidle: fix cpuidle_find_deepest_state() kerneldoc warningsBenjamin Gaignard1-0/+3
Fix cpuidle_find_deepest_state() kernel documentation to avoid warnings when compiling with W=1. Signed-off-by: Benjamin Gaignard <benjamin.gaignard@st.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-12-27cpuidle: Allow idle states to be disabled by defaultRafael J. Wysocki1-1/+5
In certain situations it may be useful to prevent some idle states from being used by default while allowing user space to enable them later on. For this purpose, introduce a new state flag, CPUIDLE_FLAG_OFF, to mark idle states that should be disabled by default, make the core set CPUIDLE_STATE_DISABLED_BY_USER for those states at the initialization time and add a new state attribute in sysfs, "default_status", to inform user space of the initial status of the given idle state ("disabled" if CPUIDLE_FLAG_OFF is set for it, "enabled" otherwise). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-12-12cpuidle: Drop unnecessary type cast in cpuidle_poll_time()Rafael J. Wysocki1-1/+1
The data type of the target_residency_ns field in struct cpuidle_state is u64, so it does not need to be cast into u64. Get rid of the unnecessary type cast. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-12-09cpuidle: use first valid target residency as poll timeMarcelo Tosatti1-0/+1
Commit 259231a04561 ("cpuidle: add poll_limit_ns to cpuidle_device structure") changed, by mistake, the target residency from the first available sleep state to the last available sleep state (which should be longer). This might cause excessive polling. Fixes: 259231a04561 ("cpuidle: add poll_limit_ns to cpuidle_device structure") Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: 5.4+ <stable@vger.kernel.org> # 5.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-29cpuidle: Drop disabled field from struct cpuidle_stateRafael J. Wysocki1-1/+1
After recent cpuidle updates the "disabled" field in struct cpuidle_state is only used by two drivers (intel_idle and shmobile cpuidle) for marking unusable idle states, but that may as well be achieved with the help of a state flag, so define an "unusable" idle state flag, CPUIDLE_FLAG_UNUSABLE, make the drivers in question use it instead of the "disabled" field and make the core set CPUIDLE_STATE_DISABLED_BY_DRIVER for the idle states with that flag set. After the above changes, the "disabled" field in struct cpuidle_state is not used any more, so drop it. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20cpuidle: Pass exit latency limit to cpuidle_use_deepest_state()Daniel Lezcano1-2/+3
Modify cpuidle_use_deepest_state() to take an additional exit latency limit argument to be passed to find_deepest_idle_state() and make cpuidle_idle_call() pass dev->forced_idle_latency_limit_ns to it for forced idle. Suggested-by: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> [ rjw: Rebase and rearrange code, subject & changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20cpuidle: Allow idle injection to apply exit latency limitDaniel Lezcano1-6/+7
In some cases it may be useful to specify an exit latency limit for the idle state to be used during CPU idle time injection. Instead of duplicating the information in struct cpuidle_device or propagating the latency limit in the call stack, replace the use_deepest_state field with forced_latency_limit_ns to represent that limit, so that the deepest idle state with exit latency within that limit is forced (i.e. no governors) when it is set. A zero exit latency limit for forced idle means to use governors in the usual way (analogous to use_deepest_state equal to "false" before this change). Additionally, add play_idle_precise() taking two arguments, the duration of forced idle and the idle state exit latency limit, both in nanoseconds, and redefine play_idle() as a wrapper around that new function. This change is preparatory, no functional impact is expected. Suggested-by: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> [ rjw: Subject, changelog, cpuidle_use_deepest_state() kerneldoc, whitespace ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-11cpuidle: Use nanoseconds as the unit of timeRafael J. Wysocki1-19/+17
Currently, the cpuidle subsystem uses microseconds as the unit of time which (among other things) causes the idle loop to incur some integer division overhead for no clear benefit. In order to allow cpuidle to measure time in nanoseconds, add two new fields, exit_latency_ns and target_residency_ns, to represent the exit latency and target residency of an idle state in nanoseconds, respectively, to struct cpuidle_state and initialize them with the help of the corresponding values in microseconds provided by drivers. Additionally, change cpuidle_governor_latency_req() to return the idle state exit latency constraint in nanoseconds. Also meeasure idle state residency (last_residency_ns in struct cpuidle_device and time_ns in struct cpuidle_driver) in nanoseconds and update the cpuidle core and governors accordingly. However, the menu governor still computes typical intervals in microseconds to avoid integer overflows. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Doug Smythies <dsmythies@telus.net> Tested-by: Doug Smythies <dsmythies@telus.net>
2019-11-06cpuidle: Consolidate disabled state checksRafael J. Wysocki1-11/+13
There are two reasons why CPU idle states may be disabled: either because the driver has disabled them or because they have been disabled by user space via sysfs. In the former case, the state's "disabled" flag is set once during the initialization of the driver and it is never cleared later (it is read-only effectively). In the latter case, the "disable" field of the given state's cpuidle_state_usage struct is set and it may be changed via sysfs. Thus checking whether or not an idle state has been disabled involves reading these two flags every time. In order to avoid the additional check of the state's "disabled" flag (which is effectively read-only anyway), use the value of it at the init time to set a (new) flag in the "disable" field of that state's cpuidle_state_usage structure and use the sysfs interface to manipulate another (new) flag in it. This way the state is disabled whenever the "disable" field of its cpuidle_state_usage structure is nonzero, whatever the reason, and it is the only place to look into to check whether or not the state has been disabled. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2019-07-30cpuidle: add poll_limit_ns to cpuidle_device structureMarcelo Tosatti1-0/+30
Add a poll_limit_ns variable to cpuidle_device structure. Calculate and configure it in the new cpuidle_poll_time function, in case its zero. Individual governors are allowed to override this value. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-04-10cpuidle: Export the next timer expiration for CPUsUlf Hansson1-2/+17
To be able to predict the sleep duration for a CPU entering idle, it is essential to know the expiration time of the next timer. Both the teo and the menu cpuidle governors already use this information for CPU idle state selection. Moving forward, a similar prediction needs to be made for a group of idle CPUs rather than for a single one and the following changes implement a new genpd governor for that purpose. In order to support that feature, add a new function called tick_nohz_get_next_hrtimer() that will return the next hrtimer expiration time of a given CPU to be invoked after deciding whether or not to stop the scheduler tick on that CPU. Make the cpuidle core call tick_nohz_get_next_hrtimer() right before invoking the ->enter() callback provided by the cpuidle driver for the given state and store its return value in the per-CPU struct cpuidle_device, so as to make it available to code outside of cpuidle. Note that at the point when cpuidle calls tick_nohz_get_next_hrtimer(), the governor's ->select() callback has already returned and indicated whether or not the tick should be stopped, so in fact the value returned by tick_nohz_get_next_hrtimer() always is the next hrtimer expiration time for the given CPU, possibly including the tick (if it hasn't been stopped). Co-developed-by: Lina Iyer <lina.iyer@linaro.org> Co-developed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> [ rjw: Subject & changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-12-12cpuidle: Add 'above' and 'below' idle state metricsRafael J. Wysocki1-1/+30
Add two new metrics for CPU idle states, "above" and "below", to count the number of times the given state had been asked for (or entered from the kernel's perspective), but the observed idle duration turned out to be too short or too long for it (respectively). These metrics help to estimate the quality of the CPU idle governor in use. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-12-11cpuidle: Add cpuidle.governor= command line parameterRafael J. Wysocki1-0/+1
Add cpuidle.governor= command line parameter to allow the default cpuidle governor to be replaced. That is useful, for example, if someone running a tickful kernel wants to use the menu governor on it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-09-18cpuidle: enter_state: Don't needlessly calculate diff timeFieah Lim1-8/+8
Currently, ktime_us_delta() is invoked unconditionally to compute the idle residency of the CPU, but it only makes sense to do that if a valid idle state has been entered, so move the ktime_us_delta() invocation after the entered_state >= 0 check. While at it, merge two comment blocks in there into one and drop a space between type casting of diff. This patch has no functional changes. Signed-off-by: Fieah Lim <kw@fieahl.im> [ rjw: Changelog cleanup, comment format fix ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-04-06cpuidle: Return nohz hint from cpuidle_select()Rafael J. Wysocki1-2/+8
Add a new pointer argument to cpuidle_select() and to the ->select cpuidle governor callback to allow a boolean value indicating whether or not the tick should be stopped before entering the selected state to be returned from there. Make the ladder governor ignore that pointer (to preserve its current behavior) and make the menu governor return 'false" through it if: (1) the idle exit latency is constrained at 0, or (2) the selected state is a polling one, or (3) the expected idle period duration is within the tick period range. In addition to that, the correction factor computations in the menu governor need to take the possibility that the tick may not be stopped into account to avoid artificially small correction factor values. To that end, add a mechanism to record tick wakeups, as suggested by Peter Zijlstra, and use it to modify the menu_update() behavior when tick wakeup occurs. Namely, if the CPU is woken up by the tick and the return value of tick_nohz_get_sleep_length() is not within the tick boundary, the predicted idle duration is likely too short, so make menu_update() try to compensate for that by updating the governor statistics as though the CPU was idle for a long time. Since the value returned through the new argument pointer of cpuidle_select() is not used by its caller yet, this change by itself is not expected to alter the functionality of the code. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-03-29PM: cpuidle/suspend: Add s2idle usage and time state attributesRafael J. Wysocki1-0/+9
Add a new attribute group called "s2idle" under the sysfs directory of each cpuidle state that supports the ->enter_s2idle callback and put two new attributes, "usage" and "time", into that group to represent the number of times the given state was requested for suspend-to-idle and the total time spent in suspend-to-idle after requesting that state, respectively. That will allow diagnostic information related to suspend-to-idle to be collected without enabling advanced debug features and analyzing dmesg output. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-11-08cpuidle: Avoid assignment in if () argumentGaurav Jindal1-3/+5
Clean up cpuidle_enable_device() to avoid doing an assignment in an expression evaluated as an argument of if (), which also makes the code in question more readable. Signed-off-by: Gaurav Jindal <gauravjindal1104@gmail.com> [ rjw: Subject & changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-11-08cpuidle: Clean up cpuidle_enable_device() error handling a bitGaurav Jindal1-1/+4
Do not fetch per CPU drv if cpuidle_curr_governor is NULL to avoid useless per CPU processing. Signed-off-by: Gaurav Jindal <gauravjindal1104@gmail.com> [ rjw: Subject & changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-09-28cpuidle: fix broadcast control when broadcast can not be enteredNicholas Piggin1-0/+1
When failing to enter broadcast timer mode for an idle state that requires it, a new state is selected that does not require broadcast, but the broadcast variable remains set. This causes tick_broadcast_exit to be called despite not having entered broadcast mode. This causes the WARN_ON_ONCE(!irqs_disabled()) to trigger in some cases. It does not appear to cause problems for code today, but seems to violate the interface so should be fixed. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-11PM / s2idle: Rename ->enter_freeze to ->enter_s2idleRafael J. Wysocki1-9/+9
Rename the ->enter_freeze cpuidle driver callback to ->enter_s2idle to make it clear that it is used for entering suspend-to-idle and rename the related functions, variables and so on accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-05-15cpuidle: Fix idle time trackingPeter Zijlstra1-0/+1
Ville reported that on his Core2, which has TSC stop in idle, we would always report very short idle durations. He tracked this down to commit: e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()") which replaces ktime_get() with local_clock(). Add a sched_clock_idle_wakeup_event() call, which will re-sync the clock with ktime_get_ns() when TSC is unstable and no-op otherwise. Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Fixes: e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-01cpuidle: check dev before usage in cpuidle_use_deepest_state()Li, Fei1-1/+2
In case of there is no cpuidle devices registered, dev will be null, and panic will be triggered like below; In this patch, add checking of dev before usage, like that done in cpuidle_idle_call. Panic without fix: [ 184.961328] BUG: unable to handle kernel NULL pointer dereference at (null) [ 184.961328] IP: cpuidle_use_deepest_state+0x30/0x60 ... [ 184.961328] play_idle+0x8d/0x210 [ 184.961328] ? __schedule+0x359/0x8e0 [ 184.961328] ? _raw_spin_unlock_irqrestore+0x28/0x50 [ 184.961328] ? kthread_queue_delayed_work+0x41/0x80 [ 184.961328] clamp_idle_injection_func+0x64/0x1e0 Fixes: bb8313b603eb8 (cpuidle: Allow enforcing deepest idle state selection) Signed-off-by: Li, Fei <fei.li@intel.com> Tested-by: Shi, Feng <fengx.shi@intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: 4.10+ <stable@vger.kernel.org> # 4.10+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h>Ingo Molnar1-0/+1
We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which will have to be picked up from other headers and .c files. Create a trivial placeholder <linux/sched/clock.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-06cpuidle: Add a kerneldoc comment to cpuidle_use_deepest_state()Rafael J. Wysocki1-1/+7
Since cpuidle_use_deepest_state() is not static, add a proper kerneldoc comment to it to document its purpose. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-11-29cpuidle: Allow enforcing deepest idle state selectionJacob Pan1-1/+12
When idle injection is used to cap power, we need to override the governor's choice of idle states. For this reason, make it possible the deepest idle state selection to be enforced by setting a flag on a given CPU to achieve the maximum potential power draw reduction. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> [ rjw: Subject & changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-04cpuidle: Fix last_residency divisionShreyas B. Prabhu1-8/+4
Snooze is a poll idle state in powernv and pseries platforms. Snooze has a timeout so that if a CPU stays in snooze for more than target residency of the next available idle state, then it would exit thereby giving chance to the cpuidle governor to re-evaluate and promote the CPU to a deeper idle state. Therefore whenever snooze exits due to this timeout, its last_residency will be target_residency of the next deeper state. Commit e93e59ce5b85 "cpuidle: Replace ktime_get() with local_clock()" changed the math around last_residency calculation. Specifically, while converting last_residency value from nano- to microseconds, it carries out right shift by 10. Because of that, in snooze timeout exit scenarios last_residency calculated is roughly 2.3% less than target_residency of the next available state. This pattern is picked up by get_typical_interval() in the menu governor and therefore expected_interval in menu_select() is frequently less than the target_residency of any state other than snooze. Due to this we are entering snooze at a higher rate, thereby affecting the single thread performance. Fix this by using more precise division via ktime_us_delta(). Fixes: e93e59ce5b85 "cpuidle: Replace ktime_get() with local_clock()" Reported-by: Anton Blanchard <anton@samba.org> Bisected-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-05-18cpuidle: Fix cpuidle_state_is_coupled() argument in cpuidle_enter()Daniel Lezcano1-1/+1
Commit 0b89e9aa2856 (cpuidle: delay enabling interrupts until all coupled CPUs leave idle) rightfully fixed a regression by letting the coupled idle state framework to handle local interrupt enabling when the CPU is exiting an idle state. The current code checks if the idle state is coupled and, if so, it will let the coupled code to enable interrupts. This way, it can decrement the ready-count before handling the interrupt. This mechanism prevents the other CPUs from waiting for a CPU which is handling interrupts. But the check is done against the state index returned by the back end driver's ->enter functions which could be different from the initial index passed as parameter to the cpuidle_enter_state() function. entered_state = target_state->enter(dev, drv, index); [ ... ] if (!cpuidle_state_is_coupled(drv, entered_state)) local_irq_enable(); [ ... ] If the 'index' is referring to a coupled idle state but the 'entered_state' is *not* coupled, then the interrupts are enabled again. All CPUs blocked on the sync barrier may busy loop longer if the CPU has interrupts to handle before decrementing the ready-count. That's consuming more energy than saving. Fixes: 0b89e9aa2856 (cpuidle: delay enabling interrupts until all coupled CPUs leave idle) Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: 3.15+ <stable@vger.kernel.org> # 3.15+ [ rjw: Subject & changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-26cpuidle: Replace ktime_get() with local_clock()Daniel Lezcano1-4/+8
The ktime_get() can have a non negligeable overhead, use local_clock() instead. In order to test the difference between ktime_get() and local_clock(), a quick hack has been added to trigger, via debugfs, 10000 times a call to ktime_get() and local_clock() and measure the elapsed time. Then the average value, the min and max is computed for each call. From userspace, the test above was called 100 times every 2 seconds. So, ktime_get() and local_clock() have been called 1000000 times in total. The results are: ktime_get(): ============ * average: 101 ns (stddev: 27.4) * maximum: 38313 ns * minimum: 65 ns local_clock(): ============== * average: 60 ns (stddev: 9.8) * maximum: 13487 ns * minimum: 46 ns The local_clock() is faster and more stable. Even if it is a drop in the ocean, changing the ktime_get() by the local_clock() allows to save 80ns at idle time (entry + exit). And in some circumstances, especially when there are several CPUs racing for the clock access, we save tens of microseconds. The idle duration resulting from a diff is converted from nanosec to microsec. This could be done with integer division (div 1000) - which is an expensive operation or by 10 bits shifting (div 1024) - which is fast but unprecise. The following table gives some results at the limits. ------------------------------------------ | nsec | div(1000) | div(1024) | ------------------------------------------ | 1e3 | 1 usec | 976 nsec | ------------------------------------------ | 1e6 | 1000 usec | 976 usec | ------------------------------------------ | 1e9 | 1000000 usec | 976562 usec | ------------------------------------------ There is a linear deviation of 2.34%. This loss of precision is acceptable in the context of the resulting diff which is used for statistics. These ones are processed to guess estimate an approximation of the duration of the next idle period which ends up into an idle state selection. The selection criteria takes into account the next duration based on large intervals, represented by the idle state's target residency. The 2^10 division is enough because the approximation regarding the 1e3 division is lost in all the approximations done for the next idle duration computation. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> [ rjw: Subject ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-09cpuidle: Indicate when a device has been unregisteredDave Gerlach1-0/+2
Currently the 'registered' member of the cpuidle_device struct is set to 1 during cpuidle_register_device. In this same function there are checks to see if the device is already registered to prevent duplicate calls to register the device, but this value is never set to 0 even on unregister of the device. Because of this, any attempt to call cpuidle_register_device after a call to cpuidle_unregister_device will fail which shouldn't be the case. To prevent this, set registered to 0 when the device is unregistered. Fixes: c878a52d3c7c (cpuidle: Check if device is already registered) Signed-off-by: Dave Gerlach <d-gerlach@ti.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-01-22cpuidle: fix fallback mechanism for suspend to idle in absence of enter_freezeSudeep Holla1-1/+1
Commit 51164251f5c3 "sched / idle: Drop default_idle_call() fallback from call_cpuidle()" made find_deepest_state() return non-negative value and check all the states with index > 0. Also as a result, find_deepest_state() returns 0 even when enter_freeze callbacks are not implemented and enter_freeze_proper() is called which ends up crashing the kernel. This patch updates the check for index > 0 in cpuidle_enter_freeze and cpuidle_idle_call(when idle_should_freeze is true) to restore the suspend-to-idle functionality in absence of enter_freeze callback. Fixes: 51164251f5c3 "sched / idle: Drop default_idle_call() fallback from call_cpuidle()" Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-01-19sched / idle: Drop default_idle_call() fallback from call_cpuidle()Rafael J. Wysocki1-3/+3
After commit 9c4b2867ed7c (cpuidle: menu: Fix menu_select() for CPUIDLE_DRIVER_STATE_START == 0) it is clear that menu_select() cannot return negative values. Moreover, ladder_select_state() will never return a negative value too, so make find_deepest_state() return non-negative values too and drop the default_idle_call() fallback from call_cpuidle(). This eliminates one branch from the idle loop and makes the governors and find_deepest_state() handle the case when all states have been disabled from sysfs consistently. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@kernel.org> Tested-by: Sudeep Holla <sudeep.holla@arm.com>
2015-09-01Merge tag 'pm+acpi-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds1-2/+2
Pull power management and ACPI updates from Rafael Wysocki: "From the number of commits perspective, the biggest items are ACPICA and cpufreq changes with the latter taking the lead (over 50 commits). On the cpufreq front, there are many cleanups and minor fixes in the core and governors, driver updates etc. We also have a new cpufreq driver for Mediatek MT8173 chips. ACPICA mostly updates its debug infrastructure and adds a number of fixes and cleanups for a good measure. The Operating Performance Points (OPP) framework is updated with new DT bindings and support for them among other things. We have a few updates of the generic power domains framework and a reorganization of the ACPI device enumeration code and bus type operations. And a lot of fixes and cleanups all over. Included is one branch from the MFD tree as it contains some PM-related driver core and ACPI PM changes a few other commits are based on. Specifics: - ACPICA update to upstream revision 20150818 including method tracing extensions to allow more in-depth AML debugging in the kernel and a number of assorted fixes and cleanups (Bob Moore, Lv Zheng, Markus Elfring). - ACPI sysfs code updates and a documentation update related to AML method tracing (Lv Zheng). - ACPI EC driver fix related to serialized evaluations of _Qxx methods and ACPI tools updates allowing the EC userspace tool to be built from the kernel source (Lv Zheng). - ACPI processor driver updates preparing it for future introduction of CPPC support and ACPI PCC mailbox driver updates (Ashwin Chaugule). - ACPI interrupts enumeration fix for a regression related to the handling of IRQ attribute conflicts between MADT and the ACPI namespace (Jiang Liu). - Fixes related to ACPI device PM (Mika Westerberg, Srinidhi Kasagar). - ACPI device registration code reorganization to separate the sysfs-related code and bus type operations from the rest (Rafael J Wysocki). - Assorted cleanups in the ACPI core (Jarkko Nikula, Mathias Krause, Andy Shevchenko, Rafael J Wysocki, Nicolas Iooss). - ACPI cpufreq driver and ia64 cpufreq driver fixes and cleanups (Pan Xinhui, Rafael J Wysocki). - cpufreq core cleanups on top of the previous changes allowing it to preseve its sysfs directories over system suspend/resume (Viresh Kumar, Rafael J Wysocki, Sebastian Andrzej Siewior). - cpufreq fixes and cleanups related to governors (Viresh Kumar). - cpufreq updates (core and the cpufreq-dt driver) related to the turbo/boost mode support (Viresh Kumar, Bartlomiej Zolnierkiewicz). - New DT bindings for Operating Performance Points (OPP), support for them in the OPP framework and in the cpufreq-dt driver plus related OPP framework fixes and cleanups (Viresh Kumar). - cpufreq powernv driver updates (Shilpasri G Bhat). - New cpufreq driver for Mediatek MT8173 (Pi-Cheng Chen). - Assorted cpufreq driver (speedstep-lib, sfi, integrator) cleanups and fixes (Abhilash Jindal, Andrzej Hajda, Cristian Ardelean). - intel_pstate driver updates including Skylake-S support, support for enabling HW P-states per CPU and an additional vendor bypass list entry (Kristen Carlson Accardi, Chen Yu, Ethan Zhao). - cpuidle core fixes related to the handling of coupled idle states (Xunlei Pang). - intel_idle driver updates including Skylake Client support and support for freeze-mode-specific idle states (Len Brown). - Driver core updates related to power management (Andy Shevchenko, Rafael J Wysocki). - Generic power domains framework fixes and cleanups (Jon Hunter, Geert Uytterhoeven, Rajendra Nayak, Ulf Hansson). - Device PM QoS framework update to allow the latency tolerance setting to be exposed to user space via sysfs (Mika Westerberg). - devfreq support for PPMUv2 in Exynos5433 and a fix for an incorrect exynos-ppmu DT binding (Chanwoo Choi, Javier Martinez Canillas). - System sleep support updates (Alan Stern, Len Brown, SungEun Kim). - rockchip-io AVS support updates (Heiko Stuebner). - PM core clocks support fixup (Colin Ian King). - Power capping RAPL driver update including support for Skylake H/S and Broadwell-H (Radivoje Jovanovic, Seiichi Ikarashi). - Generic device properties framework fixes related to the handling of static (driver-provided) property sets (Andy Shevchenko). - turbostat and cpupower updates (Len Brown, Shilpasri G Bhat, Shreyas B Prabhu)" * tag 'pm+acpi-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (180 commits) cpufreq: speedstep-lib: Use monotonic clock cpufreq: powernv: Increase the verbosity of OCC console messages cpufreq: sfi: use kmemdup rather than duplicating its implementation cpufreq: drop !cpufreq_driver check from cpufreq_parse_governor() cpufreq: rename cpufreq_real_policy as cpufreq_user_policy cpufreq: remove redundant 'policy' field from user_policy cpufreq: remove redundant 'governor' field from user_policy cpufreq: update user_policy.* on success cpufreq: use memcpy() to copy policy cpufreq: remove redundant CPUFREQ_INCOMPATIBLE notifier event cpufreq: mediatek: Add MT8173 cpufreq driver dt-bindings: mediatek: Add MT8173 CPU DVFS clock bindings PM / Domains: Fix typo in description of genpd_dev_pm_detach() PM / Domains: Remove unusable governor dummies PM / Domains: Make pm_genpd_init() available to modules PM / domains: Align column headers and data in pm_genpd_summary output powercap / RAPL: disable the 2nd power limit properly tools: cpupower: Fix error when running cpupower monitor PM / OPP: Drop unlikely before IS_ERR(_OR_NULL) PM / OPP: Fix static checker warning (broken 64bit big endian systems) ...
2015-08-31Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-0/+4
Pull scheduler updates from Ingo Molnar: "The biggest change in this cycle is the rewrite of the main SMP load balancing metric: the CPU load/utilization. The main goal was to make the metric more precise and more representative - see the changelog of this commit for the gory details: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking") It is done in a way that significantly reduces complexity of the code: 5 files changed, 249 insertions(+), 494 deletions(-) and the performance testing results are encouraging. Nevertheless we need to keep an eye on potential regressions, since this potentially affects every SMP workload in existence. This work comes from Yuyang Du. Other changes: - SCHED_DL updates. (Andrea Parri) - Simplify architecture callbacks by removing finish_arch_switch(). (Peter Zijlstra et al) - cputime accounting: guarantee stime + utime == rtime. (Peter Zijlstra) - optimize idle CPU wakeups some more - inspired by Facebook server loads. (Mike Galbraith) - stop_machine fixes and updates. (Oleg Nesterov) - Introduce the 'trace_sched_waking' tracepoint. (Peter Zijlstra) - sched/numa tweaks. (Srikar Dronamraju) - misc fixes and small cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits) sched/deadline: Fix comment in enqueue_task_dl() sched/deadline: Fix comment in push_dl_tasks() sched: Change the sched_class::set_cpus_allowed() calling context sched: Make sched_class::set_cpus_allowed() unconditional sched: Fix a race between __kthread_bind() and sched_setaffinity() sched: Ensure a task has a non-normalized vruntime when returning back to CFS sched/numa: Fix NUMA_DIRECT topology identification tile: Reorganize _switch_to() sched, sparc32: Update scheduler comments in copy_thread() sched: Remove finish_arch_switch() sched, tile: Remove finish_arch_switch sched, sh: Fold finish_arch_switch() into switch_to() sched, score: Remove finish_arch_switch() sched, avr32: Remove finish_arch_switch() sched, MIPS: Get rid of finish_arch_switch() sched, arm: Remove finish_arch_switch() sched/fair: Clean up load average references sched/fair: Provide runnable_load_avg back to cfs_rq sched/fair: Remove task and group entity load when they are dead sched/fair: Init cfs_rq's sched_entity load average ...
2015-08-28cpuidle/coupled: Remove redundant 'dev' argument of cpuidle_state_is_coupled()Xunlei Pang1-2/+2
For cpuidle_state_is_coupled(), 'dev' is not used, so remove it. Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-07-21sched/idle: Move latency tracing stop/start calls deeper inside the idle loopLucas Stach1-0/+4
Make sure to stop tracing only once we are past a point where all latency tracing events have been processed (irqs are not enabled again). This has the slight advantage of capturing more latency related events in the idle path, but most importantly it makes sure that latency tracing doesn't get re-enabled inadvertently when new events are coming in. This makes the irqsoff latency tracer useful again, as we stop capturing CPU sleep time as IRQ latency. Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel@pengutronix.de Cc: patchwork-lst@pengutronix.de Link: http://lkml.kernel.org/r/1437410090-3747-1-git-send-email-l.stach@pengutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-09suspend-to-idle: Prevent RCU from complaining about tick_freeze()Rafael J. Wysocki1-2/+7
Put tick_freeze() under RCU_NONIDLE() to prevent RCU from complaining about suspicious RCU usage in idle by trace_suspend_resume() called from there. While at it, fix a comment related to another usage of RCU_NONIDLE() in enter_freeze_proper(). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-06-19Merge branches 'pm-sleep' and 'pm-runtime'Rafael J. Wysocki1-0/+2
* pm-sleep: PM / sleep: trace_device_pm_callback coverage in dpm_prepare/complete PM / wakeup: add a dummy wakeup_source to record statistics PM / sleep: Make suspend-to-idle-specific code depend on CONFIG_SUSPEND PM / sleep: Return -EBUSY from suspend_enter() on wakeup detection PM / tick: Add tracepoints for suspend-to-idle diagnostics PM / sleep: Fix symbol name in a comment in kernel/power/main.c leds / PM: fix hibernation on arm when gpio-led used with CPU led trigger ARM: omap-device: use SET_NOIRQ_SYSTEM_SLEEP_PM_OPS bus: omap_l3_noc: add missed callbacks for suspend-to-disk PM / sleep: Add macro to define common noirq system PM callbacks PM / sleep: Refine diagnostic messages in enter_state() PM / wakeup: validate wakeup source before activating it. * pm-runtime: PM / Runtime: Update last_busy in rpm_resume PM / runtime: add note about re-calling in during device probe()
2015-05-30cpuidle: Do not use CPUIDLE_DRIVER_STATE_START in cpuidle.cRafael J. Wysocki1-3/+3
The CPUIDLE_DRIVER_STATE_START symbol is defined as 1 only if CONFIG_ARCH_HAS_CPU_RELAX is set, otherwise it is defined as 0. However, if CONFIG_ARCH_HAS_CPU_RELAX is set, the first (index 0) entry in the cpuidle driver's table of states is overwritten with the default "poll" entry by the core. The "state" defined by the "poll" entry doesn't provide ->enter_dead and ->enter_freeze callbacks and its exit_latency is 0. For this reason, it is not necessary to use CPUIDLE_DRIVER_STATE_START in cpuidle_play_dead() (->enter_dead is NULL, so the "poll state" will be skipped by the loop). It also is arguably unuseful to return states with exit_latency equal to 0 from find_deepest_state(), so the function can be modified to start the loop from index 0 and the "poll state" will be skipped by it as a result of the check against latency_req. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
2015-05-19PM / sleep: Make suspend-to-idle-specific code depend on CONFIG_SUSPENDRafael J. Wysocki1-0/+2
Since idle_should_freeze() is defined to always return 'false' for CONFIG_SUSPEND unset, all of the code depending on it in cpuidle_idle_call() is not necessary in that case. Make that code depend on CONFIG_SUSPEND too to avoid building it when it is not going to be used. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de>