aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/sched (follow)
AgeCommit message (Collapse)AuthorFilesLines
2017-02-21sched/core: Fix build paravirt build on arm and arm64Mark Brown1-0/+3
Commit 004172bdad64 ("sched/core: Remove unnecessary #include headers") removed the inclusion of asm/paravirt.h which is used to get declarations of paravirt_steal_rq_enabled and paravirt_steal_clock. It is implicitly included on x86 but not on arm and arm64 breaking the build if paravirtualization is used. Since things from that header are used directly fix the build by putting the direct inclusion back. Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-20Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds17-2203/+2406
Pull scheduler updates from Ingo Molnar: "The main changes in this (fairly busy) cycle were: - There was a class of scheduler bugs related to forgetting to update the rq-clock timestamp which can cause weird and hard to debug problems, so there's a new debug facility for this: which uncovered a whole lot of bugs which convinced us that we want to keep the debug facility. (Peter Zijlstra, Matt Fleming) - Various cputime related updates: eliminate cputime and use u64 nanoseconds directly, simplify and improve the arch interfaces, implement delayed accounting more widely, etc. - (Frederic Weisbecker) - Move code around for better structure plus cleanups (Ingo Molnar) - Move IO schedule accounting deeper into the scheduler plus related changes to improve the situation (Tejun Heo) - ... plus a round of sched/rt and sched/deadline fixes, plus other fixes, updats and cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits) sched/core: Remove unlikely() annotation from sched_move_task() sched/autogroup: Rename auto_group.[ch] to autogroup.[ch] sched/topology: Split out scheduler topology code from core.c into topology.c sched/core: Remove unnecessary #include headers sched/rq_clock: Consolidate the ordering of the rq_clock methods delayacct: Include <uapi/linux/taskstats.h> sched/core: Clean up comments sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds sched/clock: Add dummy clear_sched_clock_stable() stub function sched/cputime: Remove generic asm headers sched/cputime: Remove unused nsec_to_cputime() s390, sched/cputime: Remove unused cputime definitions powerpc, sched/cputime: Remove unused cputime definitions s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs ia64, sched/cputime: Remove unused cputime definitions ia64: Convert vtime to use nsec units directly ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it sched/cputime: Remove jiffies based cputime sched/cputime, vtime: Return nsecs instead of cputime_t to account sched/cputime: Complete nsec conversion of tick based accounting ...
2017-02-10sched/core: Remove unlikely() annotation from sched_move_task()Steven Rostedt (VMware)1-2/+2
The check for 'running' in sched_move_task() has an unlikely() around it. That is, it is unlikely that the task being moved is running. That use to be true. But with a couple of recent updates, it is now likely that the task will be running. The first change came from ea86cb4b7621 ("sched/cgroup: Fix cpu_cgroup_fork() handling") that moved around the use case of sched_move_task() in do_fork() where the call is now done after the task is woken (hence it is running). The second change came from 8e5bfa8c1f84 ("sched/autogroup: Do not use autogroup->tg in zombie threads") where sched_move_task() is called by the exit path, by the task that is exiting. Hence it too is running. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20170206110426.27ca6426@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-08sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]Ingo Molnar4-2/+2
The names are all 'autogroup', not 'auto_group' - so rename the kernel/sched/auto_group.[ch] to match the existing nomenclature. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-07sched/topology: Split out scheduler topology code from core.c into topology.cIngo Molnar4-1658/+1684
Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-07sched/core: Remove unnecessary #include headersIngo Molnar1-49/+10
Over the years sched/core.c accumulated over 50 #include lines, 40 of which are superfluous. (!) Removing them decreases the preprocessed .c file (.i) size noticeably: triton:~/tip> wc -l kernel/sched/core.i Before: 76387 kernel/sched/core.i After: 75896 kernel/sched/core.i Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-07sched/rq_clock: Consolidate the ordering of the rq_clock methodsIngo Molnar1-75/+78
update_rq_clock_task() and update_rq_clock() we unnecessarily spread across core.c, requiring an extra prototype line. Move them next to each other and in the proper order. Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-07sched/core: Clean up commentsIngo Molnar1-176/+182
Refresh the comments in the core scheduler code: - Capitalize sentences consistently - Capitalize 'CPU' consistently - ... and other small details. Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in millisecondsShile Zhang2-2/+4
We added the 'sched_rr_timeslice_ms' SCHED_RR tuning knob in this commit: ce0dbbbb30ae ("sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice") ... which name suggests to users that it's in milliseconds, while in reality it's being set in milliseconds but the result is shown in jiffies. This is obviously confusing when HZ is not 1000, it makes it appear like the value set failed, such as HZ=100: root# echo 100 > /proc/sys/kernel/sched_rr_timeslice_ms root# cat /proc/sys/kernel/sched_rr_timeslice_ms 10 Fix this to be milliseconds all around. Signed-off-by: Shile Zhang <shile.zhang@nokia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1485612049-20923-1-git-send-email-shile.zhang@nokia.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime, vtime: Return nsecs instead of cputime_t to accountFrederic Weisbecker1-20/+11
Turn the full dynticks cputime clock source to return nsec while keeping its very internals jiffies based for performance reasons. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-27-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Complete nsec conversion of tick based accountingFrederic Weisbecker1-30/+22
This is the final step toward tick based cputime conversion. Now that the whole cputime accounting engine accounts in nsecs, we can convert the very source of the cputime to account in nsecs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-26-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Push time to account_system_time() in nsecsFrederic Weisbecker1-20/+19
This is one more step toward converting cputime accounting to pure nsecs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-25-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Push time to account_idle_time() in nsecsFrederic Weisbecker1-9/+9
This is one more step toward converting cputime accounting to pure nsecs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-24-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Push time to account_steal_time() in nsecsFrederic Weisbecker1-5/+6
This is one more step toward converting cputime accounting to pure nsecs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-23-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Push time to account_user_time() in nsecsFrederic Weisbecker1-18/+24
This is one more step toward converting cputime accounting to pure nsecs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-22-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01timers/posix-timers: Convert internals to use nsecsFrederic Weisbecker2-5/+5
Use the new nsec based cputime accessors as part of the whole cputime conversion from cputime_t to nsecs. Also convert posix-cpu-timers to use nsec based internal counters to simplify it. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-19-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Increment kcpustat directly on irqtime accountFrederic Weisbecker2-36/+21
The irqtime is accounted is nsecs and stored in cpu_irq_time.hardirq_time and cpu_irq_time.softirq_time. Once the accumulated amount reaches a new jiffy, this one gets accounted to the kcpustat. This was necessary when kcpustat was stored in cputime_t, which could at worst have jiffies granularity. But now kcpustat is stored in nsecs so this whole discretization game with temporary irqtime storage has become unnecessary. We can now directly account the irqtime to the kcpustat. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-17-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Convert task/group cputime to nsecsFrederic Weisbecker1-18/+17
Now that most cputime readers use the transition API which return the task cputime in old style cputime_t, we can safely store the cputime in nsecs. This will eventually make cputime statistics less opaque and more granular. Back and forth convertions between cputime_t and nsecs in order to deal with cputime_t random granularity won't be needed anymore. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Convert guest time accounting to nsecs (u64)Frederic Weisbecker1-4/+4
cputime_t is being obsolete and replaced by nsecs units in order to make internal timestamps less opaque and more granular. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-6-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Convert kcpustat to nsecsFrederic Weisbecker2-12/+12
Kernel CPU stats are stored in cputime_t which is an architecture defined type, and hence a bit opaque and requiring accessors and mutators for any operation. Converting them to nsecs simplifies the code and is one step toward the removal of cputime_t in the core code. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-4-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30sched/rt: Add a missing rescheduling pointSebastian Andrzej Siewior2-4/+2
Since the change in commit: fd7a4bed1835 ("sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks") ... we don't reschedule a task under certain circumstances: Lets say task-A, SCHED_OTHER, is running on CPU0 (and it may run only on CPU0) and holds a PI lock. This task is removed from the CPU because it used up its time slice and another SCHED_OTHER task is running. Task-B on CPU1 runs at RT priority and asks for the lock owned by task-A. This results in a priority boost for task-A. Task-B goes to sleep until the lock has been made available. Task-A is already runnable (but not active), so it receives no wake up. The reality now is that task-A gets on the CPU once the scheduler decides to remove the current task despite the fact that a high priority task is enqueued and waiting. This may take a long time. The desired behaviour is that CPU0 immediately reschedules after the priority boost which made task-A the task with the lowest priority. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: fd7a4bed1835 ("sched, rt: Convert switched_{from, to}_rt() prio_changed_rt() to balance callbacks") Link: http://lkml.kernel.org/r/20170124144006.29821-1-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30sched/core: Fix &rd->cpudl memory leakMathieu Poirier1-1/+3
While in the process of initialising a root domain, if function cpupri_init() fails the memory allocated in cpudl_init() is not reclaimed. Adding a new goto target to cleanup the previous initialistion of the root_domain's dl_bw structure reclaims said memory. Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1485292295-21298-2-git-send-email-mathieu.poirier@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30sched/core: Fix &rd->rto_mask memory leakMathieu Poirier1-1/+1
If function cpudl_init() fails the memory allocated for &rd->rto_mask needs to be freed, something this patch is addressing. Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1485292295-21298-1-git-send-email-mathieu.poirier@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30sched/fair: Restore previous rq_flags when migrating tasks in hotplugMatt Fleming1-1/+9
__migrate_task() can return with a different runqueue locked than the one we passed as an argument. So that we can repin the lock in migrate_tasks() (and keep the update_rq_clock() bit) we need to restore the old rq_flags before repinning. Note that it wouldn't be correct to change move_queued_task() to repin because of the change of runqueue and the fact that having an up-to-date clock on the initial rq doesn't mean the new rq has one too. Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30sched/core: Add missing update_rq_clock() call in sched_move_task()Peter Zijlstra1-0/+1
Bug was noticed via this warning: WARNING: CPU: 6 PID: 1 at kernel/sched/sched.h:804 detach_task_cfs_rq+0x8e8/0xb80 rq->clock_update_flags < RQCF_ACT_SKIP Modules linked in: CPU: 6 PID: 1 Comm: systemd Not tainted 4.10.0-rc5-00140-g0874170baf55-dirty #1 Hardware name: Supermicro SYS-4048B-TRFT/X10QBi, BIOS 1.0 04/11/2014 Call Trace: dump_stack+0x4d/0x65 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5f/0x80 detach_task_cfs_rq+0x8e8/0xb80 ? allocate_cgrp_cset_links+0x59/0x80 task_change_group_fair+0x27/0x150 sched_change_group+0x48/0xf0 sched_move_task+0x53/0x150 cpu_cgroup_attach+0x36/0x70 cgroup_taskset_migrate+0x175/0x300 cgroup_migrate+0xab/0xd0 cgroup_attach_task+0xf0/0x190 __cgroup_procs_write+0x1ed/0x2f0 cgroup_procs_write+0x14/0x20 cgroup_file_write+0x3f/0x100 kernfs_fop_write+0x104/0x180 __vfs_write+0x37/0x140 vfs_write+0xb8/0x1b0 SyS_write+0x55/0xc0 do_syscall_64+0x61/0x170 entry_SYSCALL64_slow_path+0x25/0x25 Reported-by: Ingo Molnar <mingo@kernel.org> Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30sched/core: Optimize pick_next_task() for idle_sched_classPeter Zijlstra1-3/+2
Steve noticed that when we switch from IDLE to SCHED_OTHER we fail to take the shortcut, even though all runnable tasks are of the fair class, because prev->sched_class != &fair_sched_class. Since I reworked the put_prev_task() stuff, we don't really care about prev->class here, so removing that condition will allow this case. This increases the likely case from 78% to 98% correct for Steve's workload. Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170119174408.GN6485@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-27timers: Omit POSIX timer stuff from task_struct when disabledNicolas Pitre2-12/+24
When CONFIG_POSIX_TIMERS is disabled, it is preferable to remove related structures from struct task_struct and struct signal_struct as they won't contain anything useful and shouldn't be relied upon by mistake. Code still referencing those structures is also disabled here. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: John Stultz <john.stultz@linaro.org>
2017-01-20sched/clock: Fix hotplug crashPeter Zijlstra1-21/+8
Mike reported that he could trigger the WARN_ON_ONCE() in set_sched_clock_stable() using hotplug. This exposed a fundamental problem with the interface, we should never mark the TSC stable if we ever find it to be unstable. Therefore set_sched_clock_stable() is a broken interface. The reason it existed is that not having it is a pain, it means all relevant architecture code needs to call clear_sched_clock_stable() where appropriate. Of the three architectures that select HAVE_UNSTABLE_SCHED_CLOCK ia64 and parisc are trivial in that they never called set_sched_clock_stable(), so add an unconditional call to clear_sched_clock_stable() to them. For x86 the story is a lot more involved, and what this patch tries to do is ensure we preserve the status quo. So even is Cyrix or Transmeta have usable TSC they never called set_sched_clock_stable() so they now get an explicit mark unstable. Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 9881b024b7d7 ("sched/clock: Delay switching sched_clock to stable") Link: http://lkml.kernel.org/r/20170119133633.GB6536@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/core: Separate out io_schedule_prepare() and io_schedule_finish()Tejun Heo1-5/+28
Now that IO schedule accounting is done inside __schedule(), io_schedule() can be split into three steps - prep, schedule, and finish - where the schedule part doesn't need any special annotation. This allows marking a sleep as iowait by simply wrapping an existing blocking function with io_schedule_prepare() and io_schedule_finish(). Because task_struct->in_iowait is single bit, the caller of io_schedule_prepare() needs to record and the pass its state to io_schedule_finish() to be safe regarding nesting. While this isn't the prettiest, these functions are mostly gonna be used by core functions and we don't want to use more space for ->in_iowait. While at it, as it's simple to do now, reimplement io_schedule() without unnecessarily going through io_schedule_timeout(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: adilger.kernel@dilger.ca Cc: jack@suse.com Cc: kernel-team@fb.com Cc: mingbo@fb.com Cc: tytso@mit.edu Link: http://lkml.kernel.org/r/1477673892-28940-3-git-send-email-tj@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/core: move IO scheduling accounting from io_schedule_timeout() into schedulerTejun Heo1-7/+61
For an interface to support blocking for IOs, it must call io_schedule() instead of schedule(). This makes it tedious to add IO blocking to existing interfaces as the switching between schedule() and io_schedule() is often buried deep. As we already have a way to mark the task as IO scheduling, this can be made easier by separating out io_schedule() into multiple steps so that IO schedule preparation can be performed before invoking a blocking interface and the actual accounting happens inside the scheduler. io_schedule_timeout() does the following three things prior to calling schedule_timeout(). 1. Mark the task as scheduling for IO. 2. Flush out plugged IOs. 3. Account the IO scheduling. done close to the actual scheduling. This patch moves #3 into the scheduler so that later patches can separate out preparation and finish steps from io_schedule(). Patch-originally-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: adilger.kernel@dilger.ca Cc: akpm@linux-foundation.org Cc: axboe@kernel.dk Cc: jack@suse.com Cc: kernel-team@fb.com Cc: mingbo@fb.com Cc: tytso@mit.edu Link: http://lkml.kernel.org/r/20161207204841.GA22296@htj.duckdns.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/fair: Explain why MIN_SHARES isn't scaled in calc_cfs_shares()Dietmar Eggemann1-0/+12
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Turner <pjt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Samuel Thibault <samuel.thibault@ens-lyon.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/e9a4d858-bcf3-36b9-e3a9-449953e34569@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/core: Fix group_entity's share updateVincent Guittot1-13/+37
The update of the share of a cfs_rq is done when its load_avg is updated but before the group_entity's load_avg has been updated for the past time slot. This generates wrong load_avg accounting which can be significant when small tasks are involved in the scheduling. Let take the example of a task a that is dequeued of its task group A: root (cfs_rq) \ (se) A (cfs_rq) \ (se) a Task "a" was the only task in task group A which becomes idle when a is dequeued. We have the sequence: - dequeue_entity a->se - update_load_avg(a->se) - dequeue_entity_load_avg(A->cfs_rq, a->se) - update_cfs_shares(A->cfs_rq) A->cfs_rq->load.weight == 0 A->se->load.weight is updated with the new share (0 in this case) - dequeue_entity A->se - update_load_avg(A->se) but its weight is now null so the last time slot (up to a tick) will be accounted with a weight of 0 instead of its real weight during the time slot. The last time slot will be accounted as an idle one whereas it was a running one. If the running time of task a is short enough that no tick happens when it runs, all running time of group entity A->se will be accounted as idle time. Instead, we should update the share of a cfs_rq (in fact the weight of its group entity) only after having updated the load_avg of the group_entity. update_cfs_shares() now takes the sched_entity as a parameter instead of the cfs_rq, and the weight of the group_entity is updated only once its load_avg has been synced with current time. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: pjt@google.com Link: http://lkml.kernel.org/r/1482335426-7664-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/completions: Fix complete_all() semanticsPeter Zijlstra1-4/+6
Documentation/scheduler/completion.txt says this about complete_all(): "calls complete_all() to signal all current and future waiters." Which doesn't strictly match the current semantics. Currently complete_all() is equivalent to UINT_MAX/2 complete() invocations, which is distinctly less than 'all current and future waiters' (enumerable vs innumerable), although it has worked in practice. However, Dmitry had a weird case where it might matter, so change completions to use saturation semantics for complete()/complete_all(). Once done hits UINT_MAX (and complete_all() sets it there) it will never again be decremented. Requested-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: der.herr@hofr.at Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/deadline: Show leftover runtime and abs deadline in /proc/*/schedTommaso Cucinotta1-0/+4
This patch allows for reading the current (leftover) runtime and absolute deadline of a SCHED_DEADLINE task through /proc/*/sched (entries dl.runtime and dl.deadline), while debugging/testing. Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@arm.com> Reviewed-by: Luca Abeni <luca.abeni@unitn.it> Acked-by: Daniel Bistrot de Oliveira <danielbristot@gmail.com> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1477473437-10346-2-git-send-email-tommaso.cucinotta@sssup.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/clock: Provide better clock continuityPeter Zijlstra1-34/+65
When switching between the unstable and stable variants it is currently possible that clock discontinuities occur. And while these will mostly be 'small', attempt to do better. As observed on my IVB-EP, the sched_clock() is ~1.5s ahead of the ktime_get_ns() based timeline at the point of switchover (sched_clock_init_late()) after SMP bringup. Equally, when the TSC is later found to be unstable -- typically because SMM tries to hide its SMI latencies by mucking with the TSC -- we want to avoid large jumps. Since the clocksource watchdog reports the issue after the fact we cannot exactly fix up time, but since SMI latencies are typically small (~10ns range), the discontinuity is mainly due to drift between sched_clock() and ktime_get_ns() (which on my desktop is ~79s over 24days). I dislike this patch because it adds overhead to the good case in favour of dealing with badness. But given the widespread failure of TSC stability this is worth it. Note that in case the TSC makes drastic jumps after SMP bringup we're still hosed. There's just not much we can do in that case without stupid overhead. If we were to somehow expose tsc_clocksource_reliable (which is hard because this code is also used on ia64 and parisc) we could avoid some of the newly introduced overhead. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/clock: Delay switching sched_clock to stablePeter Zijlstra2-28/+26
Currently we switch to the stable sched_clock if we guess the TSC is usable, and then switch back to the unstable path if it turns out TSC isn't stable during SMP bringup after all. Delay switching to the stable path until after SMP bringup is complete. This way we'll avoid switching during the time we detect the worst of the TSC offences. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/clock: Update static_key usagePeter Zijlstra1-8/+4
sched_clock was still using the deprecated static_key interface. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/core: Add debugging code to catch missing update_rq_clock() callsMatt Fleming2-10/+75
There's no diagnostic checks for figuring out when we've accidentally missed update_rq_clock() calls. Let's add some by piggybacking on the rq_*pin_lock() wrappers. The idea behind the diagnostic checks is that upon pining rq lock the rq clock should be updated, via update_rq_clock(), before anybody reads the clock with rq_clock() or rq_clock_task(). The exception to this rule is when updates have explicitly been disabled with the rq_clock_skip_update() optimisation. There are some functions that only unpin the rq lock in order to grab some other lock and avoid deadlock. In that case we don't need to update the clock again and the previous diagnostic state can be carried over in rq_repin_lock() by saving the state in the rq_flags context. Since this patch adds a new clock update flag and some already exist in rq::clock_skip_update, that field has now been renamed. An attempt has been made to keep the flag manipulation code small and fast since it's used in the heart of the __schedule() fast path. For the !CONFIG_SCHED_DEBUG case the only object code change (other than addresses) is the following change to reset RQCF_ACT_SKIP inside of __schedule(), - c7 83 38 09 00 00 00 movl $0x0,0x938(%rbx) - 00 00 00 + 83 a3 38 09 00 00 fc andl $0xfffffffc,0x938(%rbx) Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@unitn.it> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Yuyang Du <yuyang.du@intel.com> Link: http://lkml.kernel.org/r/20160921133813.31976-8-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/core: Add missing update_rq_clock() call in set_user_nice()Peter Zijlstra1-0/+2
Address this rq-clock update bug: WARNING: CPU: 30 PID: 195 at ../kernel/sched/sched.h:797 set_next_entity() rq->clock_update_flags < RQCF_ACT_SKIP Call Trace: dump_stack() __warn() warn_slowpath_fmt() set_next_entity() ? _raw_spin_lock() set_curr_task_fair() set_user_nice.part.85() set_user_nice() create_worker() worker_thread() kthread() ret_from_fork() Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/core: Add missing update_rq_clock() call for task_hot()Peter Zijlstra1-0/+2
Add the update_rq_clock() call at the top of the callstack instead of at the bottom where we find it missing, this to aid later effort to minimize the number of update_rq_lock() calls. WARNING: CPU: 30 PID: 194 at ../kernel/sched/sched.h:797 assert_clock_updated() rq->clock_update_flags < RQCF_ACT_SKIP Call Trace: dump_stack() __warn() warn_slowpath_fmt() assert_clock_updated.isra.63.part.64() can_migrate_task() load_balance() pick_next_task_fair() __schedule() schedule() worker_thread() kthread() Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/core: Add missing update_rq_clock() in detach_task_cfs_rq()Peter Zijlstra1-0/+3
Instead of adding the update_rq_clock() all the way at the bottom of the callstack, add one at the top, this to aid later effort to minimize update_rq_lock() calls. WARNING: CPU: 0 PID: 1 at ../kernel/sched/sched.h:797 detach_task_cfs_rq() rq->clock_update_flags < RQCF_ACT_SKIP Call Trace: dump_stack() __warn() warn_slowpath_fmt() detach_task_cfs_rq() switched_from_fair() __sched_setscheduler() _sched_setscheduler() sched_set_stop_task() cpu_stop_create() __smpboot_create_thread.part.2() smpboot_register_percpu_thread_cpumask() cpu_stop_init() do_one_initcall() ? print_cpu_info() kernel_init_freeable() ? rest_init() kernel_init() ret_from_fork() Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/core: Add missing update_rq_clock() in post_init_entity_util_avg()Peter Zijlstra2-0/+2
Address this rq-clock update bug: WARNING: CPU: 0 PID: 0 at ../kernel/sched/sched.h:797 post_init_entity_util_avg() rq->clock_update_flags < RQCF_ACT_SKIP Call Trace: __warn() post_init_entity_util_avg() wake_up_new_task() _do_fork() kernel_thread() rest_init() start_kernel() Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/fair: Push rq lock pin/unpin into idle_balance()Matt Fleming1-12/+15
Future patches will emit warnings if rq_clock() is called before update_rq_clock() inside a rq_pin_lock()/rq_unpin_lock() pair. Since there is only one caller of idle_balance() we can push the unpin/repin there. Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@unitn.it> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Yuyang Du <yuyang.du@intel.com> Link: http://lkml.kernel.org/r/20160921133813.31976-7-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/core: Reset RQCF_ACT_SKIP before unpinning rq->lockMatt Fleming1-1/+4
rq_clock() is called from sched_info_{depart,arrive}() after resetting RQCF_ACT_SKIP but prior to a call to update_rq_clock(). In preparation for pending patches that check whether the rq clock has been updated inside of a pin context before rq_clock() is called, move the reset of rq->clock_skip_update immediately before unpinning the rq lock. This will avoid the new warnings which check if update_rq_clock() is being actively skipped. Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@unitn.it> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Yuyang Du <yuyang.du@intel.com> Link: http://lkml.kernel.org/r/20160921133813.31976-6-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/core: Add wrappers for lockdep_(un)pin_lock()Matt Fleming7-61/+76
In preparation for adding diagnostic checks to catch missing calls to update_rq_clock(), provide wrappers for (re)pinning and unpinning rq->lock. Because the pending diagnostic checks allow state to be maintained in rq_flags across pin contexts, swap the 'struct pin_cookie' arguments for 'struct rq_flags *'. Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@unitn.it> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Yuyang Du <yuyang.du@intel.com> Link: http://lkml.kernel.org/r/20160921133813.31976-5-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/cputime: Rename vtime_account_user() to vtime_flush()Frederic Weisbecker1-3/+1
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y used to accumulate user time and account it on ticks and context switches only through the vtime_account_user() function. Now this model has been generalized on the 3 archs for all kind of cputime (system, irq, ...) and all the cputime flushing happens under vtime_account_user(). So let's rename this function to better reflect its new role. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1483636310-6557-11-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/cputime: Export account_guest_time()Frederic Weisbecker1-1/+1
In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y to delay cputime accounting to the tick, let's allow archs to account cputime directly to gtime. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1483636310-6557-5-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14sched/cputime: Allow accounting system time using cpustat indexFrederic Weisbecker1-5/+5
In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y to delay cputime accounting to the tick, let's provide APIs to account system time to precise contexts: hardirq, softirq, pure system, ... Inspired-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1483636310-6557-4-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-25ktime: Cleanup ktime_set() usageThomas Gleixner1-1/+1
ktime_set(S,N) was required for the timespec storage type and is still useful for situations where a Seconds and Nanoseconds part of a time value needs to be converted. For anything where the Seconds argument is 0, this is pointless and can be replaced with a simple assignment. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-13Merge tag 'pm-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds3-86/+209
Pull power management updates from Rafael Wysocki: "Again, cpufreq gets more changes than the other parts this time (one new driver, one old driver less, a bunch of enhancements of the existing code, new CPU IDs, fixes, cleanups) There also are some changes in cpuidle (idle injection rework, a couple of new CPU IDs, online/offline rework in intel_idle, fixes and cleanups), in the generic power domains framework (mostly related to supporting power domains containing CPUs), and in the Operating Performance Points (OPP) library (mostly related to supporting devices with multiple voltage regulators) In addition to that, the system sleep state selection interface is modified to make it easier for distributions with unchanged user space to support suspend-to-idle as the default system suspend method, some issues are fixed in the PM core, the latency tolerance PM QoS framework is improved a bit, the Intel RAPL power capping driver is cleaned up and there are some fixes and cleanups in the devfreq subsystem Specifics: - New cpufreq driver for Broadcom STB SoCs and a Device Tree binding for it (Markus Mayer) - Support for ARM Integrator/AP and Integrator/CP in the generic DT cpufreq driver and elimination of the old Integrator cpufreq driver (Linus Walleij) - Support for the zx296718, r8a7743 and r8a7745, Socionext UniPhier, and PXA SoCs in the the generic DT cpufreq driver (Baoyou Xie, Geert Uytterhoeven, Masahiro Yamada, Robert Jarzmik) - cpufreq core fix to eliminate races that may lead to using inactive policy objects and related cleanups (Rafael Wysocki) - cpufreq schedutil governor update to make it use SCHED_FIFO kernel threads (instead of regular workqueues) for doing delayed work (to reduce the response latency in some cases) and related cleanups (Viresh Kumar) - New cpufreq sysfs attribute for resetting statistics (Markus Mayer) - cpufreq governors fixes and cleanups (Chen Yu, Stratos Karafotis, Viresh Kumar) - Support for using generic cpufreq governors in the intel_pstate driver (Rafael Wysocki) - Support for per-logical-CPU P-state limits and the EPP/EPB (Energy Performance Preference/Energy Performance Bias) knobs in the intel_pstate driver (Srinivas Pandruvada) - New CPU ID for Knights Mill in intel_pstate (Piotr Luc) - intel_pstate driver modification to use the P-state selection algorithm based on CPU load on platforms with the system profile in the ACPI tables set to "mobile" (Srinivas Pandruvada) - intel_pstate driver cleanups (Arnd Bergmann, Rafael Wysocki, Srinivas Pandruvada) - cpufreq powernv driver updates including fast switching support (for the schedutil governor), fixes and cleanus (Akshay Adiga, Andrew Donnellan, Denis Kirjanov) - acpi-cpufreq driver rework to switch it over to the new CPU offline/online state machine (Sebastian Andrzej Siewior) - Assorted cleanups in cpufreq drivers (Wei Yongjun, Prashanth Prakash) - Idle injection rework (to make it use the regular idle path instead of a home-grown custom one) and related powerclamp thermal driver updates (Peter Zijlstra, Jacob Pan, Petr Mladek, Sebastian Andrzej Siewior) - New CPU IDs for Atom Z34xx and Knights Mill in intel_idle (Andy Shevchenko, Piotr Luc) - intel_idle driver cleanups and switch over to using the new CPU offline/online state machine (Anna-Maria Gleixner, Sebastian Andrzej Siewior) - cpuidle DT driver update to support suspend-to-idle properly (Sudeep Holla) - cpuidle core cleanups and misc updates (Daniel Lezcano, Pan Bian, Rafael Wysocki) - Preliminary support for power domains including CPUs in the generic power domains (genpd) framework and related DT bindings (Lina Iyer) - Assorted fixes and cleanups in the generic power domains (genpd) framework (Colin Ian King, Dan Carpenter, Geert Uytterhoeven) - Preliminary support for devices with multiple voltage regulators and related fixes and cleanups in the Operating Performance Points (OPP) library (Viresh Kumar, Masahiro Yamada, Stephen Boyd) - System sleep state selection interface rework to make it easier to support suspend-to-idle as the default system suspend method (Rafael Wysocki) - PM core fixes and cleanups, mostly related to the interactions between the system suspend and runtime PM frameworks (Ulf Hansson, Sahitya Tummala, Tony Lindgren) - Latency tolerance PM QoS framework imorovements (Andrew Lutomirski) - New Knights Mill CPU ID for the Intel RAPL power capping driver (Piotr Luc) - Intel RAPL power capping driver fixes, cleanups and switch over to using the new CPU offline/online state machine (Jacob Pan, Thomas Gleixner, Sebastian Andrzej Siewior) - Fixes and cleanups in the exynos-ppmu, exynos-nocp, rk3399_dmc, rockchip-dfi devfreq drivers and the devfreq core (Axel Lin, Chanwoo Choi, Javier Martinez Canillas, MyungJoo Ham, Viresh Kumar) - Fix for false-positive KASAN warnings during resume from ACPI S3 (suspend-to-RAM) on x86 (Josh Poimboeuf) - Memory map verification during resume from hibernation on x86 to ensure a consistent address space layout (Chen Yu) - Wakeup sources debugging enhancement (Xing Wei) - rockchip-io AVS driver cleanup (Shawn Lin)" * tag 'pm-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (127 commits) devfreq: rk3399_dmc: Don't use OPP structures outside of RCU locks devfreq: rk3399_dmc: Remove dangling rcu_read_unlock() devfreq: exynos: Don't use OPP structures outside of RCU locks Documentation: intel_pstate: Document HWP energy/performance hints cpufreq: intel_pstate: Support for energy performance hints with HWP cpufreq: intel_pstate: Add locking around HWP requests PM / sleep: Print active wakeup sources when blocking on wakeup_count reads PM / core: Fix bug in the error handling of async suspend PM / wakeirq: Fix dedicated wakeirq for drivers not using autosuspend PM / Domains: Fix compatible for domain idle state PM / OPP: Don't WARN on multiple calls to dev_pm_opp_set_regulators() PM / OPP: Allow platform specific custom set_opp() callbacks PM / OPP: Separate out _generic_set_opp() PM / OPP: Add infrastructure to manage multiple regulators PM / OPP: Pass struct dev_pm_opp_supply to _set_opp_voltage() PM / OPP: Manage supply's voltage/current in a separate structure PM / OPP: Don't use OPP structure outside of rcu protected section PM / OPP: Reword binding supporting multiple regulators per device PM / OPP: Fix incorrect cpu-supply property in binding cpuidle: Add a kerneldoc comment to cpuidle_use_deepest_state() ..