aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/sched_fair.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2010-12-19sched: Fix interactivity bug by charging unaccounted run-time on entity re-weightPaul Turner1-1/+5
Mike Galbraith reported poor interactivity[*] when the new shares distribution code was combined with autogroups. The root cause turns out to be a mis-ordering of accounting accrued execution time and shares updates. Since update_curr() is issued hierarchically, updating the parent entity weights to reflect child enqueue/dequeue results in the parent's unaccounted execution time then being accrued (vs vruntime) at the new weight as opposed to the weight present at accumulation. While this doesn't have much effect on processes with timeslices that cross a tick, it is particularly problematic for an interactive process (e.g. Xorg) which incurs many (tiny) timeslices. In this scenario almost all updates are at dequeue which can result in significant fairness perturbation (especially if it is the only thread, resulting in potential {tg->shares, MIN_SHARES} transitions). Correct this by ensuring unaccounted time is accumulated prior to manipulating an entity's weight. [*] http://xkcd.com/619/ is perversely Nostradamian here. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20101216031038.159704378@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-19sched: Move periodic share updates to entity_tick()Paul Turner1-4/+17
Long running entities that do not block (dequeue) require periodic updates to maintain accurate share values. (Note: group entities with several threads are quite likely to be non-blocking in many circumstances). By virtue of being long-running however, we will see entity ticks (otherwise the required update occurs in dequeue/put and we are done). Thus we can move the detection (and associated work) for these updates into the periodic path. This restores the 'atomicity' of update_curr() with respect to accounting. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101216031038.067028969@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-08Merge branch 'linus' into sched/coreIngo Molnar1-5/+3
Merge reason: we want to queue up dependent cleanup Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26Merge commit 'v2.6.37-rc3' into sched/coreIngo Molnar1-9/+31
Merge reason: Pick up latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-23sched: Fix UP build breakagePeter Zijlstra1-1/+1
The recent cgroup-scheduling rework caused a UP build problem. Cc: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Allow update_cfs_load() to update global loadPaul Turner1-15/+29
Refactor the global load updates from update_shares_cpu() so that update_cfs_load() can update global load when it is more than ~10% out of sync. The new global_load parameter allows us to force an update, regardless of the error factor so that we can synchronize w/ update_shares(). Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234938.377473595@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Implement demand based update_cfs_load()Paul Turner1-0/+12
When the system is busy, dilation of rq->next_balance makes lb->update_shares() insufficiently frequent for threads which don't sleep (no dequeue/enqueue updates). Adjust for this by making demand based updates based on the accumulation of execution time sufficient to wrap our averaging window. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234938.291159744@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Update shares on idle_balancePaul Turner1-0/+1
Since shares updates are no longer expensive and effectively local, update them at idle_balance(). This allows us to more quickly redistribute shares to another cpu when our load becomes idle. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234938.204191702@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Add sysctl_sched_shares_windowPaul Turner1-1/+8
Introduce a new sysctl for the shares window and disambiguate it from sched_time_avg. A 10ms window appears to be a good compromise between accuracy and performance. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234938.112173964@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Introduce hierarchal order on shares update listPaul Turner1-10/+16
Avoid duplicate shares update calls by ensuring children always appear before parents in rq->leaf_cfs_rq_list. This allows us to do a single in-order traversal for update_shares(). Since we always enqueue in bottom-up order this reduces to 2 cases: 1) Our parent is already in the list, e.g. root \ b /\ c d* (root->b->c already enqueued) Since d's parent is enqueued we push it to the head of the list, implicitly ahead of b. 2) Our parent does not appear in the list (or we have no parent) In this case we enqueue to the tail of the list, if our parent is subsequently enqueued (bottom-up) it will appear to our right by the same rule. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234938.022488865@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Fix update_cfs_load() synchronizationPaul Turner1-12/+21
Using cfs_rq->nr_running is not sufficient to synchronize update_cfs_load with the put path since nr_running accounting occurs at deactivation. It's also not safe to make the removal decision based on load_avg as this fails with both high periods and low shares. Resolve this by clipping history after 4 periods without activity. Note: the above will always occur from update_shares() since in the last-task-sleep-case that task will still be cfs_rq->curr when update_cfs_load is called. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234937.933428187@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Fix load corruption from update_cfs_shares()Paul Turner1-8/+8
As part of enqueue_entity both a new entity weight and its contribution to the queuing cfs_rq / rq are updated. Since update_cfs_shares will only update the queueing weights when the entity is on_rq (which in this case it is not yet), there's a dependency loop here: update_cfs_shares needs account_entity_enqueue to update cfs_rq->load.weight account_entity_enqueue needs the updated weight for the queuing cfs_rq load[*] Fix this and avoid spurious dequeue/enqueues by issuing update_cfs_shares as if we had accounted the enqueue already. This was also resulting in rq->load corruption previously. [*]: this dependency also exists when using the group cfs_rq w/ update_cfs_shares as the weight of the enqueued entity changes without the load being updated. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234937.844900206@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Make tg_shares_up() walk on-demandPeter Zijlstra1-0/+58
Make tg_shares_up() use the active cgroup list, this means we cannot do a strict bottom-up walk of the hierarchy, but assuming its a very wide tree with a small number of active groups it should be a win. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234937.754159484@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Implement on-demand (active) cfs_rq listPeter Zijlstra1-6/+40
Make certain load-balance actions scale per number of active cgroups instead of the number of existing cgroups. This makes wakeup/sleep paths more expensive, but is a win for systems where the vast majority of existing cgroups are idle. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234937.666535048@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Rewrite tg_shares_up)Peter Zijlstra1-57/+107
By tracking a per-cpu load-avg for each cfs_rq and folding it into a global task_group load on each tick we can rework tg_shares_up to be strictly per-cpu. This should improve cpu-cgroup performance for smp systems significantly. [ Paul: changed to use queueing cfs_rq + bug fixes ] Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234937.580480400@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Fix idle balancingNikhil Rao1-1/+3
An earlier commit reverts idle balancing throttling reset to fix a 30% regression in volanomark throughput. We still need to reset idle_stamp when we pull a task in newidle balance. Reported-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1290022924-3548-1-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Fix volanomark performance regressionAlex Shi1-4/+0
Commit fab4762 triggers excessive idle balancing, causing a ~30% loss in volanomark throughput. Remove idle balancing throttle reset. Originally-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1289928732.5169.211.camel@maggy.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-11sched: Fix cross-sched-class wakeup preemptionPeter Zijlstra1-6/+0
Instead of dealing with sched classes inside each check_preempt_curr() implementation, pull out this logic into the generic wakeup preemption path. This fixes a hang in KVM (and others) where we are waiting for the stop machine thread to run ... Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Tested-by: Marcelo Tosatti <mtosatti@redhat.com> Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1288891946.2039.31.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-10sched: Use group weight, idle cpu metrics to fix imbalances during idleSuresh Siddha1-3/+31
Currently we consider a sched domain to be well balanced when the imbalance is less than the domain's imablance_pct. As the number of cores and threads are increasing, current values of imbalance_pct (for example 25% for a NUMA domain) are not enough to detect imbalances like: a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads), 24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another socket. Leading to an idle HT cpu. b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and 16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one socket and 7 on another socket. Leaving one core in a socket idle whereas in another socket we have a core having both its HT siblings busy. While this issue can be fixed by decreasing the domain's imbalance_pct (by making it a function of number of logical cpus in the domain), it can potentially cause more task migrations across sched groups in an overloaded case. Fix this by using imbalance_pct only during newly_idle and busy load balancing. And during idle load balancing, check if there is an imbalance in number of idle cpu's across the busiest and this sched_group or if the busiest group has more tasks than its weight that the idle cpu in this_group can pull. Reported-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1284760952.2676.11.camel@sbsiddha-MOBL3.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-29Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tipLinus Torvalds1-6/+19
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched_stat: Update sched_info_queue/dequeue() code comments sched, cgroup: Fixup broken cgroup movement
2010-10-22sched, cgroup: Fixup broken cgroup movementPeter Zijlstra1-6/+19
Dima noticed that we fail to correct the ->vruntime of sleeping tasks when we move them between cgroups. Reported-by: Dima Zavin <dima@android.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Mike Galbraith <efault@gmx.de> LKML-Reference: <1287150604.29097.1513.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-21Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tipLinus Torvalds1-17/+59
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (29 commits) sched: Export account_system_vtime() sched: Call tick_check_idle before __irq_enter sched: Remove irq time from available CPU power sched: Do not account irq time to current task x86: Add IRQ_TIME_ACCOUNTING sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time sched: Add a PF flag for ksoftirqd identification sched: Consolidate account_system_vtime extern declaration sched: Fix softirq time accounting sched: Drop group_capacity to 1 only if local group has extra capacity sched: Force balancing on newidle balance if local group has capacity sched: Set group_imb only a task can be pulled from the busiest cpu sched: Do not consider SCHED_IDLE tasks to be cache hot sched: Drop all load weight manipulation for RT tasks sched: Create special class for stop/migrate work sched: Unindent labels sched: Comment updates: fix default latency and granularity numbers tracing/sched: Add sched_pi_setprio tracepoint sched: Give CPU bound RT tasks preference sched: Try not to migrate higher priority RT tasks ...
2010-10-18sched: Remove irq time from available CPU powerVenkatesh Pallipadi1-1/+7
The idea was suggested by Peter Zijlstra here: http://marc.info/?l=linux-kernel&m=127476934517534&w=2 irq time is technically not available to the tasks running on the CPU. This patch removes irq time from CPU power piggybacking on sched_rt_avg_update(). Tested this by keeping CPU X busy with a network intensive task having 75% oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven cycle soakers on the system. Without this change, there will be two tasks on each CPU. With this change, there is a single task on irq busy CPU X and remaining 7 tasks are spread around among other 3 CPUs. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18sched: Do not account irq time to current taskVenkatesh Pallipadi1-3/+3
Scheduler accounts both softirq and interrupt processing times to the currently running task. This means, if the interrupt processing was for some other task in the system, then the current task ends up being penalized as it gets shorter runtime than otherwise. Change sched task accounting to acoount only actual task time from currently running task. Now update_curr(), modifies the delta_exec to depend on rq->clock_task. Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats for later. This change will impact scheduling behavior in interrupt heavy conditions. Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2 spending 75%+ of its time in irq processing. CPU 3 spending around 35% time running nc task. Now, if I run another CPU intensive task on CPU 2, without this change /proc/<pid>/schedstat shows 100% of time accounted to this task. With this change, it rightly shows less than 25% accounted to this task as remaining time is actually spent on irq processing. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18sched: Drop group_capacity to 1 only if local group has extra capacityNikhil Rao1-2/+7
When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1 only if the local group has extra capacity. The extra check prevents the case where you always pull from the heaviest group when it is already under-utilized (possible with a large weight task outweighs the tasks on the system). For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task, and each task is running on one core. In this case, we observe the following events when balancing at the NUMA domain: - find_busiest_group() will always pick the sched group containing the niced task to be the busiest group. - find_busiest_queue() will then always pick one of the cpus running the nice0 task (never picks the cpu with the nice -15 task since weighted_cpuload > imbalance). - The load balancer fails to migrate the task since it is the running task and increments sd->nr_balance_failed. - It repeats the above steps a few more times until sd->nr_balance_failed > 5, at which point it kicks off the active load balancer, wakes up the migration thread and kicks the nice 0 task off the cpu. The load balancer doesn't stop until we kick out all nice 0 tasks from the sched group, leaving you with 3 idle cpus and one cpu running the nice -15 task. When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child domain (in this case MC) has SD_PREFER_SIBLING set. Subsequent load checks are not relevant because the niced task has a very large weight. In this patch, we add an extra condition to the "if(prefer_sibling)" check in update_sd_lb_stats(). We drop the capacity of a group only if the local group has extra capacity, ie. nr_running < group_capacity. This patch preserves the original intent of the prefer_siblings check (to spread tasks across the system in low utilization scenarios) and fixes the case above. It helps in the following ways: - In low utilization cases (where nr_tasks << nr_cpus), we still drop group_capacity down to 1 if we prefer siblings. - On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most likely be > sgs.group_capacity. - When balancing large weight tasks, if the local group does not have extra capacity, we do not pick the group with the niced task as the busiest group. This prevents failed balances, active migration and the under-utilization described above. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18sched: Force balancing on newidle balance if local group has capacityNikhil Rao1-3/+25
This patch forces a load balance on a newly idle cpu when the local group has extra capacity and the busiest group does not have any. It improves system utilization when balancing tasks with a large weight differential. Under certain situations, such as a niced down task (i.e. nice = -15) in the presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and kicks away other tasks because of its large weight. This leads to sub-optimal utilization of the machine. Even though the sched group has capacity, it does not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL. With this patch, if the local group has extra capacity, we shortcut the checks in f_b_g() and try to pull a task over. A sched group has extra capacity if the group capacity is greater than the number of running tasks in that group. Thanks to Mike Galbraith for discussions leading to this patch and for the insight to reuse SD_NEWIDLE_BALANCE. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1287173550-30365-4-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18sched: Set group_imb only a task can be pulled from the busiest cpuNikhil Rao1-5/+7
When cycling through sched groups to determine the busiest group, set group_imb only if the busiest cpu has more than 1 runnable task. This patch fixes the case where two cpus in a group have one runnable task each, but there is a large weight differential between these two tasks. The load balancer is unable to migrate any task from this group, and hence do not consider this group to be imbalanced. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286996978-7007-3-git-send-email-ncrao@google.com> [ small code readability edits ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-14sched: Comment updates: fix default latency and granularity numbersTakuya Yoshikawa1-2/+2
Targeted preemption latency and minimal preemption granularity for CPU-bound tasks have been changed. This patch updates the comments about these values. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20101014160913.eb24fef4.yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-14Merge branch 'linus' into sched/coreIngo Molnar1-1/+1
Merge reason: update from -rc5 to -almost-final Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-07sched: suppress RCU lockdep splat in task_fork_fairPaul E. McKenney1-1/+4
> =================================================== > [ INFO: suspicious rcu_dereference_check() usage. ] > --------------------------------------------------- > /home/greearb/git/linux.wireless-testing/kernel/sched.c:618 invoked rcu_dereference_check() without protection! > > other info that might help us debug this: > > rcu_scheduler_active = 1, debug_locks = 1 > 1 lock held by ifup/23517: > #0: (&rq->lock){-.-.-.}, at: [<c042f782>] task_fork_fair+0x3b/0x108 > > stack backtrace: > Pid: 23517, comm: ifup Not tainted 2.6.36-rc6-wl+ #5 > Call Trace: > [<c075e219>] ? printk+0xf/0x16 > [<c0455842>] lockdep_rcu_dereference+0x74/0x7d > [<c0426854>] task_group+0x6d/0x79 > [<c042686e>] set_task_rq+0xe/0x57 > [<c042f79e>] task_fork_fair+0x57/0x108 > [<c042e965>] sched_fork+0x82/0xf9 > [<c04334b3>] copy_process+0x569/0xe8e > [<c0433ef0>] do_fork+0x118/0x262 > [<c076302f>] ? do_page_fault+0x16a/0x2cf > [<c044b80c>] ? up_read+0x16/0x2a > [<c04085ae>] sys_clone+0x1b/0x20 > [<c04030a5>] ptregs_clone+0x15/0x30 > [<c0402f1c>] ? sysenter_do_call+0x12/0x38 Here a newly created task is having its runqueue assigned. The new task is not yet on the tasklist, so cannot go away. This is therefore a false positive, suppress with an RCU read-side critical section. Reported-by: Ben Greear <greearb@candelatech.com Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Ben Greear <greearb@candelatech.com
2010-09-21sched: Increment cache_nice_tries only on periodic lbVenkatesh Pallipadi1-1/+8
scheduler uses cache_nice_tries as an indicator to do cache_hot and active load balance, when normal load balance fails. Currently, this value is changed on any failed load balance attempt. That ends up being not so nice to workloads that enter/exit idle often, as they do more frequent new_idle balance and that pretty soon results in cache hot tasks being pulled in. Making the cache_nice_tries ignore failed new_idle balance seems to make better sense. With that only the failed load balance in periodic load balance gets accounted and the rate of accumulation of cache_nice_tries will not depend on idle entry/exit (short running sleep-wakeup kind of tasks). This reduces movement of cache_hot tasks. schedstat diff (after-before) excerpt from a workload that has frequent and short wakeup-idle pattern (:2 in cpu col below refers to NEWIDLE idx) This snapshot was across ~400 seconds. Without this change: domainstats: domain0 cpu cnt bln fld imb gain hgain nobusyq nobusyg 0:2 306487 219575 73167 110069413 44583 19070 1172 218403 1:2 292139 194853 81421 120893383 50745 21902 1259 193594 2:2 283166 174607 91359 129699642 54931 23688 1287 173320 3:2 273998 161788 93991 132757146 57122 24351 1366 160422 4:2 289851 215692 62190 83398383 36377 13680 851 214841 5:2 316312 222146 77605 117582154 49948 20281 988 221158 6:2 297172 195596 83623 122133390 52801 21301 929 194667 7:2 283391 178078 86378 126622761 55122 22239 928 177150 8:2 297655 210359 72995 110246694 45798 19777 1125 209234 9:2 297357 202011 79363 119753474 50953 22088 1089 200922 10:2 278797 178703 83180 122514385 52969 22726 1128 177575 11:2 272661 167669 86978 127342327 55857 24342 1195 166474 12:2 293039 204031 73211 110282059 47285 19651 948 203083 13:2 289502 196762 76803 114712942 49339 20547 1016 195746 14:2 264446 169609 78292 115715605 50459 21017 982 168627 15:2 260968 163660 80142 116811793 51483 21281 1064 162596 With this change: domainstats: domain0 cpu cnt bln fld imb gain hgain nobusyq nobusyg 0:2 272347 187380 77455 105420270 24975 1 953 186427 1:2 267276 172360 86234 116242264 28087 6 1028 171332 2:2 259769 156777 93281 123243134 30555 1 1043 155734 3:2 250870 143129 97627 127370868 32026 6 1188 141941 4:2 248422 177116 64096 78261112 22202 2 757 176359 5:2 275595 180683 84950 116075022 29400 6 778 179905 6:2 262418 162609 88944 119256898 31056 4 817 161792 7:2 252204 147946 92646 122388300 32879 4 824 147122 8:2 262335 172239 81631 110477214 26599 4 864 171375 9:2 261563 164775 88016 117203621 28331 3 849 163926 10:2 243389 140949 93379 121353071 29585 2 909 140040 11:2 242795 134651 98310 124768957 30895 2 1016 133635 12:2 255234 166622 79843 104696912 26483 4 746 165876 13:2 244944 151595 83855 109808099 27787 3 801 150794 14:2 241301 140982 89935 116954383 30403 6 845 140137 15:2 232271 128564 92821 119185207 31207 4 1416 127148 Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1284167957-3675-1-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-09-21sched: Fix nohz balance kickSuresh Siddha1-1/+1
There's a situation where the nohz balancer will try to wake itself: cpu-x is idle which is also ilb_cpu got a scheduler tick during idle and the nohz_kick_needed() in trigger_load_balance() checks for rq_x->nr_running which might not be zero (because of someone waking a task on this rq etc) and this leads to the situation of the cpu-x sending a kick to itself. And this can cause a lockup. Avoid this by not marking ourself eligible for kicking. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1284400941.2684.19.camel@sbsiddha-MOBL3.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-09-13sched: Improve latencies under load by decreasing minimum scheduling granularityIngo Molnar1-3/+3
Mathieu reported bad latencies with make -j10 kind of kbuild workloads - which is mostly caused by us scheduling with a too coarse granularity. Reduce the minimum granularity some more, to make sure we can meet the latency target. I got the following results (make -j10 kbuild load, average of 3 runs): vanilla: maximum latency: 38278.9 µs average latency: 7730.1 µs patched: maximum latency: 22702.1 µs average latency: 6684.8 µs Mathieu also measured it: | | * wakeup-latency.c (SIGEV_THREAD) with make -j10 | | - Mainline 2.6.35.2 kernel | | maximum latency: 45762.1 µs | average latency: 7348.6 µs | | - With only Peter's smaller min_gran (shown below): | | maximum latency: 29100.6 µs | average latency: 6684.1 µs | Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <AANLkTi=8m4g01wZPacySoF7U0PevTNVgJoZZrHiUD-pN@mail.gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-09-11Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tipLinus Torvalds1-2/+0
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, tsc: Fix a preemption leak in restore_sched_clock_state() sched: Move sched_avg_update() to update_cpu_load()
2010-09-09sched: Move sched_avg_update() to update_cpu_load()Suresh Siddha1-2/+0
Currently sched_avg_update() (which updates rt_avg stats in the rq) is getting called from scale_rt_power() (in the load balance context) which doesn't take rq->lock. Fix it by moving the sched_avg_update() to more appropriate update_cpu_load() where the CFS load gets updated as well. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1282596171.2694.3.camel@sbsiddha-MOBL3> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-09-05gcc-4.6: kernel/*: Fix unused but set warningsAndi Kleen1-2/+1
No real bugs I believe, just some dead code. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: andi@firstfloor.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-08-20sched: Fix rq->clock synchronization when migrating tasksPeter Zijlstra1-0/+2
sched_fork() -- we do task placement in ->task_fork_fair() ensure we update_rq_clock() so we work with current time. We leave the vruntime in relative state, so the time delay until wake_up_new_task() doesn't matter. wake_up_new_task() -- Since task_fork_fair() left p->vruntime in relative state we can safely migrate, the activate_task() on the remote rq will call update_rq_clock() and causes the clock to be synced (enough). Tested-by: Jack Daniel <wanders.thirst@gmail.com> Tested-by: Philby John <pjohn@mvista.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1281002322.1923.1708.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-07-21Merge branch 'linus' into sched/coreIngo Molnar1-0/+2
Merge reason: Move from the -rc3 to the almost-rc6 base. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-07-17sched: Reduce update_group_power() callsPeter Zijlstra1-6/+6
Currently we update cpu_power() too often, update_group_power() only updates the local group's cpu_power but it gets called for all groups. Furthermore, CPU_NEWLY_IDLE invocations will result in all cpus calling it, even though a slow update of cpu_power is sufficient. Therefore move the update under 'idle != CPU_NEWLY_IDLE && local_group' to reduce superfluous invocations. Reported-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> LKML-Reference: <1278612989.1900.176.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-07-17sched: Update rq->clock for nohz balanced cpusSuresh Siddha1-0/+1
Suresh spotted that we don't update the rq->clock in the nohz load-balancer path. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1278626014.2834.74.camel@sbs-t61.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-06-29sched: Fix spelling of siblingMichael Neuling1-1/+1
No logic changes, only spelling. Signed-off-by: Michael Neuling <mikey@neuling.org> Cc: linuxppc-dev@ozlabs.org Cc: David Howells <dhowells@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <15249.1277776921@neuling.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-06-23rcu: apply RCU protection to wake_affine()Daniel J Blueman1-0/+2
The task_group() function returns a pointer that must be protected by either RCU, the ->alloc_lock, or the cgroup lock (see the rcu_dereference_check() in task_subsys_state(), which is invoked by task_group()). The wake_affine() function currently does none of these, which means that a concurrent update would be within its rights to free the structure returned by task_group(). Because wake_affine() uses this structure only to compute load-balancing heuristics, there is no reason to acquire either of the two locks. Therefore, this commit introduces an RCU read-side critical section that starts before the first call to task_group() and ends after the last use of the "tg" pointer returned from task_group(). Thanks to Li Zefan for pointing out the need to extend the RCU read-side critical section from that proposed by the original patch. Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-06-18sched: Fix comments to make them DocBook happyMichael Neuling1-4/+5
Docbook fails in sched_fair.c due to comments added in the asymmetric packing patch series. This fixes these errors. No code changes. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <24737.1276135581@neuling.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-06-18sched: Fix fix_small_capacityMichael Neuling1-1/+1
The CPU power test is the wrong way around in fix_small_capacity. This was due to a small changes made in the posted patch on lkml to what was was taken upstream. This patch fixes asymmetric packing for POWER7. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <12629.1276124617@neuling.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-06-09sched: Add asymmetric group packing option for sibling domainMichael Neuling1-17/+122
Check to see if the group is packed in a sched doman. This is primarily intended to used at the sibling level. Some cores like POWER7 prefer to use lower numbered SMT threads. In the case of POWER7, it can move to lower SMT modes only when higher threads are idle. When in lower SMT modes, the threads will perform better since they share less core resources. Hence when we have idle threads, we want them to be the higher ones. This adds a hook into f_b_g() called check_asym_packing() to check the packing. This packing function is run on idle threads. It checks to see if the busiest CPU in this domain (core in the P7 case) has a higher CPU number than what where the packing function is being run on. If it is, calculate the imbalance and return the higher busier thread as the busiest group to f_b_g(). Here we are assuming a lower CPU number will be equivalent to a lower SMT thread number. It also creates a new SD_ASYM_PACKING flag to enable this feature at any scheduler domain level. It also creates an arch hook to enable this feature at the sibling level. The default function doesn't enable this feature. Based heavily on patch from Peter Zijlstra. Fixes from Srivatsa Vaddagiri. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <20100608045702.2936CCC897@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-06-09sched: Fix capacity calculations for SMT4Srivatsa Vaddagiri1-10/+43
Handle cpu capacity being reported as 0 on cores with more number of hardware threads. For example on a Power7 core with 4 hardware threads, core power is 1177 and thus power of each hardware thread is 1177/4 = 294. This low power can lead to capacity for each hardware thread being calculated as 0, which leads to tasks bouncing within the core madly! Fix this by reporting capacity for hardware threads as 1, provided their power is not scaled down significantly because of frequency scaling or real-time tasks usage of cpu. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arjan van de Ven <arjan@linux.intel.com> LKML-Reference: <20100608045702.21D03CC895@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-06-09sched: Change nohz idle load balancing logic to push modelVenkatesh Pallipadi1-131/+198
In the new push model, all idle CPUs indeed go into nohz mode. There is still the concept of idle load balancer (performing the load balancing on behalf of all the idle cpu's in the system). Busy CPU kicks the nohz balancer when any of the nohz CPUs need idle load balancing. The kickee CPU does the idle load balancing on behalf of all idle CPUs instead of the normal idle balance. This addresses the below two problems with the current nohz ilb logic: * the idle load balancer continued to have periodic ticks during idle and wokeup frequently, even though it did not have any rebalancing to do on behalf of any of the idle CPUs. * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this periodic wakeup can result in a periodic additional interrupt on a CPU doing the timer broadcast. Also currently we are migrating the unpinned timers from an idle to the cpu doing idle load balancing (when all the cpus in the system are idle, there is no idle load balancing cpu and timers get added to the same idle cpu where the request was made. So the existing optimization works only on semi idle system). And In semi idle system, we no longer have periodic ticks on the idle load balancer CPU. Using that cpu will add more delays to the timers than intended (as that cpu's timer base may not be uptodate wrt jiffies etc). This was causing mysterious slowdowns during boot etc. For now, in the semi idle case, use the nearest busy cpu for migrating timers from an idle cpu. This is good for power-savings anyway. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <1274486981.2840.46.camel@sbs-t61.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-06-09sched: Avoid side-effect of tickless idle on update_cpu_loadVenkatesh Pallipadi1-1/+4
tickless idle has a negative side effect on update_cpu_load(), which in turn can affect load balancing behavior. update_cpu_load() is supposed to be called every tick, to keep track of various load indicies. With tickless idle, there are no scheduler ticks called on the idle CPUs. Idle CPUs may still do load balancing (with idle_load_balance CPU) using the stale cpu_load. It will also cause problems when all CPUs go idle for a while and become active again. In this case loads would not degrade as expected. This is how rq->nr_load_updates change looks like under different conditions: <cpu_num> <nr_load_updates change> All CPUS idle for 10 seconds (HZ=1000) 0 1621 10 496 11 139 12 875 13 1672 14 12 15 21 1 1472 2 2426 3 1161 4 2108 5 1525 6 701 7 249 8 766 9 1967 One CPU busy rest idle for 10 seconds 0 10003 10 601 11 95 12 966 13 1597 14 114 15 98 1 3457 2 93 3 6679 4 1425 5 1479 6 595 7 193 8 633 9 1687 All CPUs busy for 10 seconds 0 10026 10 10026 11 10026 12 10026 13 10025 14 10025 15 10025 1 10026 2 10026 3 10026 4 10026 5 10026 6 10026 7 10026 8 10026 9 10026 That is update_cpu_load works properly only when all CPUs are busy. If all are idle, all the CPUs get way lower updates. And when few CPUs are busy and rest are idle, only busy and ilb CPU does proper updates and rest of the idle CPUs will do lower updates. The patch keeps track of when a last update was done and fixes up the load avg based on current time. On one of my test system SPECjbb with warehouse 1..numcpus, patch improves throughput numbers by ~1% (average of 6 runs). On another test system (with different domain hierarchy) there is no noticable change in perf. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <AANLkTilLtDWQsAUrIxJ6s04WTgmw9GuOODc5AOrYsaR5@mail.gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-06-01sched: Fix wake_affine() vs RT tasksPeter Zijlstra1-6/+16
Mike reports that since e9e9250b (sched: Scale down cpu_power due to RT tasks), wake_affine() goes funny on RT tasks due to them still having a !0 weight and wake_affine() still subtracts that from the rq weight. Since nobody should be using se->weight for RT tasks, set the value to zero. Also, since we now use ->cpu_power to normalize rq weights to account for RT cpu usage, add that factor into the imbalance computation. Reported-by: Mike Galbraith <efault@gmx.de> Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1275316109.27810.22969.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-06sched: replace migration_thread with cpu_stopTejun Heo1-14/+34
Currently migration_thread is serving three purposes - migration pusher, context to execute active_load_balance() and forced context switcher for expedited RCU synchronize_sched. All three roles are hardcoded into migration_thread() and determining which job is scheduled is slightly messy. This patch kills migration_thread and replaces all three uses with cpu_stop. The three different roles of migration_thread() are splitted into three separate cpu_stop callbacks - migration_cpu_stop(), active_load_balance_cpu_stop() and synchronize_sched_expedited_cpu_stop() - and each use case now simply asks cpu_stop to execute the callback as necessary. synchronize_sched_expedited() was implemented with private preallocated resources and custom multi-cpu queueing and waiting logic, both of which are provided by cpu_stop. synchronize_sched_expedited_count is made atomic and all other shared resources along with the mutex are dropped. synchronize_sched_expedited() also implemented a check to detect cases where not all the callback got executed on their assigned cpus and fall back to synchronize_sched(). If called with cpu hotplug blocked, cpu_stop already guarantees that and the condition cannot happen; otherwise, stop_machine() would break. However, this patch preserves the paranoid check using a cpumask to record on which cpus the stopper ran so that it can serve as a bisection point if something actually goes wrong theree. Because the internal execution state is no longer visible, rcu_expedited_torture_stats() is removed. This patch also renames cpu_stop threads to from "stopper/%d" to "migration/%d". The names of these threads ultimately don't matter and there's no reason to make unnecessary userland visible changes. With this patch applied, stop_machine() and sched now share the same resources. stop_machine() is faster without wasting any resources and sched migration users are much cleaner. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Josh Triplett <josh@freedesktop.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Dimitri Sivanich <sivanich@sgi.com>