linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2008-01-22	sched: group scheduler, set uid share fix	Ingo Molnar	1	-0/+8
	setting cpu share to 1 causes hangs, as reported in: http://bugzilla.kernel.org/show_bug.cgi?id=9779 as the default share is 1024, the values of 0 and 1 can indeed cause problems. Limit it to 2 or higher values. These values can only be set by the root user - but still it makes sense to protect against nonsensical values. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-09	show_task: real_parent	Roland McGrath	1	-1/+1
	The show_task function invoked by sysrq-t et al displays the pid and parent's pid of each task. It seems more useful to show the actual process hierarchy here than who is using ptrace on each process. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-18	sched: touch softlockup watchdog after idling	Ingo Molnar	1	-0/+1
	touch softlockup watchdog after idling. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-12-18	sched: fix crash on ia64, introduce task_current()	Dmitry Adamushko	1	-6/+11
	Some services (e.g. sched_setscheduler(), rt_mutex_setprio() and sched_move_task()) must handle a given task differently in case it's the 'rq->curr' task on its run-queue. The task_running() interface is not suitable for determining such tasks for platforms with one of the following options: #define __ARCH_WANT_UNLOCKED_CTXSW #define __ARCH_WANT_INTERRUPTS_ON_CTXSW Due to the fact that it makes use of 'p->oncpu == 1' as a criterion but such a task is not necessarily 'rq->curr'. The detailed explanation is available here: https://lists.linux-foundation.org/pipermail/containers/2007-December/009262.html Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Tested-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2007-12-07	sched: enable early use of sched_clock()	Ingo Molnar	1	-1/+6
	some platforms have sched_clock() implementations that cannot be called very early during wakeup. If it's called it might hang or crash in hard to debug ways. So only call update_rq_clock() [which calls sched_clock()] if sched_init() has already been called. (rq->idle is NULL before the scheduler is initialized.) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-12-05	sched: style cleanups	Ingo Molnar	1	-64/+68
	style cleanup of various changes that were done recently. no code changed: text data bss dec hex filename 23680 2542 28 26250 668a sched.o.before 23680 2542 28 26250 668a sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-12-04	sched: fix crash in sys_sched_rr_get_interval()	Ingo Molnar	1	-5/+9
	Luiz Fernando N. Capitulino reported that sched_rr_get_interval() crashes for SCHED_OTHER tasks that are on an idle runqueue. The fix is to return a 0 timeslice for tasks that are on an idle runqueue. (and which are not running, obviously) this also shrinks the code a bit: text data bss dec hex filename 47903 3934 336 52173 cbcd sched.o.before 47885 3934 336 52155 cbbb sched.o.after Reported-by: Luiz Fernando N. Capitulino <lcapitulino@mandriva.com.br> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-12-02	sched: cpu accounting controller (V2)	Srivatsa Vaddagiri	1	-26/+129
	Commit cfb5285660aad4931b2ebbfa902ea48a37dfffa1 removed a useful feature for us, which provided a cpu accounting resource controller. This feature would be useful if someone wants to group tasks only for accounting purpose and doesnt really want to exercise any control over their cpu consumption. The patch below reintroduces the feature. It is based on Paul Menage's original patch (Commit 62d0df64065e7c135d0002f069444fbdfc64768f), with these differences: - Removed load average information. I felt it needs more thought (esp to deal with SMP and virtualized platforms) and can be added for 2.6.25 after more discussions. - Convert group cpu usage to be nanosecond accurate (as rest of the cfs stats are) and invoke cpuacct_charge() from the respective scheduler classes - Make accounting scalable on SMP systems by splitting the usage counter to be per-cpu - Move the code from kernel/cpu_acct.c to kernel/sched.c (since the code is not big enough to warrant a new file and also this rightly needs to live inside the scheduler. Also things like accessing rq->lock while reading cpu usage becomes easier if the code lived in kernel/sched.c) The patch also modifies the cpu controller not to provide the same accounting information. Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran some simple tests like cpuspin (spin on the cpu), ran several tasks in the same group and timed them. Compared their time stamps with cpuacct.usage. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-28	sched: clean up, move __sched_text_start/end to sched.h	Ingo Molnar	1	-3/+0
	move __sched_text_start/end to sched.h. No code changed: text data bss dec hex filename 26582 2310 28 28920 70f8 sched.o.before 26582 2310 28 28920 70f8 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-28	sched: clean up sd_alloc_ctl_cpu_table() definition	Ingo Molnar	1	-1/+1
	clean up sd_alloc_ctl_cpu_table() definition. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-15	sched: reorder SCHED_FEAT_ bits	Ingo Molnar	1	-6/+6
	reorder SCHED_FEAT_ bits so that the used ones come first. Makes tuning instructions easier. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-15	sched: remove activate_idle_task()	Dmitry Adamushko	1	-18/+4
	cpu_down() code is ok wrt sched_idle_next() placing the 'idle' task not at the beginning of the queue. So get rid of activate_idle_task() and make use of activate_task() instead. It is the same as activate_task(), except for the update_rq_clock(rq) call that is redundant. Code size goes down: text data bss dec hex filename 47853 3934 336 52123 cb9b sched.o.before 47828 3934 336 52098 cb82 sched.o.after Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-15	sched: fix __set_task_cpu() SMP race	Dmitry Adamushko	1	-7/+13
	Grant Wilson has reported rare SCHED_FAIR_USER crashes on his quad-core system, which crashes can only be explained via runqueue corruption. there is a narrow SMP race in __set_task_cpu(): after ->cpu is set up to a new value, task_rq_lock(p, ...) can be successfuly executed on another CPU. We must ensure that updates of per-task data have been completed by this moment. this bug has been hiding in the Linux scheduler for an eternity (we never had any explicit barrier for task->cpu in set_task_cpu() - so the bug was introduced in 2.5.1), but only became visible via set_task_cfs_rq() being accidentally put after the task->cpu update. It also probably needs a sufficiently out-of-order CPU to trigger. Reported-by: Grant Wilson <grant.wilson@zen.co.uk> Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-15	sched: fix SCHED_FIFO tasks & FAIR_GROUP_SCHED	Oleg Nesterov	1	-1/+3
	Suppose that the SCHED_FIFO task does switch_uid(new_user); Now, p->se.cfs_rq and p->se.parent both point into the old user_struct->tg because sched_move_task() doesn't call set_task_cfs_rq() for !fair_sched_class case. Suppose that old user_struct/task_group is freed/reused, and the task does sched_setscheduler(SCHED_NORMAL); __setscheduler() sets fair_sched_class, but doesn't update ->se.cfs_rq/parent which point to the freed memory. This means that check_preempt_wakeup() doing while (!is_same_group(se, pse)) { se = parent_entity(se); pse = parent_entity(pse); } may OOPS in a similar way if rq->curr or p did something like above. Perhaps we need something like the patch below, note that __setscheduler() can't do set_task_cfs_rq(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-15	sched: fix accounting of interrupts during guest execution on s390	Christian Borntraeger	1	-4/+2
	Currently the scheduler checks for PF_VCPU to decide if this timeslice has to be accounted as guest time. On s390 host interrupts are not disabled during guest execution. This causes theses interrupts to be accounted as guest time if CONFIG_VIRT_CPU_ACCOUNTING is set. Solution is to check if an interrupt triggered account_system_time. As the tick is timer interrupt based, we have to subtract hardirq_offset. I tested the patch on s390 with CONFIG_VIRT_CPU_ACCOUNTING and on x86_64. Seems to work. CC: Avi Kivity <avi@qumranet.com> CC: Laurent Vivier <Laurent.Vivier@bull.net> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-14	revert "Task Control Groups: example CPU accounting subsystem"	Andrew Morton	1	-11/+3
	Revert 62d0df64065e7c135d0002f069444fbdfc64768f. This was originally intended as a simple initial example of how to create a control groups subsystem; it wasn't intended for mainline, but I didn't make this clear enough to Andrew. The CFS cgroup subsystem now has better functionality for the per-cgroup usage accounting (based directly on CFS stats) than the "usage" status file in this patch, and the "load" status file is rather simplistic - although having a per-cgroup load average report would be a useful feature, I don't believe this patch actually provides it. If it gets into the final 2.6.24 we'd probably have to support this interface for ever. Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-09	sched: proper prototype for kernel/sched.c:migration_init()	Adrian Bunk	1	-3/+1
	This patch adds a proper prototype for migration_init() in include/linux/sched.h Since there's no point in always returning 0 to a caller that doesn't check the return value it also changes the function to return void. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09	sched: avoid large irq-latencies in smp-balancing	Peter Zijlstra	1	-5/+10
	SMP balancing is done with IRQs disabled and can iterate the full rq. When rqs are large this can cause large irq-latencies. Limit the nr of iterations on each run. This fixes a scheduling latency regression reported by the -rt folks. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09	sched: remove PREEMPT_RESTRICT	Ingo Molnar	1	-3/+1
	remove PREEMPT_RESTRICT. (this is a separate commit so that any regression related to the removal itself is bisectable) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09	sched: turn off PREEMPT_RESTRICT	Ingo Molnar	1	-1/+1
	PREEMPT_RESTRICT was a method aimed at reducing the amount of wakeup related preemption. It has a disadvantage though, it can prevent legitimate wakeups if a task is 'unlucky' to be hit too early by a tick that clears peer_preempt. Now that the wakeup preemption has been cleaned up we dont seem to have excessive preemptions anymore, so this feature can be turned off. (and removed in the next patch) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09	sched: cleanup, use NSEC_PER_MSEC and NSEC_PER_SEC	Eric Dumazet	1	-4/+4
	1) hardcoded 1000000000 value is used five times in places where NSEC_PER_SEC might be more readable. 2) A conversion from nsec to msec uses the hardcoded 1000000 value, which is a candidate for NSEC_PER_MSEC. no code changed: text data bss dec hex filename 44359 3326 36 47721 ba69 sched.o.before 44359 3326 36 47721 ba69 sched.o.after Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09	sched: reintroduce SMP tunings again	Ingo Molnar	1	-0/+28
	Yanmin Zhang reported an aim7 regression and bisected it down to: \| commit 38ad464d410dadceda1563f36bdb0be7fe4c8938 \| Author: Ingo Molnar <mingo@elte.hu> \| Date: Mon Oct 15 17:00:02 2007 +0200 \| \| sched: uniform tunings \| \| use the same defaults on both UP and SMP. fix this by reintroducing similar SMP tunings again. This resolves the regression. (also update the comments to match the ilog2(nr_cpus) tuning effect) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-29	sched: fix style in kernel/sched.c	Ingo Molnar	1	-9/+9
	fallout of recent commits: small coding style fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-29	sched: report CPU usage in CFS cgroup directories	Paul Menage	1	-5/+33
	Adds a cpu.usage file to the CFS cgroup that reports CPU usage in milliseconds for that cgroup's tasks [ mingo@elte.hu: style cleanups. ] Signed-off-by: Paul Menage <menage@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-29	sched: move rcu_head to task_group struct	Srivatsa Vaddagiri	1	-4/+4
	Peter Zijlstra noticed that the rcu_head object need not be present in every cfs_rq of a group. Move it to the task_group structure instead. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-29	sched: fix incorrect assumption that cpu 0 exists	James Bottomley	1	-2/+2
	This patch: commit 9b5b77512dce239fa168183fa71896712232e95a Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Date: Mon Oct 15 17:00:09 2007 +0200 sched: clean up code under CONFIG_FAIR_GROUP_SCHED Introduced an assumption of the existence of CPU0 via this line cfs_rq = tg->cfs_rq[0]; If you have no CPU0, that will be NULL. The fix seems to be just to take whatever cfs_rq queue comes out of the for_each_possible_cpu() loop, since they're all equally good for the destruction operation. Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-29	sched: make kernel/sched.c:account_guest_time() static	Adrian Bunk	1	-1/+1
	account_guest_time() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24	sched: isolate SMP balancing code a bit more	Peter Williams	1	-17/+0
	At the moment, a lot of load balancing code that is irrelevant to non SMP systems gets included during non SMP builds. This patch addresses this issue and reduces the binary size on non SMP systems: text data bss dec hex filename 10983 28 1192 12203 2fab sched.o.before 10739 28 1192 11959 2eb7 sched.o.after Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24	sched: reduce balance-tasks overhead	Peter Williams	1	-32/+67
	At the moment, balance_tasks() provides low level functionality for both move_tasks() and move_one_task() (indirectly) via the load_balance() function (in the sched_class interface) which also provides dual functionality. This dual functionality complicates the interfaces and internal mechanisms and makes the run time overhead of operations that are called with two run queue locks held. This patch addresses this issue and reduces the overhead of these operations. Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24	sched: clean up some control group code	Paul Menage	1	-35/+18
	- replace "cont" with "cgrp" in a few places in the CFS cgroup code, - use write_uint rather than write for cpu.shares write function Signed-off-by: Paul Menage <menage@google.com> Acked-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24	sched: use show_regs() to improve __schedule_bug() output	Satyam Sharma	1	-3/+11
	A full register dump along with stack backtrace would make the "scheduling while atomic" message more helpful. Use show_regs() instead of dump_stack() for this. We already know we're atomic in here (that is why this function was called) so show_regs()'s atomicity expectations are guaranteed. Also, modify the output of the "BUG: scheduling while atomic:" header a bit to keep task->comm and task->pid together and preempt_count() after them. Signed-off-by: Satyam Sharma <satyam@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24	sched: clean up sched_domain_debug()	Ingo Molnar	1	-73/+73
	clean up sched_domain_debug(). this also shrinks the code a bit: text data bss dec hex filename 50474 4306 480 55260 d7dc sched.o.before 50404 4306 480 55190 d796 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24	sched: fix fastcall mismatch in completion APIs	Ingo Molnar	1	-5/+5
	Jeff Dike noticed that wait_for_completion_interruptible()'s prototype had a mismatched fastcall. Fix this by removing the fastcall attributes from all the completion APIs. Found-by: Jeff Dike <jdike@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24	sched: fix sched_domain sysctl registration again	Milton Miller	1	-5/+20
	commit 029190c515f15f512ac85de8fc686d4dbd0ae731 (cpuset sched_load_balance flag) was not tested SCHED_DEBUG enabled as committed as it dereferences NULL when used and it reordered the sysctl registration to cause it to never show any domains or their tunables. Fixes: 1) restore arch_init_sched_domains ordering we can't walk the domains before we build them presently we register cpus with empty directories (no domain directories or files). 2) make unregister_sched_domain_sysctl do nothing when already unregistered detach_destroy_domains is now called one set of cpus at a time unregister_syctl dereferences NULL if called with a null. While the the function would always dereference null if called twice, in the previous code it was always called once and then was followed a register. So only the hidden bug of the sysctl_root_table not being allocated followed by an attempt to free it would have shown the error. 3) always call unregister and register in partition_sched_domains The code is "smart" about unregistering only needed domains. Since we aren't guaranteed any calls to unregister, always unregister. Without calling register on the way out we will not have a table or any sysctl tree. 4) warn if register is called without unregistering The previous table memory is lost, leaving pointers to the later freed memory in sysctl and leaking the memory of the tables. Before this patch on a 2-core 4-thread box compiled for SMT and NUMA, the domains appear empty (there are actually 3 levels per cpu). And as soon as two domains a null pointer is dereferenced (unreliable in this case is stack garbage): bu19a:~# ls -R /proc/sys/kernel/sched_domain/ /proc/sys/kernel/sched_domain/: cpu0 cpu1 cpu2 cpu3 /proc/sys/kernel/sched_domain/cpu0: /proc/sys/kernel/sched_domain/cpu1: /proc/sys/kernel/sched_domain/cpu2: /proc/sys/kernel/sched_domain/cpu3: bu19a:~# mkdir /dev/cpuset bu19a:~# mount -tcpuset cpuset /dev/cpuset/ bu19a:~# cd /dev/cpuset/ bu19a:/dev/cpuset# echo 0 > sched_load_balance bu19a:/dev/cpuset# mkdir one bu19a:/dev/cpuset# echo 1 > one/cpus bu19a:/dev/cpuset# echo 0 > one/sched_load_balance Unable to handle kernel paging request for data at address 0x00000018 Faulting instruction address: 0xc00000000006b608 NIP: c00000000006b608 LR: c00000000006b604 CTR: 0000000000000000 REGS: c000000018d973f0 TRAP: 0300 Not tainted (2.6.23-bml) MSR: 9000000000009032 <EE,ME,IR,DR> CR: 28242442 XER: 00000000 DAR: 0000000000000018, DSISR: 0000000040000000 TASK = c00000001912e340[1987] 'bash' THREAD: c000000018d94000 CPU: 2 .. NIP [c00000000006b608] .unregister_sysctl_table+0x38/0x110 LR [c00000000006b604] .unregister_sysctl_table+0x34/0x110 Call Trace: [c000000018d97670] [c000000007017270] 0xc000000007017270 (unreliable) [c000000018d97720] [c000000000058710] .detach_destroy_domains+0x30/0xb0 [c000000018d977b0] [c00000000005cf1c] .partition_sched_domains+0x1bc/0x230 [c000000018d97870] [c00000000009fdc4] .rebuild_sched_domains+0xb4/0x4c0 [c000000018d97970] [c0000000000a02e8] .update_flag+0x118/0x170 [c000000018d97a80] [c0000000000a1768] .cpuset_common_file_write+0x568/0x820 [c000000018d97c00] [c00000000009d95c] .cgroup_file_write+0x7c/0x180 [c000000018d97cf0] [c0000000000e76b8] .vfs_write+0xe8/0x1b0 [c000000018d97d90] [c0000000000e810c] .sys_write+0x4c/0x90 [c000000018d97e30] [c00000000000852c] syscall_exit+0x0/0x40 Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-22	sched: don't clear PF_VCPU in scheduler	Laurent Vivier	1	-1/+0
	KVM clears it by itself now, and for s390 this is plain wrong. Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-20	kernel/sched.c: remove bogus comment from account_user_time	Michael Neuling	1	-1/+0
	hardirq_offset is no longer needed. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Adrian Bunk <bunk@kernel.org>
2007-10-19	Fix misspellings of "system", "controller", "interrupt" and "necessary".	Robert P. J. Day	1	-2/+2
	Fix the various misspellings of "system", controller", "interrupt" and "[un]necessary". Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>
2007-10-19	Hook up group scheduler with control groups	Srivatsa Vaddagiri	1	-0/+121
	Enable "cgroup" (formerly containers) based fair group scheduling. This will let administrator create arbitrary groups of tasks (using "cgroup" pseudo filesystem) and control their cpu bandwidth usage. [akpm@linux-foundation.org: fix cpp condition] Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19	hotplug cpu: migrate a task within its cpuset	Cliff Wickman	1	-1/+11
	When a cpu is disabled, move_task_off_dead_cpu() is called for tasks that have been running on that cpu. Currently, such a task is migrated: 1) to any cpu on the same node as the disabled cpu, which is both online and among that task's cpus_allowed 2) to any cpu which is both online and among that task's cpus_allowed It is typical of a multithreaded application running on a large NUMA system to have its tasks confined to a cpuset so as to cluster them near the memory that they share. Furthermore, it is typical to explicitly place such a task on a specific cpu in that cpuset. And in that case the task's cpus_allowed includes only a single cpu. This patch would insert a preference to migrate such a task to some cpu within its cpuset (and set its cpus_allowed to its entire cpuset). With this patch, migrate the task to: 1) to any cpu on the same node as the disabled cpu, which is both online and among that task's cpus_allowed 2) to any online cpu within the task's cpuset 3) to any cpu which is both online and among that task's cpus_allowed In order to do this, move_task_off_dead_cpu() must make a call to cpuset_cpus_allowed_locked(), a new subset of cpuset_cpus_allowed(), that will not block. (name change - per Oleg's suggestion) Calls are made to cpuset_lock() and cpuset_unlock() in migration_call() to set the cpuset mutex during the whole migrate_live_tasks() and migrate_dead_tasks() procedure. [akpm@linux-foundation.org: build fix] [pj@sgi.com: Fix indentation and spacing] Signed-off-by: Cliff Wickman <cpw@sgi.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Christoph Lameter <clameter@sgi.com> Cc: Paul Jackson <pj@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19	Use helpers to obtain task pid in printks	Pavel Emelyanov	1	-3/+4
	The task_struct->pid member is going to be deprecated, so start using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in the kernel. The first thing to start with is the pid, printed to dmesg - in this case we may safely use task_pid_nr(). Besides, printks produce more (much more) than a half of all the explicit pid usage. [akpm@linux-foundation.org: git-drm went and changed lots of stuff] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19	Fix tsk->exit_state usage	Eugene Teo	1	-1/+1
	tsk->exit_state can only be 0, EXIT_ZOMBIE, or EXIT_DEAD. A non-zero test is the same as tsk->exit_state & (EXIT_ZOMBIE \| EXIT_DEAD), so just testing tsk->exit_state is sufficient. Signed-off-by: Eugene Teo <eugeneteo@kernel.sg> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19	Fix cpusets update_cpumask	Paul Menage	1	-0/+13
	Cause writes to cpuset "cpus" file to update cpus_allowed for member tasks: - collect batches of tasks under tasklist_lock and then call set_cpus_allowed() on them outside the lock (since this can sleep). - add a simple generic priority heap type to allow efficient collection of batches of tasks to be processed without duplicating or missing any tasks in subsequent batches. - make "cpus" file update a no-op if the mask hasn't changed - fix race between update_cpumask() and sched_setaffinity() by making sched_setaffinity() post-check that it's not running on any cpus outside cpuset_cpus_allowed(). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19	cpuset sched_load_balance flag	Paul Jackson	1	-14/+81
	Add a new per-cpuset flag called 'sched_load_balance'. When enabled in a cpuset (the default value) it tells the kernel scheduler that the scheduler should provide the normal load balancing on the CPUs in that cpuset, sometimes moving tasks from one CPU to a second CPU if the second CPU is less loaded and if that task is allowed to run there. When disabled (write "0" to the file) then it tells the kernel scheduler that load balancing is not required for the CPUs in that cpuset. Now even if this flag is disabled for some cpuset, the kernel may still have to load balance some or all the CPUs in that cpuset, if some overlapping cpuset has its sched_load_balance flag enabled. If there are some CPUs that are not in any cpuset whose sched_load_balance flag is enabled, the kernel scheduler will not load balance tasks to those CPUs. Moreover the kernel will partition the 'sched domains' (non-overlapping sets of CPUs over which load balancing is attempted) into the finest granularity partition that it can find, while still keeping any two CPUs that are in the same shed_load_balance enabled cpuset in the same element of the partition. This serves two purposes: 1) It provides a mechanism for real time isolation of some CPUs, and 2) it can be used to improve performance on systems with many CPUs by supporting configurations in which load balancing is not done across all CPUs at once, but rather only done in several smaller disjoint sets of CPUs. This mechanism replaces the earlier overloading of the per-cpuset flag 'cpu_exclusive', which overloading was removed in an earlier patch: cpuset-remove-sched-domain-hooks-from-cpusets See further the Documentation and comments in the code itself. [akpm@linux-foundation.org: don't be weird] Signed-off-by: Paul Jackson <pj@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19	Uninline find_task_by_xxx set of functions	Pavel Emelyanov	1	-2/+1
	The find_task_by_something is a set of macros are used to find task by pid depending on what kind of pid is proposed - global or virtual one. All of them are wrappers above the most generic one - find_task_by_pid_type_ns() - and just substitute some args for it. It turned out, that dereferencing the current->nsproxy->pid_ns construction and pushing one more argument on the stack inline cause kernel text size to grow. This patch moves all this stuff out-of-line into kernel/pid.c. Together with the next patch it saves a bit less than 400 bytes from the .text section. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19	pid namespaces: changes to show virtual ids to user	Pavel Emelyanov	1	-2/+4
	This is the largest patch in the set. Make all (I hope) the places where the pid is shown to or get from user operate on the virtual pids. The idea is: - all in-kernel data structures must store either struct pid itself or the pid's global nr, obtained with pid_nr() call; - when seeking the task from kernel code with the stored id one should use find_task_by_pid() call that works with global pids; - when showing pid's numerical value to the user the virtual one should be used, but however when one shows task's pid outside this task's namespace the global one is to be used; - when getting the pid from userspace one need to consider this as the virtual one and use appropriate task/pid-searching functions. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: nuther build fix] [akpm@linux-foundation.org: yet nuther build fix] [akpm@linux-foundation.org: remove unneeded casts] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19	Task Control Groups: example CPU accounting subsystem	Paul Menage	1	-3/+11
	This example demonstrates how to use the generic cgroup subsystem for a simple resource tracker that counts, for the processes in a cgroup, the total CPU time used and the %CPU used in the last complete 10 second interval. Portions contributed by Balbir Singh <balbir@in.ibm.com> Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched	Linus Torvalds	1	-33/+40
	* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: sched: reduce schedstat variable overhead a bit sched: add KERN_CONT annotation sched: cleanup, make struct rq comments more consistent sched: cleanup, fix spacing sched: fix return value of wait_for_completion_interruptible()
2007-10-18	Add scaled time to taskstats based process accounting	Michael Neuling	1	-0/+21
	This adds items to the taststats struct to account for user and system time based on scaling the CPU frequency and instruction issue rates. Adds account_(user\|system)_time_scaled callbacks which architectures can use to account for time using this mechanism. Signed-off-by: Michael Neuling <mikey@neuling.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Jay Lan <jlan@engr.sgi.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18	sched: reduce schedstat variable overhead a bit	Ken Chen	1	-10/+10
	schedstat is useful in investigating CPU scheduler behavior. Ideally, I think it is beneficial to have it on all the time. However, the cost of turning it on in production system is quite high, largely due to number of events it collects and also due to its large memory footprint. Most of the fields probably don't need to be full 64-bit on 64-bit arch. Rolling over 4 billion events will most like take a long time and user space tool can be made to accommodate that. I'm proposing kernel to cut back most of variable width on 64-bit system. (note, the following patch doesn't affect 32-bit system). Signed-off-by: Ken Chen <kenchen@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-18	sched: add KERN_CONT annotation	Ingo Molnar	1	-11/+11
	printk: add the KERN_CONT annotation (which is empty string but via which checkpatch.pl can notice that the lacking KERN_ level is fine). Signed-off-by: Ingo Molnar <mingo@elte.hu>