aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/posix-cpu-timers.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2008-09-23timers: fix itimer/many thread hang, v2Frank Mayhar1-89/+64
This is the second resubmission of the posix timer rework patch, posted a few days ago. This includes the changes from the previous resubmittion, which addressed Oleg Nesterov's comments, removing the RCU stuff from the patch and un-inlining the thread_group_cputime() function for SMP. In addition, per Ingo Molnar it simplifies the UP code, consolidating much of it with the SMP version and depending on lower-level SMP/UP handling to take care of the differences. It also cleans up some UP compile errors, moves the scheduler stats-related macros into kernel/sched_stats.h, cleans up a merge error in kernel/fork.c and has a few other minor fixes and cleanups as suggested by Oleg and Ingo. Thanks for the review, guys. Signed-off-by: Frank Mayhar <fmayhar@google.com> Cc: Roland McGrath <roland@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-09-14timers: fix itimer/many thread hang, cleanupsIngo Molnar1-3/+3
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-09-14timers: fix itimer/many thread hangFrank Mayhar1-219/+252
Overview This patch reworks the handling of POSIX CPU timers, including the ITIMER_PROF, ITIMER_VIRT timers and rlimit handling. It was put together with the help of Roland McGrath, the owner and original writer of this code. The problem we ran into, and the reason for this rework, has to do with using a profiling timer in a process with a large number of threads. It appears that the performance of the old implementation of run_posix_cpu_timers() was at least O(n*3) (where "n" is the number of threads in a process) or worse. Everything is fine with an increasing number of threads until the time taken for that routine to run becomes the same as or greater than the tick time, at which point things degrade rather quickly. This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF." Code Changes This rework corrects the implementation of run_posix_cpu_timers() to make it run in constant time for a particular machine. (Performance may vary between one machine and another depending upon whether the kernel is built as single- or multiprocessor and, in the latter case, depending upon the number of running processors.) To do this, at each tick we now update fields in signal_struct as well as task_struct. The run_posix_cpu_timers() function uses those fields to make its decisions. We define a new structure, "task_cputime," to contain user, system and scheduler times and use these in appropriate places: struct task_cputime { cputime_t utime; cputime_t stime; unsigned long long sum_exec_runtime; }; This is included in the structure "thread_group_cputime," which is a new substructure of signal_struct and which varies for uniprocessor versus multiprocessor kernels. For uniprocessor kernels, it uses "task_cputime" as a simple substructure, while for multiprocessor kernels it is a pointer: struct thread_group_cputime { struct task_cputime totals; }; struct thread_group_cputime { struct task_cputime *totals; }; We also add a new task_cputime substructure directly to signal_struct, to cache the earliest expiration of process-wide timers, and task_cputime also replaces the it_*_expires fields of task_struct (used for earliest expiration of thread timers). The "thread_group_cputime" structure contains process-wide timers that are updated via account_user_time() and friends. In the non-SMP case the structure is a simple aggregator; unfortunately in the SMP case that simplicity was not achievable due to cache-line contention between CPUs (in one measured case performance was actually _worse_ on a 16-cpu system than the same test on a 4-cpu system, due to this contention). For SMP, the thread_group_cputime counters are maintained as a per-cpu structure allocated using alloc_percpu(). The timer functions update only the timer field in the structure corresponding to the running CPU, obtained using per_cpu_ptr(). We define a set of inline functions in sched.h that we use to maintain the thread_group_cputime structure and hide the differences between UP and SMP implementations from the rest of the kernel. The thread_group_cputime_init() function initializes the thread_group_cputime structure for the given task. The thread_group_cputime_alloc() is a no-op for UP; for SMP it calls the out-of-line function thread_group_cputime_alloc_smp() to allocate and fill in the per-cpu structures and fields. The thread_group_cputime_free() function, also a no-op for UP, in SMP frees the per-cpu structures. The thread_group_cputime_clone_thread() function (also a UP no-op) for SMP calls thread_group_cputime_alloc() if the per-cpu structures haven't yet been allocated. The thread_group_cputime() function fills the task_cputime structure it is passed with the contents of the thread_group_cputime fields; in UP it's that simple but in SMP it must also safely check that tsk->signal is non-NULL (if it is it just uses the appropriate fields of task_struct) and, if so, sums the per-cpu values for each online CPU. Finally, the three functions account_group_user_time(), account_group_system_time() and account_group_exec_runtime() are used by timer functions to update the respective fields of the thread_group_cputime structure. Non-SMP operation is trivial and will not be mentioned further. The per-cpu structure is always allocated when a task creates its first new thread, via a call to thread_group_cputime_clone_thread() from copy_signal(). It is freed at process exit via a call to thread_group_cputime_free() from cleanup_signal(). All functions that formerly summed utime/stime/sum_sched_runtime values from from all threads in the thread group now use thread_group_cputime() to snapshot the values in the thread_group_cputime structure or the values in the task structure itself if the per-cpu structure hasn't been allocated. Finally, the code in kernel/posix-cpu-timers.c has changed quite a bit. The run_posix_cpu_timers() function has been split into a fast path and a slow path; the former safely checks whether there are any expired thread timers and, if not, just returns, while the slow path does the heavy lifting. With the dedicated thread group fields, timers are no longer "rebalanced" and the process_timer_rebalance() function and related code has gone away. All summing loops are gone and all code that used them now uses the thread_group_cputime() inline. When process-wide timers are set, the new task_cputime structure in signal_struct is used to cache the earliest expiration; this is checked in the fast path. Performance The fix appears not to add significant overhead to existing operations. It generally performs the same as the current code except in two cases, one in which it performs slightly worse (Case 5 below) and one in which it performs very significantly better (Case 2 below). Overall it's a wash except in those two cases. I've since done somewhat more involved testing on a dual-core Opteron system. Case 1: With no itimer running, for a test with 100,000 threads, the fixed kernel took 1428.5 seconds, 513 seconds more than the unfixed system, all of which was spent in the system. There were twice as many voluntary context switches with the fix as without it. Case 2: With an itimer running at .01 second ticks and 4000 threads (the most an unmodified kernel can handle), the fixed kernel ran the test in eight percent of the time (5.8 seconds as opposed to 70 seconds) and had better tick accuracy (.012 seconds per tick as opposed to .023 seconds per tick). Case 3: A 4000-thread test with an initial timer tick of .01 second and an interval of 10,000 seconds (i.e. a timer that ticks only once) had very nearly the same performance in both cases: 6.3 seconds elapsed for the fixed kernel versus 5.5 seconds for the unfixed kernel. With fewer threads (eight in these tests), the Case 1 test ran in essentially the same time on both the modified and unmodified kernels (5.2 seconds versus 5.8 seconds). The Case 2 test ran in about the same time as well, 5.9 seconds versus 5.4 seconds but again with much better tick accuracy, .013 seconds per tick versus .025 seconds per tick for the unmodified kernel. Since the fix affected the rlimit code, I also tested soft and hard CPU limits. Case 4: With a hard CPU limit of 20 seconds and eight threads (and an itimer running), the modified kernel was very slightly favored in that while it killed the process in 19.997 seconds of CPU time (5.002 seconds of wall time), only .003 seconds of that was system time, the rest was user time. The unmodified kernel killed the process in 20.001 seconds of CPU (5.014 seconds of wall time) of which .016 seconds was system time. Really, though, the results were too close to call. The results were essentially the same with no itimer running. Case 5: With a soft limit of 20 seconds and a hard limit of 2000 seconds (where the hard limit would never be reached) and an itimer running, the modified kernel exhibited worse tick accuracy than the unmodified kernel: .050 seconds/tick versus .028 seconds/tick. Otherwise, performance was almost indistinguishable. With no itimer running this test exhibited virtually identical behavior and times in both cases. In times past I did some limited performance testing. those results are below. On a four-cpu Opteron system without this fix, a sixteen-thread test executed in 3569.991 seconds, of which user was 3568.435s and system was 1.556s. On the same system with the fix, user and elapsed time were about the same, but system time dropped to 0.007 seconds. Performance with eight, four and one thread were comparable. Interestingly, the timer ticks with the fix seemed more accurate: The sixteen-thread test with the fix received 149543 ticks for 0.024 seconds per tick, while the same test without the fix received 58720 for 0.061 seconds per tick. Both cases were configured for an interval of 0.01 seconds. Again, the other tests were comparable. Each thread in this test computed the primes up to 25,000,000. I also did a test with a large number of threads, 100,000 threads, which is impossible without the fix. In this case each thread computed the primes only up to 10,000 (to make the runtime manageable). System time dominated, at 1546.968 seconds out of a total 2176.906 seconds (giving a user time of 629.938s). It received 147651 ticks for 0.015 seconds per tick, still quite accurate. There is obviously no comparable test without the fix. Signed-off-by: Frank Mayhar <fmayhar@google.com> Cc: Roland McGrath <roland@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-24posix-timers: print RT watchdog messageHiroshi Shimamoto1-0/+3
It's useful to detect which process is killed by RT watchdog. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-01remove div_long_long_remRoman Zippel1-6/+5
x86 is the only arch right now, which provides an optimized for div_long_long_rem and it has the downside that one has to be very careful that the divide doesn't overflow. The API is a little akward, as the arguments for the unsigned divide are signed. The signed version also doesn't handle a negative divisor and produces worse code on 64bit archs. There is little incentive to keep this API alive, so this converts the few users to the new API. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-17posix-timers: fix shadowed variablesWANG Cong1-15/+15
Fix sparse warnings like this: kernel/posix-cpu-timers.c:1090:25: warning: symbol 't' shadows an earlier one kernel/posix-cpu-timers.c:1058:21: originally declared here Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-02-08Use find_task_by_vpid in posix timersPavel Emelyanov1-4/+4
All the functions that need to lookup a task by pid in posix timers obtain this pid from a user space, and thus this value refers to a task in the same namespace, as the current task lives in. So the proper behavior is to call find_task_by_vpid() here. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-25sched: rt-watchdog: fix .rlim_max = RLIM_INFINITYPeter Zijlstra1-1/+2
Remove the curious logic to set it_sched_expires in the future. It useless because rt.timeout wouldn't be incremented anyway. Explicity check for RLIM_INFINITY as a test programm that had a 1s soft limit and a inf hard limit would SIGKILL at 1s. This is because RLIM_INFINITY+d-1 is d-2. Signed-off-by: Peter Zijlsta <a.p.zijlstra@chello.nl> CC: Michal Schmidt <mschmidt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25sched: SCHED_FIFO/SCHED_RR watchdog timerPeter Zijlstra1-0/+29
Introduce a new rlimit that allows the user to set a runtime timeout on real-time tasks their slice. Once this limit is exceeded the task will receive SIGXCPU. So it measures runtime since the last sleep. Input and ideas by Thomas Gleixner and Lennart Poettering. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Lennart Poettering <mzxreary@0pointer.de> CC: Michael Kerrisk <mtk.manpages@googlemail.com> CC: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-19Isolate some explicit usage of task->tgidPavel Emelyanov1-6/+6
With pid namespaces this field is now dangerous to use explicitly, so hide it behind the helpers. Also the pid and pgrp fields o task_struct and signal_struct are to be deprecated. Unfortunately this patch cannot be sent right now as this leads to tons of warnings, so start isolating them, and deprecate later. Actually the p->tgid == pid has to be changed to has_group_leader_pid(), but Oleg pointed out that in case of posix cpu timers this is the same, and thread_group_leader() is more preferable. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-09sched: make posix-cpu-timers use CFS's accounting informationIngo Molnar1-17/+17
update the posix-cpu-timers code to use CFS's CPU accounting information. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-05-08Introduce a handy list_first_entry macroPavel Emelianov1-7/+7
There are many places in the kernel where the construction like foo = list_entry(head->next, struct foo_struct, list); are used. The code might look more descriptive and neat if using the macro list_first_entry(head, type, member) \ list_entry((head)->next, type, member) Here is the macro itself and the examples of its usage in the generic code. If it will turn out to be useful, I can prepare the set of patches to inject in into arch-specific code, drivers, networking, etc. Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Kirill Korotaev <dev@openvz.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Zach Brown <zach.brown@oracle.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: John McCutchan <ttb@tentacle.dhs.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: john stultz <johnstul@us.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16[PATCH] posix timers: RCU optimization for clock_gettime()Paul E. McKenney1-5/+10
Use RCU to avoid the need to acquire tasklist_lock in the single-threaded case of clock_gettime(). It still acquires tasklist_lock when for a (potentially multithreaded) process. This change allows realtime applications to frequently monitor CPU consumption of individual tasks, as requested (and now deployed) by some off-list users. This has been in Ingo Molnar's -rt patchset since late 2005 with no problems reported, and tests successfully on 2.6.20-rc6, so I believe that it is long-since ready for mainline adoption. [paulmck@linux.vnet.ibm.com: fix exit()/posix_cpu_clock_get() race spotted by Oleg] Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2006-10-17[PATCH] posix-cpu-timers: prevent signal delivery starvationThomas Gleixner1-6/+21
The integer divisions in the timer accounting code can round the result down to 0. Adding 0 is without effect and the signal delivery stops. Clamp the division result to minimum 1 to avoid this. Problem was reported by Seongbae Park <spark@google.com>, who provided also an inital patch. Roland sayeth: I have had some more time to think about the problem, and to reproduce it using Toyo's test case. For the record, if my understanding of the problem is correct, this happens only in one very particular case. First, the expiry time has to be so soon that in cputime_t units (usually 1s/HZ ticks) it's < nthreads so the division yields zero. Second, it only affects each thread that is so new that its CPU time accumulation is zero so now+0 is still zero and ->it_*_expires winds up staying zero. For the VIRT and PROF clocks when cputime_t is tick granularity (or the SCHED clock on configurations where sched_clock's value only advances on clock ticks), this is not hard to arrange with new threads starting up and blocking before they accumulate a whole tick of CPU time. That's what happens in Toyo's test case. Note that in general it is fine for that division to round down to zero, and set each thread's expiry time to its "now" time. The problem only arises with thread's whose "now" value is still zero, so that now+0 winds up 0 and is interpreted as "not set" instead of ">= now". So it would be a sufficient and more precise fix to just use max(ticks, 1) inside the loop when setting each it_*_expires value. But, it does no harm to round the division up to one and always advance every thread's expiry time. If the thread didn't already fire timers for the expiry time of "now", there is no expectation that it will do so before the next tick anyway. So I followed Thomas's patch in lifting the max out of the loops. This patch also covers the reload cases, which are harder to write a test for (and I didn't try). I've tested it with Toyo's case and it fixes that. [toyoa@mvista.com: fix: min_t -> max_t] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Daniel Walker <dwalker@mvista.com> Cc: Toyo Abe <toyoa@mvista.com> Cc: john stultz <johnstul@us.ibm.com> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Seongbae Park <spark@google.com> Cc: Peter Mattis <pmattis@google.com> Cc: Rohit Seth <rohitseth@google.com> Cc: Martin Bligh <mbligh@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29[PATCH] posix-timers: Fix the flags handling in posix_cpu_nsleep()Toyo Abe1-26/+58
When a posix_cpu_nsleep() sleep is interrupted by a signal more than twice, it incorrectly reports the sleep time remaining to the user. Because posix_cpu_nsleep() doesn't report back to the user when it's called from restart function due to the wrong flags handling. This patch, which applies after previous one, moves the nanosleep() function from posix_cpu_nsleep() to do_cpu_nanosleep() and cleans up the flags handling appropriately. Signed-off-by: Toyo Abe <toyoa@mvista.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29[PATCH] posix-timers: Fix clock_nanosleep() doesn't return the remaining time in compatibility modeToyo Abe1-5/+12
The clock_nanosleep() function does not return the time remaining when the sleep is interrupted by a signal. This patch creates a new call out, compat_clock_nanosleep_restart(), which handles returning the remaining time after a sleep is interrupted. This patch revives clock_nanosleep_restart(). It is now accessed via the new call out. The compat_clock_nanosleep_restart() is used for compatibility access. Since this is implemented in compatibility mode the normal path is virtually unaffected - no real performance impact. Signed-off-by: Toyo Abe <toyoa@mvista.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-17[PATCH] arm_timer: remove a racy and obsolete PF_EXITING checkOleg Nesterov1-3/+0
arm_timer() checks PF_EXITING to prevent BUG_ON(->exit_state) in run_posix_cpu_timers(). However, for some reason it does so only for CPUCLOCK_PERTHREAD case (which is imho wrong). Also, this check is not reliable, PF_EXITING could be set on another cpu without any locks/barriers just after the check, so it can't prevent from attaching the timer to the exiting task. The previous patch makes this check unneeded. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-17[PATCH] run_posix_cpu_timers: remove a bogus BUG_ON()Oleg Nesterov1-18/+18
do_exit() clears ->it_##clock##_expires, but nothing prevents another cpu to attach the timer to exiting process after that. arm_timer() tries to protect against this race, but the check is racy. After exit_notify() does 'write_unlock_irq(&tasklist_lock)' and before do_exit() calls 'schedule() local timer interrupt can find tsk->exit_state != 0. If that state was EXIT_DEAD (or another cpu does sys_wait4) interrupted task has ->signal == NULL. At this moment exiting task has no pending cpu timers, they were cleanuped in __exit_signal()->posix_cpu_timers_exit{,_group}(), so we can just return from irq. John Stultz recently confirmed this bug, see http://marc.theaimsgroup.com/?l=linux-kernel&m=115015841413687 Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-17[PATCH] check_process_timers: fix possible lockupOleg Nesterov1-5/+4
If the local timer interrupt happens just after do_exit() sets PF_EXITING (and before it clears ->it_xxx_expires) run_posix_cpu_timers() will call check_process_timers() with tasklist_lock + ->siglock held and check_process_timers: t = tsk; do { .... do { t = next_thread(t); } while (unlikely(t->flags & PF_EXITING)); } while (t != tsk); the outer loop will never stop. Actually, the window is bigger. Another process can attach the timer after ->it_xxx_expires was cleared (see the next commit) and the 'if (PF_EXITING)' check in arm_timer() is racy (see the one after that). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] hrtimer: switch clock_nanosleep to hrtimer nanosleep APIThomas Gleixner1-9/+14
Switch clock_nanosleep to use the new nanosleep functions in hrtimer.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] hrtimer: make clockid_t arguments constThomas Gleixner1-18/+22
add const arguments to the posix-timers.h API functions Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] Fix posix-cpu-timers sched_time accumulationDavid S. Miller1-12/+1
I've spent the past 3 days digging into a glibc testsuite failure in current CVS, specifically libc/rt/tst-cputimer1.c The thr1 and thr2 timers fire too early in the second pass of this test. The second pass is noteworthy because it makes use of intervals, whereas the first pass does not. All throughout the posix-cpu-timers.c code, the calculation of the process sched_time sum is implemented roughly as: unsigned long long sum; sum = tsk->signal->sched_time; t = tsk; do { sum += t->sched_time; t = next_thread(t); } while (t != tsk); In fact this is the exact scheme used by check_process_timers(). In the case of check_process_timers(), current->sched_time has just been updated (via scheduler_tick(), which is invoked by update_process_times(), which subsequently invokes run_posix_cpu_timers()) So there is no special processing necessary wrt. that. In other contexts, we have to allot for the fact that tsk->sched_time might be a bit out of date if we are current. And the posix-cpu-timers.c code uses current_sched_time() to deal with that. Unfortunately it does so in an erroneous and inconsistent manner in one spot which is what results in the early timer firing. In cpu_clock_sample_group_locked(), it does this: cpu->sched = p->signal->sched_time; /* Add in each other live thread. */ while ((t = next_thread(t)) != p) { cpu->sched += t->sched_time; } if (p->tgid == current->tgid) { /* * We're sampling ourselves, so include the * cycles not yet banked. We still omit * other threads running on other CPUs, * so the total can always be behind as * much as max(nthreads-1,ncpus) * (NSEC_PER_SEC/HZ). */ cpu->sched += current_sched_time(current); } else { cpu->sched += p->sched_time; } The problem is the "p->tgid == current->tgid" test. If "p" is not current, and the tgids are the same, we will add the process t->sched_time twice into cpu->sched and omit "p"'s sched_time which is very very very wrong. posix-cpu-timers.c has a helper function, sched_ns(p) which takes care of this, so my fix is to use that here instead of this special tgid test. The fact that current can be one of the sub-threads of "p" points out that we could make things a little bit more accurate, perhaps by using sched_ns() on every thread we process in these loops. It also points out that we don't use the most accurate value for threads in the group actively running other cpus (and this is mentioned in the comment). But that is a future enhancement, and this fix here definitely makes sense. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28[PATCH] fix 32bit overflow in timespec_to_sample()Oleg Nesterov1-1/+1
fix 32bit overflow in timespec_to_sample() Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] posix-timers `unlikely' rejigAndrew Morton1-3/+3
!unlikely(expr) hurts my brain. likely(!expr) is more straightforward. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] posix-cpu-timers: fix overrun reportingRoland McGrath1-6/+10
This change corrects an omission in posix_cpu_timer_schedule, so that it correctly propagates the overrun calculation to where it will get reported to the user. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-27[PATCH] Yet more posix-cpu-timer fixesRoland McGrath1-4/+7
This just makes sure that a thread's expiry times can't get reset after it clears them in do_exit. This is what allowed us to re-introduce the stricter BUG_ON() check in a362f463a6d316d14daed0f817e151835ce97ff7. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-27Revert "remove false BUG_ON() from run_posix_cpu_timers()"Linus Torvalds1-18/+18
This reverts commit 3de463c7d9d58f8cf3395268230cb20a4c15bffa. Roland has another patch that allows us to leave the BUG_ON() in place by just making sure that the condition it tests for really is always true. That goes in next. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-26[PATCH] Fix cpu timers expiration timeOleg Nesterov1-3/+3
There's a silly off-by-one error in the code that updates the expiration of posix CPU timers, causing them to not be properly updated when they hit exactly on their expiration time (which should be the normal case). This causes them to then fire immediately again, and only _then_ get properly updated. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-26posix cpu timers: fix timer orderingLinus Torvalds1-6/+4
Pointed out by Oleg Nesterov, who has been walking over the code forwards and backwards. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-24[PATCH] posix-timers: fix posix_cpu_timer_set() vs run_posix_cpu_timers() raceOleg Nesterov1-4/+8
This might be harmless, but looks like a race from code inspection (I was unable to trigger it). I must admit, I don't understand why we can't return TIMER_RETRY after 'spin_unlock(&p->sighand->siglock)' without doing bump_cpu_timer(), but this is what original code does. posix_cpu_timer_set: read_lock(&tasklist_lock); spin_lock(&p->sighand->siglock); list_del_init(&timer->it.cpu.entry); spin_unlock(&p->sighand->siglock); We are probaly deleting the timer from run_posix_cpu_timers's 'firing' local list_head while run_posix_cpu_timers() does list_for_each_safe. Various bad things can happen, for example we can just delete this timer so that list_for_each() will not notice it and run_posix_cpu_timers() will not reset '->firing' flag. In that case, .... if (timer->it.cpu.firing) { read_unlock(&tasklist_lock); timer->it.cpu.firing = -1; return TIMER_RETRY; } sys_timer_settime() goes to 'retry:', calls posix_cpu_timer_set() again, it returns TIMER_RETRY ... Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-24[PATCH] posix-timers: exit path cleanupOleg Nesterov1-0/+6
No need to rebalance when task exited Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-24[PATCH] posix-timers: remove false BUG_ON() from run_posix_cpu_timers()Oleg Nesterov1-18/+18
do_exit() clears ->it_##clock##_expires, but nothing prevents another cpu to attach the timer to exiting process after that. After exit_notify() does 'write_unlock_irq(&tasklist_lock)' and before do_exit() calls 'schedule() local timer interrupt can find tsk->exit_state != 0. If that state was EXIT_DEAD (or another cpu does sys_wait4) interrupted task has ->signal == NULL. At this moment exiting task has no pending cpu timers, they were cleaned up in __exit_signal()->posix_cpu_timers_exit{,_group}(), so we can just return from irq. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-24[PATCH] posix-timers: fix cleanup_timers() and run_posix_cpu_timers() racesOleg Nesterov1-19/+10
1. cleanup_timers() sets timer->task = NULL under tasklist + ->sighand locks. That means that this code in posix_cpu_timer_del() and posix_cpu_timer_set() lock_timer(timer); if (timer->task == NULL) return; read_lock(tasklist); put_task_struct(timer->task) is racy. With this patch timer->task modified and accounted only under timer->it_lock. Sadly, this means that dead task_struct won't be freed until timer deleted or armed. 2. run_posix_cpu_timers() collects expired timers into local list under tasklist + ->sighand again. That means that posix_cpu_timer_del() should check timer->it.cpu.firing under these locks too. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-23Posix timers: limit number of timers firing at onceLinus Torvalds1-6/+14
Bursty timers aren't good for anybody, very much including latency for other programs when we trigger lots of timers in interrupt context. So set a random limit, after which we'll handle the rest on the next timer tick. Noted by Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-21Revert "Fix cpu timers exit deadlock and races"Linus Torvalds1-11/+17
Revert commit e03d13e985d48ac4885382c9e3b1510c78bd047f, to be replaced by a much nicer fix from Roland.
2005-10-19[PATCH] Fix cpu timers exit deadlock and racesRoland McGrath1-17/+11
Oleg Nesterov reported an SMP deadlock. If there is a running timer tracking a different process's CPU time clock when the process owning the timer exits, we deadlock on tasklist_lock in posix_cpu_timer_del via exit_itimers. That code was using tasklist_lock to check for a race with __exit_signal being called on the timer-target task and clearing its ->signal. However, there is actually no such race. __exit_signal will have called posix_cpu_timers_exit and posix_cpu_timers_exit_group before it does that. Those will clear those k_itimer's association with the dying task, so posix_cpu_timer_del will return early and never reach the code in question. In addition, posix_cpu_timer_del called from exit_itimers during execve or directly from timer_delete in the process owning the timer can race with an exiting timer-target task to cause a double put on timer-target task struct. Make sure we always access cpu_timers lists with sighand lock held. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Chris Wright <chrisw@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-17[PATCH] posix-timers: fix task accountingOleg Nesterov1-0/+3
Make sure we release the task struct properly when releasing pending timers. release_task() does write_lock_irq(&tasklist_lock), so it can't race with run_posix_cpu_timers() on any cpu. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16Linux-2.6.12-rc2Linus Torvalds1-0/+1559
Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!