summaryrefslogtreecommitdiffstats
path: root/sys/kern/kern_timeout.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Simplify sleep_setup API to two operations in preparation for splittingmpi2021-02-081-2/+2
| | | | | | | | | | | | the SCHED_LOCK(). Putting a thread on a sleep queue is reduce to the following: sleep_setup(); /* check condition or release lock */ sleep_finish(); Previous version ok cheloha@, jmatthew@, ok claudio@
* timeout(9): fix compilation under NKCOVcheloha2020-10-201-2/+2
|
* timeout(9): basic support for kclock timeoutscheloha2020-10-151-60/+338
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A kclock timeout is a timeout that expires at an absolute time on one of the kernel's clocks. A timeout's absolute expiration time is kept in a new member of the timeout struct, to_abstime. The timeout's kclock is set at initialization and is kept in another new member of the timeout struct, to_kclock. Kclock timeouts are desireable because they have nanosecond resolution, regardless of the value of hz(9). The timecounter subsystem is also inherently NTP-sensitive, so timeouts scheduled against the subsystem are NTP-sensitive. These two qualities guarantee that a kclock timeout will never expire early. Currently there is support for one kclock, KCLOCK_UPTIME (the uptime clock). Support for KCLOCK_RUNTIME (the runtime clock) and KCLOCK_UTC (the UTC clock) is planned for the future. Support for these additional kclocks will allow us to implement some of the POSIX interfaces OpenBSD is missing, e.g. clock_nanosleep() and timer_create(). We could also use it to provide proper absolute timeouts for e.g. pthread_mutex_timedlock(3). Kclock timeouts are initialized with timeout_set_kclock(). They can be scheduled with either timeout_in_nsec() (relative timeout) or timeout_at_ts() (absolute timeout). They are incompatible with timeout_add(9), timeout_add_sec(9), timeout_add_msec(9), timeout_add_usec(9), timeout_add_nsec(9), and timeout_add_tv(9). They can be cancelled with timeout_del(9) or timeout_del_barrier(9). Documentation for the new interfaces is a work in progress. For now, tick-based timeouts remain supported alongside kclock timeouts. They will remain supported until we are certain we don't need them anymore. It is possible we will never remove them. I would rather not keep them around forever, but I cannot predict what difficulties we will encounter while converting tick-based timeouts to kclock timeouts. There are a *lot* of timeouts in the kernel. Kclock timeouts are more costly than tick-based timeouts: - Calling timeout_in_nsec() incurs a call to nanouptime(9). Reading the hardware timecounter is too expensive in some contexts, so care must be taken when converting existing timeouts. We may add a flag in the future to cause timeout_in_nsec() to use getnanouptime(9) instead of nanouptime(9), which is much cheaper. This may be appropriate for certain classes of timeouts. tcp/ip session timeouts come to mind. - Kclock timeout expirations are kept in a timespec. Timespec arithmetic has more overhead than 32-bit tick arithmetic, so processing kclock timeouts during softclock() is more expensive. On my machine the overhead for processing a tick-based timeout is ~125 cycles. The overhead for a kclock timeout is ~500 cycles. The overhead difference on 32-bit platforms is unknown. If it proves too large we may need to use a 64-bit value to store the expiration time. More measurement is needed. Priority targets for conversion are setitimer(2), *sleep_nsec(9), and the kevent(2) EVFILT_TIMER timers. Others will follow. With input from mpi@, visa@, kettenis@, dlg@, guenther@, claudio@, deraadt@, probably many others. Older version tested by visa@. Problems found in older version by bluhm@. Current version tested by Yuichiro Naito. "wait until after unlock" deraadt@, ok kettenis@
* timeout(9): timeout_run(): read to_process before leaving timeout_mutexcheloha2020-09-221-2/+4
| | | | | | | | to_process is assigned during timeout_add(9) within timeout_mutex. In timeout_run() we need to read to_process before leaving timeout_mutex to ensure that the process pointer given to kcov_remote_enter(9) is the same as the one we set from timeout_add(9) when the candidate timeout was originally scheduled to run.
* timeout(9): remove unused interfaces: timeout_add_ts(9), timeout_add_bt(9)cheloha2020-08-071-30/+1
| | | | | | | | | | These two interfaces have been entirely unused since introduction. Remove them and thin the "timeout" namespace a bit. Discussed with mpi@ and ratchov@ almost a year ago, though I blocked the change at that time. Also discussed with visa@. ok visa@, mpi@
* timeout(9): fix miscellaneous remote kcov(4) bugscheloha2020-08-061-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | Commit v1.77 introduced remote kcov support for timeouts. We need to tweak a few things to make our support more correct: - Set to_process for barrier timeouts to the calling thread's parent process. Currently it is uninitialized, so during timeout_run() we are passing stack garbage to kcov_remote_enter(9). - Set to_process to NULL during timeout_set_flags(9). If in the future we forget to properly initialize to_process before reaching timeout_run(), we'll pass NULL to kcov_remote_enter(9). anton@ says this is harmless. I assume it is also preferable to passing stack garbage. - Save a copy of to_process on the stack in timeout_run() before calling to_func to ensure that we pass the same process pointer to kcov_remote_leave(9) upon return. The timeout may be freely modified from to_func, so to_process may have changed when we return. Tested by anton@. ok anton@
* Add support for remote coverage to kcov. Remote coverage is collectedanton2020-08-011-1/+15
| | | | | | | | | | | | | | | | | | | | | from threads other than the one currently having kcov enabled. A thread with kcov enabled occasionally delegates work to another thread, collecting coverage from such threads improves the ability of syzkaller to correlate side effects in the kernel caused by issuing a syscall. Remote coverage is divided into subsystems. The only supported subsystem right now collects coverage from scheduled tasks and timeouts on behalf of a kcov enabled thread. In order to make this work `struct task' and `struct timeout' must be extended with a new field keeping track of the process that scheduled the task/timeout. Both aforementioned structures have therefore increased with the size of a pointer on all architectures. The kernel API is documented in a new kcov_remote_register(9) manual. Remote coverage is also supported by kcov on NetBSD and Linux. ok mpi@
* timeout(9): remove TIMEOUT_SCHEDULED flagcheloha2020-07-251-11/+16
| | | | | | | | | | | | | The TIMEOUT_SCHEDULED flag was added a few months ago to differentiate between wheel timeouts and new timeouts during softclock(). The distinction is useful when incrementing the "rescheduled" stat and the "late" stat. Now that we have an intermediate queue for new timeouts, timeout_new, we don't need the flag. The distinction between wheel timeouts and new timeouts can be made computationally. Suggested by procter@ several months ago.
* timeout(9): delay processing of timeouts added during softclock()cheloha2020-07-241-6/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | New timeouts are appended to the timeout_todo circq via timeout_add(9). If this is done during softclock(), i.e. a timeout function calls timeout_add(9) to reschedule itself, the newly added timeout will be processed later during the same softclock(). This works, but it is not optimal: 1. If a timeout reschedules itself to run in zero ticks, i.e. timeout_add(..., 0); it will be run again during the current softclock(). This can cause an infinite loop, softlocking the primary CPU. 2. Many timeouts are cancelled before they execute. Processing a timeout during the current softclock() is "eager": if we waited, the timeout might be cancelled and we could spare ourselves the effort. If the timeout is not cancelled before the next softclock() we can bucket it as we normally would with no change in behavior. 3. Many timeouts are scheduled to run after 1 tick, i.e. timeout_add(..., 1); Processing these timeouts during the same softclock means bucketing them for no reason: they will be dumped into the timeout_todo queue during the next hardclock(9) anyway. Processing them is pointless. We can avoid these issues by using an intermediate queue, timeout_new. New timeouts are put onto this queue during timeout_add(9). The queue is concatenated to the end of the timeout_todo queue at the start of each softclock() and then softclock() proceeds. This means the amount of work done during a given softclock() is finite and we avoid doing extra work with eager processing. Any timeouts that *depend* upon being rerun during the current softclock() will need to be updated, though I doubt any such timeouts exist. Discussed with visa@ last year. No complaints after a month.
* Make timeout_add_sec(9) add a tick if given zero secondskn2020-07-241-1/+3
| | | | | | | All other timeout_add_*() functions do so before calling timeout_add(9) as described in the manual, this one did not. OK cheloha
* It's been agreed upon that global locks should be expressed usinganton2020-07-041-6/+6
| | | | | | | | | | capital letters in locking annotations. Therefore harmonize the existing annotations. Also, if multiple locks are required they should be delimited using commas. ok mpi@
* Cleanup <sys/kthread.h> and <sys/proc.h> includes.mpi2020-02-181-1/+2
| | | | | | | Do not include <sys/kthread.h> where it is not needed and stop including <sys/proc.h> in it. ok visa@, anton@
* Raise ipl of the softclock thread to IPL_SOFTCLOCK.mpi2020-01-131-1/+4
| | | | | | | This prevent the soft-interrupt to run in-between of timeouts executed in a thread context. ok kettenis@, visa@
* timeout(9): delay thread wakeup(9) decision to end of softclock() loopcheloha2020-01-031-4/+3
| | | | | | | The process-context timeout(s) in question might be cancelled before we leave the loop, leading to a spurious wakeup(9). ok mpi@
* timeout(9): Add timeout_set_flags(9) and TIMEOUT_INITIALIZER_FLAGS(9)cheloha2020-01-031-6/+11
| | | | | | | | | | These allow the caller to initialize timeouts with arbitrary flags. We only have one flag at the moment, TIMEOUT_PROC, but experimenting with other flags is easier if these interfaces are available in-tree. With input from bluhm@, guenther@, and visa@. "makes sense to me" bluhm@, ok visa@
* timeout(9): Rename the TIMEOUT_NEEDPROCCTX flag to TIMEOUT_PROC.cheloha2020-01-031-6/+6
| | | | | | | | This makes it more likely to fit into 80 columns if used alongside the forthcoming timeout_set_flags() and TIMEOUT_INITIALIZER_FLAGS() interfaces. "makes sense to me" bluhm@, ok visa@
* timeout(9): new flag: TIMEOUT_SCHEDULED, new statistic: tos_scheduledcheloha2019-12-251-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This flag is set whenever a timeout is put on the wheel and cleared upon (a) running, (b) deletion, and (c) readdition. It serves two purposes: 1. Facilitate distinguishing scheduled and rescheduled timeouts. When a timeout is put on the wheel it is "scheduled" for a later softclock(). If this happens two or more times it is also said to be "rescheduled". The tos_rescheduled value thus indicates how many distant timeouts have been cascaded into a lower wheel level. 2. Eliminate false late timeouts. A timeout is not late if it is due before softclock() has had a chance to schedule it. To track this we need additional state, hence a new flag. rprocter@ raises some interesting questions. Some answers: - This interface is not stable and name changes are possible at a later date. - Although rescheduling timeouts is a side effect of the underlying implementation, I don't forsee us using anything but a timeout wheel in the future. Other data structures are too slow in practice, so I doubt that the concept of a rescheduled timeout will be irrelevant any time soon. - I think the development utility of gathering these sorts of statistics is high. Watching the distribution of timeouts under a given workflow is informative. ok visa@
* Recommit "timeout(9): make CIRCQ look more like other sys/queue.h data structures"cheloha2019-12-121-29/+34
| | | | | | | | | | | | | | Backed out during revert of "timeout(9): switch to tickless backend". Original commit message: - CIRCQ_APPEND -> CIRCQ_CONCAT - Flip argument order of CIRCQ_INSERT to match e.g. TAILQ_INSERT_TAIL - CIRCQ_INSERT -> CIRCQ_INSERT_TAIL - Add CIRCQ_FOREACH, use it in ddb(4) when printing buckets - While here, use tabs for indentation like we do with other macros ok visa@ mpi@
* Revert "timeout(9): switch to tickless backend"cheloha2019-12-021-210/+135
| | | | | | | | | It appears to have caused major performance regressions all over the network stack. Reported by bluhm@ ok deraadt@
* timeout(9): make CIRCQ look more like other sys/queue.h data structurescheloha2019-11-291-28/+33
| | | | | | | | | | - CIRCQ_APPEND -> CIRCQ_CONCAT - Flip argument order of CIRCQ_INSERT to match e.g. TAILQ_INSERT_TAIL - CIRCQ_INSERT -> CIRCQ_INSERT_TAIL - Add CIRCQ_FOREACH, use it in ddb(4) when printing buckets - While here, use tabs for indentation like we do with other macros ok visa@
* timeout(9): switch to tickless backendcheloha2019-11-261-112/+182
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rebase the timeout wheel on the system uptime clock. Timeouts are now set to run at or after an absolute time as returned by nanouptime(9). Timeouts are thus "tickless": they expire at a real time on that clock instead of at a particular value of the global "ticks" variable. To facilitate this change the timeout struct's .to_time member becomes a timespec. Hashing timeouts into a bucket on the wheel changes slightly: we build a 32-bit hash with 25 bits of seconds (.tv_sec) and 7 bits of subseconds (.tv_nsec). 7 bits of subseconds means the width of the lowest wheel level is now 2 seconds on all platforms and each bucket in that lowest level corresponds to 1/128 seconds on the uptime clock. These values were chosen to closely align with the current 100hz hardclock(9) typical on almost all of our platforms. At 100hz a bucket is currently ~1/100 seconds wide on the lowest level and the lowest level itself is ~2.56 seconds wide. Not a huge change, but a change nonetheless. Because a bucket no longer corresponds to a single tick more than one bucket may be dumped during an average timeout_hardclock_update() call. On 100hz platforms you now dump ~2 buckets. On 64hz machines (sh) you dump ~4 buckets. On 1024hz machines (alpha) you dump only 1 bucket, but you are doing extra work in softclock() to reschedule timeouts that aren't due yet. To avoid changing current behavior all timeout_add*(9) interfaces convert their timeout interval into ticks, compute an equivalent timespec interval, and then add that interval to the timestamp of the most recent timeout_hardclock_update() call to determine an absolute deadline. So all current timeouts still "use" ticks, but the ticks are faked in the timeout layer. A new interface, timeout_at_ts(9), is introduced here to bypass this backwardly compatible behavior. It will be used in subsequent diffs to add absolute timeout support for userland and to clean up some of the messier parts of kernel timekeeping, especially at the syscall layer. Because timeouts are based against the uptime clock they are subject to NTP adjustment via adjtime(2) and adjfreq(2). Unless you have a crazy adjfreq(2) adjustment set this will not change the expiration behavior of your timeouts. Tons of design feedback from mpi@, visa@, guenther@, and kettenis@. Additional amd64 testing from anton@ and visa@. Octeon testing from visa@. macppc testing from me. Positive feedback from deraadt@, ok visa@
* db_addr_t -> vaddr_t, missed in previous.mpi2019-11-071-2/+2
| | | | ok deraadt@
* kern_timeout.c: style(9), misc. cleanupcheloha2019-11-031-49/+45
| | | | | | | | | | | | - Move mutex to top of file, annotate locking for module - Group module-local prototypes below globals but above function defs - __inline -> inline - No static without inline - Drop extra parentheses around return values Compiler input from visa@. ok visa@
* softclock: move softintr registration/scheduling into timeout modulecheloha2019-11-021-7/+22
| | | | | | | | | | | | | | | softclock() is scheduled from hardclock(9) because long ago callouts were processed from hardclock(9) directly. The introduction of timeout(9) circa 2000 moved all callout processing into a dedicated module, but the softclock scheduling stayed behind in hardclock(9). We can move all the softclock() "stuff" into the timeout module to make kern_clock.c a bit cleaner. Neither initclocks() nor hardclock(9) need to "know" about softclock(). The initial softclock() softintr registration can be done from timeout_proc_init() and softclock() can be scheduled from timeout_hardclock_update(). ok visa@
* timeout(9): process-context timeouts can be latecheloha2019-09-201-11/+9
| | | | | | | | | | Move the check for lateness earlier in the softclock() loop so every timeout is checked before being run. While here, remove an unsafe DEBUG printf(9). You can't safely printf(9) within a mutex, and the print itself isn't even particularly useful. ok bluhm@
* timeout(9): use CLR/ISSET/SET consistentlycheloha2019-09-201-18/+15
| | | | | | While here in timeout_add(9), use KASSERT for brevity. CLR/ISSET/SET bits ok krw@
* ddb(4): clean up callout commandcheloha2019-07-191-6/+19
| | | | | | | | | | | - display timeouts in the thread work queue, if any - identify timeouts in the thread/softint work queue as such - if not in work queue, print <bucket>/<level>; easier to right-align - print arg pointer by hand to ensure consistent length for all pointers on both 32 and 64-bit platforms - generally make sure columns are correctly aligned and spaced ok mpi@ visa@
* sysctl(2): add KERN_TIMEOUT_STATS: timeout(9) status and statistics.cheloha2019-07-121-2/+30
| | | | | | | | | With these totals one can track the throughput of the timeout(9) layer from userspace. With input from mpi@. ok mpi@
* Remove file name and line number output from witness(4)visa2019-04-231-5/+4
| | | | | | | | | | | | | Reduce code clutter by removing the file name and line number output from witness(4). Typically it is easy enough to locate offending locks using the stack traces that are shown in lock order conflict reports. Tricky cases can be tracked using sysctl kern.witness.locktrace=1 . This patch additionally removes the witness(4) wrapper for mutexes. Now each mutex implementation has to invoke the WITNESS_*() macros in order to utilize the checker. Discussed with and OK dlg@, OK mpi@
* Add lock order checking for timeoutsvisa2019-04-141-2/+69
| | | | | | | | | | | | | | | | | | The caller of timeout_barrier() must not hold locks that could prevent timeout handlers from making progress. The system could deadlock otherwise. This patch makes witness(4) able to detect barrier locking errors. This is done by introducing a pseudo-lock that couples the lock chains of barrier callers to the lock chains of timeout handlers. In order to find these errors faster, this diff adds a synchronous version of cancelling timeouts, timeout_del_barrier(9). As the synchronous intent is explicit, this interface can check lock order immediately instead of waiting for the potentially rare occurrence of timeout_barrier(9). OK dlg@ mpi@
* i forgot to convert timeout_proc_barrier to cond_signaldlg2017-12-141-5/+3
|
* replace the bare sleep state handling in barriers with wait cond codedlg2017-12-141-8/+4
|
* add timeout_barrier, which is like intr_barrier and taskq_barrier.dlg2017-11-241-1/+41
| | | | | | | | | | | | | | | | | | | | | | | | | if you're trying to free something that a timeout is using, you have to wait for that timeout to finish running before doing the free. timeout_del can stop a timeout from running in the future, but it doesn't know if a timeout has finished being scheduled and is now running. previously you could know that timeouts are not running by simply masking softclock interrupts on the cpu running the kernel. however, code is now running outside the kernel lock, and timeouts can run in a thread instead of softclock. timeout_barrier solves the first problem by taking the kernel lock and then masking softclock interrupts. that is enough to ensure that any further timeout processing is waiting for those resources to run again. the second problem is solved by having timeout_barrier insert work into the thread. when that work runs, that means all previous work running in that thread has completed. fixes and ok visa@, who thinks this will be useful for his work too.
* avoid holding timeout_mutex while interacting with the scheduler.dlg2016-10-031-9/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | as noted by haesbaert, this is necessary to avoid deadlocks because the scheduler can call back into the timeout subsystem while its holding its own locks. this happened in two places. firstly, in softclock() it would take timeout_mutex to find pending work. if that pending work needs a process context, it would queue the work for the thread and call wakeup, which enters the scheduler locks. if another cpu is trying to tsleep (or msleep) with a timeout specified, the sleep code would be holding the sched lock and call timeout_add, which takes timeout_mutex. this is solved by deferring the wakeup to after timeout_mutex is left. this also has the benefit of mitigating the number of wakeups done per softclock tick. secondly, the timeout worker thread takes timeout_mutex and calls msleep when there's no work to do (ie, the queue is empty). msleep will take the sched locks. again, if another cpu does a tsleep with a timeout, you get a deadlock. to solve this im using sleep_setup and sleep_finish to sleep on an empty queue, which is safe to do outside the lock as it is comparisons of the queue head pointers, not derefs of the contents of the queue. as long as the sleeps and wakeups are ordered correctly with the enqueue and dequeue operations under the mutex, this all works. you can think of the queue as a single descriptor ring, and the wakeup as an interrupt. the second deadlock was identified by guenther@ ok tedu@ mpi@
* Introduce a new 'softclock' thread that will be used to execute timeoutmpi2016-09-221-12/+81
| | | | | | | | | | | | | | | | | callbacks needing a process context. The function timeout_set_proc(9) has to be used instead of timeout_set(9) when a timeout callback needs a process context. Note that if such a timeout is waiting, understand sleeping, for a non negligible amount of time it might delay other timeouts needing a process context. dlg@ agrees with this as a temporary solution. Manpage tweaks from jmc@ ok kettenis@, bluhm@, mikeb@
* fix several places where calculating ticks could overflow.tedu2016-07-061-11/+11
| | | | | | | | it's not enough to assign to an unsigned type because if the arithmetic overflows the compiler may decide to do anything. so change all the long long casts to uint64_t so that we start with the right type. reported by Tim Newsham of NCC. ok deraadt
* Avoid multiple evaluation of macro arguments in softclock()stefan2016-06-231-7/+9
| | | | ok mikeb@ tedu@
* Prevent a round to zero in the timeout_add_...() functions. Gettingbluhm2016-06-141-1/+15
| | | | | | | an immediate timeout if a positive value is specified is unexpected behavior. Defer calling the handler for at least one tick. Do not change that timeout_add(0) gives you an immediate timeout. OK millert@ uebayasi@ tedu@
* Update ticks in hardclock().uebayasi2016-03-201-3/+1
| | | | OK mikeb@
* KNF: Remove a blank line.uebayasi2016-03-171-2/+1
|
* Move `ticks' declaration to sys/kernel.h.uebayasi2015-07-201-2/+1
|
* Remove some includes include-what-you-use claims don'tjsg2015-03-141-2/+1
| | | | | | | have any direct symbols used. Tested for indirect use by compiling amd64/i386/sparc64 kernels. ok tedu@ deraadt@
* make timeout_add and its wrappers return whether the timeout was scheduleddlg2013-11-271-16/+20
| | | | | | | | in this call by returning 1, or a previous call by returning 0. this makes it easy to refcount the stuff we're scheduling a timeout for, and brings the api in line with what task_add(9) provides. ok mpi@ matthew@ mikeb@ guenther@
* Replace some XXX casts with an inline function that explains what's going onguenther2013-10-061-4/+15
| | | | ok deraadt@
* Format string fix: Use %td for pointer differencesf2013-10-021-2/+2
|
* format string fixsf2013-10-011-2/+2
| | | | to_arg is void *
* Fix a misaligned backslashguenther2013-09-171-2/+2
|
* Delete variable left over from the diagnostic code removed by previous commitguenther2013-08-031-4/+2
| | | | pointed out by Artturi Alm (artturi.alm (at) gmail.com)
* Delete diagnostic code that reports timeout adjustments on resume.guenther2012-06-021-13/+1
| | | | | | | | It was useful for tracking down the last devices which weren't deleting their timeouts on suspend and recreating them on resume, but it's too verbose to keep around. noted by deraadt@
* On resume, run forward the monotonic and realtimes clocks instead of jumpingguenther2012-05-241-1/+47
| | | | | | just the realtime clock, triggering and adjusting timeouts to reflect that. ok matthew@ deraadt@