| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The timecounter struct is large and I think it may change in the
future. Changing it later will be easier if we use C99-style
initialization for all timecounter structs. It also makes reading the
code a bit easier.
For reasons I cannot explain, switching to C99-style initialization
sometimes changes the hash of the resulting object file, even though
the resulting struct should be the same. So there is a binary change
here, but only sometimes. No behavior should change in either case.
I can't compile-test this everywhere but I have been staring at the
diff for days now and I'm relatively confident this will not break
compilation. Fingers crossed.
ok gnezdo@
|
|
|
|
| |
ok gkoehler@
|
|
|
|
|
|
|
|
|
|
|
| |
vmstat(8) uses kvm_read(3) to extract the naptime from the kernel.
Problem is, I deleted `naptime' from the global namespace when I moved
it into the timehands. This patch restores it. It gets updated from
tc_windup(). Only userspace should use it, and only when the kernel
is dead.
We need to tweak a variable in tc_setclock() to avoid shadowing the
(once again) global naptime.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
- Use real variable names like "utc" and "uptime" instead of non-names
like "bt" and "bt2"
- Move the TIMESPEC_TO_BINTIME(9) conversions out of the critical
section
- Sprinkle in a little whitespace
- Sort automatic variables according to style(9)
|
|
|
|
|
|
|
| |
Using getmicrotime(9) or getnanotime(9) is perfectly appropriate in
certain contexts. The programmer needs to weigh the overhead savings
against the reduced accuracy and decide whether the low-res interfaces
are appropriate.
|
|
|
|
|
|
|
|
|
| |
If th0.th_generation == th1.th_generation when we update the user
timekeep page, then tk_generation doesn't change, so libc may
calculate the wrong time. Now th0 and th1 share the sequence so
th0.th_generation != th1.th_generation.
ok kettenis@ cheloha@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The adjtime(2) adjustment is applied at up to 5000ppm/sec from
tc_windup(). At the start of each UTC second, ntp_update_second() is
called from tc_windup() and up to 5000ppm worth of skew is deducted
from the timehands' th_adjtimedelta member and moved to the
th_adjustment member. The resulting th_adjustment value is then mixed
into the th_scale member and thus the system UTC time is slowly nudged
in a particular direction.
This works pretty well. The only issues have to do with the use of
the the edge of the UTC second as the start of the ntp_update_second()
period:
1. If the UTC clock jumps forward we can get stuck in a loop calling
ntp_update_second() from tc_windup(). We work around this with
a magic number, LARGE_STEP. If the UTC clock jumps forward more
than LARGE_STEP seconds we truncate the number of iterations to 2.
Per the comment in tc_windup(), we do 2 iterations instead of 1
iteration to account for a leap second we may have missed. This is
an anachronism: the OpenBSD kernel does not handle leap seconds
anymore.
Such jumps happen during settimeofday(2), during boot when we jump
the clock from zero to the RTC time, and during resume when we jump
the clock to the RTC time (again). They are unavoidable.
2. Changes to adjtime(2) are applied asynchronously. For example, if
you try to cancel the ongoing adjustment...
struct timeval zero = { 0, 0 };
adjtime(&zero, NULL);
... it can take up to one second for the adjustment to be cancelled.
In the meantime, the skew continues. This delayed application is not
intuitive or documented.
3. Adjustment is deducted from th_adjtimedelta across suspends of fewer
than LARGE_STEP seconds, even though we do not skew the clock while
we are suspended. This is unintuitive, incorrect, and undocumented.
We can avoid all of these problems by applying the adjustment along
an arbitrary period on the runtime clock instead of the UTC clock.
1. The runtime clock doesn't jump arbitrary amounts, so we never get
stuck in a loop and we don't need a magic number to test for this
possibility. With the removal of the magic number LARGE_STEP we
can also remove the leap second handling from the tc_windup() code.
2. With a new timehands member, th_next_ntp_update, we can track when
the next ntp_update_second() call should happen on the runtime clock.
This value can be updated during the adjtime(2) system call, so
changes to the skew happen *immediately* instead of up to one second
after the adjtime(2) call.
3. The runtime clock does not jump across a suspend: no skew is
deducted from th_adjtimedelta for any time we are offline and
unable to adjust the clock.
otto@ says the use of the runtime clock should not be a problem for
ntpd(8) or the NTP algorithm in general.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This diff exposes parts of clock_gettime(2) and gettimeofday(2) to
userland via libc eliberating processes from the need for a context
switch everytime they want to count the passage of time.
If a timecounter clock can be exposed to userland than it needs to set
its tc_user member to a non-zero value. Tested with one or multiple
counters per architecture.
The timing data is shared through a pointer found in the new ELF
auxiliary vector AUX_openbsd_timekeep containing timehands information
that is frequently updated by the kernel.
Timing differences between the last kernel update and the current time
are adjusted in userland by the tc_get_timecount() function inside the
MD usertc.c file.
This permits a much more responsive environment, quite visible in
browsers, office programs and gaming (apparently one is are able to fly
in Minecraft now).
Tested by robert@, sthen@, naddy@, kmos@, phessler@, and many others!
OK from at least kettenis@, cheloha@, naddy@, sthen@
|
|
|
|
|
|
|
|
|
|
| |
capital letters in locking annotations. Therefore harmonize the existing
annotations.
Also, if multiple locks are required they should be delimited using
commas.
ok mpi@
|
|
|
|
|
| |
The dummy counter should be deterministic with respect to interrupts
and multiple threads of execution.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
time_second(9) has been replaced in the kernel by gettime(9).
time_uptime(9) has been replaced in the kernel by getuptime(9).
New code should use the replacement interfaces. They do not suffer
from the split-read problem inherent to the time_* variables on 32-bit
platforms.
The variables remain in sys/kern/kern_tc.c for use via kvm(3) when
examining kernel core dumps.
This commit completes the deprecation process:
- Remove the extern'd definitions for time_second and time_uptime
from sys/time.h.
- Replace manpage cross-references to time_second(9)/time_uptime(9)
with references to microtime(9) or a related interface.
- Move the time_second.9 manpage to the attic.
With input from dlg@, kettenis@, visa@, and tedu@.
ok kettenis@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
time_second and time_uptime are used widely in the tree. This is a
problem on 32-bit platforms because time_t is 64-bit, so there is a
potential split-read whenever they are used at or below IPL_CLOCK.
Here are two replacement interfaces: gettime(9) and getuptime(9).
The "get" prefix signifies that they do not read the hardware
timecounter, i.e. they are fast and low-res. The lack of a unit
(e.g. micro, nano) signifies that they yield a plain time_t.
As an optimization on LP64 platforms we can just return time_second or
time_uptime, as a single read is atomic. On 32-bit platforms we need
to do the lockless read loop and get the values from the timecounter.
In a subsequent diff these will be substituted for time_second and
time_uptime almost everywhere in the kernel.
With input from visa@ and dlg@.
ok kettenis@
|
|
|
|
|
|
| |
conversion steps). it only contains kernel prototypes for 4 interfaces,
all of which legitimately belong in sys/systm.h, which are already included
by all enqueue_randomness() users.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we resume from a suspend we use the time from the RTC to advance
the system offset. This changes the UTC to match what the RTC has given
us while increasing the system uptime to account for the time we were
suspended.
Currently we decide whether to change to the RTC time in tc_setclock()
by comparing the new offset with the th_offset member. This is wrong.
th_offset is the *minimum* possible value for the offset, not the "real
offset". We need to perform the comparison within tc_windup() after
updating th_offset, otherwise we might rewind said offset.
Because we're now doing the comparison within tc_windup() we ought to
move naptime into the timehands. This means we now need a way to safely
read the naptime to compute the value of CLOCK_UPTIME for userspace.
Enter nanoruntime(9); it increases monotonically from boot but does not
jump forward after a resume like nanouptime(9).
|
|
|
|
| |
Missing piece of tickless timeout revert.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reverted with backout of tickless timeouts.
Original commit message:
We currently mix timecounter.tc_freq_adj and timehands.th_adjtimedelta
in ntp_update_second() to produce timehands.th_adjustment, our net skew.
But if you set a low enough adjfreq(2) adjustment you can freeze time.
This prevents ntp_update_second() from running again. So even if you
then set a sane adjfreq(2) you cannot unfreeze time without rebooting.
If we just reread timecounter.tc_freq_adj every time we recompute
timehands.th_scale we avoid this trap. visa@ notes that this is
more costly than what we currently do but that the cost itself is
negligible.
Intuitively, timecounter.tc_freq_adj is a constant skew and should be
handled separately from timehands.th_adjtimedelta, an adjustment that
we chip away at very slowly.
tedu@ notes that this problem is sort-of an argument for imposing range
limits on adjfreq(2) inputs. He's right, but I think we should still
separate the counter adjustment from the adjtime(2) adjustment, with
or without range limits.
ok visa@
|
|
|
|
|
|
|
|
|
| |
It appears to have caused major performance regressions all over the
network stack.
Reported by bluhm@
ok deraadt@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We currently mix timecounter.tc_freq_adj and timehands.th_adjtimedelta
in ntp_update_second() to produce timehands.th_adjustment, our net skew.
But if you set a low enough adjfreq(2) adjustment you can freeze time.
This prevents ntp_update_second() from running again. So even if you
then set a sane adjfreq(2) you cannot unfreeze time without rebooting.
If we just reread timecounter.tc_freq_adj every time we recompute
timehands.th_scale we avoid this trap. visa@ notes that this is
more costly than what we currently do but that the cost itself is
negligible.
Intuitively, timecounter.tc_freq_adj is a constant skew and should be
handled separately from timehands.th_adjtimedelta, an adjustment that
we chip away at very slowly.
tedu@ notes that this problem is sort-of an argument for imposing range
limits on adjfreq(2) inputs. He's right, but I think we should still
separate the counter adjustment from the adjtime(2) adjustment, with
or without range limits.
ok visa@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Rebase the timeout wheel on the system uptime clock. Timeouts are now
set to run at or after an absolute time as returned by nanouptime(9).
Timeouts are thus "tickless": they expire at a real time on that clock
instead of at a particular value of the global "ticks" variable.
To facilitate this change the timeout struct's .to_time member becomes a
timespec. Hashing timeouts into a bucket on the wheel changes slightly:
we build a 32-bit hash with 25 bits of seconds (.tv_sec) and 7 bits of
subseconds (.tv_nsec). 7 bits of subseconds means the width of the
lowest wheel level is now 2 seconds on all platforms and each bucket in
that lowest level corresponds to 1/128 seconds on the uptime clock.
These values were chosen to closely align with the current 100hz
hardclock(9) typical on almost all of our platforms. At 100hz a bucket
is currently ~1/100 seconds wide on the lowest level and the lowest
level itself is ~2.56 seconds wide. Not a huge change, but a change
nonetheless.
Because a bucket no longer corresponds to a single tick more than one
bucket may be dumped during an average timeout_hardclock_update() call.
On 100hz platforms you now dump ~2 buckets. On 64hz machines (sh) you
dump ~4 buckets. On 1024hz machines (alpha) you dump only 1 bucket,
but you are doing extra work in softclock() to reschedule timeouts
that aren't due yet.
To avoid changing current behavior all timeout_add*(9) interfaces
convert their timeout interval into ticks, compute an equivalent
timespec interval, and then add that interval to the timestamp of
the most recent timeout_hardclock_update() call to determine an
absolute deadline. So all current timeouts still "use" ticks,
but the ticks are faked in the timeout layer.
A new interface, timeout_at_ts(9), is introduced here to bypass this
backwardly compatible behavior. It will be used in subsequent diffs
to add absolute timeout support for userland and to clean up some of
the messier parts of kernel timekeeping, especially at the syscall
layer.
Because timeouts are based against the uptime clock they are subject to
NTP adjustment via adjtime(2) and adjfreq(2). Unless you have a crazy
adjfreq(2) adjustment set this will not change the expiration behavior
of your timeouts.
Tons of design feedback from mpi@, visa@, guenther@, and kettenis@.
Additional amd64 testing from anton@ and visa@. Octeon testing from visa@.
macppc testing from me.
Positive feedback from deraadt@, ok visa@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently we return (1000000000 / hz) from clock_getres(2) as the
resolution for every clock. This is often untrue.
For CPUTIME clocks, if we have a separate statclock interrupt the
resolution is (1000000000 / stathz). Otherwise it is as we currently
claim: (1000000000 / hz).
For the REALTIME/MONOTONIC/UPTIME/BOOTTIME clocks the resolution is
that of the active timecounter. During tc_init() we can compute the
precision of a timecounter by examining its tc_counter_mask and store
it for lookup later in a new member, tc_precision. The resolution of
a clock backed by a timecounter "tc" is then
tc.tc_precision * (2^64 / tc.tc_frequency)
fractional seconds.
While here we can clean up sys_clock_getres() a bit.
Standards input from guenther@. Lots of input, feedback from
kettenis@.
ok kettenis@
|
|
|
|
|
|
| |
Wanted for upcoming process accounting changes, maybe useful elsewhere.
ok bluhm@ millert@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Basically just make all the bintime routines look and behave more like
the timeradd(3) macros.
Switch to three-argument forms for structure math, introduce and use
bintimecmp(9), and rename the structure conversion routines to resemble
e.g. TIMEVAL_TO_TIMESPEC(3).
Document all of this in a new bintimeadd.9 page.
Code input from mpi@, manpage input from schwarze@.
code ok mpi@, docs ok schwarze@, docs probably still ok jmc@
|
|
|
|
|
|
|
| |
Call it "tc_list" instead of "timecounters", which is too similar to
the variable "timecounter" for my taste.
ok mpi@ visa@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The dummy counter is a stopgap during boot. It is not useful after a
real timecounter is attached and started and there is no reason to return
to using it.
So don't even offer it to the admin. This is easy: never add it to the
timecounter list. It will effectively cease to exist after the first real
timecounter is actived in tc_init().
In principle this means that we can have an empty timecounter list so we
need to check for that case in sysctl_tc_choice().
"I don't mind" mpi@, ok visa@
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reduces the worst-case error for for time values retrieved via the
microtime(9) functions from 10 ticks to 2 ticks. Being interrupted
for over a tick is unlikely but possible.
While here use C99 initializers.
From FreeBSD r303383.
ok mpi@
|
|
|
|
|
| |
We ought to conform to the windup_mtx protocol and call tc_windup() even
if we aren't changing the system uptime.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tc_lock allows adjfreq(2) and the kern.timecounter.hardware sysctl(2)
to read/write the active timecounter pointer and the .tc_adj_freq
member of the active timecounter safely. This eliminates any possibility
of a torn read/write for the .tc_adj_freq member when we drop the
KERNEL_LOCK from the timecounting layer. It also ensures the active
timecounter does not change in the midst of an adjfreq(2) call.
Because these are not high-traffic paths, we can get away with using
tc_lock in write-mode to ensure combination read/write adjtime(2) calls
are relatively atomic (a) to other writer adjtime(2) calls, and (b) to
settimeofday(2)/clock_settime(2) calls, which cancel ongoing adjtime(2)
adjustment.
When the KERNEL_LOCK is dropped, an unprivileged user will be able to
create some tc_lock contention via adjfreq(2); it is very unlikely to
ever be a problem. If it ever is actually a problem a lockless read
could be added to address it.
While here, reorganize sys_adjfreq()/sys_adjtime() to minimize code
under the lock. Also while here, make tc_adjfreq() void, as it cannot
fail under any circumstance. Also also while here, annotate various
globals/struct members with lock ordering details.
With lots of input from mpi@ and visa@.
ok visa@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
adjtimedelta is 64-bit and thus can't be read/written atomically on all
architectures. Because it can be modified from tc_windup() and
ntp_update_second() we need a way to ensure safe reads/writes for
adjtime(2) callers. One solution is to move it into the timehands and
adopt the lockless read protocol we now use for the system boot time and
uptime.
So make new_adjtimedelta an argument to tc_windup() and add a lockless
read loop to tc_adjtime(). With adjtimedelta stored in the timehands
we can now simply pass a timehands pointer to ntp_update_second(). This
makes ntp_update_second() safer as we're using the timehands' timecounter
pointer instead of the mutable global timecounter pointer.
Lots of input from mpi@ and visa@.
ok visa@
|
|
|
|
|
|
| |
This will make upcoming MP-related diffs smaller and should make the code
int kern_tc.c easier to read in general. "windup_mtx" is also a better
mnemonic: always call tc_windup() before leaving windup_mtx.
|
|
|
|
|
|
|
|
|
|
|
|
| |
We need to perform the actual modification of the boot offset and the
time-of-boot within the "safe zone" in tc_windup() where the timehands'
generation is zero to conform to the timehands lockless read protocol.
Based on FreeBSD r303387.
Discussed with mpi@ and visa@.
ok visa@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This will simplify upcoming MP-safety diffs for the timecounting layer.
adjtimedelta is now accessed nowhere outside of kern_tc.c, so we can
remove its extern declaration from kernel.h. Zeroing adjtimedelta
within timecounter_mtx before we jump the real-time clock is also a
bit safer than what we do now, as we are not racing a simultaneous
tc_windup() call from hardclock(), which itself can modify adjtimedelta
via ntp_update_second().
Discussed with visa@ and mpi@.
ok visa@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tc_windup() is not necessarily called with KERNEL_LOCK, so it is possible
for the timecounter pointer to change in the midst of the call via the
kern.timecounter.hardware sysctl(2). Reading it once and using that local
copy ensures we're referring to the same timecounter consistently.
Apparently the compiler can optimize this out... somehow... so there may
be room for improvement.
Idea from visa@. With input from visa@, mpi@, cjeker@, and guenther@.
ok visa@ mpi@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we come back from suspend/hibernate the BIOS/firmware/whatever can
hand us *any* TOD, so we need to check that the given TOD doesn't set our
boot offset backwards, breaking the monotonicity of e.g. CLOCK_MONOTONIC.
This is trivial to do from the BIOS on most PCs before unhibernating.
There might be other ways it can happen, accidentally or otherwise.
This is a bit messy but it can be made prettier later with a "bintimecmp"
macro or something like that.
Problem confirmed by jmatthew@.
"you are very likely right" deraadt@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a user thread from e.g. clock_settime(2) is in the midst of changing
the boottime or calling tc_windup() when it is interrupted by hardclock(9),
the timehands could be left in a damaged state.
So protect tc_windup() calls with a mutex, timecounter_mtx. hardclock(9)
merely attempts to enter the mutex instead of spinning because it cannot
afford to wait around. In practice hardclock(9) will skip tc_windup() very
rarely, and when it does skip there aren't any negative effects because the
skip indicates that a user thread is already calling, or about to call,
tc_windup() anyway.
Based on FreeBSD r303387 and NetBSD sys/kern/kern_tc.c,v1.30
Discussed with mpi@ and visa@. Tons of nice technical detail about
lockless reads from visa@.
OK visa@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To protect the timehands we first need to protect the basis for all UTC
time in the kernel: the boottime.
Because the boottime can be changed at any time it needs to be versioned
along with the other members of the timehands to enable safe lockless reads
when using it for anything. So the global boottime timespec goes away and
the static boottimebin becomes a member of the timehands. Instead of reading
the global boottime you use one of two interfaces: binboottime(9) or
microboottime(9). nanoboottime(9) can trivially be added later, though there
are no consumers for it at the moment.
This introduces one small change in behavior. We used to advance the
reported boottime just before launching kernel threads from main().
This makes it look to userland like we "booted" moments before those
threads were launched. Because there is no longer a boottime global we
can no longer trivially do this from main(), so the boottime we report
to userspace via e.g. kern.boottime will now reflect whatever the time
was when we bootstrapped the timehands via inittodr(9). This is usually
no more than a minute before the kernel threads are launched from main().
The prior behavior can be restored by adding a new interface to the
timecounter layer in a future commit.
Based on FreeBSD r303387.
Discussed with mpi@ and visa@.
ok visa@
|
|
|
|
|
|
|
|
| |
membar_producer() into tc_windup() and membar_consumer() into the
uptime functions. They order the visibility of the time and
generation number updates.
This is a combination of what NetBSD and FreeBSD do.
OK kettenis@
|
|
|
|
| |
ok tb@ kettenis@
|
|
|
|
|
|
|
|
|
| |
this gets rid of the source annotation which doesn't really add
anything other than adding complexitiy. randomess is generally
good enough that the few extra bits that the source type would
add are not worth it.
ok mikeb@ deraadt@
|
|
|
|
| |
ok jca@ deraadt@
|
| |
|
|
|
|
|
|
|
|
| |
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt
|
| |
|
| |
|
|
|
|
| |
ok mpi@ kspillner@
|
|
|
|
| |
after discussions with beck deraadt kettenis.
|
| |
|
|
|
|
| |
this license change. We will remember that we all still like beer.
|
|
|
|
|
|
|
|
| |
microsecond in a 64-bit integer. Fixes the issue where ntpd loses sync
because the struct timeval currently used to hold the adjustment is not
properly normalized after the changes guenther@ made.
ok guenther@, millert@
|