| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
| |
We only see 8 characters of wmesg in e.g. top(1), so shorten the
string to fit.
Indirectly prompted by kettenis@.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To unlock getitimer(2) and setitimer(2) we need to protect the
per-process ITIMER_REAL state with something other than the kernel
lock. As the ITIMER_REAL timeout callback realitexpire() runs at
IPL_SOFTCLOCK the per-process mutex ps_mtx is appropriate.
In setitimer() we need to use ps_mtx instead of the global itimer_mtx
if the given timer is ITIMER_REAL. Easy.
The ITIMER_REAL timeout callback routine realitexpire() is trickier.
When we enter ps_mtx during the callback we need to check if the timer
was cancelled or rescheduled. A thread from the process can call
setitimer(2) at the exact moment the callback is about to run from
timeout_run() (see kern_timeout.c).
Update the locking annotation in sys/proc.h accordingly.
ok anton@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reimplement the ITIMER_REAL interval timer with a kclock timeout.
Couple things of note:
- We need to use the high-res nanouptime(9) call, not the low-res
getnanouptime(9).
- The code is simpler now that we aren't working with ticks.
Misc. thoughts:
- Still unsure if "kclock" is the right name for these things.
- MP-safely cancelling a periodic timeout is very difficult.
|
|
|
|
|
|
|
|
| |
If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".
Based on input from millert@ and kettenis@.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.
The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.
While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.
ok deraadt@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If itv.it_value is zero we cancel the timer. When we cancel the timer
we don't care about itv.it_interval because the timer is not running:
we don't use it, we don't look at it, etc.
To be on the paranoid side, I think we should zero itv.it_interval
when itv.it_value is zero. No need to write arbitrary values into the
process struct if we aren't required to. The standard is ambiguous
about what should happen in this case, i.e. the value of olditv after
the following code executes is unspecified:
struct itimerval newitv, olditv;
newitv.it_value.tv_sec = newitv.it_value.tv_usec = 0;
newitv.it_interval.tv_sec = newitv.it_interval.tv_usec = 1;
setitimer(ITIMER_REAL, &newitv, NULL);
getitimer(ITIMER_REAL, &olditv);
This change should not break any real code.
|
|
|
|
|
|
|
|
|
|
| |
timespecadd(3) is fast. There is no need to call getnanouptime(9)
repeatedly when searching for the next expiration point. Given that
it_interval is at least 1/hz, we expect to run through the loop maybe
hz times at most. Even at HZ=10000 that's pretty brief.
While we're here, pull *all* of the other logic out of the loop.
The only thing we need to do in the loop is timespecadd(3).
|
|
|
|
|
|
|
|
| |
- Consolidate variable declarations.
- Remove superfluous parentheses from return statements.
- Prefer sizeof(variable) to sizeof(type) for copyin(9)/copyout(9).
- Remove some intermediate pointers from sys_setitimer(). Using SCARG()
directly here makes it more obvious to the reader what you're copying.
|
|
|
|
|
|
|
| |
Now that the critical sections are merged we should call
getnanouptime(9) once. This makes an ITIMER_REAL timer swap atomic
with respect to the clock: the time remaining on the old timer is
computed with the same timestamp used to schedule the new timer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Merge the common code from sys_getitimer() and sys_setitimer() into a
new kernel subroutine, setitimer(). setitimer() performs all of the
error-free work for both system calls within a single critical
section. We need a single critical section to make the setitimer(2)
timer swap operation atomic relative to realitexpire() and hardclock(9).
The downside of the new atomicity is that the behavior of setitimer(2)
must change. With a single critical section we can no longer copyout(9)
the old timer before installing the new timer. So If SCARG(uap, oitv)
points to invalid memory, setitimer(2) now fail with EFAULT but the
new timer will be left running. You can see this in action with code
like the following:
struct itv, olditv;
itv.it_value.tv_sec = 1;
itv.it_value.tv_usec = 0;
itv.it_interval = itv.it_value;
/* This should EFAULT. 0x1 is probably an invalid address. */
if (setitimer(ITIMER_REAL, &itv, (void *)0x1) == -1)
warn("setitimer");
/* The timer will be running anyway. */
getitimer(ITIMER_REAL, &olditv);
printf("time left: %lld.%06ld\n",
olditv.it_value.tv_sec, olditv.it_value.tv_usec);
There is no easy way to work around this. Both FreeBSD's and Linux's
setitimer(2) implementations have a single critical section and they
too fail with EFAULT in this case and leave the new timer running.
I imagine their developers decided that fixing this error case was
a waste of effort. Without permitting copyout(9) from within a mutex
I'm not sure it is even possible to avoid it on OpenBSD without
sacrificing atomicity during a setitimer(2) timer swap.
Given the rarity of this error case I would rather have an atomic swap.
Behavior change discussed with deraadt@.
|
|
|
|
|
| |
if they are out of range, making it easier to isolate reason for EINVAL
ok cheloha
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
setitimer(2) works with timespecs in its critical section. It will be
easier to merge the two critical sections if getitimer(2) also works
with timespecs.
In particular, we currently read the uptime clock *twice* during a
setitimer(2) swap: we call getmicrouptime(9) in sys_getitimer() and
then call getnanouptime(9) in sys_setitimer(). This means that
swapping one timer in for another is not atomic with respect to the
uptime clock. It also means the two operations are working with
different time structures and resolutions, which is potentially
confusing.
If both critical sections work with timespecs we can combine the two
getnanouptime(9) calls into a single call at the start of the combined
critical section in a future patch, making the swap atomic with
respect to the clock.
So, in preparation, move the TIMESPEC_TO_TIMEVAL conversions in
getitimer(2) after the ITIMER_REAL conversion from absolute to
relative time, just before copyout(9). The ITIMER_REAL conversion
must then be done with timespec macros and getnanouptime(9), just like
in setitimer(2).
|
|
|
|
|
|
|
|
|
| |
If we're replacing the current ITIMER_REAL timer with a new one we
don't need to call timeout_del(9) before calling timeout_add(9).
timeout_add(9) does the work of timeout_del(9) implicitly if the
timeout in question is already pending.
This saves us an extra trip through the timeout_mutex.
|
|
|
|
|
|
|
|
|
|
|
| |
Rearrange the critical section in setitimer(2) to match that of
getitimer(2). This will make it easier to merge the two critical
sections in a subsequent diff.
In particular, we want to write the new timer value in *one* place in
the code, regardless of which timer we're setting.
ok millert@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For what are probably historical reasons, setitimer(2) does not
validate its input (itv) immediately after copyin(9). Instead, it
waits until after (possibly) performing a getitimer(2) to copy out the
state of the timer.
Consolidating copyin(9), input validation, and input conversion into a
single block before the getitimer(2) operation makes setitimer(2)
itself easier to read. It will also simplify merging the critical
sections of setitimer(2) and getitimer(2) in a subsequent patch.
This changes setitimer(2)'s behavior in the EINVAL case. Currently,
if your input (itv) is invalid, we return EINVAL *after* modifying the
output (olditv). With the patch we will now return EINVAL *before*
modifying the output. However, any code dependent upon this behavior
is broken: the contents of olditv are undefined in all setitimer(2)
error cases.
ok millert@
|
|
|
|
|
|
|
|
| |
The ITIMER_REAL per-process interval timer is protected by the kernel
lock. The ITIMER_REAL timeout (ps_realit_to), setitimer(2), and
getitimer(2) all run under the kernel lock. Entering itimer_mtx
during getitimer(2) when reading the ITIMER_REAL ps_timer state is
superfluous and misleading.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ITIMER_VIRTUAL and ITIMER_PROF per-process interval timers are
updated from hardclock(9). If a timer for the parent process is
enabled the hardclock(9) thread calls itimerdecr() to update and
reload it as needed.
However, in itimerdecr(), after entering itimer_mtx, the thread needs
to double-check that the timer in question is still enabled. While
the hardclock(9) thread is entering itimer_mtx a thread in
setitimer(2) can take the mutex and disable the timer.
If the timer is disabled, itimerdecr() should return 1 to indicate
that the timer has not expired and that no action needs to be taken.
ok kettenis@
|
|
|
|
|
|
|
| |
The current input validation for overflow is more complex than
it needs to be. We can flatten the conditional hierarchy into
a string of checks just one level deep. The result is easier
to read.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
At securelevel 2 we prevent root from rewinding the kernel UTC clock.
The rationale given in the comment is that this prevents a compromised
root from setting arbitrary timestamps on files.
I can't really speak to the efficacy of this mitigation, or to the
efficacy of the securelevel concept in general, but the implementation
of this mitigation is wrong. We need to check:
timespeccmp(ts, &now, <=)
instead of
timespeccmp(ts, &now, <)
like we do now.
Time is a continuous value that is always advancing. We must prevent
root from setting the kernel UTC clock to its current value in
addition to prior values. Setting the UTC clock to its current value
amounts to rewinding it even if we cannot actually measure the
difference with a timespec.
With this change, at securelevel 2, root can no longer completely
freeze the UTC clock.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we recompute the scaling factor during tc_windup() there is an
opportunity for arithmetic overflow if the active timecounter's
adjfreq(2) adjustment is too large. If we limit the adjustment to
[-500000, +500000] ppm the statement in question cannot overflow.
In particular, we are concerned with the following bit of code:
scale = (u_int64_t)1 << 63;
scale += \
((th->th_adjustment + th->th_counter->tc_freq_adj) / 1024) * 2199;
scale /= th->th_counter->tc_frequency;
th->th_scale = scale * 2;
where scale is an int64_t. Overflow when we do:
scale += (...) / 1024 * 2199;
as th->th_counter->tc_freq_adj is currently unbounded.
th->th_adjustment is limited to [-5000ppm, 5000ppm].
To see that overflow is prevented with the new bounds, consider the
new edge case where th->th_counter->tc_freq_adj is 500000ppm and
th->th_adjustment is 5000ppm. Both are of type int64_t. We have:
int64_t th_adjustment = (5000 * 1000) << 32; /* 21474836480000000 */
int64_t tc_freq_adj = 500000000LL << 32; /* 2147483648000000000 */
scale = (u_int64_t)1 << 63; /* 9223372036854775808 */
scale += (th_adjustment + tc_freq_adj) / 1024 * 2199;
/* scale += 2168958484480000000 / 1024 * 2199; */
/* scale += 4657753620480000000; */
9223372036854775808 + 4657753620480000000 = 13881125657334775808,
which less than 18446744073709551616, so we don't have overflow.
On the opposite end, if th->th_counter->tc_freq_adj is -500000ppm and
th->th_adjustment is -5000ppm we would have -4657753620480000000.
9223372036854775808 - 4657753620480000000 = 4565618416374775808.
Again, no overflow.
500000ppm and -500000ppm are extreme adjustments. otto@ says ntpd(8)
would never arrive at them naturally, so we are not at risk of breaking
a working setup by imposing these restrictions.
Documentation input from kettenis@.
No complaints from otto@.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We don't want resettodr(9) to write the RTC until inittodr(9) has
actually run. Until inittodr(9) calls tc_setclock() the system UTC
clock will contain a meaningless value and there's no sense in
overwriting a good value with a value we know is nonsense.
This is not an uncommon problem if you're debugging a problem in early
boot, e.g. a panic that occurs prior to inittodr(9).
Currently we use the following logic in resettodr(9) to inhibit writes:
if (time_second == 1)
return;
... this is too magical.
A better way to accomplish the same thing is to introduce a dedicated
flag set from inittodr(9). Hence, "inittodr_done".
Suggested by visa@.
ok kettenis@
|
| |
|
|
|
|
|
|
| |
of todr_handle.
OK kettenis@
|
|
|
|
|
| |
ok deraadt@, mpi@, visa@
ok cheloha@ as well (would have preferred in new file for this code)
|
|
|
|
|
|
|
|
|
|
| |
While here, rename the wait channel so the tsleep_nsec(9) call will fit
onto a single line. It isn't a global channel so the name is arbitrary
anyway.
With input from visa@.
ok visa@
|
|
|
|
|
|
|
|
|
| |
I broke adjfreq(2)'s atomic swap in kern_time.c,v1.112. By using the
"f" variable to store both the new and old frequency adjustments, the
new adjustment gets clobbered by the old adjustment if the caller asked
for a swap.
ok visa@ mpi@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently we return (1000000000 / hz) from clock_getres(2) as the
resolution for every clock. This is often untrue.
For CPUTIME clocks, if we have a separate statclock interrupt the
resolution is (1000000000 / stathz). Otherwise it is as we currently
claim: (1000000000 / hz).
For the REALTIME/MONOTONIC/UPTIME/BOOTTIME clocks the resolution is
that of the active timecounter. During tc_init() we can compute the
precision of a timecounter by examining its tc_counter_mask and store
it for lookup later in a new member, tc_precision. The resolution of
a clock backed by a timecounter "tc" is then
tc.tc_precision * (2^64 / tc.tc_frequency)
fractional seconds.
While here we can clean up sys_clock_getres() a bit.
Standards input from guenther@. Lots of input, feedback from
kettenis@.
ok kettenis@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For gettimeofday(2), always copy out an empty timezone struct. For
settimeofday(2), still copyin(9) the struct but ignore the contents.
In gettimeofday(2)'s case we have not changed the original BSD semantics:
the kernel only tracks UTC time without an offset for DST, so a zeroed
timezone struct is the correct thing to return to the caller.
Future work could move these out into libc as stubs for clock_gettime and
clock_settime(2). But, definitely a "later" thing, given that we are in
beta.
Update the manpage to de-emphasize the timezone parameters for these
syscalls.
Discussed with tedu@, deraadt@, millert@, kettenis@, yasuoka@, jca@, and
guenther@. Tested by job@. Ports input from jca@ and sthen@. Manpage
input from jca@.
ok jca@ deraadt@
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Loongson runs at 128hz. 128 doesn't divide evenly into a million,
but it does divide evenly into a billion. So if we do the per-process
itimer bookkeeping with itimerspec structs we can have error-free
virtual itimers on loongson just as we do on most other platforms.
This change doesn't fix the virtual itimer error alpha, as 1024 does not
divide evenly into a billion. But this doesn't make the situation any
worse, either.
ok deraadt@
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Basically just make all the bintime routines look and behave more like
the timeradd(3) macros.
Switch to three-argument forms for structure math, introduce and use
bintimecmp(9), and rename the structure conversion routines to resemble
e.g. TIMEVAL_TO_TIMESPEC(3).
Document all of this in a new bintimeadd.9 page.
Code input from mpi@, manpage input from schwarze@.
code ok mpi@, docs ok schwarze@, docs probably still ok jmc@
|
|
|
|
|
|
|
|
|
| |
It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.
Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@
|
|
|
|
|
|
|
| |
Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.
ok visa@, cheloha@
|
|
|
|
| |
ok mlarkin, otto (who both had the same diff)
|
|
|
|
|
|
|
|
|
| |
clock_settime(2)/settimeofday(2) still need KERNEL_LOCK for a moment
when resetting the RTC, as that's done periodically from a task under
KERNEL_LOCK. Not quite sure how to approach that one yet.
ok visa@ mpi@, "good stuff" tedu@,
"please wait until after [tree] unlock" deraadt@
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
No other (known) BSD-derived adjtime(2) implementation checks for overflow
when converting delta into its final denomination of fractional seconds.
This is peculiar, as the call originates in 4.3BSD.
However, glibc, uclibc, and (to an extent) musl /do/ check the input and set
EINVAL if it exceeds a certain bound, so we'll just use the errno that they
use to be consistent with extant practice.
Prompted by the comment kettenis@ left when we switched to storing the
adjustment in an int64_t like ~5 years ago (kern_time.c,v 1.87).
Positive feedback from deraadt@, manpage bits ok jmc@,
no code complaints from otto@ or tedu@.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tc_lock allows adjfreq(2) and the kern.timecounter.hardware sysctl(2)
to read/write the active timecounter pointer and the .tc_adj_freq
member of the active timecounter safely. This eliminates any possibility
of a torn read/write for the .tc_adj_freq member when we drop the
KERNEL_LOCK from the timecounting layer. It also ensures the active
timecounter does not change in the midst of an adjfreq(2) call.
Because these are not high-traffic paths, we can get away with using
tc_lock in write-mode to ensure combination read/write adjtime(2) calls
are relatively atomic (a) to other writer adjtime(2) calls, and (b) to
settimeofday(2)/clock_settime(2) calls, which cancel ongoing adjtime(2)
adjustment.
When the KERNEL_LOCK is dropped, an unprivileged user will be able to
create some tc_lock contention via adjfreq(2); it is very unlikely to
ever be a problem. If it ever is actually a problem a lockless read
could be added to address it.
While here, reorganize sys_adjfreq()/sys_adjtime() to minimize code
under the lock. Also while here, make tc_adjfreq() void, as it cannot
fail under any circumstance. Also also while here, annotate various
globals/struct members with lock ordering details.
With lots of input from mpi@ and visa@.
ok visa@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This will simplify upcoming MP-safety diffs for the timecounting layer.
adjtimedelta is now accessed nowhere outside of kern_tc.c, so we can
remove its extern declaration from kernel.h. Zeroing adjtimedelta
within timecounter_mtx before we jump the real-time clock is also a
bit safer than what we do now, as we are not racing a simultaneous
tc_windup() call from hardclock(), which itself can modify adjtimedelta
via ntp_update_second().
Discussed with visa@ and mpi@.
ok visa@
|
|
|
|
|
| |
add locking in clock_gettime where needed.
ok cheloha matthew
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
Add documentation for the new EINVAL cases for adjtime(2) and
settimeofday(2).
adjtime.2 docs ok schwarze@,
settimeofday(2)/clock_settime(2) stuff ok tedu@,
"stop waiting" deraadt@
|
|
|
|
| |
ok jca@ visa@ guenther@ deraadt@
|
|
|
|
|
|
|
|
|
|
|
|
| |
tsleep(9)'s maximum timeout shrinks as HZ grows, so this ensures we do
not return early from longer timeouts on alpha or on custom kernels.
POSIX says you cannot return early unless a signal is delivered, so
this makes us more compliant with the standard.
While here, remove the 100 million second upper bound. It is an
artifact from itimerfix() and it serves no discernible purpose.
ok tedu@ visa@
|
| |
|
|
|
|
| |
ok tb@ kettenis@
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of converting timespec -> timeval and truncating the input,
check with timespecfix and use tstohz(9) for the tsleep.
All other contemporary systems check this correctly.
Also add a regression test for this case.
ok tb@
|