| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
| |
Simplify MD code and reduce the amount of recursion into the signal code
which helps when dealing with locks.
ok cheloha@, deraadt@
|
|
|
|
|
|
|
| |
Requires sysctl_bounded_arr branch to support sysctl_rdint.
The read-only variables are marked by an empty range of [1, 0].
OK millert@
|
|
|
|
|
|
|
| |
Based/previous work on an idea from deraadt@
Input from claudio@, djm@, deraadt@, sthen@
OK deraadt@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This introduces a system-wide mutex that serializes msgbuf operations.
The mutex controls access to all modifiable fields of struct msgbuf.
It also covers logsoftc.sc_state.
To avoid adding extra lock order constraints that would affect use of
printf(9), the code does not take new locks when the log mutex is held.
The code assumes that there is at most one thread using logread(). This
keeps the logic simple. If there was more than one reader, logread()
might return the same data to different readers. Also, log wakeup might
not be reliable with multiple threads.
Tested in snaps for two weeks.
OK mpi@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A kclock timeout is a timeout that expires at an absolute time on one
of the kernel's clocks. A timeout's absolute expiration time is kept
in a new member of the timeout struct, to_abstime. The timeout's
kclock is set at initialization and is kept in another new member of
the timeout struct, to_kclock.
Kclock timeouts are desireable because they have nanosecond
resolution, regardless of the value of hz(9). The timecounter
subsystem is also inherently NTP-sensitive, so timeouts scheduled
against the subsystem are NTP-sensitive. These two qualities
guarantee that a kclock timeout will never expire early.
Currently there is support for one kclock, KCLOCK_UPTIME (the uptime
clock). Support for KCLOCK_RUNTIME (the runtime clock) and KCLOCK_UTC
(the UTC clock) is planned for the future.
Support for these additional kclocks will allow us to implement some
of the POSIX interfaces OpenBSD is missing, e.g. clock_nanosleep() and
timer_create(). We could also use it to provide proper absolute
timeouts for e.g. pthread_mutex_timedlock(3).
Kclock timeouts are initialized with timeout_set_kclock(). They can
be scheduled with either timeout_in_nsec() (relative timeout) or
timeout_at_ts() (absolute timeout). They are incompatible with
timeout_add(9), timeout_add_sec(9), timeout_add_msec(9),
timeout_add_usec(9), timeout_add_nsec(9), and timeout_add_tv(9).
They can be cancelled with timeout_del(9) or timeout_del_barrier(9).
Documentation for the new interfaces is a work in progress.
For now, tick-based timeouts remain supported alongside kclock
timeouts. They will remain supported until we are certain we don't
need them anymore. It is possible we will never remove them. I would
rather not keep them around forever, but I cannot predict what
difficulties we will encounter while converting tick-based timeouts to
kclock timeouts. There are a *lot* of timeouts in the kernel.
Kclock timeouts are more costly than tick-based timeouts:
- Calling timeout_in_nsec() incurs a call to nanouptime(9). Reading
the hardware timecounter is too expensive in some contexts, so care
must be taken when converting existing timeouts.
We may add a flag in the future to cause timeout_in_nsec() to use
getnanouptime(9) instead of nanouptime(9), which is much cheaper.
This may be appropriate for certain classes of timeouts. tcp/ip
session timeouts come to mind.
- Kclock timeout expirations are kept in a timespec. Timespec
arithmetic has more overhead than 32-bit tick arithmetic, so
processing kclock timeouts during softclock() is more expensive.
On my machine the overhead for processing a tick-based timeout is
~125 cycles. The overhead for a kclock timeout is ~500 cycles.
The overhead difference on 32-bit platforms is unknown. If it
proves too large we may need to use a 64-bit value to store the
expiration time. More measurement is needed.
Priority targets for conversion are setitimer(2), *sleep_nsec(9), and
the kevent(2) EVFILT_TIMER timers. Others will follow.
With input from mpi@, visa@, kettenis@, dlg@, guenther@, claudio@,
deraadt@, probably many others. Older version tested by visa@.
Problems found in older version by bluhm@. Current version tested by
Yuichiro Naito.
"wait until after unlock" deraadt@, ok kettenis@
|
|
|
|
|
|
|
|
| |
If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".
Based on input from millert@ and kettenis@.
|
|
|
|
|
|
|
|
|
|
|
|
| |
miod@ removed several time-related globals from the kernel with the
commit "unifdef -d __HAVE_TIMECOUNTER" (see sys/kern/kern_clock.c v1.76).
He neglected to remove their externs from sys/kernel.h, though.
Remove the externs.
With help from jsg@.
ok jsg@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.
The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.
While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.
ok deraadt@
|
|
|
|
|
|
|
|
|
|
| |
The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.
Extracted from a previous diff from visa@.
ok visa@, anton@
|
| |
|
|
|
|
|
|
|
| |
LIST_END -> SMR_LIST_END
TAILQ_END -> SMR_TAILQ_END
ok visa@
|
|
|
|
| |
in dev/kcov.c; therefore move it to dev/kcov.c.
|
|
|
|
| |
ok claudio@
|
|
|
|
| |
ok claudio@
|
| |
|
|
|
|
| |
struct sigacts since that is the only thing that is modified by siginit.
|
|
|
|
| |
ok claudio@, pirofti@
|
| |
|
| |
|
|
|
|
| |
OK deraadt@, mpi@
|
|
|
|
| |
Prompted by mpi@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
"syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008.
Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c
the only visible difference between used and stub ctldebug structs in the
debugvars[] array is their extern keyword, indicating that it is defined
elsewhere.
sys/sysctl.h declares all debugN members as extern upfront, but these
declarations are not needed.
Remove the unused debug sysctl, rename the only remaining one to something
meaningful and remove forward declarations from /sys/sysctl.h; this way,
adding new debug sysctls is a matter of adding extern and coming up with a
name, which is nicer to read on its own and better to grep for.
OK mpi
|
|
|
|
| |
ok mvs@, visa@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adding "debug.my-knob" sysctls is really helpful to select different
code paths and/or log on demand during runtime without recompile,
but as this code is under DEBUG, lots of other noise comes with it
which is often undesired, at least when looking at specific subsystems
only.
Adding globals to the kernel and breaking into DDB to change them helps,
but that does not work over SSH, hence the need for debug sysctls.
Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of
DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general
option for all of sysctl(2).
OK gnezdo
|
|
|
|
| |
OK mvs@
|
|
|
|
|
|
| |
Design by deraadt@
ok deraadt@
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ITIMER_REAL itimerspec (ps_timer[0]) and timeout (ps_realit_to)
are protected by the kernel lock. Annotate them with "K".
The ITIMER_VIRTUAL and ITIMER_PROF itimerspecs (ps_timer[1],
ps_timer[2]) are protected by itimer_mtx. Annotate them with "T",
for "timer".
With input from kettenis@ and anton@.
ok kettenis@, anton@
|
|
|
|
| |
Reminded by, input & OK jca
|
|
|
|
|
|
|
|
|
|
| |
These two interfaces have been entirely unused since introduction.
Remove them and thin the "timeout" namespace a bit.
Discussed with mpi@ and ratchov@ almost a year ago, though I blocked
the change at that time. Also discussed with visa@.
ok visa@, mpi@
|
|
|
|
|
|
|
| |
Range violations are now consistently reported as EOPNOTSUPP.
Previously they were mixed with ENOPROTOOPT.
OK kn@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
from threads other than the one currently having kcov enabled. A thread
with kcov enabled occasionally delegates work to another thread,
collecting coverage from such threads improves the ability of syzkaller
to correlate side effects in the kernel caused by issuing a syscall.
Remote coverage is divided into subsystems. The only supported subsystem
right now collects coverage from scheduled tasks and timeouts on behalf
of a kcov enabled thread. In order to make this work `struct task' and
`struct timeout' must be extended with a new field keeping track of the
process that scheduled the task/timeout. Both aforementioned structures
have therefore increased with the size of a pointer on all
architectures.
The kernel API is documented in a new kcov_remote_register(9) manual.
Remote coverage is also supported by kcov on NetBSD and Linux.
ok mpi@
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The TIMEOUT_SCHEDULED flag was added a few months ago to differentiate
between wheel timeouts and new timeouts during softclock(). The
distinction is useful when incrementing the "rescheduled" stat and the
"late" stat.
Now that we have an intermediate queue for new timeouts, timeout_new,
we don't need the flag. The distinction between wheel timeouts and
new timeouts can be made computationally.
Suggested by procter@ several months ago.
|
|
|
|
|
|
|
| |
MD versions of these headers were unhooked. As nothing has hit those
checks we can drop them at this point.
ok visa@ and "makes sense" to millert@
|
|
|
|
|
|
|
|
| |
used by the processor chip. Although we have a SENSOR_WATTHOUR sensor
type its units are not really suitable for this sensor. So add a
SENSOR_ENERGY type that uses micro Joules as its unit.
ok deraadt@
|
|
|
|
|
|
|
|
|
| |
VERASE would perform (sometimes irrelevant) compute in the kernel which
can be heavy (especially with our insufficient tty subsystem locking). Use
tsleep_nsec for 1 tick in such circumstances to yield cpu, and also bring
interruptability to ptcwrite()
https://syzkaller.appspot.com/bug?extid=462539bc18fef8fc26cc
ok kettenis millert, discussions with greg and anton
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This diff exposes parts of clock_gettime(2) and gettimeofday(2) to
userland via libc eliberating processes from the need for a context
switch everytime they want to count the passage of time.
If a timecounter clock can be exposed to userland than it needs to set
its tc_user member to a non-zero value. Tested with one or multiple
counters per architecture.
The timing data is shared through a pointer found in the new ELF
auxiliary vector AUX_openbsd_timekeep containing timehands information
that is frequently updated by the kernel.
Timing differences between the last kernel update and the current time
are adjusted in userland by the tc_get_timecount() function inside the
MD usertc.c file.
This permits a much more responsive environment, quite visible in
browsers, office programs and gaming (apparently one is are able to fly
in Minecraft now).
Tested by robert@, sthen@, naddy@, kmos@, phessler@, and many others!
OK from at least kettenis@, cheloha@, naddy@, sthen@
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
a kstat is an arbitrary chunk of data that a part of the kernel
wants to expose to userland. data could mean just a chunk of raw
bytes, but generally a kernel subsystem will provide a series of
kstat key/value chunks.
this code is loosely modelled on kstat in solaris, but with a bunch
of simplifications (we don't want to provide write support for
example). the named or key/value structure is significantly richer
in this version too. eg, ssolaris kstat named data supports integer
types, but this version offers differentiation between counters
(like the number of packets transmitted on an interface) and gauges
(like how long the transmit queue is) and lets kernel providers say
what the units are (eg, packets vs bytes vs cycles).
the main motivation for this is to improve the visibility of what
the kernel is doing while it's running. i wrote this as part of the
recent work we've been doing on multiqueue and rss/toeplitz so i
could verify that network load is actually spread across multiple
rings on a single nic. without this we would be wasting memory and
interrupt vectors on multiple rings and still just using the 1st
one, and noone would know cos there's no way to see what rings are
being used.
another thing that can become visible is the different counters
that various network cards provide. i'm particularly interested in
seeing if packets get dropped because the rings aren't filled fully,
which is an effect we've never really observed directly.
a small part of wanting this is cos i spend an annoying amount of
time instrumenting the kernel when hacking code in it. if most of
the scaffolding for the instrumentation is already there, i can
avoid repeatedly writing that code and save time.
iterated a few times with claudio@ and deraadt@
|
|
|
|
|
|
|
|
|
|
| |
capital letters in locking annotations. Therefore harmonize the existing
annotations.
Also, if multiple locks are required they should be delimited using
commas.
ok mpi@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
serializing calls to pipe_buffer_free(). Repeating the previous commit
message:
Instead of performing three distinct allocations per created pipe,
reduce it to a single one. Not only should this be more performant, it
also solves a kqueue related issue found by visa@ who also requested
this change: if you attach an EVFILT_WRITE filter to a pipe fd, the
knote gets added to the peer's klist. This is a problem for kqueue
because if you close the peer's fd, the knote is left in the list whose
head is about to be freed. knote_fdclose() is not able to clear the
knote because it is not registered with the peer's fd.
FreeBSD also takes a similar approach to pipe allocations.
once again ok mpi@ visa@
|
|
|
|
|
| |
to keep the behavior when switching poll(2) to use kqueue filters.
From mpi@
|
|
|
|
|
|
| |
function but actually a 'true' value is needed; use seltrue instead.
Problem reported, kenel bisected and diff tested by Jens A. Griepentrog.
ok deraadt@ mpi@
|
|
|
|
|
|
| |
these days, so inventing our own numbers is fine.
From drahn@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
time_second(9) has been replaced in the kernel by gettime(9).
time_uptime(9) has been replaced in the kernel by getuptime(9).
New code should use the replacement interfaces. They do not suffer
from the split-read problem inherent to the time_* variables on 32-bit
platforms.
The variables remain in sys/kern/kern_tc.c for use via kvm(3) when
examining kernel core dumps.
This commit completes the deprecation process:
- Remove the extern'd definitions for time_second and time_uptime
from sys/time.h.
- Replace manpage cross-references to time_second(9)/time_uptime(9)
with references to microtime(9) or a related interface.
- Move the time_second.9 manpage to the attic.
With input from dlg@, kettenis@, visa@, and tedu@.
ok kettenis@
|
|
|
|
| |
discussed with cheloha@
|
|
|
|
|
|
|
| |
it means we can do quick hacks to existing drivers to test interrupts
on multiple cpus. emphasis on quick and hacks.
ok jmatthew@, who will also ok the removal of it at the right time.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
time_second and time_uptime are used widely in the tree. This is a
problem on 32-bit platforms because time_t is 64-bit, so there is a
potential split-read whenever they are used at or below IPL_CLOCK.
Here are two replacement interfaces: gettime(9) and getuptime(9).
The "get" prefix signifies that they do not read the hardware
timecounter, i.e. they are fast and low-res. The lack of a unit
(e.g. micro, nano) signifies that they yield a plain time_t.
As an optimization on LP64 platforms we can just return time_second or
time_uptime, as a single read is atomic. On 32-bit platforms we need
to do the lockless read loop and get the values from the timecounter.
In a subsequent diff these will be substituted for time_second and
time_uptime almost everywhere in the kernel.
With input from visa@ and dlg@.
ok kettenis@
|
|
|
|
|
|
|
|
|
|
| |
This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.
The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.
ok millert@ on a previous version, ok visa@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
reading vpd stuff is useful when you're trying to get support
information about a pci device, eg, if you want a serial number,
or firmware versions, or specific part name or number, it's likely
available via vpd. also, im sick of having the diff in my tree.
the vpd info is not accessed as bytes read from a capability, but
is read via a register in the capability. the same register also
supports updating or writing vpd info, which sounds like a bad idea
to let userland have raw access to.
this adds an ioctl so that userland can ask the kernel to read via
the vpd register on its behalf. this ensures that the only access
is read access, and it's sanity checked.
tested by hrvoje popovski on many devices.
ok jmatthew@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
gif used its mbuf tag to store it's interface index so it could
detect loops. gre also did this, and i cut most of the drivers
(including gif) over to using the gre tag. so the gif tag is unused.
wireguard uses the tag to store peer information between different
contexts the packet is processed in. it also needs a bit more space
to do that.
from Matt Dunwoodie and Jason A. Donenfeld
ok deraadt@
|