| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
| |
Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.
Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.
OK mpi@
|
|
|
|
|
|
|
|
|
|
|
|
| |
Make the SMR thread maintain an explicit system-wide grace period and
make CPUs observe the current grace period when crossing a quiescent
state. This lets the SMR thread avoid a forced context switch for CPUs
that have already entered the latest grace period.
This change provides a small improvement in smr_grace_wait()'s
performance in terms of context switching.
OK mpi@, anton@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:
#include <sys/systm.h>
tsleep_nsec(&nowait, ...);
There is now no need to handroll a local dead channel, e.g.
int chan;
tsleep_nsec(&chan, ...);
which expands the stack. Local dead channels will be replaced with
&nowake in later patches.
One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.
NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.
Discussed with mpi@, claudio@, kettenis@, and deraadt@.
Basically designed by kettenis@, who vetoed my other proposals.
Bugs caught by deraadt@, tb@, and patrick@.
|
|
|
|
|
|
|
|
| |
Make it obvious where the thread is blocked. "pause" is ambiguous.
Tweaked by kettenis@.
Probably ok kettenis@.
|
|
|
|
|
|
|
| |
We only see 8 characters of wmesg in e.g. top(1), so shorten the
string to fit.
Indirectly prompted by kettenis@.
|
|
|
|
|
|
|
|
| |
Invoke dead_filtops' f_event callback in klist_invalidate() to ensure
that filt_dead() modifies every invalidated knote. If a knote has
EV_ONESHOT set in its event flags, kqueue_scan() will not call f_event.
OK mpi@
|
|
|
|
|
|
|
| |
This fixes a regression where kqueue_scan() may incorrectly return
EWOULDBLOCK after a timeout.
OK mpi@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The given set of fds are converted to equivalent kevents using EV_SET(2)
and passed to the scanning internals of kevent(2): kqueue_scan().
ktrace(1) will now output the converted kevents on top of the usuals set
bits to be able to find possible error in the convertion.
This switch implies that select(2) and pselect(2) will now query the
underlying kqfilters instead of the *_poll() routines.
Based on similar work done on DragonFlyBSD with inputs from from visa@,
millert@, anton@, cheloha@, thanks!
ok visa@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().
Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.
There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.
The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.
For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.
Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.
OK mpi@
|
|
|
|
|
|
|
|
|
|
| |
When the file descriptor of an __EV_POLL-flagged knote is closed,
post EBADF through the kqueue instance to the caller of kqueue_scan().
This lets kqueue-based poll() and select() preserve their current
behaviour of returning EBADF when a polled file descriptor is closed
concurrently.
OK mpi@
|
|
|
|
| |
OK mpi@
|
|
|
|
|
|
|
| |
Because kqpoll instances are now linked to the file descriptor table,
the freeing of kqpoll and ordinary kqueues is similar.
Suggested by mpi@
|
|
|
|
|
|
|
| |
This lets the system remove kqpoll-related event registrations when
a file descriptor is closed.
OK mpi@
|
|
|
|
| |
OK cheloha@, mpi@, mvs@
|
|
|
|
|
|
| |
OK dlg@, bluhm@
No Opinion mpi@
Not against it claudio@
|
|
|
|
|
|
|
|
| |
By storing pipe pointer in kn_hook, filt_pipedetach() does not need
extra logic to find the correct pipe instance. This also lets the kernel
clear the knote lists fully.
OK anton@, mpi@
|
|
|
|
|
|
| |
Removed some trailing whitespace while there.
ok gkoehler@
|
|
|
|
|
|
| |
This will soon be used by select(2) and poll(2).
ok anton@, visa@
|
|
|
|
|
|
|
| |
Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.
From and ok claudio@
|
|
|
|
|
|
|
| |
Stop iterating in the function and instead copy the returned events to
userland after every call.
ok visa@
|
|
|
|
|
|
|
|
|
|
| |
srp_finalize(9) spins until the refcount hits zero. Blocking for at
least 1ms each iteration instead of blocking for at most 1 tick is
sufficient.
Discussed with mpi@.
ok claudio@ jmatthew@
|
|
|
|
| |
ok gkoehler@
|
|
|
|
|
|
|
| |
Make sure `ps_single' is set only once by checking then updating it without
releasing the lock.
Analyzed by and ok claudio@
|
|
|
|
| |
Panic reported by dhill@
|
|
|
|
|
|
|
| |
Make sure `ps_single' is set only once by checking then updating it without
releasing the lock.
Analyzed by and ok claudio@
|
|
|
|
| |
printf(9) already lacked documentation and needs no change.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.
This is required to implement select(2) and poll(2) on top of kqueue_scan().
Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.
ok visa@, anton@
|
|
|
|
|
|
| |
Performed a minor refactoring and removed a few trailing whitespaces.
ok anton@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mbuf is encountered in a seqpacket socket.
This diff uses the fact that setting orig_resid to 0 results in soreceive()
to return instead of looping back with the intent to sleep for more data.
orig_resid is now always set to 0 in the control message case (instead of
only if controlp is defined). This is the same behaviour as for the PR_NAME
case. Additionally orig_resid is set to 0 in the data reader when MSG_PEEK
is used.
Tested in snaps for a while and by anton@
Reported-by: syzbot+4b0e9698b344b0028b14@syzkaller.appspotmail.com
|
|
|
|
|
|
|
|
| |
Used sysctl_int_bounded in many places to shrink code.
Extracted a new function to make the case tidy.
Removed some superflous fluff.
OK millert@
|
|
|
|
|
|
|
|
|
|
|
|
| |
Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.
It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.
ok kettenis@ visa@
|
|
|
|
|
|
|
|
| |
This one is surprisingly a minor loss if one were to simply add bytes
on amd64:
.text+.data+.bss+.rodata
before 0x64b0+0x40+0x14+0x338 = 0x683c
after 0x6440+0x48+0x14+0x3b8 = 0x6854
|
|
|
|
|
|
|
|
|
| |
objdump -h changes in Size of kern_sysctl.o on amd64
before after
.text 7140 64b0
.data 24 40
.bss 10 14
.rodata 50 338
|
|
|
|
| |
ok visa@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To unlock getitimer(2) and setitimer(2) we need to protect the
per-process ITIMER_REAL state with something other than the kernel
lock. As the ITIMER_REAL timeout callback realitexpire() runs at
IPL_SOFTCLOCK the per-process mutex ps_mtx is appropriate.
In setitimer() we need to use ps_mtx instead of the global itimer_mtx
if the given timer is ITIMER_REAL. Easy.
The ITIMER_REAL timeout callback routine realitexpire() is trickier.
When we enter ps_mtx during the callback we need to check if the timer
was cancelled or rescheduled. A thread from the process can call
setitimer(2) at the exact moment the callback is about to run from
timeout_run() (see kern_timeout.c).
Update the locking annotation in sys/proc.h accordingly.
ok anton@
|
|
|
|
|
|
|
| |
Simplify MD code and reduce the amount of recursion into the signal code
which helps when dealing with locks.
ok cheloha@, deraadt@
|
|
|
|
|
|
|
| |
Requires sysctl_bounded_arr branch to support sysctl_rdint.
The read-only variables are marked by an empty range of [1, 0].
OK millert@
|
|
|
|
|
|
|
| |
Based/previous work on an idea from deraadt@
Input from claudio@, djm@, deraadt@, sthen@
OK deraadt@
|
|
|
|
|
| |
As deraadt@ has pointed out, tracing timevals and timespecs before
validating them makes debugging easier.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This introduces a system-wide mutex that serializes msgbuf operations.
The mutex controls access to all modifiable fields of struct msgbuf.
It also covers logsoftc.sc_state.
To avoid adding extra lock order constraints that would affect use of
printf(9), the code does not take new locks when the log mutex is held.
The code assumes that there is at most one thread using logread(). This
keeps the logic simple. If there was more than one reader, logread()
might return the same data to different readers. Also, log wakeup might
not be reliable with multiple threads.
Tested in snaps for two weeks.
OK mpi@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reimplement the ITIMER_REAL interval timer with a kclock timeout.
Couple things of note:
- We need to use the high-res nanouptime(9) call, not the low-res
getnanouptime(9).
- The code is simpler now that we aren't working with ticks.
Misc. thoughts:
- Still unsure if "kclock" is the right name for these things.
- MP-safely cancelling a periodic timeout is very difficult.
|
|
|
|
|
|
|
| |
Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.
ok visa@
|
| |
|
|
|
|
|
|
|
| |
The underlying vm_space lock is used as a substitute to the KERNEL_LOCK()
in uvm_grow() to make sure `vm_ssize' is not corrupted.
ok anton@, kettenis@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A kclock timeout is a timeout that expires at an absolute time on one
of the kernel's clocks. A timeout's absolute expiration time is kept
in a new member of the timeout struct, to_abstime. The timeout's
kclock is set at initialization and is kept in another new member of
the timeout struct, to_kclock.
Kclock timeouts are desireable because they have nanosecond
resolution, regardless of the value of hz(9). The timecounter
subsystem is also inherently NTP-sensitive, so timeouts scheduled
against the subsystem are NTP-sensitive. These two qualities
guarantee that a kclock timeout will never expire early.
Currently there is support for one kclock, KCLOCK_UPTIME (the uptime
clock). Support for KCLOCK_RUNTIME (the runtime clock) and KCLOCK_UTC
(the UTC clock) is planned for the future.
Support for these additional kclocks will allow us to implement some
of the POSIX interfaces OpenBSD is missing, e.g. clock_nanosleep() and
timer_create(). We could also use it to provide proper absolute
timeouts for e.g. pthread_mutex_timedlock(3).
Kclock timeouts are initialized with timeout_set_kclock(). They can
be scheduled with either timeout_in_nsec() (relative timeout) or
timeout_at_ts() (absolute timeout). They are incompatible with
timeout_add(9), timeout_add_sec(9), timeout_add_msec(9),
timeout_add_usec(9), timeout_add_nsec(9), and timeout_add_tv(9).
They can be cancelled with timeout_del(9) or timeout_del_barrier(9).
Documentation for the new interfaces is a work in progress.
For now, tick-based timeouts remain supported alongside kclock
timeouts. They will remain supported until we are certain we don't
need them anymore. It is possible we will never remove them. I would
rather not keep them around forever, but I cannot predict what
difficulties we will encounter while converting tick-based timeouts to
kclock timeouts. There are a *lot* of timeouts in the kernel.
Kclock timeouts are more costly than tick-based timeouts:
- Calling timeout_in_nsec() incurs a call to nanouptime(9). Reading
the hardware timecounter is too expensive in some contexts, so care
must be taken when converting existing timeouts.
We may add a flag in the future to cause timeout_in_nsec() to use
getnanouptime(9) instead of nanouptime(9), which is much cheaper.
This may be appropriate for certain classes of timeouts. tcp/ip
session timeouts come to mind.
- Kclock timeout expirations are kept in a timespec. Timespec
arithmetic has more overhead than 32-bit tick arithmetic, so
processing kclock timeouts during softclock() is more expensive.
On my machine the overhead for processing a tick-based timeout is
~125 cycles. The overhead for a kclock timeout is ~500 cycles.
The overhead difference on 32-bit platforms is unknown. If it
proves too large we may need to use a 64-bit value to store the
expiration time. More measurement is needed.
Priority targets for conversion are setitimer(2), *sleep_nsec(9), and
the kevent(2) EVFILT_TIMER timers. Others will follow.
With input from mpi@, visa@, kettenis@, dlg@, guenther@, claudio@,
deraadt@, probably many others. Older version tested by visa@.
Problems found in older version by bluhm@. Current version tested by
Yuichiro Naito.
"wait until after unlock" deraadt@, ok kettenis@
|
|
|
|
|
|
|
|
| |
If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".
Based on input from millert@ and kettenis@.
|
|
|
|
|
|
| |
This create too many false positive when setting pool_debug=2.
Prodded by deraadt@, ok mvs@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.
The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.
While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.
ok deraadt@
|
|
|
|
| |
ok millert
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If itv.it_value is zero we cancel the timer. When we cancel the timer
we don't care about itv.it_interval because the timer is not running:
we don't use it, we don't look at it, etc.
To be on the paranoid side, I think we should zero itv.it_interval
when itv.it_value is zero. No need to write arbitrary values into the
process struct if we aren't required to. The standard is ambiguous
about what should happen in this case, i.e. the value of olditv after
the following code executes is unspecified:
struct itimerval newitv, olditv;
newitv.it_value.tv_sec = newitv.it_value.tv_usec = 0;
newitv.it_interval.tv_sec = newitv.it_interval.tv_usec = 1;
setitimer(ITIMER_REAL, &newitv, NULL);
getitimer(ITIMER_REAL, &olditv);
This change should not break any real code.
|