| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
| |
The dummy counter should be deterministic with respect to interrupts
and multiple threads of execution.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
serializing calls to pipe_buffer_free(). Repeating the previous commit
message:
Instead of performing three distinct allocations per created pipe,
reduce it to a single one. Not only should this be more performant, it
also solves a kqueue related issue found by visa@ who also requested
this change: if you attach an EVFILT_WRITE filter to a pipe fd, the
knote gets added to the peer's klist. This is a problem for kqueue
because if you close the peer's fd, the knote is left in the list whose
head is about to be freed. knote_fdclose() is not able to clear the
knote because it is not registered with the peer's fd.
FreeBSD also takes a similar approach to pipe allocations.
once again ok mpi@ visa@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
time_second(9) has been replaced in the kernel by gettime(9).
time_uptime(9) has been replaced in the kernel by getuptime(9).
New code should use the replacement interfaces. They do not suffer
from the split-read problem inherent to the time_* variables on 32-bit
platforms.
The variables remain in sys/kern/kern_tc.c for use via kvm(3) when
examining kernel core dumps.
This commit completes the deprecation process:
- Remove the extern'd definitions for time_second and time_uptime
from sys/time.h.
- Replace manpage cross-references to time_second(9)/time_uptime(9)
with references to microtime(9) or a related interface.
- Move the time_second.9 manpage to the attic.
With input from dlg@, kettenis@, visa@, and tedu@.
ok kettenis@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
time_second(9) and time_uptime(9) are widely used in the kernel to
quickly get the system UTC or system uptime as a time_t. However,
time_t is 64-bit everywhere, so it is not generally safe to use them
on 32-bit platforms: you have a split-read problem if your hardware
cannot perform atomic 64-bit reads.
This patch replaces time_second(9) with gettime(9), a safer successor
interface, throughout the kernel. Similarly, time_uptime(9) is replaced
with getuptime(9).
There is a performance cost on 32-bit platforms in exchange for
eliminating the split-read problem: instead of two register reads you
now have a lockless read loop to pull the values from the timehands.
This is really not *too* bad in the grand scheme of things, but
compared to what we were doing before it is several times slower.
There is no performance cost on 64-bit (__LP64__) platforms.
With input from visa@, dlg@, and tedu@.
Several bugs squashed by visa@.
ok kettenis@
|
|
|
|
|
|
|
| |
it means we can do quick hacks to existing drivers to test interrupts
on multiple cpus. emphasis on quick and hacks.
ok jmatthew@, who will also ok the removal of it at the right time.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
time_second and time_uptime are used widely in the tree. This is a
problem on 32-bit platforms because time_t is 64-bit, so there is a
potential split-read whenever they are used at or below IPL_CLOCK.
Here are two replacement interfaces: gettime(9) and getuptime(9).
The "get" prefix signifies that they do not read the hardware
timecounter, i.e. they are fast and low-res. The lack of a unit
(e.g. micro, nano) signifies that they yield a plain time_t.
As an optimization on LP64 platforms we can just return time_second or
time_uptime, as a single read is atomic. On 32-bit platforms we need
to do the lockless read loop and get the values from the timecounter.
In a subsequent diff these will be substituted for time_second and
time_uptime almost everywhere in the kernel.
With input from visa@ and dlg@.
ok kettenis@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We don't want resettodr(9) to write the RTC until inittodr(9) has
actually run. Until inittodr(9) calls tc_setclock() the system UTC
clock will contain a meaningless value and there's no sense in
overwriting a good value with a value we know is nonsense.
This is not an uncommon problem if you're debugging a problem in early
boot, e.g. a panic that occurs prior to inittodr(9).
Currently we use the following logic in resettodr(9) to inhibit writes:
if (time_second == 1)
return;
... this is too magical.
A better way to accomplish the same thing is to introduce a dedicated
flag set from inittodr(9). Hence, "inittodr_done".
Suggested by visa@.
ok kettenis@
|
|
|
|
|
|
|
|
|
|
| |
This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.
The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.
ok millert@ on a previous version, ok visa@
|
| |
|
|
|
|
| |
from Matt Dunwoodie and Jason A. Donenfeld
|
| |
|
|
|
|
|
|
|
|
| |
sockets from different domains so there is no reason to have locking and memory
allocation in this error path. Also in this case only `so' will be locked by
solock() so we should avoid `sosp' modification.
ok mpi@
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
reduce it to a single one. Not only should this be more performant, it
also solves a kqueue related issue found by visa@ who also requested
this change: if you attach an EVFILT_WRITE filter to a pipe fd, the
knote gets added to the peer's klist. This is a problem for kqueue
because if you close the peer's fd, the knote is left in the list whose
head is about to be freed. knote_fdclose() is not able to clear the
knote because it is not registered with the peer's fd.
FreeBSD also takes a similar approach to pipe allocations.
ok mpi@ visa@
|
|
|
|
|
| |
requested by kettenis@
discussed with jmatthew@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
there's been discussions for years (and even some diffs!) about how we
should let drivers establish interrupts on multiple cpus.
the simple approach is to let every driver look at the number of
cpus in a box and just pin an interrupt on it, which is what pretty
much everyone else started with, but we have never seemed to get
past bikeshedding about. from what i can tell, the principal
objections to this are:
1. interrupts will tend to land on low numbered cpus.
ie, if drivers try to establish n interrupts on m cpus, they'll
start at cpu 0 and go to cpu n, which means cpu 0 will end up with more
interrupts than cpu m-1.
2. some cpus shouldn't be used for interrupts.
why a cpu should or shouldn't be used for interrupts can be pretty
arbitrary, but in practical terms i'm going to borrow from the
scheduler and say that we shouldn't run work on hyperthreads.
3. making all the drivers make the same decisions about the above is
a lot of maintenance overhead.
either we will have a bunch of inconsistencies, or we'll have a lot
of untested commits to keep everything the same.
my proposed solution to the above is this diff to provide the intrmap
api. drivers that want to establish multiple interrupts ask the api for
a set of cpus it can use, and the api considers the above issues when
generating a set of cpus for the driver to use. drivers then establish
interrupts on cpus with the info provided by the map.
it is based on the if_ringmap api in dragonflybsd, but generalised so it
could be used by something like nvme(4) in the future.
this version provides numeric ids for CPUs to drivers, but as
kettenis@ has been pointing out for a very long time, it makes more
sense to use cpu_info pointers. i'll be updating the code to address
that shortly.
discussed with deraadt@ and jmatthew@
ok claudio@ patrick@ kettenis@
|
| |
|
|
|
|
| |
ok visa@, millert@
|
|
|
|
|
|
| |
This is only done in poll-compatibility mode, when __EV_POLL is set.
ok visa@, millert@
|
|
|
|
|
|
|
| |
The list can be accessed from interrupt context if a signal is sent
from an interrupt handler.
OK anton@ cheloha@ mpi@
|
|
|
|
| |
subsystem and ps_klist handling still run under the kernel lock.
|
|
|
|
| |
Port breakages reported by naddy@
|
|
|
|
|
|
|
|
|
|
|
|
| |
lock order issue with the file close path.
The FRELE() can trigger the close path during dup*(2) if another thread
manages to close the file descriptor simultaneously. This race is
possible because the file reference is taken before the file descriptor
table is locked for write access.
Vitaliy Makkoveev agrees
OK anton@ mpi@
|
|
|
|
|
|
| |
While here prefix kernel-only EV flags with two underbars.
Suggested by kettenis@, ok visa@
|
|
|
|
|
|
| |
poll handler if the EV_OLDAPI flag is set.
ok visa@
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I wrote taskq_barrier with the behaviour described in the manpage:
taskq_barrier() guarantees that any task that was running on the tq taskq
when the barrier was called has finished by the time the barrier returns.
Note that it talks about the currently running task, not pending tasks.
It just so happens that the original implementation just used task_add
to put a condvar on the list and waited for it to run. Because task_add
uses TAILQ_INSERT_TAIL, you ended up waiting for all pending to work to
run too, not just the currently running task.
The new implementation took advantage of already holding the lock and
used TAILQ_INSERT_HEAD to put the barrier work at the front of the queue
so it would run next, which is closer to the stated behaviour.
Using the tail insert here restores the previous accidental behaviour.
jsg@ points out the following:
> The linux functions like flush_workqueue() we use this for in drm want
> to wait on all scheduled work not just currently running.
>
> ie a comment from one of the linux functions:
>
> /**
> * flush_workqueue - ensure that any scheduled work has run to completion.
> * @wq: workqueue to flush
> *
> * This function sleeps until all work items which were queued on entry
> * have finished execution, but it is not livelocked by new incoming ones.
> */
>
> our implementation of this in drm is
>
> void
> flush_workqueue(struct workqueue_struct *wq)
> {
> if (cold)
> return;
>
> taskq_barrier((struct taskq *)wq);
> }
I don't think it's worth complicating the taskq API, so I'm just
going to make taskq_barrier wait for pending work too.
tested by tb@
ok jsg@
|
|
|
|
|
| |
i should have removed the sbartq pointer in r1.47 when i removed
the sbartq.
|
|
|
|
|
|
|
|
|
| |
a lock order problem with altered locking of UNIX domain sockets.
closef() does not need the file descriptor table lock.
From Vitaliy Makkoveev
OK mpi@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
this is required for an upcoming drm update, where the linux workqueue
api that supports this is mapped to our taskq api.
this main way taskqs support that is to have the taskq worker threads
record thir curproc on the taskq, so taskq_barrier calls can iterate
over that list looking for their own curproc. if a barriers curproc
is in the list, it must be running inside the taskq, and should
pretend that it's a barrier task.
this also supports concurrent barrier calls by having the taskq
recognise the situation and have the barriers work together rather
than deadlocking. they end up trying to share the work of getting
the barrier tasks onto the workers. once all the workers (or in tq
barriers) have rendezvoused the barrier calls unwind, and the last
one out lets the other barriers and barrier tasks return.
all this barrier logic is implemented in the barrier code, it takes
the existing multiworker handling out of the actual taskq loop.
thanks to jsg@ for testing this and previous versions of the diff.
ok visa@ kettenis@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mitigation the algorithm was still accounting the offline CPUs, leading to
a code path that would never be reached.
This should allow better frequency scaling on systems with many CPUs.
The frequency should scale up if one of two condition is true.
- if at least one CPU has less than 25% of idle cpu time
- if the average of all idle time is under 33%
The second condition was never met because offline CPU are always accounted as
100% idle.
A bit more explanations about the auto scaling in case someone want to improve
this later: When one condition is met, CPU frequency is set to maximum and a
counter set to 5, then the function will be run again 100ms later and decrement
the counter if both conditions are not met anymore. Once the counter reach 0
the frequency is set to minimum. This mean that it can take up to 100ms to
scale up and up to 500ms to scale down.
ok brynet@
looks good tedu@
|
|
|
|
|
|
|
| |
These functions will be used to managed per-thread kqueues that are not
associated to a file descriptor.
ok visa@
|
|
|
|
|
|
| |
conversion steps). it only contains kernel prototypes for 4 interfaces,
all of which legitimately belong in sys/systm.h, which are already included
by all enqueue_randomness() users.
|
| |
|
|
|
|
|
|
|
|
| |
Bring the two syscalls in sync with recent MP changes in the file layer.
Inconsistency pointed by haesbaert@.
ok anton@, visa@
|
|
|
|
| |
sthen@ has reported that the patch might be causing hangs with X.
|
|
|
|
|
| |
will frantically compensate.
ok kettenis
|
|
|
|
| |
ok millert@, visa@
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we resume from a suspend we use the time from the RTC to advance
the system offset. This changes the UTC to match what the RTC has given
us while increasing the system uptime to account for the time we were
suspended.
Currently we decide whether to change to the RTC time in tc_setclock()
by comparing the new offset with the th_offset member. This is wrong.
th_offset is the *minimum* possible value for the offset, not the "real
offset". We need to perform the comparison within tc_windup() after
updating th_offset, otherwise we might rewind said offset.
Because we're now doing the comparison within tc_windup() we ought to
move naptime into the timehands. This means we now need a way to safely
read the naptime to compute the value of CLOCK_UPTIME for userspace.
Enter nanoruntime(9); it increases monotonically from boot but does not
jump forward after a resume like nanouptime(9).
|
|
|
|
|
|
| |
of todr_handle.
OK kettenis@
|
|
|
|
|
|
|
|
|
|
|
|
| |
The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.
OK mpi@
|
|
|
|
|
| |
ok deraadt@, mpi@, visa@
ok cheloha@ as well (would have preferred in new file for this code)
|
|
|
|
|
|
|
|
|
|
|
| |
This prevent exiting processes from hanging when a slave pseudo terminal
is close(2)d before its master.
From NetBSD via anton@.
Reported-by: syzbot+2ed25b5c40d11e4c3beb@syzkaller.appspotmail.com
ok anton@, kettenis@
|
|
|
|
|
|
|
|
| |
clear the NOCACHE flag, since if we are doing a delayed write the buffer
must be cached or it is thrown away when the "write" is done.
fixes vnd on mfs regress tests.
ok kettenis@ deraadt@
|
|
|
|
| |
ok millert@, deraadt@
|
|
|
|
|
|
|
|
|
|
|
| |
FRELE() as the last reference could be dropped which in turn will cause
soclose() to be called where the socket lock is unconditionally
acquired. Note that this is only a problem for sockets protected by the
non-recursive NET_LOCK() right now.
ok mpi@ visa@
Reported-by: syzbot+7c805a09545d997b924d@syzkaller.appspotmail.com
|
|
|
|
|
| |
them in line with sbappendstream() and sbappendrecord().
Agreed by mpi@
|
|
|
|
|
|
|
| |
Prevent generating events that do not correspond to how the fifo has been
opened.
ok visa@, millert@
|
|
|
|
|
|
| |
for example, with locking assertions.
OK mpi@, anton@
|
|
|
|
|
|
|
|
| |
deep recursion. This also helps making kqueue_wakeup() free of the
kernel lock because the current implementation of selwakeup()
requires the lock.
OK mpi@
|