summaryrefslogtreecommitdiffstats
path: root/sys/kern (follow)
Commit message (Collapse)AuthorAgeFilesLines
...
* timecounting: make the dummy counter interrupt- and MP-safecheloha2020-07-021-2/+2
| | | | | The dummy counter should be deterministic with respect to interrupts and multiple threads of execution.
* Bring back revision 1.122 with a fix preventing a use-after-free byanton2020-06-291-52/+55
| | | | | | | | | | | | | | | | | serializing calls to pipe_buffer_free(). Repeating the previous commit message: Instead of performing three distinct allocations per created pipe, reduce it to a single one. Not only should this be more performant, it also solves a kqueue related issue found by visa@ who also requested this change: if you attach an EVFILT_WRITE filter to a pipe fd, the knote gets added to the peer's klist. This is a problem for kqueue because if you close the peer's fd, the knote is left in the list whose head is about to be freed. knote_fdclose() is not able to clear the knote because it is not registered with the peer's fd. FreeBSD also takes a similar approach to pipe allocations. once again ok mpi@ visa@
* timecounting: deprecate time_second(9), time_uptime(9)cheloha2020-06-261-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | time_second(9) has been replaced in the kernel by gettime(9). time_uptime(9) has been replaced in the kernel by getuptime(9). New code should use the replacement interfaces. They do not suffer from the split-read problem inherent to the time_* variables on 32-bit platforms. The variables remain in sys/kern/kern_tc.c for use via kvm(3) when examining kernel core dumps. This commit completes the deprecation process: - Remove the extern'd definitions for time_second and time_uptime from sys/time.h. - Replace manpage cross-references to time_second(9)/time_uptime(9) with references to microtime(9) or a related interface. - Move the time_second.9 manpage to the attic. With input from dlg@, kettenis@, visa@, and tedu@. ok kettenis@
* kernel: use gettime(9)/getuptime(9) in lieu of time_second(9)/time_uptime(9)cheloha2020-06-245-19/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | time_second(9) and time_uptime(9) are widely used in the kernel to quickly get the system UTC or system uptime as a time_t. However, time_t is 64-bit everywhere, so it is not generally safe to use them on 32-bit platforms: you have a split-read problem if your hardware cannot perform atomic 64-bit reads. This patch replaces time_second(9) with gettime(9), a safer successor interface, throughout the kernel. Similarly, time_uptime(9) is replaced with getuptime(9). There is a performance cost on 32-bit platforms in exchange for eliminating the split-read problem: instead of two register reads you now have a lockless read loop to pull the values from the timehands. This is really not *too* bad in the grand scheme of things, but compared to what we were doing before it is several times slower. There is no performance cost on 64-bit (__LP64__) platforms. With input from visa@, dlg@, and tedu@. Several bugs squashed by visa@. ok kettenis@
* add intrmap_one, some temp code to help us write pci_intr_establish_cpu.dlg2020-06-231-1/+15
| | | | | | | it means we can do quick hacks to existing drivers to test interrupts on multiple cpus. emphasis on quick and hacks. ok jmatthew@, who will also ok the removal of it at the right time.
* timecounting: add gettime(9), getuptime(9)cheloha2020-06-221-1/+45
| | | | | | | | | | | | | | | | | | | | | | time_second and time_uptime are used widely in the tree. This is a problem on 32-bit platforms because time_t is 64-bit, so there is a potential split-read whenever they are used at or below IPL_CLOCK. Here are two replacement interfaces: gettime(9) and getuptime(9). The "get" prefix signifies that they do not read the hardware timecounter, i.e. they are fast and low-res. The lack of a unit (e.g. micro, nano) signifies that they yield a plain time_t. As an optimization on LP64 platforms we can just return time_second or time_uptime, as a single read is atomic. On 32-bit platforms we need to do the lockless read loop and get the values from the timecounter. In a subsequent diff these will be substituted for time_second and time_uptime almost everywhere in the kernel. With input from visa@ and dlg@. ok kettenis@
* inittodr(9): introduce dedicated flag to enable writes from resettodr(9)cheloha2020-06-221-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | We don't want resettodr(9) to write the RTC until inittodr(9) has actually run. Until inittodr(9) calls tc_setclock() the system UTC clock will contain a meaningless value and there's no sense in overwriting a good value with a value we know is nonsense. This is not an uncommon problem if you're debugging a problem in early boot, e.g. a panic that occurs prior to inittodr(9). Currently we use the following logic in resettodr(9) to inhibit writes: if (time_second == 1) return; ... this is too magical. A better way to accomplish the same thing is to introduce a dedicated flag set from inittodr(9). Hence, "inittodr_done". Suggested by visa@. ok kettenis@
* Extend kqueue interface with EVFILT_EXCEPT filter.mpi2020-06-223-5/+43
| | | | | | | | | | This filter, already implemented in macOS and Dragonfly BSD, returns exceptional conditions like the reception of out-of-band data. The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and it can be used by the kqfilter-based poll & select implementation. ok millert@ on a previous version, ok visa@
* there's not going to be any whole kernel wide network livelocks soon.dlg2020-06-221-3/+2
|
* add mq_push. it's like mq_enqueue, but drops from the head, not the tail.dlg2020-06-211-1/+20
| | | | from Matt Dunwoodie and Jason A. Donenfeld
* backout pipe change, it crashes some archderaadt2020-06-191-53/+51
|
* Compare `so' and `sosp' types just after `sosp' obtaining. We can't splicemvs2020-06-181-5/+5
| | | | | | | | sockets from different domains so there is no reason to have locking and memory allocation in this error path. Also in this case only `so' will be locked by solock() so we should avoid `sosp' modification. ok mpi@
* Instead of performing three distinct allocations per created pipe,anton2020-06-171-51/+53
| | | | | | | | | | | | | reduce it to a single one. Not only should this be more performant, it also solves a kqueue related issue found by visa@ who also requested this change: if you attach an EVFILT_WRITE filter to a pipe fd, the knote gets added to the peer's klist. This is a problem for kqueue because if you close the peer's fd, the knote is left in the list whose head is about to be freed. knote_fdclose() is not able to clear the knote because it is not registered with the peer's fd. FreeBSD also takes a similar approach to pipe allocations. ok mpi@ visa@
* make intrmap_cpu return a struct cpu_info *, not a "cpuid number" thing.dlg2020-06-171-8/+8
| | | | | requested by kettenis@ discussed with jmatthew@
* add intrmap, an api that picks cpus for devices to attach interrupts to.dlg2020-06-171-0/+347
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | there's been discussions for years (and even some diffs!) about how we should let drivers establish interrupts on multiple cpus. the simple approach is to let every driver look at the number of cpus in a box and just pin an interrupt on it, which is what pretty much everyone else started with, but we have never seemed to get past bikeshedding about. from what i can tell, the principal objections to this are: 1. interrupts will tend to land on low numbered cpus. ie, if drivers try to establish n interrupts on m cpus, they'll start at cpu 0 and go to cpu n, which means cpu 0 will end up with more interrupts than cpu m-1. 2. some cpus shouldn't be used for interrupts. why a cpu should or shouldn't be used for interrupts can be pretty arbitrary, but in practical terms i'm going to borrow from the scheduler and say that we shouldn't run work on hyperthreads. 3. making all the drivers make the same decisions about the above is a lot of maintenance overhead. either we will have a bunch of inconsistencies, or we'll have a lot of untested commits to keep everything the same. my proposed solution to the above is this diff to provide the intrmap api. drivers that want to establish multiple interrupts ask the api for a set of cpus it can use, and the api considers the above issues when generating a set of cpus for the driver to use. drivers then establish interrupts on cpus with the info provided by the map. it is based on the if_ringmap api in dragonflybsd, but generalised so it could be used by something like nvme(4) in the future. this version provides numeric ids for CPUs to drivers, but as kettenis@ has been pointing out for a very long time, it makes more sense to use cpu_info pointers. i'll be updating the code to address that shortly. discussed with deraadt@ and jmatthew@ ok claudio@ patrick@ kettenis@
* wire stoeplitz code into the tree.dlg2020-06-161-1/+10
|
* Implement a simple kqfilter for deadfs matching its poll handler.mpi2020-06-151-2/+4
| | | | ok visa@, millert@
* Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.mpi2020-06-154-4/+20
| | | | | | This is only done in poll-compatibility mode, when __EV_POLL is set. ok visa@, millert@
* Raise SPL when modifying ps_klist to prevent a race with interrupts.visa2020-06-152-3/+15
| | | | | | | The list can be accessed from interrupt context if a signal is sent from an interrupt handler. OK anton@ cheloha@ mpi@
* Remove misleading XXX about locking of ps_klist. All of the kqueuevisa2020-06-142-5/+2
| | | | subsystem and ps_klist handling still run under the kernel lock.
* Revert addition of double underbars for filter-specific flag.mpi2020-06-121-4/+4
| | | | Port breakages reported by naddy@
* Move FRELE() outside fdplock in dup*(2) code. This avoids a potentialvisa2020-06-111-4/+7
| | | | | | | | | | | | lock order issue with the file close path. The FRELE() can trigger the close path during dup*(2) if another thread manages to close the file descriptor simultaneously. This race is possible because the file reference is taken before the file descriptor table is locked for write access. Vitaliy Makkoveev agrees OK anton@ mpi@
* Rename poll-compatibility flag to better reflect what it is.mpi2020-06-113-8/+8
| | | | | | While here prefix kernel-only EV flags with two underbars. Suggested by kettenis@, ok visa@
* Make spec_kqfilter() and cttykqfilter() behave like their correspondingmpi2020-06-112-8/+16
| | | | | | poll handler if the EV_OLDAPI flag is set. ok visa@
* whitespace and speeling fix in a comment. no functional change.dlg2020-06-111-4/+4
|
* make taskq_barrier wait for pending tasks, not just the running tasks.dlg2020-06-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I wrote taskq_barrier with the behaviour described in the manpage: taskq_barrier() guarantees that any task that was running on the tq taskq when the barrier was called has finished by the time the barrier returns. Note that it talks about the currently running task, not pending tasks. It just so happens that the original implementation just used task_add to put a condvar on the list and waited for it to run. Because task_add uses TAILQ_INSERT_TAIL, you ended up waiting for all pending to work to run too, not just the currently running task. The new implementation took advantage of already holding the lock and used TAILQ_INSERT_HEAD to put the barrier work at the front of the queue so it would run next, which is closer to the stated behaviour. Using the tail insert here restores the previous accidental behaviour. jsg@ points out the following: > The linux functions like flush_workqueue() we use this for in drm want > to wait on all scheduled work not just currently running. > > ie a comment from one of the linux functions: > > /** > * flush_workqueue - ensure that any scheduled work has run to completion. > * @wq: workqueue to flush > * > * This function sleeps until all work items which were queued on entry > * have finished execution, but it is not livelocked by new incoming ones. > */ > > our implementation of this in drm is > > void > flush_workqueue(struct workqueue_struct *wq) > { > if (cold) > return; > > taskq_barrier((struct taskq *)wq); > } I don't think it's worth complicating the taskq API, so I'm just going to make taskq_barrier wait for pending work too. tested by tb@ ok jsg@
* get rid of a vestigial bit of the sbartq.dlg2020-06-111-5/+1
| | | | | i should have removed the sbartq pointer in r1.47 when i removed the sbartq.
* Move closef() outside fdplock() in sys_socketpair(). This preventsvisa2020-06-101-6/+11
| | | | | | | | | a lock order problem with altered locking of UNIX domain sockets. closef() does not need the file descriptor table lock. From Vitaliy Makkoveev OK mpi@
* add support for running taskq_barrier from a task inside the taskq.dlg2020-06-071-62/+131
| | | | | | | | | | | | | | | | | | | | | | | | this is required for an upcoming drm update, where the linux workqueue api that supports this is mapped to our taskq api. this main way taskqs support that is to have the taskq worker threads record thir curproc on the taskq, so taskq_barrier calls can iterate over that list looking for their own curproc. if a barriers curproc is in the list, it must be running inside the taskq, and should pretend that it's a barrier task. this also supports concurrent barrier calls by having the taskq recognise the situation and have the barriers work together rather than deadlocking. they end up trying to share the work of getting the barrier tasks onto the workers. once all the workers (or in tq barriers) have rendezvoused the barrier calls unwind, and the last one out lets the other barriers and barrier tasks return. all this barrier logic is implemented in the barrier code, it takes the existing multiworker handling out of the actual taskq loop. thanks to jsg@ for testing this and previous versions of the diff. ok visa@ kettenis@
* In automatic performance mode on systems with offline CPUs because of SMTsolene2020-05-301-1/+3
| | | | | | | | | | | | | | | | | | | | | | | mitigation the algorithm was still accounting the offline CPUs, leading to a code path that would never be reached. This should allow better frequency scaling on systems with many CPUs. The frequency should scale up if one of two condition is true. - if at least one CPU has less than 25% of idle cpu time - if the average of all idle time is under 33% The second condition was never met because offline CPU are always accounted as 100% idle. A bit more explanations about the auto scaling in case someone want to improve this later: When one condition is met, CPU frequency is set to maximum and a counter set to 5, then the function will be run again 100ms later and decrement the counter if both conditions are not met anymore. Once the counter reach 0 the frequency is set to minimum. This mean that it can take up to 100ms to scale up and up to 500ms to scale down. ok brynet@ looks good tedu@
* Introduce kqueue_terminate() & kqueue_free(), no functional changes.mpi2020-05-301-12/+38
| | | | | | | These functions will be used to managed per-thread kqueues that are not associated to a file descriptor. ok visa@
* dev/rndvar.h no longer has statistical interfaces (removed during variousderaadt2020-05-294-9/+4
| | | | | | conversion steps). it only contains kernel prototypes for 4 interfaces, all of which legitimately belong in sys/systm.h, which are already included by all enqueue_randomness() users.
* rndvar.h not needed herederaadt2020-05-291-2/+1
|
* File allocation in socket(2) & socketpair(2) no longer need the KERNEL_LOCK().mpi2020-05-281-6/+1
| | | | | | | | Bring the two syscalls in sync with recent MP changes in the file layer. Inconsistency pointed by haesbaert@. ok anton@, visa@
* Revert "Add kqueue_scan_state struct"visa2020-05-251-72/+26
| | | | sthen@ has reported that the patch might be causing hangs with X.
* Pass bootblock indicator RB_GOODRANDOM to random_start(). Future workderaadt2020-05-251-2/+2
| | | | | will frantically compensate. ok kettenis
* Add missing ICANON check in filt_ptcwrite().mpi2020-05-211-2/+3
| | | | ok millert@, visa@
* clock_gettime(2): use nanoruntime(9) to get value for CLOCK_UPTIMEcheloha2020-05-201-5/+2
|
* timecounting: decide whether to advance offset within tc_windup()cheloha2020-05-201-25/+44
| | | | | | | | | | | | | | | | | | | When we resume from a suspend we use the time from the RTC to advance the system offset. This changes the UTC to match what the RTC has given us while increasing the system uptime to account for the time we were suspended. Currently we decide whether to change to the RTC time in tc_setclock() by comparing the new offset with the th_offset member. This is wrong. th_offset is the *minimum* possible value for the offset, not the "real offset". We need to perform the comparison within tc_windup() after updating th_offset, otherwise we might rewind said offset. Because we're now doing the comparison within tc_windup() we ought to move naptime into the timehands. This means we now need a way to safely read the naptime to compute the value of CLOCK_UPTIME for userspace. Enter nanoruntime(9); it increases monotonically from boot but does not jump forward after a resume like nanouptime(9).
* Add function for attaching RTC drivers, to reduce direct usevisa2020-05-171-1/+7
| | | | | | of todr_handle. OK kettenis@
* Add kqueue_scan_state structvisa2020-05-171-26/+72
| | | | | | | | | | | | The struct keeps track of the end point of an event queue scan by persisting the end marker. This will be needed when kqueue_scan() is called repeatedly to complete a scan in a piecewise fashion. The end marker has to be preserved between calls because otherwise the scan might collect an event more than once. If a collected event gets reactivated during scanning, it will be added at the tail of the queue, out of reach because of the end marker. OK mpi@
* Make inittodr() and resettodr() MI.kettenis2020-05-161-1/+91
| | | | | ok deraadt@, mpi@, visa@ ok cheloha@ as well (would have preferred in new file for this code)
* Do not wait indefinitely for flushing when closing a tty.mpi2020-05-081-6/+24
| | | | | | | | | | | This prevent exiting processes from hanging when a slave pseudo terminal is close(2)d before its master. From NetBSD via anton@. Reported-by: syzbot+2ed25b5c40d11e4c3beb@syzkaller.appspotmail.com ok anton@, kettenis@
* Ensure that if we are doing a delayed write with a NOCACHE buffer, webeck2020-04-291-1/+2
| | | | | | | | clear the NOCACHE flag, since if we are doing a delayed write the buffer must be cached or it is thrown away when the "write" is done. fixes vnd on mfs regress tests. ok kettenis@ deraadt@
* Fix panic message.kettenis2020-04-151-2/+2
| | | | ok millert@, deraadt@
* In sosplice(), temporarily release the socket lock before callinganton2020-04-121-1/+9
| | | | | | | | | | | FRELE() as the last reference could be dropped which in turn will cause soclose() to be called where the socket lock is unconditionally acquired. Note that this is only a problem for sockets protected by the non-recursive NET_LOCK() right now. ok mpi@ visa@ Reported-by: syzbot+7c805a09545d997b924d@syzkaller.appspotmail.com
* Add soassertlocked() checks to sbappend() and sbappendaddr(). This bringsclaudio2020-04-111-1/+4
| | | | | them in line with sbappendstream() and sbappendrecord(). Agreed by mpi@
* Make fifo_kqfilter() honor FREAD|FWRITE just like fifo_poll() does.mpi2020-04-083-6/+7
| | | | | | | Prevent generating events that do not correspond to how the fifo has been opened. ok visa@, millert@
* Abstract the head of knote lists. This allows extending the lists,visa2020-04-077-45/+64
| | | | | | for example, with locking assertions. OK mpi@, anton@
* Defer selwakeup() from kqueue_wakeup() to kqueue_task() to preventvisa2020-04-071-6/+8
| | | | | | | | deep recursion. This also helps making kqueue_wakeup() free of the kernel lock because the current implementation of selwakeup() requires the lock. OK mpi@