summaryrefslogtreecommitdiffstats
path: root/sys/kern (follow)
Commit message (Collapse)AuthorAgeFilesLines
...
* Refactor klist insertion and removalvisa2020-12-257-28/+48
| | | | | | | | | | | | Rename klist_{insert,remove}() to klist_{insert,remove}_locked(). These functions assume that the caller has locked the klist. The current state of locking remains intact because the kernel lock is still used with all klists. Add new functions klist_insert() and klist_remove() that lock the klist internally. This allows some code simplification. OK mpi@
* Small smr_grace_wait() optimizationvisa2020-12-251-6/+26
| | | | | | | | | | | | Make the SMR thread maintain an explicit system-wide grace period and make CPUs observe the current grace period when crossing a quiescent state. This lets the SMR thread avoid a forced context switch for CPUs that have already entered the latest grace period. This change provides a small improvement in smr_grace_wait()'s performance in terms of context switching. OK mpi@, anton@
* tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)cheloha2020-12-241-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It would be convenient if there were a channel a thread could sleep on to indicate they do not want any wakeup(9) broadcasts. The easiest way to do this is to add an "int nowake" to kern_synch.c and extern it in sys/systm.h. You use it like this: #include <sys/systm.h> tsleep_nsec(&nowait, ...); There is now no need to handroll a local dead channel, e.g. int chan; tsleep_nsec(&chan, ...); which expands the stack. Local dead channels will be replaced with &nowake in later patches. One possible problem with this "one global channel" approach is sleep queue congestion. If you have lots of threads sleeping on &nowake you might slow down a wakeup(9) on a different channel that hashes into the same queue. Unsure how much of problem this actually is, if at all. NetBSD and FreeBSD have a "pause" interface in the kernel that chooses a suitable channel automatically. To keep things simple and avoid adding a new interface we will start with this global channel. Discussed with mpi@, claudio@, kettenis@, and deraadt@. Basically designed by kettenis@, who vetoed my other proposals. Bugs caught by deraadt@, tb@, and patrick@.
* sigsuspend(2): change wmesg from "pause" to "sigsusp"cheloha2020-12-231-2/+2
| | | | | | | | Make it obvious where the thread is blocked. "pause" is ambiguous. Tweaked by kettenis@. Probably ok kettenis@.
* nanosleep(2): shorten wmesg from "nanosleep" to "nanoslp"cheloha2020-12-231-2/+2
| | | | | | | We only see 8 characters of wmesg in e.g. top(1), so shorten the string to fit. Indirectly prompted by kettenis@.
* Ensure that filt_dead() takes effectvisa2020-12-231-1/+2
| | | | | | | | Invoke dead_filtops' f_event callback in klist_invalidate() to ensure that filt_dead() modifies every invalidated knote. If a knote has EV_ONESHOT set in its event flags, kqueue_scan() will not call f_event. OK mpi@
* Clear error before each iteration in kqueue_scan()visa2020-12-231-1/+3
| | | | | | | This fixes a regression where kqueue_scan() may incorrectly return EWOULDBLOCK after a timeout. OK mpi@
* Implement select(2) and pselect(2) on top of kqueue.mpi2020-12-221-55/+148
| | | | | | | | | | | | | | | | The given set of fds are converted to equivalent kevents using EV_SET(2) and passed to the scanning internals of kevent(2): kqueue_scan(). ktrace(1) will now output the converted kevents on top of the usuals set bits to be able to find possible error in the convertion. This switch implies that select(2) and pselect(2) will now query the underlying kqfilters instead of the *_poll() routines. Based on similar work done on DragonFlyBSD with inputs from from visa@, millert@, anton@, cheloha@, thanks! ok visa@
* Introduce klistopsvisa2020-12-201-8/+156
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch extends struct klist with a callback descriptor and an argument. The main purpose of this is to let the kqueue subsystem assert when a klist should be locked, and operate the klist lock in klist_invalidate(). Access to a knote list of a kqueue-monitored object has to be serialized somehow. Because the object often has a lock for protecting its state, and because the object often acquires this lock at the latest in its f_event callback function, it makes sense to use this lock also for the knote lists. The existing uses of NOTE_SUBMIT already show a pattern that is likely to become more prevalent. There could be an embedded lock in klist. However, such a lock would be redundant in many cases. The code cannot rely on a single lock type (mutex, rwlock, something else) because the needs of monitored objects vary. In addition, an embedded lock would introduce new lock order constraints. Note that the patch does not rule out use of dedicated klist locks. The patch introduces a way to associate lock operations with a klist. The caller can provide a custom implementation, or use a ready-made interface with a mutex or rwlock. For compatibility with old code, the new code falls back to using the kernel lock if no specific klist initialization has been done. The existing code already relies on implicit initialization of klist. Sadly, this change increases the size of struct klist. dlg@ thinks this is not fatal, though. OK mpi@
* Add fd close notification for kqueue-based poll() and select()visa2020-12-181-7/+38
| | | | | | | | | | When the file descriptor of an __EV_POLL-flagged knote is closed, post EBADF through the kqueue instance to the caller of kqueue_scan(). This lets kqueue-based poll() and select() preserve their current behaviour of returning EBADF when a polled file descriptor is closed concurrently. OK mpi@
* Make knote_{activate,remove}() internal to kern_event.c.visa2020-12-181-1/+3
| | | | OK mpi@
* Remove kqueue_free() and use KQRELE() in kqpoll_exit().visa2020-12-161-11/+6
| | | | | | | Because kqpoll instances are now linked to the file descriptor table, the freeing of kqpoll and ordinary kqueues is similar. Suggested by mpi@
* Link kqpoll instances to fd_kqlist.visa2020-12-161-10/+14
| | | | | | | This lets the system remove kqpoll-related event registrations when a file descriptor is closed. OK mpi@
* Use nkev in place of count in kqueue_scan().visa2020-12-151-7/+4
| | | | OK cheloha@, mpi@, mvs@
* Rename the macro MCLGETI to MCLGETL and removes the dead parameter ifp.jan2020-12-123-13/+13
| | | | | | OK dlg@, bluhm@ No Opinion mpi@ Not against it claudio@
* Simplify filt_pipedetach()visa2020-12-111-18/+7
| | | | | | | | By storing pipe pointer in kn_hook, filt_pipedetach() does not need extra logic to find the correct pipe instance. This also lets the kernel clear the knote lists fully. OK anton@, mpi@
* Use sysctl_int_bounded for sysctl_hwsetperfgnezdo2020-12-101-13/+7
| | | | | | Removed some trailing whitespace while there. ok gkoehler@
* Add kernel-only per-thread kqueue & helpers to initialize and free it.mpi2020-12-092-3/+37
| | | | | | This will soon be used by select(2) and poll(2). ok anton@, visa@
* Convert the per-process thread list into a SMR_TAILQ.mpi2020-12-0712-39/+43
| | | | | | | Currently all iterations are done under KERNEL_LOCK() and therefor use the *_LOCKED() variant. From and ok claudio@
* Refactor kqueue_scan() so it can be used by other syscalls.mpi2020-12-071-48/+48
| | | | | | | Stop iterating in the function and instead copy the returned events to userland after every call. ok visa@
* srp_finalize(9): tsleep(9) -> tsleep_nsec(9)cheloha2020-12-061-2/+2
| | | | | | | | | | srp_finalize(9) spins until the refcount hits zero. Blocking for at least 1ms each iteration instead of blocking for at most 1 tick is sufficient. Discussed with mpi@. ok claudio@ jmatthew@
* Convert sysctl_tc to sysctl_bounded_arrgnezdo2020-12-051-7/+8
| | | | ok gkoehler@
* Prevent a TOCTOU race in single_thread_set() by extending the scope of the lock.mpi2020-12-042-13/+27
| | | | | | | Make sure `ps_single' is set only once by checking then updating it without releasing the lock. Analyzed by and ok claudio@
* Revert previous extension of the SCHED_LOCK(), the state isn't passed down.mpi2020-12-022-27/+11
| | | | Panic reported by dhill@
* Prevent a TOCTOU race in single_thread_set() by extending the scope of the lock.mpi2020-12-022-11/+27
| | | | | | | Make sure `ps_single' is set only once by checking then updating it without releasing the lock. Analyzed by and ok claudio@
* Rather than skipping %[sizearg]n in the kernel, panic when it is encountered.deraadt2020-11-281-13/+3
| | | | printf(9) already lacked documentation and needs no change.
* Change kqueue_scan() to keep track of collected events in the given context.mpi2020-11-251-10/+31
| | | | | | | | | | | | | | It is now possible to call the function multiple times to collect events. For that, the end marker has to be preserved between calls because otherwise the scan might collect an event more than once. If a collected event gets reactivated during scanning, it will be added at the tail of the queue, out of reach because of the end marker. This is required to implement select(2) and poll(2) on top of kqueue_scan(). Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks. ok visa@, anton@
* Convert sysctl_sysvsem to sysctl_int_boundedgnezdo2020-11-191-65/+44
| | | | | | Performed a minor refactoring and removed a few trailing whitespaces. ok anton@
* Fix handling of MSG_PEEK in soreceive() for the case where an emptyclaudio2020-11-171-6/+7
| | | | | | | | | | | | | | | mbuf is encountered in a seqpacket socket. This diff uses the fact that setting orig_resid to 0 results in soreceive() to return instead of looping back with the intent to sleep for more data. orig_resid is now always set to 0 in the control message case (instead of only if controlp is defined). This is the same behaviour as for the PR_NAME case. Additionally orig_resid is set to 0 in the data reader when MSG_PEEK is used. Tested in snaps for a while and by anton@ Reported-by: syzbot+4b0e9698b344b0028b14@syzkaller.appspotmail.com
* Convert sysctl_sysvsem to sysctl_bounded_argsgnezdo2020-11-171-81/+48
| | | | | | | | Used sysctl_int_bounded in many places to shrink code. Extracted a new function to make the case tidy. Removed some superflous fluff. OK millert@
* Prevent exit status from being clobbered on thread exit.jsing2020-11-161-2/+2
| | | | | | | | | | | | Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING. It was previously possible for EXIT_NORMAL to be run twice, depending on which thread called exit() and the order in which the threads were torn down. This is due to the P_HASSIBLING() check triggering the last thread to run EXIT_NORMAL, even though it may have already been run via an exit() call. ok kettenis@ visa@
* Convert hw_sysctl to sysctl_bounded_argsgnezdo2020-11-161-12/+15
| | | | | | | | This one is surprisingly a minor loss if one were to simply add bytes on amd64: .text+.data+.bss+.rodata before 0x64b0+0x40+0x14+0x338 = 0x683c after 0x6440+0x48+0x14+0x3b8 = 0x6854
* Convert kern_sysctl to sysctl_bounded_argsgnezdo2020-11-161-103/+74
| | | | | | | | | objdump -h changes in Size of kern_sysctl.o on amd64 before after .text 7140 64b0 .data 24 40 .bss 10 14 .rodata 50 338
* witness: detect and report uninitialized (or zeroed) lock usagesemarie2020-11-121-3/+15
| | | | ok visa@
* setitimer(2): ITIMER_REAL: protect state with per-process mutex ps_mtxcheloha2020-11-101-9/+32
| | | | | | | | | | | | | | | | | | | | To unlock getitimer(2) and setitimer(2) we need to protect the per-process ITIMER_REAL state with something other than the kernel lock. As the ITIMER_REAL timeout callback realitexpire() runs at IPL_SOFTCLOCK the per-process mutex ps_mtx is appropriate. In setitimer() we need to use ps_mtx instead of the global itimer_mtx if the given timer is ITIMER_REAL. Easy. The ITIMER_REAL timeout callback routine realitexpire() is trickier. When we enter ps_mtx during the callback we need to check if the timer was cancelled or rescheduled. A thread from the process can call setitimer(2) at the exact moment the callback is about to run from timeout_run() (see kern_timeout.c). Update the locking annotation in sys/proc.h accordingly. ok anton@
* In case of failure, call sigexit() from trapsignal instead of sensig().mpi2020-11-081-3/+11
| | | | | | | Simplify MD code and reduce the amount of recursion into the signal code which helps when dealing with locks. ok cheloha@, deraadt@
* Convert ffs_sysctl to sysctl_bounded_argsgnezdo2020-11-071-3/+9
| | | | | | | Requires sysctl_bounded_arr branch to support sysctl_rdint. The read-only variables are marked by an empty range of [1, 0]. OK millert@
* Add feature to force the selection of source IP addressdenis2020-10-291-2/+2
| | | | | | | Based/previous work on an idea from deraadt@ Input from claudio@, djm@, deraadt@, sthen@ OK deraadt@
* kevent(2): ktrace the timeout before validating itcheloha2020-10-261-5/+5
| | | | | As deraadt@ has pointed out, tracing timevals and timespecs before validating them makes debugging easier.
* Serialize msgbuf access with a mutex.visa2020-10-251-28/+62
| | | | | | | | | | | | | | | | | | This introduces a system-wide mutex that serializes msgbuf operations. The mutex controls access to all modifiable fields of struct msgbuf. It also covers logsoftc.sc_state. To avoid adding extra lock order constraints that would affect use of printf(9), the code does not take new locks when the log mutex is held. The code assumes that there is at most one thread using logread(). This keeps the logic simple. If there was more than one reader, logread() might return the same data to different readers. Also, log wakeup might not be reliable with multiple threads. Tested in snaps for two weeks. OK mpi@
* setitimer(2): ITIMER_REAL: use kclock timeoutscheloha2020-10-252-16/+9
| | | | | | | | | | | | | | | Reimplement the ITIMER_REAL interval timer with a kclock timeout. Couple things of note: - We need to use the high-res nanouptime(9) call, not the low-res getnanouptime(9). - The code is simpler now that we aren't working with ticks. Misc. thoughts: - Still unsure if "kclock" is the right name for these things. - MP-safely cancelling a periodic timeout is very difficult.
* sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unsetcheloha2020-10-231-2/+3
| | | | | | | Even if we aren't setting a timeout, P_TIMEOUT should not be set at this point in the sleep. ok visa@
* timeout(9): fix compilation under NKCOVcheloha2020-10-201-2/+2
|
* Serialize accesses to "struct vmspace" and document its refcounting.mpi2020-10-192-7/+6
| | | | | | | The underlying vm_space lock is used as a substitute to the KERNEL_LOCK() in uvm_grow() to make sure `vm_ssize' is not corrupted. ok anton@, kettenis@
* timeout(9): basic support for kclock timeoutscheloha2020-10-151-60/+338
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A kclock timeout is a timeout that expires at an absolute time on one of the kernel's clocks. A timeout's absolute expiration time is kept in a new member of the timeout struct, to_abstime. The timeout's kclock is set at initialization and is kept in another new member of the timeout struct, to_kclock. Kclock timeouts are desireable because they have nanosecond resolution, regardless of the value of hz(9). The timecounter subsystem is also inherently NTP-sensitive, so timeouts scheduled against the subsystem are NTP-sensitive. These two qualities guarantee that a kclock timeout will never expire early. Currently there is support for one kclock, KCLOCK_UPTIME (the uptime clock). Support for KCLOCK_RUNTIME (the runtime clock) and KCLOCK_UTC (the UTC clock) is planned for the future. Support for these additional kclocks will allow us to implement some of the POSIX interfaces OpenBSD is missing, e.g. clock_nanosleep() and timer_create(). We could also use it to provide proper absolute timeouts for e.g. pthread_mutex_timedlock(3). Kclock timeouts are initialized with timeout_set_kclock(). They can be scheduled with either timeout_in_nsec() (relative timeout) or timeout_at_ts() (absolute timeout). They are incompatible with timeout_add(9), timeout_add_sec(9), timeout_add_msec(9), timeout_add_usec(9), timeout_add_nsec(9), and timeout_add_tv(9). They can be cancelled with timeout_del(9) or timeout_del_barrier(9). Documentation for the new interfaces is a work in progress. For now, tick-based timeouts remain supported alongside kclock timeouts. They will remain supported until we are certain we don't need them anymore. It is possible we will never remove them. I would rather not keep them around forever, but I cannot predict what difficulties we will encounter while converting tick-based timeouts to kclock timeouts. There are a *lot* of timeouts in the kernel. Kclock timeouts are more costly than tick-based timeouts: - Calling timeout_in_nsec() incurs a call to nanouptime(9). Reading the hardware timecounter is too expensive in some contexts, so care must be taken when converting existing timeouts. We may add a flag in the future to cause timeout_in_nsec() to use getnanouptime(9) instead of nanouptime(9), which is much cheaper. This may be appropriate for certain classes of timeouts. tcp/ip session timeouts come to mind. - Kclock timeout expirations are kept in a timespec. Timespec arithmetic has more overhead than 32-bit tick arithmetic, so processing kclock timeouts during softclock() is more expensive. On my machine the overhead for processing a tick-based timeout is ~125 cycles. The overhead for a kclock timeout is ~500 cycles. The overhead difference on 32-bit platforms is unknown. If it proves too large we may need to use a 64-bit value to store the expiration time. More measurement is needed. Priority targets for conversion are setitimer(2), *sleep_nsec(9), and the kevent(2) EVFILT_TIMER timers. Others will follow. With input from mpi@, visa@, kettenis@, dlg@, guenther@, claudio@, deraadt@, probably many others. Older version tested by visa@. Problems found in older version by bluhm@. Current version tested by Yuichiro Naito. "wait until after unlock" deraadt@, ok kettenis@
* _exit(2), execve(2): tweak per-process interval timer cancellationcheloha2020-10-153-12/+10
| | | | | | | | If we fold the for-loop iterating over each interval timer into the helper function the result is slightly tidier than what we have now. Rename the helper function "cancel_all_itimers". Based on input from millert@ and kettenis@.
* Stop asserting that the NET_LOCK() shouldn't be held in yield().mpi2020-10-151-3/+1
| | | | | | This create too many false positive when setting pool_debug=2. Prodded by deraadt@, ok mvs@
* _exit(2), execve(2): cancel per-process interval timers safelycheloha2020-10-153-12/+21
| | | | | | | | | | | | | | | | | During _exit(2) and sometimes during execve(2) we need to cancel any active per-process interval timers. We don't currently do this in an MP-safe way. Both syscalls ignore the locking assumptions documented in proc.h. The easiest way to make them MP-safe is to use setitimer(), just like the getitimer(2) and setitimer(2) syscalls do. To make things a bit cleaner I have added a helper function, cancelitimer(), so the callers don't need to fuss with an itimerval struct. While we're here we can remove the splclock/splx dance from execve(2). It is no longer necessary. ok deraadt@
* delete strange historical FFS_SOFTUPDATES ifdef...deraadt2020-10-141-4/+4
| | | | ok millert
* setitimer(2): zero itv.it_interval if itv.it_value is zerocheloha2020-10-131-1/+3
| | | | | | | | | | | | | | | | | | | | | If itv.it_value is zero we cancel the timer. When we cancel the timer we don't care about itv.it_interval because the timer is not running: we don't use it, we don't look at it, etc. To be on the paranoid side, I think we should zero itv.it_interval when itv.it_value is zero. No need to write arbitrary values into the process struct if we aren't required to. The standard is ambiguous about what should happen in this case, i.e. the value of olditv after the following code executes is unspecified: struct itimerval newitv, olditv; newitv.it_value.tv_sec = newitv.it_value.tv_usec = 0; newitv.it_interval.tv_sec = newitv.it_interval.tv_usec = 1; setitimer(ITIMER_REAL, &newitv, NULL); getitimer(ITIMER_REAL, &olditv); This change should not break any real code.