summaryrefslogtreecommitdiffstats
path: root/sys/kern (follow)
Commit message (Collapse)AuthorAgeFilesLines
...
* Revert previous extension of the SCHED_LOCK(), the state isn't passed down.mpi2020-12-022-27/+11
| | | | Panic reported by dhill@
* Prevent a TOCTOU race in single_thread_set() by extending the scope of the lock.mpi2020-12-022-11/+27
| | | | | | | Make sure `ps_single' is set only once by checking then updating it without releasing the lock. Analyzed by and ok claudio@
* Rather than skipping %[sizearg]n in the kernel, panic when it is encountered.deraadt2020-11-281-13/+3
| | | | printf(9) already lacked documentation and needs no change.
* Change kqueue_scan() to keep track of collected events in the given context.mpi2020-11-251-10/+31
| | | | | | | | | | | | | | It is now possible to call the function multiple times to collect events. For that, the end marker has to be preserved between calls because otherwise the scan might collect an event more than once. If a collected event gets reactivated during scanning, it will be added at the tail of the queue, out of reach because of the end marker. This is required to implement select(2) and poll(2) on top of kqueue_scan(). Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks. ok visa@, anton@
* Convert sysctl_sysvsem to sysctl_int_boundedgnezdo2020-11-191-65/+44
| | | | | | Performed a minor refactoring and removed a few trailing whitespaces. ok anton@
* Fix handling of MSG_PEEK in soreceive() for the case where an emptyclaudio2020-11-171-6/+7
| | | | | | | | | | | | | | | mbuf is encountered in a seqpacket socket. This diff uses the fact that setting orig_resid to 0 results in soreceive() to return instead of looping back with the intent to sleep for more data. orig_resid is now always set to 0 in the control message case (instead of only if controlp is defined). This is the same behaviour as for the PR_NAME case. Additionally orig_resid is set to 0 in the data reader when MSG_PEEK is used. Tested in snaps for a while and by anton@ Reported-by: syzbot+4b0e9698b344b0028b14@syzkaller.appspotmail.com
* Convert sysctl_sysvsem to sysctl_bounded_argsgnezdo2020-11-171-81/+48
| | | | | | | | Used sysctl_int_bounded in many places to shrink code. Extracted a new function to make the case tidy. Removed some superflous fluff. OK millert@
* Prevent exit status from being clobbered on thread exit.jsing2020-11-161-2/+2
| | | | | | | | | | | | Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING. It was previously possible for EXIT_NORMAL to be run twice, depending on which thread called exit() and the order in which the threads were torn down. This is due to the P_HASSIBLING() check triggering the last thread to run EXIT_NORMAL, even though it may have already been run via an exit() call. ok kettenis@ visa@
* Convert hw_sysctl to sysctl_bounded_argsgnezdo2020-11-161-12/+15
| | | | | | | | This one is surprisingly a minor loss if one were to simply add bytes on amd64: .text+.data+.bss+.rodata before 0x64b0+0x40+0x14+0x338 = 0x683c after 0x6440+0x48+0x14+0x3b8 = 0x6854
* Convert kern_sysctl to sysctl_bounded_argsgnezdo2020-11-161-103/+74
| | | | | | | | | objdump -h changes in Size of kern_sysctl.o on amd64 before after .text 7140 64b0 .data 24 40 .bss 10 14 .rodata 50 338
* witness: detect and report uninitialized (or zeroed) lock usagesemarie2020-11-121-3/+15
| | | | ok visa@
* setitimer(2): ITIMER_REAL: protect state with per-process mutex ps_mtxcheloha2020-11-101-9/+32
| | | | | | | | | | | | | | | | | | | | To unlock getitimer(2) and setitimer(2) we need to protect the per-process ITIMER_REAL state with something other than the kernel lock. As the ITIMER_REAL timeout callback realitexpire() runs at IPL_SOFTCLOCK the per-process mutex ps_mtx is appropriate. In setitimer() we need to use ps_mtx instead of the global itimer_mtx if the given timer is ITIMER_REAL. Easy. The ITIMER_REAL timeout callback routine realitexpire() is trickier. When we enter ps_mtx during the callback we need to check if the timer was cancelled or rescheduled. A thread from the process can call setitimer(2) at the exact moment the callback is about to run from timeout_run() (see kern_timeout.c). Update the locking annotation in sys/proc.h accordingly. ok anton@
* In case of failure, call sigexit() from trapsignal instead of sensig().mpi2020-11-081-3/+11
| | | | | | | Simplify MD code and reduce the amount of recursion into the signal code which helps when dealing with locks. ok cheloha@, deraadt@
* Convert ffs_sysctl to sysctl_bounded_argsgnezdo2020-11-071-3/+9
| | | | | | | Requires sysctl_bounded_arr branch to support sysctl_rdint. The read-only variables are marked by an empty range of [1, 0]. OK millert@
* Add feature to force the selection of source IP addressdenis2020-10-291-2/+2
| | | | | | | Based/previous work on an idea from deraadt@ Input from claudio@, djm@, deraadt@, sthen@ OK deraadt@
* kevent(2): ktrace the timeout before validating itcheloha2020-10-261-5/+5
| | | | | As deraadt@ has pointed out, tracing timevals and timespecs before validating them makes debugging easier.
* Serialize msgbuf access with a mutex.visa2020-10-251-28/+62
| | | | | | | | | | | | | | | | | | This introduces a system-wide mutex that serializes msgbuf operations. The mutex controls access to all modifiable fields of struct msgbuf. It also covers logsoftc.sc_state. To avoid adding extra lock order constraints that would affect use of printf(9), the code does not take new locks when the log mutex is held. The code assumes that there is at most one thread using logread(). This keeps the logic simple. If there was more than one reader, logread() might return the same data to different readers. Also, log wakeup might not be reliable with multiple threads. Tested in snaps for two weeks. OK mpi@
* setitimer(2): ITIMER_REAL: use kclock timeoutscheloha2020-10-252-16/+9
| | | | | | | | | | | | | | | Reimplement the ITIMER_REAL interval timer with a kclock timeout. Couple things of note: - We need to use the high-res nanouptime(9) call, not the low-res getnanouptime(9). - The code is simpler now that we aren't working with ticks. Misc. thoughts: - Still unsure if "kclock" is the right name for these things. - MP-safely cancelling a periodic timeout is very difficult.
* sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unsetcheloha2020-10-231-2/+3
| | | | | | | Even if we aren't setting a timeout, P_TIMEOUT should not be set at this point in the sleep. ok visa@
* timeout(9): fix compilation under NKCOVcheloha2020-10-201-2/+2
|
* Serialize accesses to "struct vmspace" and document its refcounting.mpi2020-10-192-7/+6
| | | | | | | The underlying vm_space lock is used as a substitute to the KERNEL_LOCK() in uvm_grow() to make sure `vm_ssize' is not corrupted. ok anton@, kettenis@
* timeout(9): basic support for kclock timeoutscheloha2020-10-151-60/+338
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A kclock timeout is a timeout that expires at an absolute time on one of the kernel's clocks. A timeout's absolute expiration time is kept in a new member of the timeout struct, to_abstime. The timeout's kclock is set at initialization and is kept in another new member of the timeout struct, to_kclock. Kclock timeouts are desireable because they have nanosecond resolution, regardless of the value of hz(9). The timecounter subsystem is also inherently NTP-sensitive, so timeouts scheduled against the subsystem are NTP-sensitive. These two qualities guarantee that a kclock timeout will never expire early. Currently there is support for one kclock, KCLOCK_UPTIME (the uptime clock). Support for KCLOCK_RUNTIME (the runtime clock) and KCLOCK_UTC (the UTC clock) is planned for the future. Support for these additional kclocks will allow us to implement some of the POSIX interfaces OpenBSD is missing, e.g. clock_nanosleep() and timer_create(). We could also use it to provide proper absolute timeouts for e.g. pthread_mutex_timedlock(3). Kclock timeouts are initialized with timeout_set_kclock(). They can be scheduled with either timeout_in_nsec() (relative timeout) or timeout_at_ts() (absolute timeout). They are incompatible with timeout_add(9), timeout_add_sec(9), timeout_add_msec(9), timeout_add_usec(9), timeout_add_nsec(9), and timeout_add_tv(9). They can be cancelled with timeout_del(9) or timeout_del_barrier(9). Documentation for the new interfaces is a work in progress. For now, tick-based timeouts remain supported alongside kclock timeouts. They will remain supported until we are certain we don't need them anymore. It is possible we will never remove them. I would rather not keep them around forever, but I cannot predict what difficulties we will encounter while converting tick-based timeouts to kclock timeouts. There are a *lot* of timeouts in the kernel. Kclock timeouts are more costly than tick-based timeouts: - Calling timeout_in_nsec() incurs a call to nanouptime(9). Reading the hardware timecounter is too expensive in some contexts, so care must be taken when converting existing timeouts. We may add a flag in the future to cause timeout_in_nsec() to use getnanouptime(9) instead of nanouptime(9), which is much cheaper. This may be appropriate for certain classes of timeouts. tcp/ip session timeouts come to mind. - Kclock timeout expirations are kept in a timespec. Timespec arithmetic has more overhead than 32-bit tick arithmetic, so processing kclock timeouts during softclock() is more expensive. On my machine the overhead for processing a tick-based timeout is ~125 cycles. The overhead for a kclock timeout is ~500 cycles. The overhead difference on 32-bit platforms is unknown. If it proves too large we may need to use a 64-bit value to store the expiration time. More measurement is needed. Priority targets for conversion are setitimer(2), *sleep_nsec(9), and the kevent(2) EVFILT_TIMER timers. Others will follow. With input from mpi@, visa@, kettenis@, dlg@, guenther@, claudio@, deraadt@, probably many others. Older version tested by visa@. Problems found in older version by bluhm@. Current version tested by Yuichiro Naito. "wait until after unlock" deraadt@, ok kettenis@
* _exit(2), execve(2): tweak per-process interval timer cancellationcheloha2020-10-153-12/+10
| | | | | | | | If we fold the for-loop iterating over each interval timer into the helper function the result is slightly tidier than what we have now. Rename the helper function "cancel_all_itimers". Based on input from millert@ and kettenis@.
* Stop asserting that the NET_LOCK() shouldn't be held in yield().mpi2020-10-151-3/+1
| | | | | | This create too many false positive when setting pool_debug=2. Prodded by deraadt@, ok mvs@
* _exit(2), execve(2): cancel per-process interval timers safelycheloha2020-10-153-12/+21
| | | | | | | | | | | | | | | | | During _exit(2) and sometimes during execve(2) we need to cancel any active per-process interval timers. We don't currently do this in an MP-safe way. Both syscalls ignore the locking assumptions documented in proc.h. The easiest way to make them MP-safe is to use setitimer(), just like the getitimer(2) and setitimer(2) syscalls do. To make things a bit cleaner I have added a helper function, cancelitimer(), so the callers don't need to fuss with an itimerval struct. While we're here we can remove the splclock/splx dance from execve(2). It is no longer necessary. ok deraadt@
* delete strange historical FFS_SOFTUPDATES ifdef...deraadt2020-10-141-4/+4
| | | | ok millert
* setitimer(2): zero itv.it_interval if itv.it_value is zerocheloha2020-10-131-1/+3
| | | | | | | | | | | | | | | | | | | | | If itv.it_value is zero we cancel the timer. When we cancel the timer we don't care about itv.it_interval because the timer is not running: we don't use it, we don't look at it, etc. To be on the paranoid side, I think we should zero itv.it_interval when itv.it_value is zero. No need to write arbitrary values into the process struct if we aren't required to. The standard is ambiguous about what should happen in this case, i.e. the value of olditv after the following code executes is unspecified: struct itimerval newitv, olditv; newitv.it_value.tv_sec = newitv.it_value.tv_usec = 0; newitv.it_interval.tv_sec = newitv.it_interval.tv_usec = 1; setitimer(ITIMER_REAL, &newitv, NULL); getitimer(ITIMER_REAL, &olditv); This change should not break any real code.
* setitimer(2): realitexpire(): call getnanouptime(9) oncecheloha2020-10-131-16/+15
| | | | | | | | | | timespecadd(3) is fast. There is no need to call getnanouptime(9) repeatedly when searching for the next expiration point. Given that it_interval is at least 1/hz, we expect to run through the loop maybe hz times at most. Even at HZ=10000 that's pretty brief. While we're here, pull *all* of the other logic out of the loop. The only thing we need to do in the loop is timespecadd(3).
* Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".mpi2020-10-111-27/+51
| | | | | | | | | | The struct keeps track of the end point of an event queue scan by persisting the end marker. This will be needed when kqueue_scan() is called repeatedly to complete a scan in a piecewise fashion. Extracted from a previous diff from visa@. ok visa@, anton@
* sys_getitimer(), sys_setitimer(): style(9), misc. cleanupcheloha2020-10-071-21/+15
| | | | | | | | - Consolidate variable declarations. - Remove superfluous parentheses from return statements. - Prefer sizeof(variable) to sizeof(type) for copyin(9)/copyout(9). - Remove some intermediate pointers from sys_setitimer(). Using SCARG() directly here makes it more obvious to the reader what you're copying.
* getitimer(2), setitimer(2): ITIMER_REAL: call getnanouptime(9) oncecheloha2020-10-071-6/+5
| | | | | | | Now that the critical sections are merged we should call getnanouptime(9) once. This makes an ITIMER_REAL timer swap atomic with respect to the clock: the time remaining on the old timer is computed with the same timestamp used to schedule the new timer.
* getitimer(2), setitimer(2): merge critical sectionscheloha2020-10-071-59/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge the common code from sys_getitimer() and sys_setitimer() into a new kernel subroutine, setitimer(). setitimer() performs all of the error-free work for both system calls within a single critical section. We need a single critical section to make the setitimer(2) timer swap operation atomic relative to realitexpire() and hardclock(9). The downside of the new atomicity is that the behavior of setitimer(2) must change. With a single critical section we can no longer copyout(9) the old timer before installing the new timer. So If SCARG(uap, oitv) points to invalid memory, setitimer(2) now fail with EFAULT but the new timer will be left running. You can see this in action with code like the following: struct itv, olditv; itv.it_value.tv_sec = 1; itv.it_value.tv_usec = 0; itv.it_interval = itv.it_value; /* This should EFAULT. 0x1 is probably an invalid address. */ if (setitimer(ITIMER_REAL, &itv, (void *)0x1) == -1) warn("setitimer"); /* The timer will be running anyway. */ getitimer(ITIMER_REAL, &olditv); printf("time left: %lld.%06ld\n", olditv.it_value.tv_sec, olditv.it_value.tv_usec); There is no easy way to work around this. Both FreeBSD's and Linux's setitimer(2) implementations have a single critical section and they too fail with EFAULT in this case and leave the new timer running. I imagine their developers decided that fixing this error case was a waste of effort. Without permitting copyout(9) from within a mutex I'm not sure it is even possible to avoid it on OpenBSD without sacrificing atomicity during a setitimer(2) timer swap. Given the rarity of this error case I would rather have an atomic swap. Behavior change discussed with deraadt@.
* Document that `a_p' is always curproc by using a KASSERT().mpi2020-10-071-1/+12
| | | | | | | | | | | | One exception of this rule is VOP_CLOSE() where NULL is used instead of curproc when the garbace collector of unix sockets, that runs in a kernel thread, drops the last reference of a file. This will allows for future simplifications of the VFS interfaces. Previous version ok visa@, anton@. ok kn@
* Fix write hang-up on file system on vnd.asou2020-10-051-1/+6
| | | | ok beck@
* expose timeval/timespec from system calls into ktrace, before determiningderaadt2020-10-023-9/+36
| | | | | if they are out of range, making it easier to isolate reason for EINVAL ok cheloha
* Move the solock() call outside of solisten(). The reason is that theclaudio2020-09-292-9/+9
| | | | | | | so_state and splice checks were done without the proper lock which is incorrect. This is similar to sobind(), soconnect() which also require the callee to hold the socket lock. Found by, with and OK mvs@, OK mpi@
* Remove the PR_WAITOK flag from the ucred_pool. The pool items are smallkettenis2020-09-261-2/+2
| | | | | | | | | | enough that this pool uses the single page allocator for which PR_WAITOK is a no-op. However it presence suggests that pool_put(9) may sleep. The single page allocator will never actually do that. This makes it obvious that refreshcreds() will not sleep. ok deraadt@, visa@
* setpriority(2): don't treat booleans as scalarscheloha2020-09-251-5/+5
| | | | | | | | | | | | The variable "found" in sys_setpriority() is used as a boolean. We should set it to 1 to indicate that we found the object we were looking for instead of incrementing it. deraadt@ notes that the current code is not buggy, because OpenBSD cannot support anywhere near 2^32 processes, but agrees that incrementing the variable signals the wrong thing to the reader. ok millert@ deraadt@
* timeout(9): timeout_run(): read to_process before leaving timeout_mutexcheloha2020-09-221-2/+4
| | | | | | | | to_process is assigned during timeout_add(9) within timeout_mutex. In timeout_run() we need to read to_process before leaving timeout_mutex to ensure that the process pointer given to kcov_remote_enter(9) is the same as the one we set from timeout_add(9) when the candidate timeout was originally scheduled to run.
* Move duplicated code to send an uncatchable SIGABRT into a function.mpi2020-09-163-15/+22
| | | | ok claudio@
* put HW_PHYSMEM64 case under CTL_HW not CTL_KERNjsg2020-09-161-2/+2
| | | | Fixes previous. Problem spotted by kettenis@
* As discovered by kettenis, recent mesa wants sysctl hw.physmem64, andderaadt2020-09-161-4/+2
| | | | | | | | in pledged programs that is unfortable. My snark levels are a bit drained, but I must say I'm always dissapointed when programs operating on virtual resources enquire about total physical resource availability, the only reason to ask is so they can act unfair relative to others in the shared environment. SIGH.
* timecounting: provide a naptime variable for userspace via kvm_read(3)cheloha2020-09-161-5/+7
| | | | | | | | | | | vmstat(8) uses kvm_read(3) to extract the naptime from the kernel. Problem is, I deleted `naptime' from the global namespace when I moved it into the timehands. This patch restores it. It gets updated from tc_windup(). Only userspace should use it, and only when the kernel is dead. We need to tweak a variable in tc_setclock() to avoid shadowing the (once again) global naptime.
* add three static probes for vfs: cleaner, bufcache_take and bufcache_rel.jasper2020-09-141-2/+12
| | | | | | | while here, swap two lines in bufcache_release() to put a KASSERT() first following the pattern in bufcache_take() ok beck@ mpi@
* Unbreak tree. Instead of passing struct process to siginit() just pass theclaudio2020-09-131-3/+2
| | | | struct sigacts since that is the only thing that is modified by siginit.
* Grep the KERNEL_LOCK in ktrpsig() before calling ktrwrite(). Anotherclaudio2020-09-131-1/+3
| | | | | little step towards moving signal delivery outside of KERNEL_LOCK. OK mpi@
* Initialize sigacts0 before making them visible by setting ps->ps_sigacts.claudio2020-09-131-2/+2
| | | | OK mpi@
* Add a NULL check in bufbackoff so we don't die when passed a NULL pmem range.beck2020-09-121-2/+2
| | | | Noticed by, and based on a diff from Mike Small <smallm@sdf.org>.
* Introduce a helper to check if a signal is ignored or masked by a thread.mpi2020-09-093-11/+23
| | | | ok claudio@, pirofti@
* Remove unused sysctl_int_arr(9)gnezdo2020-09-011-15/+1
|