summaryrefslogtreecommitdiffstats
path: root/sys/sys (follow)
Commit message (Collapse)AuthorAgeFilesLines
...
* In case of failure, call sigexit() from trapsignal instead of sensig().mpi2020-11-081-2/+2
| | | | | | | Simplify MD code and reduce the amount of recursion into the signal code which helps when dealing with locks. ok cheloha@, deraadt@
* Convert ffs_sysctl to sysctl_bounded_argsgnezdo2020-11-071-2/+2
| | | | | | | Requires sysctl_bounded_arr branch to support sysctl_rdint. The read-only variables are marked by an empty range of [1, 0]. OK millert@
* Add feature to force the selection of source IP addressdenis2020-10-291-2/+4
| | | | | | | Based/previous work on an idea from deraadt@ Input from claudio@, djm@, deraadt@, sthen@ OK deraadt@
* Serialize msgbuf access with a mutex.visa2020-10-251-7/+13
| | | | | | | | | | | | | | | | | | This introduces a system-wide mutex that serializes msgbuf operations. The mutex controls access to all modifiable fields of struct msgbuf. It also covers logsoftc.sc_state. To avoid adding extra lock order constraints that would affect use of printf(9), the code does not take new locks when the log mutex is held. The code assumes that there is at most one thread using logread(). This keeps the logic simple. If there was more than one reader, logread() might return the same data to different readers. Also, log wakeup might not be reliable with multiple threads. Tested in snaps for two weeks. OK mpi@
* timeout(9): basic support for kclock timeoutscheloha2020-10-151-6/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A kclock timeout is a timeout that expires at an absolute time on one of the kernel's clocks. A timeout's absolute expiration time is kept in a new member of the timeout struct, to_abstime. The timeout's kclock is set at initialization and is kept in another new member of the timeout struct, to_kclock. Kclock timeouts are desireable because they have nanosecond resolution, regardless of the value of hz(9). The timecounter subsystem is also inherently NTP-sensitive, so timeouts scheduled against the subsystem are NTP-sensitive. These two qualities guarantee that a kclock timeout will never expire early. Currently there is support for one kclock, KCLOCK_UPTIME (the uptime clock). Support for KCLOCK_RUNTIME (the runtime clock) and KCLOCK_UTC (the UTC clock) is planned for the future. Support for these additional kclocks will allow us to implement some of the POSIX interfaces OpenBSD is missing, e.g. clock_nanosleep() and timer_create(). We could also use it to provide proper absolute timeouts for e.g. pthread_mutex_timedlock(3). Kclock timeouts are initialized with timeout_set_kclock(). They can be scheduled with either timeout_in_nsec() (relative timeout) or timeout_at_ts() (absolute timeout). They are incompatible with timeout_add(9), timeout_add_sec(9), timeout_add_msec(9), timeout_add_usec(9), timeout_add_nsec(9), and timeout_add_tv(9). They can be cancelled with timeout_del(9) or timeout_del_barrier(9). Documentation for the new interfaces is a work in progress. For now, tick-based timeouts remain supported alongside kclock timeouts. They will remain supported until we are certain we don't need them anymore. It is possible we will never remove them. I would rather not keep them around forever, but I cannot predict what difficulties we will encounter while converting tick-based timeouts to kclock timeouts. There are a *lot* of timeouts in the kernel. Kclock timeouts are more costly than tick-based timeouts: - Calling timeout_in_nsec() incurs a call to nanouptime(9). Reading the hardware timecounter is too expensive in some contexts, so care must be taken when converting existing timeouts. We may add a flag in the future to cause timeout_in_nsec() to use getnanouptime(9) instead of nanouptime(9), which is much cheaper. This may be appropriate for certain classes of timeouts. tcp/ip session timeouts come to mind. - Kclock timeout expirations are kept in a timespec. Timespec arithmetic has more overhead than 32-bit tick arithmetic, so processing kclock timeouts during softclock() is more expensive. On my machine the overhead for processing a tick-based timeout is ~125 cycles. The overhead for a kclock timeout is ~500 cycles. The overhead difference on 32-bit platforms is unknown. If it proves too large we may need to use a 64-bit value to store the expiration time. More measurement is needed. Priority targets for conversion are setitimer(2), *sleep_nsec(9), and the kevent(2) EVFILT_TIMER timers. Others will follow. With input from mpi@, visa@, kettenis@, dlg@, guenther@, claudio@, deraadt@, probably many others. Older version tested by visa@. Problems found in older version by bluhm@. Current version tested by Yuichiro Naito. "wait until after unlock" deraadt@, ok kettenis@
* _exit(2), execve(2): tweak per-process interval timer cancellationcheloha2020-10-151-2/+2
| | | | | | | | If we fold the for-loop iterating over each interval timer into the helper function the result is slightly tidier than what we have now. Rename the helper function "cancel_all_itimers". Based on input from millert@ and kettenis@.
* sys/kernel.h: remove dead externs: tickfix, tixfixinterval, tickdelta, ...cheloha2020-10-151-5/+1
| | | | | | | | | | | | miod@ removed several time-related globals from the kernel with the commit "unifdef -d __HAVE_TIMECOUNTER" (see sys/kern/kern_clock.c v1.76). He neglected to remove their externs from sys/kernel.h, though. Remove the externs. With help from jsg@. ok jsg@
* _exit(2), execve(2): cancel per-process interval timers safelycheloha2020-10-151-1/+2
| | | | | | | | | | | | | | | | | During _exit(2) and sometimes during execve(2) we need to cancel any active per-process interval timers. We don't currently do this in an MP-safe way. Both syscalls ignore the locking assumptions documented in proc.h. The easiest way to make them MP-safe is to use setitimer(), just like the getitimer(2) and setitimer(2) syscalls do. To make things a bit cleaner I have added a helper function, cancelitimer(), so the callers don't need to fuss with an itimerval struct. While we're here we can remove the splclock/splx dance from execve(2). It is no longer necessary. ok deraadt@
* Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".mpi2020-10-111-1/+12
| | | | | | | | | | The struct keeps track of the end point of an event queue scan by persisting the end marker. This will be needed when kqueue_scan() is called repeatedly to complete a scan in a piecewise fashion. Extracted from a previous diff from visa@. ok visa@, anton@
* Returning a void expression is weird; ok kettenis@ daniel@otto2020-10-101-5/+5
|
* Fix mistypes within sys/smr.hmvs2020-09-291-3/+3
| | | | | | | LIST_END -> SMR_LIST_END TAILQ_END -> SMR_TAILQ_END ok visa@
* KCOV_BUF_MAX_NMEMB is defined under _KERNEL in sys/kcov.h but only usedanton2020-09-261-3/+1
| | | | in dev/kcov.c; therefore move it to dev/kcov.c.
* Move duplicated code to send an uncatchable SIGABRT into a function.mpi2020-09-161-2/+2
| | | | ok claudio@
* Document that `p_siglist' and `p_sigmask' are updated via atomics.mpi2020-09-161-3/+3
| | | | ok claudio@
* Fix comment, ktrace flags are per-process.mpi2020-09-141-3/+3
|
* Unbreak tree. Instead of passing struct process to siginit() just pass theclaudio2020-09-131-2/+2
| | | | struct sigacts since that is the only thing that is modified by siginit.
* Introduce a helper to check if a signal is ignored or masked by a thread.mpi2020-09-091-1/+2
| | | | ok claudio@, pirofti@
* Remove unused sysctl_int_arr(9)gnezdo2020-09-011-2/+1
|
* crank to 6.8-betaderaadt2020-08-311-3/+3
|
* Declare hw_{prod,serial,uuid,vendor,ver} in <sys/systm.h>.visa2020-08-261-1/+7
| | | | OK deraadt@, mpi@
* Annotate locking of ps_single.visa2020-08-261-2/+3
| | | | Prompted by mpi@
* Remove unused debug_syncprt, improve debug sysctl handlingkn2020-08-231-5/+1
| | | | | | | | | | | | | | | | | | | "syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008. Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c the only visible difference between used and stub ctldebug structs in the debugvars[] array is their extern keyword, indicating that it is defined elsewhere. sys/sysctl.h declares all debugN members as extern upfront, but these declarations are not needed. Remove the unused debug sysctl, rename the only remaining one to something meaningful and remove forward declarations from /sys/sysctl.h; this way, adding new debug sysctls is a matter of adding extern and coming up with a name, which is nicer to read on its own and better to grep for. OK mpi
* Allow userland to use EVFILT_EXCEPT.mpi2020-08-231-2/+2
| | | | ok mvs@, visa@
* Move sysctl(2) CTL_DEBUG from DEBUG to new DEBUG_SYSCTLkn2020-08-221-4/+4
| | | | | | | | | | | | | | | | | Adding "debug.my-knob" sysctls is really helpful to select different code paths and/or log on demand during runtime without recompile, but as this code is under DEBUG, lots of other noise comes with it which is often undesired, at least when looking at specific subsystems only. Adding globals to the kernel and breaking into DDB to change them helps, but that does not work over SSH, hence the need for debug sysctls. Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general option for all of sysctl(2). OK gnezdo
* Remove an unnecessary field from struct msgbuf.visa2020-08-181-2/+1
| | | | OK mvs@
* Add sysctl_bounded_arr as a replacement for sysctl_int_arrgnezdo2020-08-181-1/+15
| | | | | | Design by deraadt@ ok deraadt@
* struct process: annotate locking for getitimer(2), setitimer(2)cheloha2020-08-111-3/+6
| | | | | | | | | | | | | The ITIMER_REAL itimerspec (ps_timer[0]) and timeout (ps_realit_to) are protected by the kernel lock. Annotate them with "K". The ITIMER_VIRTUAL and ITIMER_PROF itimerspecs (ps_timer[1], ps_timer[2]) are protected by itimer_mtx. Annotate them with "T", for "timer". With input from kettenis@ and anton@. ok kettenis@, anton@
* Remove now unused M_ACAST flag.florian2020-08-081-5/+3
| | | | Reminded by, input & OK jca
* timeout(9): remove unused interfaces: timeout_add_ts(9), timeout_add_bt(9)cheloha2020-08-071-5/+1
| | | | | | | | | | These two interfaces have been entirely unused since introduction. Remove them and thin the "timeout" namespace a bit. Discussed with mpi@ and ratchov@ almost a year ago, though I blocked the change at that time. Also discussed with visa@. ok visa@, mpi@
* Move range check inside sysctl_int_arrgnezdo2020-08-011-2/+2
| | | | | | | Range violations are now consistently reported as EOPNOTSUPP. Previously they were mixed with ENOPROTOOPT. OK kn@
* Add support for remote coverage to kcov. Remote coverage is collectedanton2020-08-013-3/+21
| | | | | | | | | | | | | | | | | | | | | from threads other than the one currently having kcov enabled. A thread with kcov enabled occasionally delegates work to another thread, collecting coverage from such threads improves the ability of syzkaller to correlate side effects in the kernel caused by issuing a syscall. Remote coverage is divided into subsystems. The only supported subsystem right now collects coverage from scheduled tasks and timeouts on behalf of a kcov enabled thread. In order to make this work `struct task' and `struct timeout' must be extended with a new field keeping track of the process that scheduled the task/timeout. Both aforementioned structures have therefore increased with the size of a pointer on all architectures. The kernel API is documented in a new kcov_remote_register(9) manual. Remote coverage is also supported by kcov on NetBSD and Linux. ok mpi@
* timeout(9): remove TIMEOUT_SCHEDULED flagcheloha2020-07-251-2/+1
| | | | | | | | | | | | | The TIMEOUT_SCHEDULED flag was added a few months ago to differentiate between wheel timeouts and new timeouts during softclock(). The distinction is useful when incrementing the "rescheduled" stat and the "late" stat. Now that we have an intermediate queue for new timeouts, timeout_new, we don't need the flag. The distinction between wheel timeouts and new timeouts can be made computationally. Suggested by procter@ several months ago.
* The "unsupported compiler" checks were added back in December whendaniel2020-07-212-13/+2
| | | | | | | MD versions of these headers were unhooked. As nothing has hit those checks we can drop them at this point. ok visa@ and "makes sense" to millert@
* POWE9 CPUs provide an energy sensor that accumulates the emount of energykettenis2020-07-151-1/+3
| | | | | | | | used by the processor chip. Although we have a SENSOR_WATTHOUR sensor type its units are not really suitable for this sensor. So add a SENSOR_ENERGY type that uses micro Joules as its unit. ok deraadt@
* A pty write containing VDISCARD, VREPRINT, or various retyping cases ofderaadt2020-07-141-3/+3
| | | | | | | | | VERASE would perform (sometimes irrelevant) compute in the kernel which can be heavy (especially with our insufficient tty subsystem locking). Use tsleep_nsec for 1 tick in such circumstances to yield cpu, and also bring interruptability to ptcwrite() https://syzkaller.appspot.com/bug?extid=462539bc18fef8fc26cc ok kettenis millert, discussions with greg and anton
* Add support for timeconting in userland.pirofti2020-07-064-7/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | This diff exposes parts of clock_gettime(2) and gettimeofday(2) to userland via libc eliberating processes from the need for a context switch everytime they want to count the passage of time. If a timecounter clock can be exposed to userland than it needs to set its tc_user member to a non-zero value. Tested with one or multiple counters per architecture. The timing data is shared through a pointer found in the new ELF auxiliary vector AUX_openbsd_timekeep containing timehands information that is frequently updated by the kernel. Timing differences between the last kernel update and the current time are adjusted in userland by the tc_get_timecount() function inside the MD usertc.c file. This permits a much more responsive environment, quite visible in browsers, office programs and gaming (apparently one is are able to fly in Minecraft now). Tested by robert@, sthen@, naddy@, kmos@, phessler@, and many others! OK from at least kettenis@, cheloha@, naddy@, sthen@
* kstat does open, close, and ioctl.dlg2020-07-061-1/+9
|
* add kstat(4), a subsystem to let the kernel expose statistics to userland.dlg2020-07-061-0/+193
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | a kstat is an arbitrary chunk of data that a part of the kernel wants to expose to userland. data could mean just a chunk of raw bytes, but generally a kernel subsystem will provide a series of kstat key/value chunks. this code is loosely modelled on kstat in solaris, but with a bunch of simplifications (we don't want to provide write support for example). the named or key/value structure is significantly richer in this version too. eg, ssolaris kstat named data supports integer types, but this version offers differentiation between counters (like the number of packets transmitted on an interface) and gauges (like how long the transmit queue is) and lets kernel providers say what the units are (eg, packets vs bytes vs cycles). the main motivation for this is to improve the visibility of what the kernel is doing while it's running. i wrote this as part of the recent work we've been doing on multiqueue and rss/toeplitz so i could verify that network load is actually spread across multiple rings on a single nic. without this we would be wasting memory and interrupt vectors on multiple rings and still just using the 1st one, and noone would know cos there's no way to see what rings are being used. another thing that can become visible is the different counters that various network cards provide. i'm particularly interested in seeing if packets get dropped because the rings aren't filled fully, which is an effect we've never really observed directly. a small part of wanting this is cos i spend an annoying amount of time instrumenting the kernel when hacking code in it. if most of the scaffolding for the instrumentation is already there, i can avoid repeatedly writing that code and save time. iterated a few times with claudio@ and deraadt@
* It's been agreed upon that global locks should be expressed usinganton2020-07-043-24/+24
| | | | | | | | | | capital letters in locking annotations. Therefore harmonize the existing annotations. Also, if multiple locks are required they should be delimited using commas. ok mpi@
* Bring back revision 1.122 with a fix preventing a use-after-free byanton2020-06-291-1/+4
| | | | | | | | | | | | | | | | | serializing calls to pipe_buffer_free(). Repeating the previous commit message: Instead of performing three distinct allocations per created pipe, reduce it to a single one. Not only should this be more performant, it also solves a kqueue related issue found by visa@ who also requested this change: if you attach an EVFILT_WRITE filter to a pipe fd, the knote gets added to the peer's klist. This is a problem for kqueue because if you close the peer's fd, the knote is left in the list whose head is about to be freed. knote_fdclose() is not able to clear the knote because it is not registered with the peer's fd. FreeBSD also takes a similar approach to pipe allocations. once again ok mpi@ visa@
* ipmi: add a matching kqfilter filter for `seltrue' as well, allowing ussthen2020-06-291-3/+3
| | | | | to keep the behavior when switching poll(2) to use kqueue filters. From mpi@
* fix /dev/ipmi. conf.h r1.150 changed from enodev->selfalse for the pollsthen2020-06-291-2/+2
| | | | | | function but actually a 'true' value is needed; use seltrue instead. Problem reported, kenel bisected and diff tested by Jens A. Griepentrog. ok deraadt@ mpi@
* Add MID_POWERPC64. These identifiers are only used for kernel core dumpskettenis2020-06-281-1/+2
| | | | | | these days, so inventing our own numbers is fine. From drahn@
* timecounting: deprecate time_second(9), time_uptime(9)cheloha2020-06-261-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | time_second(9) has been replaced in the kernel by gettime(9). time_uptime(9) has been replaced in the kernel by getuptime(9). New code should use the replacement interfaces. They do not suffer from the split-read problem inherent to the time_* variables on 32-bit platforms. The variables remain in sys/kern/kern_tc.c for use via kvm(3) when examining kernel core dumps. This commit completes the deprecation process: - Remove the extern'd definitions for time_second and time_uptime from sys/time.h. - Replace manpage cross-references to time_second(9)/time_uptime(9) with references to microtime(9) or a related interface. - Move the time_second.9 manpage to the attic. With input from dlg@, kettenis@, visa@, and tedu@. ok kettenis@
* add USEC_TO_TIMEVAL()jsg2020-06-261-1/+8
| | | | discussed with cheloha@
* add intrmap_one, some temp code to help us write pci_intr_establish_cpu.dlg2020-06-231-1/+2
| | | | | | | it means we can do quick hacks to existing drivers to test interrupts on multiple cpus. emphasis on quick and hacks. ok jmatthew@, who will also ok the removal of it at the right time.
* timecounting: add gettime(9), getuptime(9)cheloha2020-06-221-1/+4
| | | | | | | | | | | | | | | | | | | | | | time_second and time_uptime are used widely in the tree. This is a problem on 32-bit platforms because time_t is 64-bit, so there is a potential split-read whenever they are used at or below IPL_CLOCK. Here are two replacement interfaces: gettime(9) and getuptime(9). The "get" prefix signifies that they do not read the hardware timecounter, i.e. they are fast and low-res. The lack of a unit (e.g. micro, nano) signifies that they yield a plain time_t. As an optimization on LP64 platforms we can just return time_second or time_uptime, as a single read is atomic. On 32-bit platforms we need to do the lockless read loop and get the values from the timecounter. In a subsequent diff these will be substituted for time_second and time_uptime almost everywhere in the kernel. With input from visa@ and dlg@. ok kettenis@
* Extend kqueue interface with EVFILT_EXCEPT filter.mpi2020-06-221-1/+8
| | | | | | | | | | This filter, already implemented in macOS and Dragonfly BSD, returns exceptional conditions like the reception of out-of-band data. The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and it can be used by the kqfilter-based poll & select implementation. ok millert@ on a previous version, ok visa@
* let userland read vpd info from a pci device.dlg2020-06-221-1/+9
| | | | | | | | | | | | | | | | | | | reading vpd stuff is useful when you're trying to get support information about a pci device, eg, if you want a serial number, or firmware versions, or specific part name or number, it's likely available via vpd. also, im sick of having the diff in my tree. the vpd info is not accessed as bytes read from a capability, but is read via a register in the capability. the same register also supports updating or writing vpd info, which sounds like a bad idea to let userland have raw access to. this adds an ioctl so that userland can ask the kernel to read via the vpd register on its behalf. this ensures that the only access is read access, and it's sanity checked. tested by hrvoje popovski on many devices. ok jmatthew@
* wireguard is taking over the gif mbuf tag.dlg2020-06-211-4/+4
| | | | | | | | | | | | | | gif used its mbuf tag to store it's interface index so it could detect loops. gre also did this, and i cut most of the drivers (including gif) over to using the gre tag. so the gif tag is unused. wireguard uses the tag to store peer information between different contexts the packet is processed in. it also needs a bit more space to do that. from Matt Dunwoodie and Jason A. Donenfeld ok deraadt@