summaryrefslogtreecommitdiffstats
path: root/sys/kern (follow)
Commit message (Collapse)AuthorAgeFilesLines
...
* if waitok flag is set, have the interrupt multipage allocator redirecttedu2019-02-101-1/+9
| | | | to the not interrupt allocator.
* make it possible to reduce kmem pressure by letting some pools use a moretedu2019-02-102-28/+18
| | | | | | | | | | accomodating allocator. an interrupt safe pool may also be used in process context, as indicated by waitok flags. thanks to the garbage collector, we can always free pages in process context. the only complication is where to put the pages. solve this by saving the allocation flags in the pool page header so the free function can examine them. not actually used in this diff. (coming soon.) arm testing and compile fixes from phessler
* Fix stack info leak in execve(2). There are 2x4 bytes of paddingbluhm2019-02-081-1/+3
| | | | | in struct ps_strings. from NetBSD; OK deraadt@ guenther@ visa@
* Add lock stack trace saving for witness(4).visa2019-02-071-3/+157
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This lets witness(4) save a stack trace on each lock acquisition. The saved traces can be viewed in ddb(4) when showing the currently held locks, which may help when debugging incorrect locking. Sample output: ddb{0}> show all locks Process 63836 (rm) thread 0xffff8000221e52c8 (435004) exclusive rrwlock inode r = 0 (0xfffffd8119a092c0) locked @ /usr/src/sys/ufs/ufs/ufs_vnops.c:1547 #0 witness_lock+0x419 #1 _rw_enter+0x2bb #2 _rrw_enter+0x42 #3 VOP_LOCK+0x3f #4 vn_lock+0x36 #5 vfs_lookup+0xa1 #6 namei+0x2b3 #7 dounlinkat+0x85 #8 syscall+0x338 #9 Xsyscall+0x128 exclusive kernel_lock &kernel_lock r = 1 (0xffffffff81e6a5f0) locked @ /usr/src/sys/arch/amd64/amd64/intr.c:525 #0 witness_lock+0x419 #1 syscall+0x2b6 #2 Xsyscall+0x128 The saving adds overhead, so it is not enabled by default. It can be taken into use by setting sysctl kern.witness.locktrace=1 at runtime or by defining WITNESS_LOCKTRACE in the kernel configuration. Feedback and OK anton@
* Use ktrreltimespec() as the timeout is relative, pointed by matthew@.mpi2019-02-061-2/+2
| | | | ok cheloha@
* Avoid an mbuf double free in the oob soreceive() path. In thebluhm2019-02-041-6/+6
| | | | | | | | | | usrreq functions move the mbuf m_freem() logic to the release block instead of distributing it over the switch statement. Then the goto release in the initial check, whether the pcb still exists, will not free the mbuf for the PRU_RCVD, PRU_RVCOOB, PRU_SENSE command. OK claudio@ mpi@ visa@ Reported-by: syzbot+8e7997d4036ae523c79c@syzkaller.appspotmail.com
* Make callers of witness_lock_list_{get,free}() responsible of raisingvisa2019-02-041-11/+11
| | | | | | | the system priority level to IPL_HIGH. This simplifies the code a bit relative to calling from witness_lock() and witness_unlock(). OK mpi@
* When freeing the sem_undo structure in semundo_adjust(), update theanton2019-02-041-1/+2
| | | | | | | | | | caller supplied pointer. Otherwise, the caller is left with a dangling pointer that could lead to a use-after-free panic. ok millert@ visa@ Reported-by: syzbot+ac1d7685deab53b95ace@syzkaller.appspotmail.com Reported-by: syzbot+dbe8f002f8051f26f6fe@syzkaller.appspotmail.com
* make m_pullup use the first mbuf with data to measure alignment.dlg2019-02-011-44/+65
| | | | | | | | this fixes an issue found by a regress test on sparc64 by claudio, and between us took about half a day of work to understand and fix at a2k19. ok claudio@
* matthew noticed that some clocks use tfind() which is not mpsafe.tedu2019-01-311-10/+20
| | | | | add locking in clock_gettime where needed. ok cheloha matthew
* tc_setclock: Don't rewind the system uptime during resume/unhibernate.cheloha2019-01-311-1/+16
| | | | | | | | | | | | | | | When we come back from suspend/hibernate the BIOS/firmware/whatever can hand us *any* TOD, so we need to check that the given TOD doesn't set our boot offset backwards, breaking the monotonicity of e.g. CLOCK_MONOTONIC. This is trivial to do from the BIOS on most PCs before unhibernating. There might be other ways it can happen, accidentally or otherwise. This is a bit messy but it can be made prettier later with a "bintimecmp" macro or something like that. Problem confirmed by jmatthew@. "you are very likely right" deraadt@
* Replace hand rolled linked list with TAILQ. All made possible by the recentanton2019-01-301-133/+75
| | | | | | introduction of struct lockf_state. ok bluhm@ visa@
* Add a dedicated sysctl(2) node for witness(4).visa2019-01-292-2/+27
| | | | | | | | The new node contains the subsystem's main control variable, kern.witness.watch. It is aliased by the old name, kern.witnesswatch. The alias will be removed in the future. OK anton@ mpi@
* Simplify by using `spc' since we already have it, no behavior change.mpi2019-01-281-3/+2
|
* Stop accounting/updating priorities for Idle threads.mpi2019-01-281-1/+13
| | | | | | | | | | | | Idle threads are never placed on the runqueue so their priority doesn't matter. This fixes an accounting bug where top(1) would report a high CPU usage for Idle threads of secondary CPUs right after booting. That's because schedcpu() would give 100% CPU time to the Idle thread until "real" threads get scheduled on the corresponding CPU. Issue reported by bluhm@, ok visa@, kettenis@
* stop using capital letters in printf format strings; ok visa@anton2019-01-271-4/+4
|
* consistency tweaks to panic format strings; ok visa@anton2019-01-271-11/+11
|
* Parse altitude and ground speed values from the GGA & RMC NMEA messages,landry2019-01-261-8/+105
| | | | | | | | | | | and provide them as nmea(4) distance & velocity sensors. With my 'u-blox GNSS receiver' that gives: hw.sensors.nmea0.distance0=335.600 m (Altitude), OK hw.sensors.nmea0.velocity0=18.337 m/s (Ground speed), OK ok deraadt@
* Use memset() instead of bzero().visa2019-01-261-3/+3
|
* Tag the start of witness(4) output with prefix "witness:".visa2019-01-261-15/+18
| | | | | | This eases data extraction in syzkaller. Prompted by and OK anton@
* I am retiring my old email address; replace it with my OpenBSD one.millert2019-01-252-4/+4
|
* eliminate a ?: in witness mtx initializer by pushing the default onetedu2019-01-231-2/+2
| | | | | level up. ok guenther mpi visa
* Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kerncheloha2019-01-234-16/+14
|
* futimens(2), futimes(2), utimensat(2), utimes(2): Validate input at copyincheloha2019-01-231-15/+25
| | | | | | | | | | | | | | | | | | Currently we validate time input for all four of these syscalls in the workhorse function dovutimens(). This is bad because both futimes(2) and utimes(2) have input as timevals that need to be converted to timespecs. This multiplication can overflow to create a "valid" input, e.g. if tv_usec is equal to 2^61 (invalid value) on a platform with 64-bit longs, the resulting tv_nsec is equal to zero (valid value). This is also a bit wasteful. We aquire a vnode and do other work under KERNEL_LOCK only to release the vnode when the time input is invalid. So, duplicate a bit of code to validate the time inputs before we do any conversions or real VFS work. probably still ok tedu@ deraadt@
* namei can return a null dvp on success. check this before access.tedu2019-01-221-3/+4
| | | | | | ok beck Reported-by: syzbot+cc59412ed8429450a1ae@syzkaller.appspotmail.com
* #ifdef video junk as required.deraadt2019-01-221-2/+4
|
* select(2), pselect(2), poll(2), ppoll(2): Support full timeout range.cheloha2019-01-211-63/+58
| | | | | | | | | | | | | | | | | | | | | | | Remove the arbitrary and undocumented 24hr limits for timeouts from these interfaces. To do so, loop tsleep(9) to chip away at timeouts larger than what tsleep(9) can handle in one call. Use timerisvalid(3)/timespecisvalid() for input validation instead of itimerfix()/timespecfix() to avoid the 100 million second upper bounds those functions introduce. POSIX requires support for timeouts of at least 31 days for select(2) and pselect(2), so these changes make our implementation more compliant. Other improvements here include better variable names for the time stuff and more consolidated timeout logic with less backwards goto jumping, all of which made dopselect() and doppoll() a bear to read. Naming improvements prompted by tedu@ in a prior patch for nanosleep(2). With input from deraadt@. Validation bug spotted by matthew@ in an earlier version. ok visa@
* sometimes we don't call unveil_add, which means memory allocated by nameitedu2019-01-212-5/+8
| | | | | | doesn't get freed. move the free calls into the same function as namei. fixed bug report from Dariusz Sendkowski ok beck
* Add "video" promise.landry2019-01-211-1/+31
| | | | | | | Allows a subset of ioctls on video(4) devices, subset selected from video(1) and firefox webrtc implementation. ok semarie@ deraadt@
* Introduce a dedicated entry point data structure for file locks. This new dataanton2019-01-212-37/+129
| | | | | | | | | | | | structure allows for better tracking of pending lock operations which is essential in order to prevent a use-after-free once the underlying vnode is gone. Inspired by the lockf implementation in FreeBSD. ok visa@ Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com
* Serialize tc_windup() calls and modification of some timehands members.cheloha2019-01-201-4/+20
| | | | | | | | | | | | | | | | | | | | If a user thread from e.g. clock_settime(2) is in the midst of changing the boottime or calling tc_windup() when it is interrupted by hardclock(9), the timehands could be left in a damaged state. So protect tc_windup() calls with a mutex, timecounter_mtx. hardclock(9) merely attempts to enter the mutex instead of spinning because it cannot afford to wait around. In practice hardclock(9) will skip tc_windup() very rarely, and when it does skip there aren't any negative effects because the skip indicates that a user thread is already calling, or about to call, tc_windup() anyway. Based on FreeBSD r303387 and NetBSD sys/kern/kern_tc.c,v1.30 Discussed with mpi@ and visa@. Tons of nice technical detail about lockless reads from visa@. OK visa@
* Move boottime into the timehands.cheloha2019-01-193-28/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To protect the timehands we first need to protect the basis for all UTC time in the kernel: the boottime. Because the boottime can be changed at any time it needs to be versioned along with the other members of the timehands to enable safe lockless reads when using it for anything. So the global boottime timespec goes away and the static boottimebin becomes a member of the timehands. Instead of reading the global boottime you use one of two interfaces: binboottime(9) or microboottime(9). nanoboottime(9) can trivially be added later, though there are no consumers for it at the moment. This introduces one small change in behavior. We used to advance the reported boottime just before launching kernel threads from main(). This makes it look to userland like we "booted" moments before those threads were launched. Because there is no longer a boottime global we can no longer trivially do this from main(), so the boottime we report to userspace via e.g. kern.boottime will now reflect whatever the time was when we bootstrapped the timehands via inittodr(9). This is usually no more than a minute before the kernel threads are launched from main(). The prior behavior can be restored by adding a new interface to the timecounter layer in a future commit. Based on FreeBSD r303387. Discussed with mpi@ and visa@. ok visa@
* no need to KERNEL_LOCK before calling ktrstruct() anymore; ok mpi@ visa@cheloha2019-01-181-21/+6
|
* futex(2): validate relative timeout before sleeping.cheloha2019-01-181-1/+3
| | | | | | | | Linux does validation. Document this new failure case as an EINVAL, like Linux. "stop waiting" deraadt
* adjtime(2), settimeofday(2), clock_settime(2): validate inputcheloha2019-01-181-1/+8
| | | | | | | | | Add documentation for the new EINVAL cases for adjtime(2) and settimeofday(2). adjtime.2 docs ok schwarze@, settimeofday(2)/clock_settime(2) stuff ok tedu@, "stop waiting" deraadt@
* delete vmm(4) in i386pd2019-01-181-2/+2
| | | | | | | | | | | | We will still be able to run i386 guests on amd64 vmm. Reasons to delete i386 vmm: - Been broken for a while, almost no one complained. - Had been falling out of sync from amd64 while it worked. - If your machine has vmx, you most probably can run amd64, so why not run that? ok deraadt@ mlarkin@
* Unveil fixes:beck2019-01-171-12/+39
| | | | | | | | | 1) Correctly notice covering unveil when using .. - fix crash noticed by visa@ 2) Notice when v_mount is NULL to not crash when unveil vnodes are on a forcibly unmounted filesystem, noticed by yasuoka@ 3) Add a flag to ni_data so that failures from unveil flag mismatches in covering unveils return the correct EACCESS instead of ENOENT (noticed by brynet@) ok deraadt@
* backout previous; crashes near mountpoints it seemsderaadt2019-01-141-7/+4
|
* Fix unveil issue noticed by kn@ where unveil does not notice coveringbeck2019-01-141-4/+7
| | | | | unveil matches when .. is used correctly. Also adds regress based upon his test program for the same issue.
* syncderaadt2019-01-112-7/+7
|
* mincore() is a relic from the past, exposing physical machine informationderaadt2019-01-111-3/+2
| | | | | | | about shared resources which no program should see. only a few pieces of software use it, generally poorly thought out. they are being fixed, so mincore() can be deleted. ok guenther tedu jca sthen, others
* settime: Don't cancel ongoing adjtime(2) until after full permission checkscheloha2019-01-101-7/+6
| | | | ok jca@ visa@ guenther@ deraadt@
* Eliminate an else branch from m_extunref().visa2019-01-091-4/+4
| | | | OK millert@ bluhm@
* If the mbuf cluster in m_zero() is read only, propagate the M_ZEROIZEbluhm2019-01-081-2/+10
| | | | | | flag to the other references. Then the final m_free() will clear the memory. OK claudio@
* It is possible to call m_zero with a read-only cluster. In that case justclaudio2019-01-071-6/+3
| | | | | | | return. Hopefully the other reference holder has the M_ZEROIZE flag set as well. Triggered by syzkaller. OK deradt@ visa@ Reported-by: syzbot+c578107d70008715d41f@syzkaller.appspotmail.com
* the pledge handing for access(2) of /var/run/ypbind.lock is artificiallyderaadt2019-01-061-2/+3
| | | | | | | tough (so that non-YP using developers don't break the tree for YP/LDAP users). This check failed to handle the newish RPATH+UNVEIL_INSPECT namei operation. discovered by florian, ok beck
* fold a bunch of similar sysctl cases into a switch.tedu2019-01-061-53/+43
| | | | ok deraadt mestre
* Clear ps_uvpcwd when we free ps_uvpaths. Fixes a crash seen by kn@ and mekettenis2019-01-061-1/+2
| | | | | | where ps_uvpcwd obviously contains a dangling pointer. ok deraadt@, krw@
* Fix unsafe use of ptsignal() in mi_switch().visa2019-01-064-22/+46
| | | | | | | | | | | | | | | | | | ptsignal() has to be called with the kernel lock held. As ensuring the locking in mi_switch() is not easy, and deferring the signaling using the task API is not possible because of lock order issues in mi_switch(), move the CPU time checking into a periodic timer where the kernel can be locked without issues. With this change, each process has a dedicated resource check timer. The timer gets activated only when a CPU time limit is set. Because the checking is not done as frequently as before, some precision is lost. Use of timers adapted from FreeBSD. OK tedu@ Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com
* Fix a collection of covering unveil bugs that prevent unveil's of upperbeck2019-01-033-22/+38
| | | | | | level directories from working when you don't traverse into them starting from /. Most found by brynet@ and a few others. ok brynet@ deraadt@