summaryrefslogtreecommitdiffstats
path: root/sys/kern/init_main.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Move initialization of limit0 into a dedicated function. This newvisa2019-06-021-15/+2
| | | | | | | | | function is also a proper place for setting up the plimit pool. While here, raise the IPL of the plimit pool to IPL_MPFLOOR, needed in upcoming MP work. OK claudio@
* Revert to using the SCHED_LOCK() to protect time accounting.mpi2019-06-011-3/+1
| | | | | | | | | It currently creates a lock ordering problem because SCHED_LOCK() is taken by hardclock(). That means the "priorities" of a thread should be moved out of the SCHED_LOCK() first in order to make progress. Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com via anton@ as well as by kettenis@
* Use a per-process mutex to protect time accounting instead of SCHED_LOCK().mpi2019-05-311-1/+3
| | | | | | | Note that hardclock(9) still increments p_{u,s,i}ticks without holding a lock. ok visa@, cheloha@
* Rename struct plimit field p_refcnt to pl_refcnt to avoid confusionvisa2019-05-311-2/+2
| | | | | | | with the fields of struct proc. Make pl_refcnt unsigned for upcoming atomic updating. OK deraadt@ guenther@
* Introduce safe memory reclamation, a mechanism for reclaiming sharedvisa2019-02-261-1/+5
| | | | | | | | | | | | | | | | | | | | | | | objects that readers can access without locking. This provides a basis for read-copy-update operations. Readers access SMR-protected shared objects inside SMR read-side critical section where sleeping is not allowed. To reclaim an SMR-protected object, the writer has to ensure mutual exclusion of other writers, remove the object's shared reference and wait until read-side references cannot exist any longer. As an alternative to waiting, the writer can schedule a callback that gets invoked when reclamation is safe. The mechanism relies on CPU quiescent states to determine when an SMR-protected object is ready for reclamation. The <sys/smr.h> header additionally provides an implementation of singly- and doubly-linked lists that can be used together with SMR. These lists allow lockless read access with a concurrent writer. Discussed with many OK mpi@ sashan@
* Move boottime into the timehands.cheloha2019-01-191-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To protect the timehands we first need to protect the basis for all UTC time in the kernel: the boottime. Because the boottime can be changed at any time it needs to be versioned along with the other members of the timehands to enable safe lockless reads when using it for anything. So the global boottime timespec goes away and the static boottimebin becomes a member of the timehands. Instead of reading the global boottime you use one of two interfaces: binboottime(9) or microboottime(9). nanoboottime(9) can trivially be added later, though there are no consumers for it at the moment. This introduces one small change in behavior. We used to advance the reported boottime just before launching kernel threads from main(). This makes it look to userland like we "booted" moments before those threads were launched. Because there is no longer a boottime global we can no longer trivially do this from main(), so the boottime we report to userspace via e.g. kern.boottime will now reflect whatever the time was when we bootstrapped the timehands via inittodr(9). This is usually no more than a minute before the kernel threads are launched from main(). The prior behavior can be restored by adding a new interface to the timecounter layer in a future commit. Based on FreeBSD r303387. Discussed with mpi@ and visa@. ok visa@
* copyright++;jsg2019-01-011-2/+2
|
* - if_cloners list populated at boot time only then becomes immutable,sashan2018-09-101-1/+7
| | | | | | so we can let go if_cloners_lock. OK tb@, claudio@, bluhm@, kn@, henning@
* Simplify the startup of the cleaner, reaper and update threads byvisa2018-08-131-28/+4
| | | | | | | | | | | | | passing the main function directly to kthread_create(9). The start_* functions are mere stepping stones nowadays and can be pruned. They used to contain more logic in the pre-kthread era. While here, set `cleanerproc' and `syncerproc' during the thread creation rather than expect the threads to set the proc pointer. Also, rename `sched_sync' to `syncer_thread' to reduce confusion with the scheduler-related functions. OK kettenis@, deraadt@, mpi@
* Remove a few leftovers from the days of emulation, which could result inderaadt2018-07-201-3/+2
| | | | | a bad/corrupt binary not returning ENOEXEC but some other error. ok guenther kettenis bluhm
* Move from sendsig() to its callers the initsiginfo() calls andguenther2018-07-101-2/+1
| | | | | | | | instead of passing sendsig() the code+type+val, pass a siginfo_t* to copy from. Eliminate the indirection through struct emul for sendsig(); we no longer have a SunOS4-compat version of sendsig() ok deraadt@
* Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is alwaysvisa2018-04-281-2/+2
| | | | | | | curproc that does the locking or unlocking, so the proc parameter is pointless and can be dropped. OK mpi@, deraadt@
* Implement MAP_STACK option for mmap(). Synchronous faults (pagefault andderaadt2018-04-121-2/+2
| | | | | | | | | | | | | | syscall) confirm the stack register points at MAP_STACK memory, otherwise SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified to create a MAP_STACK sub-region which satisfies alignment requirements. Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the contents of the region -- there is no mprotect() equivalent operation, so there is no MAP_STACK-adding gadget. This opportunistic software-emulation of a stack protection bit makes stack-pivot operations during ROPchain fragile (kind of like removing a tool from the toolbox). original discussion with tedu, uvm work by stefan, testing by mortimer ok kettenis
* Do not panic from ddb(4) when a lock requirement isn't fulfilled.mpi2018-03-201-1/+2
| | | | | | | | | | | Extend the logic already present for panic() to any DDB-related operation such that if ddb(4) is entered because of a fault or other trap it is still possible to call 'boot reboot'. While here stop printing splassert() messages as well, to not fill the buffer. ok visa@, deraadt@
* Revert the change that postpones CPUs until after mounthook activities.patrick2018-02-281-3/+3
| | | | | | | | This was needed to be able to use loadfirmware() to load the microcode before letting the cores go. Now that the microcode is loaded earlier we can restore the previous behaviour. ok deraadt@
* Postpone secondary CPUs until after mounthook activities. This ispatrick2018-01-111-3/+3
| | | | | | | | | useful for loading CPU microcode from the disk before the CPUs are let go. Tested by visa@ on sgi, loongson and octeon "don't see immediate issues" kettenis@ ok deraadt@
* copyright++;jsg2018-01-011-2/+2
|
* Load CTF debug symbols before mountrootuwe2017-08-141-5/+6
| | | | | | | This is obviously useful in order to investigate a failure to mount an NFS or other root device. ok mpi
* Merge DDBCTF into DDB.mpi2017-08-111-2/+2
|
* Add futex(2) syscall based on a sane subset of its Linux equivalent.mpi2017-04-281-1/+7
| | | | | | | | | | | | | | | | The syscall is marked NOLOCK and only FUTEX_WAIT grabs the KERNEL_LOCK() because of PCATCH and the signal nightmare. Serialization of threads is currently done with a global & exclusive rwlock. Note that the current implementation still use copyin(9) which is not guaranteed to be atomic. Committing now such that remaining issues can be addressed in-tree. With inputs from guenther@, kettenis@ and visa@. ok deraadt@, visa@
* Add a port of witness(4) lock validation tool from FreeBSD.visa2017-04-201-1/+4
| | | | Go-ahead from kettenis@, guenther@, deraadt@
* domaininit() doesn't need splnet().mpi2017-03-061-6/+3
| | | | | | | At this stage the scheduler isn't setup, which means the 'softnet' isn't running yet, so input packets aren't processed. Prodded by a question from guenther@, ok bluhm@
* Split up fork1():guenther2017-02-121-3/+2
| | | | | | | | | | | | | | | | | - FORK_THREAD handling is a totally separate function, thread_fork(), that is only used by sys___tfork() and which loses the flags, func, arg, and newprocp parameters and gains tcb parameter to guarantee the new thread's TCB is set before the creating thread returns - fork1() loses its stack and tidptr parameters Common bits factor out: - struct proc allocation and initialization moves to thread_new() - maxthread handling moves to fork_check_maxthread() - setting the new thread running moves to fork_thread_start() The MD cpu_fork() function swaps its unused stacksize parameter for a tcb parameter. luna88k testing by aoyama@, alpha testing by dlg@ ok mpi@
* p_comm is the process's command and isn't per thread, so move it fromguenther2017-01-211-2/+2
| | | | | | struct proc to struct process. ok deraadt@ kettenis@
* copyright++;jsg2017-01-011-2/+2
|
* Automatically create a default lo(4) interface per rdomain.mpi2016-11-141-3/+4
| | | | | | | | | | | | | | | | | | In order to stop abusing lo0 for all rdomains, a new loopback interface will be created every time a rdomain is created. The unit number will be the same as the rdomain, i.e. lo1 will be attached to rdomain 1. If this loopback interface is already in use it wont be possible to create the corresponding rdomain. In order to know which lo(4) interface is attached to a rdomain, its index is stored in the rtable/rdomain map. This is a long overdue since the introduction of rtable/rdomain. It also fixes a recent regression due to resetting the rdomain of an incoming packet reported by semarie@, Andreas Bartelt and Nils Frohberg. ok claudio@
* Split PID from TID, giving processes a PID unrelated to the TID of theirguenther2016-11-071-2/+3
| | | | | | initial thread ok jsing@ kettenis@
* move the mbstat structure to percpu countersdlg2016-10-241-1/+3
| | | | | | | each cpus counters still have to be protected by splnet, but this is better thana single set of counters protected by a global mutex. ok bluhm@
* add generalised access to per cpu data structures and counters.dlg2016-10-211-1/+5
| | | | | | | | | | | | | | | | | both the cpumem and counters api simply allocates memory for each cpu in the system that can be used for arbitrary per cpu data (via cpumem), or a versioned set of counters per cpu (counters). there is an alternate backend for uniprocessor systems that basically turns the percpu data access into an immediate access to a single allocation. there is also support for percpu data structures that are available at boot time by providing an allocation for the boot cpu. after autoconf, these allocations have to be resized to provide for all cpus that were enumerated by boot. ok mpi@
* Introduce a new 'softclock' thread that will be used to execute timeoutmpi2016-09-221-1/+5
| | | | | | | | | | | | | | | | | callbacks needing a process context. The function timeout_set_proc(9) has to be used instead of timeout_set(9) when a timeout callback needs a process context. Note that if such a timeout is waiting, understand sleeping, for a non negligible amount of time it might delay other timeouts needing a process context. dlg@ agrees with this as a temporary solution. Manpage tweaks from jmc@ ok kettenis@, bluhm@, mikeb@
* add missing call to db_ctf_init().jasper2016-09-181-1/+6
| | | | | this was part of the larger diff that was ok guenther@ mpi@, somehow I forgot to commit this particular piece.
* Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernelmpi2016-09-041-3/+4
| | | | | | | | | | | | | | | | profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
* Write the system time back to the RTC every 30 minutes.naddy2016-09-031-1/+3
| | | | | | | This fixes the problem that long-running machines which were not shut down properly would reboot with a badly offset system time. hints and ok kettenis@
* Do not reinitialize __guard_local if it is 0. This cannot be donederaadt2016-09-031-10/+1
| | | | | | | | anymore, since it is now RO. It is the bootloader's job to initialize it correctly. If the bootloader fails to perform that, you silently lose. The road to building an always-available rng is served by us depending on it :)
* move links from http to https://www.openbsd.org/tb2016-09-021-2/+2
| | | | ok beck
* Backout the previous fix for the sendsyslog(2) with LOG_CONS solution.bluhm2016-05-171-18/+8
| | | | | | | Permanently holding /dev/console open in the kernel works only until init(8) calls revoke(2). After that the console device vnode cannot be used anymore. It still resulted in a hanging init(8) if it tried to syslog(3) something. With the backout also dmesg -s works again.
* If sendsyslog(2) is called with LOG_CONS before syslogd(8) has beenbluhm2016-05-101-8/+18
| | | | | | | | started and before init(8) has opened the console, the kernel could crash as the console device has not been initialized. Open /dev/console in the kernel before starting init(8) and keep it open. This way sendsyslog(2) can be called early in the system. OK beck@ deraadt@
* SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookiederaadt2016-05-101-2/+3
| | | | | | | | inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
* Stop using a soft-interrupt context to process incoming network packets.mpi2016-05-031-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Use a new task that runs holding the KERNEL_LOCK to execute mp-unsafe code. Our current goal is to progressively move input functions to the unlocked task. This gives a small performance boost confirmed by Hrvoje Popovski's IPv4 forwarding measurement: before: after: send receive send receive 400kpps 400kpps 400kpps 400kpps 500kpps 500kpps 500kpps 500kpps 600kpps 600kpps 600kpps 600kpps 650kpps 650kpps 650kpps 640kpps 700kpps 700kpps 700kpps 700kpps 720kpps 640kpps 720kpps 710kpps 800kpps 640kpps 800kpps 650kpps 1.4Mpps 570kpps 1.4Mpps 590kpps 14Mpps 570kpps 14Mpps 590kpps ok kettenis@, bluhm@, dlg@
* Remove the unused flags argument from VOP_UNLOCK().natano2016-03-191-2/+2
| | | | | | torture tested on amd64, i386 and macppc ok beck mpi stefan "the change looks right" deraadt
* copyright++;jsg2016-01-031-2/+2
|
* Replace mountroothook_establish(9) by config_mountroot(9) a narrower APImpi2015-12-111-2/+2
| | | | | | similar to config_defer(9). ok mikeb@, deraadt@
* keep all the setperf timeout(9) handling in one place; ok tedu@naddy2015-11-081-10/+1
|
* Initialize the routing table before domains.mpi2015-10-071-2/+5
| | | | | | | | | | | | | | | | | | | | | | | The routing table is not an optional component of the network stack and initializing it inside the "routing domain" requires some ugly introspection in the domain interface. This put the rtable* layer at the same level of the if* level. These two subsystem are organized around the two global data structure used in the network stack: - the global &ifnet list, to be used in process context only, and - the routing table which can be read in interrupt context. This change makes the rtable_* layer domain-aware and extends the "struct domain" such that INET, INET6 and MPLS can specify the length of the binary key used in lookups. This allows us to keep, or move towards, AF-free route and rtable layers. While here stop the madness and pass the size of the maximum key length in *byte* to rn_inithead0(). ok claudio@, mikeb@
* Use a global table for domains instead of building a list at run time.mpi2015-08-301-2/+1
| | | | | | | As a side effect there's no need to run if_attachdomain() after the list of domains has been built. ok claudio@, reyk@
* Disable pool_gc on m88k if MULTIPROCESSOR; we don't have enough volunteersmiod2015-07-091-1/+3
| | | | | for human sacrifices to get this fixed in a reasonably near future, and the tree must build.
* introduce srp, which according to the manpage i wrote is short fordlg2015-07-021-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "shared reference pointers". srp allows concurrent access to a data structure by multiple cpus while avoiding interlocking cpu opcodes. it manages its own reference counts and the garbage collection of those data structure to avoid use after frees. internally srp is a twisted version of hazard pointers, which are a relative of RCU. jmatthew wrote the bulk of a hazard pointer implementation and changed bpf to use it to allow mpsafe access to bpfilters. however, at s2k15 we were trying to apply it to other data structures but the memory overhead of every hazard pointer would have blown out significantly in several uses cases. a bulk of our time at s2k15 was spent reworking hazard pointers into srp. this diff adds the srp api and adds the necessary metadata to struct cpuinfo on our MP architectures. srp on uniprocessor platforms has alternate code that is optimised because it knows there'll be no concurrent access to data by multiple cpus. srp is made available to the system via param.h, so it should be available everywhere in the kernel. the docs likely need improvement cos im too close to the implementation. ok mpi@
* reenable the pool gc task.dlg2015-06-241-3/+1
| | | | | | | | the problems it tickled by working outside the biglock on archs with mutex and clock interaction have been fixed, as evidenced by the softnet taskq. ok deraadt@
* Reenable the page zeroing thread on MP m88k kernels.miod2015-05-181-3/+2
|
* emul_native is only used for kernel threads which can't dump core, soguenther2015-05-051-4/+3
| | | | | | | | | | | delete coredump_trad(), uvm_coredump(), cpu_coredump(), struct md_coredump, and various #includes that are superfluous. This leaves compat_linux processes without a coredump callback. If that ability is desired, someone should update it to use coredump_elf32() and verify the results... ok kettenis@