summaryrefslogtreecommitdiffstats
path: root/sys/kern/kern_sched.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Simplify sleep_setup API to two operations in preparation for splittingmpi2021-02-081-2/+2
| | | | | | | | | | | | the SCHED_LOCK(). Putting a thread on a sleep queue is reduce to the following: sleep_setup(); /* check condition or release lock */ sleep_finish(); Previous version ok cheloha@, jmatthew@, ok claudio@
* Use sysctl_int_bounded in sysctl_hwsmtgnezdo2021-01-091-6/+2
| | | | | | Prefer error reporting is to silent clipping. OK millert@
* get rid of a vestigial bit of the sbartq.dlg2020-06-111-5/+1
| | | | | i should have removed the sbartq pointer in r1.47 when i removed the sbartq.
* Remove sigacts structure sharing. The only process that used sharing wasclaudio2020-02-211-2/+2
| | | | | | | proc0 which is used for kthreads and idle threads. proc0 and all those other kernel threads don't handle signals so there is no benefit in sharing. Simplifies the code a fair bit since the refcnt is gone. OK kettenis@
* Remove dead store, from Amit Kulkarni.mpi2020-02-051-2/+1
|
* Split `p_priority' into `p_runpri' and `p_slppri'.mpi2020-01-301-5/+5
| | | | | | | | | | | Using different fields to remember in which runqueue or sleepqueue threads currently are will make it easier to split the SCHED_LOCK(). With this change, the (potentially boosted) sleeping priority is no longer overwriting the thread priority. This let us get rids of the logic required to synchronize `p_priority' with `p_usrpri'. Tested by many, ok visa@
* Import dt(4) a driver and framework for Dynamic Profiling.mpi2020-01-211-1/+4
| | | | | | | | | | | The design is fairly simple: events, in the form of descriptors on a ring, are being produced in any kernel context and being consumed by a userland process reading /dev/dt. Code and hooks are all guarded under '#if NDT > 0' so this commit shouldn't introduce any change as long as dt(4) is disable in GENERIC. ok kettenis@, visa@, jasper@, deraadt@
* Restore the old way of dispatching dead procs through idle proc.visa2019-11-041-4/+19
| | | | The new way needs more thought.
* Move dead procs to the reaper queue immediately after context switch.visa2019-11-021-19/+4
| | | | | | | | This eliminates a forced context switch to the idle proc. In addition, sched_exit() no longer needs to sum proc runtime because mi_switch() will do it. OK mpi@ a while ago
* Kill resched_proc() and instead call need_resched() when a thread ismpi2019-11-011-1/+4
| | | | | | | | | added to the runqueue of a CPU. This fix out-of-sync cases when the priority of a thread wasn't reflecting the runqueue it was sitting in leading to unnecessary context switch. ok visa@
* Reduce the number of places where `p_priority' and `p_stat' are set.mpi2019-10-151-10/+16
| | | | | | | | | This refactoring will help future scheduler locking, in particular to shrink the SCHED_LOCK(). No intended behavior change. ok visa@
* Revert to using the SCHED_LOCK() to protect time accounting.mpi2019-06-011-5/+1
| | | | | | | | | It currently creates a lock ordering problem because SCHED_LOCK() is taken by hardclock(). That means the "priorities" of a thread should be moved out of the SCHED_LOCK() first in order to make progress. Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com via anton@ as well as by kettenis@
* Use a per-process mutex to protect time accounting instead of SCHED_LOCK().mpi2019-05-311-1/+5
| | | | | | | Note that hardclock(9) still increments p_{u,s,i}ticks without holding a lock. ok visa@, cheloha@
* Make sure that each ci has its spc_deferred queue initialized.visa2019-03-261-3/+2
| | | | | | | Otherwise, the system can crash in smr_call_impl() if SMT is enabled later. Crash reported by jcs@
* Introduce safe memory reclamation, a mechanism for reclaiming sharedvisa2019-02-261-1/+6
| | | | | | | | | | | | | | | | | | | | | | | objects that readers can access without locking. This provides a basis for read-copy-update operations. Readers access SMR-protected shared objects inside SMR read-side critical section where sleeping is not allowed. To reclaim an SMR-protected object, the writer has to ensure mutual exclusion of other writers, remove the object's shared reference and wait until read-side references cannot exist any longer. As an alternative to waiting, the writer can schedule a callback that gets invoked when reclamation is safe. The mechanism relies on CPU quiescent states to determine when an SMR-protected object is ready for reclamation. The <sys/smr.h> header additionally provides an implementation of singly- and doubly-linked lists that can be used together with SMR. These lists allow lockless read access with a concurrent writer. Discussed with many OK mpi@ sashan@
* Add new KERN_CPUSTATS sysctl(2) so we can identify offline CPUs.cheloha2018-11-171-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because of hw.smt we need a way to determine whether a given CPU is "online" or "offline" from userspace. KERN_CPTIME2 is an array, and so cannot be cleanly extended for this purpose, so add a new sysctl(2) KERN_CPUSTATS with an extensible struct. At the moment it's just KERN_CPTIME2 with a flags member, but it can grow as needed. KERN_CPUSTATS appears to have been defined by BSDi long ago, but there are few (if any) packages in the wild still using the symbol so breakage in ports should be near zero. No other system inherited the symbol from BSDi, either. Then, use the new sysctl(2) in systat(1) and top(1): - systat(1) draws placeholder marks ('-') instead of percentages for offline CPUs in the cpu view. - systat(1) omits offline CPU ticks when drawing the "big bar" in the vmstat view. The upshot is that the bar isn't half idle when half your logical CPUs are disabled. - top(1) does not draw lines for offline CPUs; if CPUs toggle on or offline in interactive mode we redraw the display to expand/reduce space for the new/missing CPUs. This is consistent with what some top(1) implementations do on Linux. - top(1) omits offline CPUs from the totals when CPU totals are combined into a single line (the '-1' flag). Originally prompted by deraadt@. Discussed endlessly with deraadt@, ketennis@, and sthen@. Tested by jmc@ and jca@. Earlier versions also discussed with jca@. Earlier versions tested by jmc@, tb@, and many others. docs ok jmc@, kernel bits ok ketennis@, everything ok sthen@, "Is your stuff in yet?" deraadt@
* Revert KERN_CPTIME2 ENODEV changes in kernel and userspace.cheloha2018-10-051-7/+1
| | | | ok kettenis deraadt
* KERN_CPTIME2: set ENODEV if the CPU is offline.cheloha2018-09-261-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This lets userspace distinguish between idle CPUs and those that are not schedulable because hw.smt=0. A subsequent commit probably needs to add documentation for this to sysctl.2 (and perhaps elsewhere) after the dust settles. Also included here are changes to systat(1) and top(1) that account for the ENODEV case and adjust behavior accordingly: - systat(1)'s cpu view prints placeholder marks ('-') instead of percentages for each state if the given CPU is offline. - systat(1)'s vmstat view checks for offline CPUs when computing the machine state total and excludes them, so the CPU usage graph only represents the states for online CPUs. - top(1) does not draw CPU rows for offline CPUs when the view is redrawn. If CPUs "go offline", percentages for each state are replaced by placeholder marks ('-'); the view will need to be redrawn to remove these rows. If CPUs "go online" the view will need to be redrawn to show these new CPUs. In "combined CPU" mode, the count and the state totals only represent online CPUs. Ports using KERN_CPTIME2 will need to be updated. The changes described above to make systat(1) and top(1) aware of the ENODEV case *and* gracefully handle a changing HW_NCPUONLINE while the application is running are not necessarily appropriate for each and every port. The changes described above are so extensive in part to demonstrate one way a program *might* be made robust to changing CPU availability. In particular, changing hw.smt after boot is an extremely rare event, and this needs to be weighed when updating ports. The logic needed to account for the KERN_CPTIME2 ENODEV case is very roughly: if (sysctl(...) == -1) { if (errno != ENODEV) { /* Actual error occurred. */ } else { /* CPU is offline. */ } } else { /* CPU is online and CPU states were set by sysctl(2). */ } Prompted by deraadt@. Basic idea for ENODEV from kettenis@. Discussed at length with kettenis@. Additional testing by tb@. No complaints from hackers@ after a week. ok kettenis@, "I think you should commit [now]" deraadt@
* Add hw.ncpuonline to count the number of online CPUs.cheloha2018-07-121-1/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The introduction of hw.smt means that logical CPUs can be disabled after boot and prior to suspend/resume. If hw.smt=0 (the default), there needs to be a way to count the number of hardware threads available on the system at any given time. So, import HW_NCPUONLINE/hw.ncpuonline from NetBSD and document it. hw.ncpu becomes equal to the number of CPUs given to sched_init_cpu() during boot, while hw.ncpuonline is equal to the number of CPUs available to the scheduler in the cpuset "sched_all_cpus". Set_SC_NPROCESSORS_ONLN equal to this new sysctl and keep _SC_NPROCESSORS_CONF equal to hw.ncpu. This is preferable to adding a new sysctl to count the number of configured CPUs and keeping hw.ncpu equal to the number of online CPUs because such a change would break software in the ecosystem that relies on HW_NCPU/hw.ncpu to measure CPU usage and the like. Such software in base includes top(1), systat(1), and snmpd(8), and perhaps others. We don't need additional locking to count the cardinality of a cpuset in this case because the only interfaces that can modify said cardinality are sysctl(2) and ioctl(2), both of which are under the KERNEL_LOCK. Software using HW_NCPU/hw.ncpu to determine optimal parallism will need to be updated to use HW_NCPUONLINE/hw.ncpuonline. Until then, such software may perform suboptimally. However, most changes will be similar to the change included here for libcxx's std::thread:hardware_concurrency(): using HW_NCPUONLINE in lieu of HW_NCPU should be sufficient for determining optimal parallelism for most software if the change to _SC_NPROCESSORS_ONLN is insufficient. Prompted by deraadt. Discussed at length with kettenis, deraadt, and sthen. Lots of patch tweaks from kettenis. ok kettenis, "proceed" deraadt
* Release the kernel lock fully on thread exit. This prevents a lockingvisa2018-07-071-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | error that would happen otherwise when a traced and stopped multithreaded process is forced to exit. The error shows up as a kernel panic when WITNESS is enabled. Without WITNESS, the error causes a system hang. sched_exit() has expected that a single KERNEL_UNLOCK() would release the lock completely. That assumption is wrong when an exit happens through the signal tracing logic: sched_exit exit1 single_thread_check single_thread_set issignal <-- KERNEL_LOCK() userret <-- KERNEL_LOCK() syscall The error is a regression of r1.216 of kern_sig.c. Panic reported and fix tested by Laurence Tratt OK mpi@
* Don't steal processes from other CPUs if we're not scheduling processes onkettenis2018-06-301-1/+5
| | | | | | a CPU. ok deraadt@
* SMT (Simultanious Multi Threading) implementations typically sharekettenis2018-06-191-2/+52
| | | | | | | | | | | | | | | | | | | | | | | TLBs and L1 caches between threads. This can make cache timing attacks a lot easier and we strongly suspect that this will make several spectre-class bugs exploitable. Especially on Intel's SMT implementation which is better known as Hypter-threading. We really should not run different security domains on different processor threads of the same core. Unfortunately changing our scheduler to take this into account is far from trivial. Since many modern machines no longer provide the ability to disable Hyper-threading in the BIOS setup, provide a way to disable the use of additional processor threads in our scheduler. And since we suspect there are serious risks, we disable them by default. This can be controlled through a new hw.smt sysctl. For now this only works on Intel CPUs when running OpenBSD/amd64. But we're planning to extend this feature to CPUs from other vendors and other hardware architectures. Note that SMT doesn't necessarily have a posive effect on performance; it highly depends on the workload. In all likelyhood it will actually slow down most workloads if you have a CPU with more than two cores. ok deraadt@
* make sched_barrier use cond_wait/cond_signal.dlg2017-12-141-20/+16
| | | | | | | | | | | previously the code was using a percpu flag to manage the sleeps/wakeups, which means multiple threads waiting for a barrier on a cpu could race. moving to a cond struct on the stack fixes this. while here, get rid of the sbar taskq and just use systqmp instead. the barrier tasks are short, so there's no real downside. ok mpi@
* Raise the IPL of the sbar taskq to avoid lock order issuesvisa2017-11-281-2/+2
| | | | | | | with the kernel lock. Fixes a deadlock seen by Hrvoje Popovski and dhill@. OK mpi@, dhill@
* Split up fork1():guenther2017-02-121-2/+2
| | | | | | | | | | | | | | | | | - FORK_THREAD handling is a totally separate function, thread_fork(), that is only used by sys___tfork() and which loses the flags, func, arg, and newprocp parameters and gains tcb parameter to guarantee the new thread's TCB is set before the creating thread returns - fork1() loses its stack and tidptr parameters Common bits factor out: - struct proc allocation and initialization moves to thread_new() - maxthread handling moves to fork_check_maxthread() - setting the new thread running moves to fork_thread_start() The MD cpu_fork() function swaps its unused stacksize parameter for a tcb parameter. luna88k testing by aoyama@, alpha testing by dlg@ ok mpi@
* p_comm is the process's command and isn't per thread, so move it fromguenther2017-01-211-2/+3
| | | | | | struct proc to struct process. ok deraadt@ kettenis@
* Allow pegged process on secondary CPUs to continue to be scheduled whenkettenis2016-06-031-2/+5
| | | | | | | halting a CPU. Necessary for intr_barrier(9) to work when interrupts are targeted at secondary CPUs. ok mpi@, mikeb@ (a while back)
* Replace curcpu_is_idle() by cpu_is_idle() and use it instead of rollingmpi2016-03-171-2/+2
| | | | | | our own. From Michal Mazurek, ok mmcc@
* One "sbar" taskq is enough.kettenis2015-12-231-8/+7
| | | | ok visa@
* Make the cost of moving a process to the primary cpu a bit higher. This iskettenis2015-12-171-1/+10
| | | | | | | | | | | | | | | | | the CPU that handles most hardware interrupts but we don't account for that in any way in the scheduler. So processes (and kernel threads) that are unlucky enough to end up on this CPU will get less CPU cycles than those running on other CPUs. This is especially true for the softnet taskq. There network interrupts will prevent the softnet taskq from running. This means that the more packets we receive, the less packets we can actually process and/or forward. This is why "unlocking" network drivers actually decreases the forwarding performance. This diff restores most of the lost performance by making it less likely that the softnet taskq ends up on the same CPU that handles network interrupts. Tested by Hrvoje Popovski ok mpi@, deraadt@
* Make sched_barrier() use its own task queue to avoid deadlocks.mpi2015-10-161-2/+13
| | | | | | | Prevent a deadlock from occuring when intr_barrier() is called from a non-primary CPU in the watchdog task, also enqueued on ``systq''. ok kettenis@
* Short circuit if we're running on the CPU that we want to sync with. Fixeskettenis2015-09-201-2/+5
| | | | | | suspend on machines with em(4) now that it uses intr_barrier(9). ok krw@
* Introduce sched_barrier(9), an interface that acts as a scheduler barrier inkettenis2015-09-131-1/+46
| | | | | | | | the sense that it guarantees that the specified CPU went through the scheduler. This also guarantees that interrupt handlers running on that CPU will have finished when sched_barrier() returns. ok miod@, guenther@
* Remove some includes include-what-you-use claims don'tjsg2015-03-141-4/+1
| | | | | | | have any direct symbols used. Tested for indirect use by compiling amd64/i386/sparc64 kernels. ok tedu@ deraadt@
* Keep under #ifdef MULTIPROCESSOR the code that deals with SPCF_SHOULDHALTmpi2014-09-241-1/+5
| | | | | | | and SPCF_HALTED, these flags only make sense on secondary CPUs which are unlikely to be present on a SP kernel. ok kettenis@
* If we're stopping a secondary cpu, don't let sched_choosecpu() short-circuitkettenis2014-07-261-1/+3
| | | | | | | | and return the current current CPU, otherwise sched_stop_secondary_cpus() will spin forever trying to empty its run queues. Fixes hangs during suspend that many people reported over the last couple of days. ok bcook@, guenther@
* Fix sched_stop_secondary_cpus() to properly drain CPUsmatthew2014-07-131-2/+2
| | | | | | | | | | TAILQ_FOREACH() isn't safe to use in sched_chooseproc() to iterate over the run queues because within the loop body we remove the threads from their run queues and reinsert them elsewhere. As a result, we end up only draining the first thread of each run queue rather than all of them. ok kettenis
* Add PS_SYSTEM, the process-level mirror of the thread-level P_SYSTEM,guenther2014-05-041-7/+2
| | | | | | | and FORK_SYSTEM as a flag to set them. This eliminates needing to peek into other processes threads in various places. Inspired by NetBSD ok miod@ matthew@
* Eliminate the exit sig handling, which was only invokable via theguenther2014-02-121-2/+2
| | | | | | | | Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@ confirms Opera runs in compat without this, so out it goes; one less hair to choke on in kern_exit.c ok tedu@ pirofti@
* Prevent idle thread from being stolen on startup.haesbaert2013-06-061-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a race condition which might trigger a case where two cpus try to run the same idle thread. The problem arises when one cpu steals the idle proc of another cpu and this other cpu ends up running the idle thread via spc->spc_idleproc, resulting in two cpus trying to cpu_switchto(idleX). On startup, idle procs are scaterred around different runqueues, the decision for scheduling is: 1 look at my runqueue. 2 if empty, look at other dudes runqueue. 3 if empty, select idle proc via spc->spc_idleproc. The problem is that cpu0's idle0 might be running on cpu1 due to step 1 or 2 and cpu0 hits step 3. So cpu0 will select idle0, while cpu1 is in fact running it already. The solution is to never place idle on a runqueue, therefore being only selectable through spc->spc_idleproc. This race can be more easily triggered on a HT cpu on virtualized environments, where the guest more often than not doesn't have the cpu for itself, so timing gets shuffled. ok tedu@ guenther@ go ahead after t2k13 deraadt@
* Convert some internal APIs to use timespecs instead of timevalsguenther2013-06-031-5/+5
| | | | ok matthew@ deraadt@
* sprinkle ifdef MP to disable cpu migration code when not needed.tedu2013-04-191-7/+17
| | | | ok deraadt
* Make sure that we don't schedule processes on CPUs that we havetaken out ofkettenis2012-07-101-1/+5
| | | | | | the scheduler. ok hasbaert@. deraadt@
* Make rusage totals, itimers, and profile settings per-process insteadguenther2012-03-231-2/+2
| | | | | | | of per-rthread. Handling of per-thread tick and runtime counters inspired by how FreeBSD does it. ok kettenis@
* Account for sched_noidle and document the scheduler variables.haesbaert2012-03-101-11/+13
| | | | ok tedu@
* Remove all MD diagnostics in cpu_switchto(), and move them to MI code ifmiod2011-10-121-1/+4
| | | | | | they apply. ok oga@ deraadt@
* Clean up after P_BIGLOCK removal.art2011-07-061-3/+3
| | | | | | | KERNEL_PROC_LOCK -> KERNEL_LOCK KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK oga@ ok
* Delete a fallback definition for CPU_INFO_UNIT that's both unnecessaryguenther2010-05-281-8/+1
| | | | | | and incorrect. Kills an XXX comment. ok syuu, thib, art, kettenis, millert, deraadt
* Actively remove processes from the runqueues of a CPU when we stop it.kettenis2010-05-251-5/+22
| | | | | | | | | | Also make sure not to take the scheduler lock once we have stopped a CPU such that we can safely take it away without having to worry about deadlock because it happened to own the scheduler lock. Fixes issues with suspen on SMP machines. ok mlarkin@, marco@, art@, deraadt@
* Make sure we initialize sched_lock before we try to use it.kettenis2010-05-141-4/+1
| | | | ok miod@, thib@, oga@, jsing@