diff options
Diffstat (limited to 'Documentation/scheduler')
-rw-r--r-- | Documentation/scheduler/completion.rst | 4 | ||||
-rw-r--r-- | Documentation/scheduler/index.rst | 13 | ||||
-rw-r--r-- | Documentation/scheduler/membarrier.rst | 39 | ||||
-rw-r--r-- | Documentation/scheduler/sched-arch.rst | 6 | ||||
-rw-r--r-- | Documentation/scheduler/sched-bwc.rst | 100 | ||||
-rw-r--r-- | Documentation/scheduler/sched-capacity.rst | 442 | ||||
-rw-r--r-- | Documentation/scheduler/sched-deadline.rst | 34 | ||||
-rw-r--r-- | Documentation/scheduler/sched-debug.rst | 54 | ||||
-rw-r--r-- | Documentation/scheduler/sched-design-CFS.rst | 33 | ||||
-rw-r--r-- | Documentation/scheduler/sched-domains.rst | 45 | ||||
-rw-r--r-- | Documentation/scheduler/sched-eevdf.rst | 43 | ||||
-rw-r--r-- | Documentation/scheduler/sched-energy.rst | 52 | ||||
-rw-r--r-- | Documentation/scheduler/sched-ext.rst | 362 | ||||
-rw-r--r-- | Documentation/scheduler/sched-nice-design.rst | 2 | ||||
-rw-r--r-- | Documentation/scheduler/sched-rt-group.rst | 43 | ||||
-rw-r--r-- | Documentation/scheduler/sched-stats.rst | 139 | ||||
-rw-r--r-- | Documentation/scheduler/sched-util-clamp.rst | 741 | ||||
-rw-r--r-- | Documentation/scheduler/schedutil.rst | 172 |
18 files changed, 2132 insertions, 192 deletions
diff --git a/Documentation/scheduler/completion.rst b/Documentation/scheduler/completion.rst index 9f039b4f4b09..adf0c0a56d02 100644 --- a/Documentation/scheduler/completion.rst +++ b/Documentation/scheduler/completion.rst @@ -51,7 +51,7 @@ which has only two fields:: struct completion { unsigned int done; - wait_queue_head_t wait; + struct swait_queue_head wait; }; This provides the ->wait waitqueue to place tasks on for waiting (if any), and @@ -157,7 +157,7 @@ A typical usage scenario is:: /* run non-dependent code */ /* do setup */ - wait_for_completion(&setup_done); complete(setup_done); + wait_for_completion(&setup_done); complete(&setup_done); This is not implying any particular order between wait_for_completion() and the call to complete() - if the call to complete() happened before the call diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst index 69074e5de9c4..5dd53e47bc0c 100644 --- a/Documentation/scheduler/index.rst +++ b/Documentation/scheduler/index.rst @@ -1,21 +1,28 @@ -=============== -Linux Scheduler -=============== +========= +Scheduler +========= .. toctree:: :maxdepth: 1 completion + membarrier sched-arch sched-bwc sched-deadline sched-design-CFS + sched-eevdf sched-domains + sched-capacity sched-energy + schedutil + sched-util-clamp sched-nice-design sched-rt-group sched-stats + sched-ext + sched-debug text_files diff --git a/Documentation/scheduler/membarrier.rst b/Documentation/scheduler/membarrier.rst new file mode 100644 index 000000000000..2387804b1c63 --- /dev/null +++ b/Documentation/scheduler/membarrier.rst @@ -0,0 +1,39 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================== +membarrier() System Call +======================== + +MEMBARRIER_CMD_{PRIVATE,GLOBAL}_EXPEDITED - Architecture requirements +===================================================================== + +Memory barriers before updating rq->curr +---------------------------------------- + +The commands MEMBARRIER_CMD_PRIVATE_EXPEDITED and MEMBARRIER_CMD_GLOBAL_EXPEDITED +require each architecture to have a full memory barrier after coming from +user-space, before updating rq->curr. This barrier is implied by the sequence +rq_lock(); smp_mb__after_spinlock() in __schedule(). The barrier matches a full +barrier in the proximity of the membarrier system call exit, cf. +membarrier_{private,global}_expedited(). + +Memory barriers after updating rq->curr +--------------------------------------- + +The commands MEMBARRIER_CMD_PRIVATE_EXPEDITED and MEMBARRIER_CMD_GLOBAL_EXPEDITED +require each architecture to have a full memory barrier after updating rq->curr, +before returning to user-space. The schemes providing this barrier on the various +architectures are as follows. + + - alpha, arc, arm, hexagon, mips rely on the full barrier implied by + spin_unlock() in finish_lock_switch(). + + - arm64 relies on the full barrier implied by switch_to(). + + - powerpc, riscv, s390, sparc, x86 rely on the full barrier implied by + switch_mm(), if mm is not NULL; they rely on the full barrier implied + by mmdrop(), otherwise. On powerpc and riscv, switch_mm() relies on + membarrier_arch_switch_mm(). + +The barrier matches a full barrier in the proximity of the membarrier system call +entry, cf. membarrier_{private,global}_expedited(). diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst index 0eaec669790a..ed07efea7d02 100644 --- a/Documentation/scheduler/sched-arch.rst +++ b/Documentation/scheduler/sched-arch.rst @@ -10,7 +10,7 @@ Context switch By default, the switch_to arch function is called with the runqueue locked. This is usually not a problem unless switch_to may need to take the runqueue lock. This is usually due to a wake up operation in -the context switch. See arch/ia64/include/asm/switch_to.h for an example. +the context switch. To request the scheduler call switch_to with the runqueue unlocked, you must `#define __ARCH_WANT_UNLOCKED_CTXSW` in a header file @@ -68,9 +68,5 @@ Possible arch/ problems Possible arch problems I found (and either tried to fix or didn't): -ia64 - is safe_halt call racy vs interrupts? (does it sleep?) (See #4a) - -sh64 - Is sleeping racy vs interrupts? (See #4a) - sparc - IRQs on at this point(?), change local_irq_save to _disable. - TODO: needs secondary CPUs to disable preempt (See #1) diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst index 9801d6b284b1..e881a945c188 100644 --- a/Documentation/scheduler/sched-bwc.rst +++ b/Documentation/scheduler/sched-bwc.rst @@ -2,8 +2,9 @@ CFS Bandwidth Control ===================== -[ This document only discusses CPU bandwidth control for SCHED_NORMAL. - The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.rst ] +.. note:: + This document only discusses CPU bandwidth control for SCHED_NORMAL. + The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.rst CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the specification of the maximum CPU bandwidth available to a group or hierarchy. @@ -21,33 +22,89 @@ cfs_quota units at each period boundary. As threads consume this bandwidth it is transferred to cpu-local "silos" on a demand basis. The amount transferred within each of these updates is tunable and described as the "slice". +Burst feature +------------- +This feature borrows time now against our future underrun, at the cost of +increased interference against the other system users. All nicely bounded. + +Traditional (UP-EDF) bandwidth control is something like: + + (U = \Sum u_i) <= 1 + +This guaranteeds both that every deadline is met and that the system is +stable. After all, if U were > 1, then for every second of walltime, +we'd have to run more than a second of program time, and obviously miss +our deadline, but the next deadline will be further out still, there is +never time to catch up, unbounded fail. + +The burst feature observes that a workload doesn't always executes the full +quota; this enables one to describe u_i as a statistical distribution. + +For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100) +(the traditional WCET). This effectively allows u to be smaller, +increasing the efficiency (we can pack more tasks in the system), but at +the cost of missing deadlines when all the odds line up. However, it +does maintain stability, since every overrun must be paired with an +underrun as long as our x is above the average. + +That is, suppose we have 2 tasks, both specify a p(95) value, then we +have a p(95)*p(95) = 90.25% chance both tasks are within their quota and +everything is good. At the same time we have a p(5)p(5) = 0.25% chance +both tasks will exceed their quota at the same time (guaranteed deadline +fail). Somewhere in between there's a threshold where one exceeds and +the other doesn't underrun enough to compensate; this depends on the +specific CDFs. + +At the same time, we can say that the worst case deadline miss, will be +\Sum e_i; that is, there is a bounded tardiness (under the assumption +that x+e is indeed WCET). + +The interference when using burst is valued by the possibilities for +missing the deadline and the average WCET. Test results showed that when +there many cgroups or CPU is under utilized, the interference is +limited. More details are shown in: +https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/ + Management ---------- -Quota and period are managed within the cpu subsystem via cgroupfs. +Quota, period and burst are managed within the cpu subsystem via cgroupfs. + +.. note:: + The cgroupfs files described in this section are only applicable + to cgroup v1. For cgroup v2, see + :ref:`Documentation/admin-guide/cgroup-v2.rst <cgroup-v2-cpu>`. -cpu.cfs_quota_us: the total available run-time within a period (in microseconds) -cpu.cfs_period_us: the length of a period (in microseconds) -cpu.stat: exports throttling statistics [explained further below] +- cpu.cfs_quota_us: run-time replenished within a period (in microseconds) +- cpu.cfs_period_us: the length of a period (in microseconds) +- cpu.stat: exports throttling statistics [explained further below] +- cpu.cfs_burst_us: the maximum accumulated run-time (in microseconds) The default values are:: cpu.cfs_period_us=100ms - cpu.cfs_quota=-1 + cpu.cfs_quota_us=-1 + cpu.cfs_burst_us=0 A value of -1 for cpu.cfs_quota_us indicates that the group does not have any bandwidth restriction in place, such a group is described as an unconstrained bandwidth group. This represents the traditional work-conserving behavior for CFS. -Writing any (valid) positive value(s) will enact the specified bandwidth limit. -The minimum quota allowed for the quota or period is 1ms. There is also an -upper bound on the period length of 1s. Additional restrictions exist when -bandwidth limits are used in a hierarchical fashion, these are explained in -more detail below. +Writing any (valid) positive value(s) no smaller than cpu.cfs_burst_us will +enact the specified bandwidth limit. The minimum quota allowed for the quota or +period is 1ms. There is also an upper bound on the period length of 1s. +Additional restrictions exist when bandwidth limits are used in a hierarchical +fashion, these are explained in more detail below. Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit and return the group to an unconstrained state once more. +A value of 0 for cpu.cfs_burst_us indicates that the group can not accumulate +any unused bandwidth. It makes the traditional bandwidth control behavior for +CFS unchanged. Writing any (valid) positive value(s) no larger than +cpu.cfs_quota_us into cpu.cfs_burst_us will enact the cap on unused bandwidth +accumulation. + Any updates to a group's bandwidth specification will result in it becoming unthrottled if it is in a constrained state. @@ -67,7 +124,7 @@ for more fine-grained consumption. Statistics ---------- -A group's bandwidth statistics are exported via 3 fields in cpu.stat. +A group's bandwidth statistics are exported via 5 fields in cpu.stat. cpu.stat: @@ -75,6 +132,9 @@ cpu.stat: - nr_throttled: Number of times the group has been throttled/limited. - throttled_time: The total time duration (in nanoseconds) for which entities of the group have been throttled. +- nr_bursts: Number of periods burst occurs. +- burst_time: Cumulative wall-time (in nanoseconds) that any CPUs has used + above quota in respective periods. This interface is read-only. @@ -126,7 +186,7 @@ average usage, albeit over a longer time window than a single period. This also limits the burst ability to no more than 1ms per cpu. This provides better more predictable user experience for highly threaded applications with small quota limits on high core count machines. It also eliminates the -propensity to throttle these applications while simultanously using less than +propensity to throttle these applications while simultaneously using less than quota amounts of cpu. Another way to say this, is that by allowing the unused portion of a slice to remain valid across periods we have decreased the possibility of wastefully expiring quota on cpu-local silos that don't need a @@ -172,3 +232,15 @@ Examples By using a small period here we are ensuring a consistent latency response at the expense of burst capacity. + +4. Limit a group to 40% of 1 CPU, and allow accumulate up to 20% of 1 CPU + additionally, in case accumulation has been done. + + With 50ms period, 20ms quota will be equivalent to 40% of 1 CPU. + And 10ms burst will be equivalent to 20% of 1 CPU:: + + # echo 20000 > cpu.cfs_quota_us /* quota = 20ms */ + # echo 50000 > cpu.cfs_period_us /* period = 50ms */ + # echo 10000 > cpu.cfs_burst_us /* burst = 10ms */ + + Larger buffer setting (no larger than quota) allows greater burst capacity. diff --git a/Documentation/scheduler/sched-capacity.rst b/Documentation/scheduler/sched-capacity.rst new file mode 100644 index 000000000000..de414b33dd2a --- /dev/null +++ b/Documentation/scheduler/sched-capacity.rst @@ -0,0 +1,442 @@ +========================= +Capacity Aware Scheduling +========================= + +1. CPU Capacity +=============== + +1.1 Introduction +---------------- + +Conventional, homogeneous SMP platforms are composed of purely identical +CPUs. Heterogeneous platforms on the other hand are composed of CPUs with +different performance characteristics - on such platforms, not all CPUs can be +considered equal. + +CPU capacity is a measure of the performance a CPU can reach, normalized against +the most performant CPU in the system. Heterogeneous systems are also called +asymmetric CPU capacity systems, as they contain CPUs of different capacities. + +Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems +from two factors: + +- not all CPUs may have the same microarchitecture (µarch). +- with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be + physically able to attain the higher Operating Performance Points (OPP). + +Arm big.LITTLE systems are an example of both. The big CPUs are more +performance-oriented than the LITTLE ones (more pipeline stages, bigger caches, +smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones +can. + +CPU performance is usually expressed in Millions of Instructions Per Second +(MIPS), which can also be expressed as a given amount of instructions attainable +per Hz, leading to:: + + capacity(cpu) = work_per_hz(cpu) * max_freq(cpu) + +1.2 Scheduler terms +------------------- + +Two different capacity values are used within the scheduler. A CPU's +``original capacity`` is its maximum attainable capacity, i.e. its maximum +attainable performance level. This original capacity is returned by +the function arch_scale_cpu_capacity(). A CPU's ``capacity`` is its ``original +capacity`` to which some loss of available performance (e.g. time spent +handling IRQs) is subtracted. + +Note that a CPU's ``capacity`` is solely intended to be used by the CFS class, +while ``original capacity`` is class-agnostic. The rest of this document will use +the term ``capacity`` interchangeably with ``original capacity`` for the sake of +brevity. + +1.3 Platform examples +--------------------- + +1.3.1 Identical OPPs +~~~~~~~~~~~~~~~~~~~~ + +Consider an hypothetical dual-core asymmetric CPU capacity system where + +- work_per_hz(CPU0) = W +- work_per_hz(CPU1) = W/2 +- all CPUs are running at the same fixed frequency + +By the above definition of capacity: + +- capacity(CPU0) = C +- capacity(CPU1) = C/2 + +To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would +be a LITTLE. + +With a workload that periodically does a fixed amount of work, you will get an +execution trace like so:: + + CPU0 work ^ + | ____ ____ ____ + | | | | | | | + +----+----+----+----+----+----+----+----+----+----+-> time + + CPU1 work ^ + | _________ _________ ____ + | | | | | | + +----+----+----+----+----+----+----+----+----+----+-> time + +CPU0 has the highest capacity in the system (C), and completes a fixed amount of +work W in T units of time. On the other hand, CPU1 has half the capacity of +CPU0, and thus only completes W/2 in T. + +1.3.2 Different max OPPs +~~~~~~~~~~~~~~~~~~~~~~~~ + +Usually, CPUs of different capacity values also have different maximum +OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with: + +- max_freq(CPU0) = F +- max_freq(CPU1) = 2/3 * F + +This yields: + +- capacity(CPU0) = C +- capacity(CPU1) = C/3 + +Executing the same workload as described in 1.3.1, which each CPU running at its +maximum frequency results in:: + + CPU0 work ^ + | ____ ____ ____ + | | | | | | | + +----+----+----+----+----+----+----+----+----+----+-> time + + workload on CPU1 + CPU1 work ^ + | ______________ ______________ ____ + | | | | | | + +----+----+----+----+----+----+----+----+----+----+-> time + +1.4 Representation caveat +------------------------- + +It should be noted that having a *single* value to represent differences in CPU +performance is somewhat of a contentious point. The relative performance +difference between two different µarchs could be X% on integer operations, Y% on +floating point operations, Z% on branches, and so on. Still, results using this +simple approach have been satisfactory for now. + +2. Task utilization +=================== + +2.1 Introduction +---------------- + +Capacity aware scheduling requires an expression of a task's requirements with +regards to CPU capacity. Each scheduler class can express this differently, and +while task utilization is specific to CFS, it is convenient to describe it here +in order to introduce more generic concepts. + +Task utilization is a percentage meant to represent the throughput requirements +of a task. A simple approximation of it is the task's duty cycle, i.e.:: + + task_util(p) = duty_cycle(p) + +On an SMP system with fixed frequencies, 100% utilization suggests the task is a +busy loop. Conversely, 10% utilization hints it is a small periodic task that +spends more time sleeping than executing. Variable CPU frequencies and +asymmetric CPU capacities complexify this somewhat; the following sections will +expand on these. + +2.2 Frequency invariance +------------------------ + +One issue that needs to be taken into account is that a workload's duty cycle is +directly impacted by the current OPP the CPU is running at. Consider running a +periodic workload at a given frequency F:: + + CPU work ^ + | ____ ____ ____ + | | | | | | | + +----+----+----+----+----+----+----+----+----+----+-> time + +This yields duty_cycle(p) == 25%. + +Now, consider running the *same* workload at frequency F/2:: + + CPU work ^ + | _________ _________ ____ + | | | | | | + +----+----+----+----+----+----+----+----+----+----+-> time + +This yields duty_cycle(p) == 50%, despite the task having the exact same +behaviour (i.e. executing the same amount of work) in both executions. + +The task utilization signal can be made frequency invariant using the following +formula:: + + task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu)) + +Applying this formula to the two examples above yields a frequency invariant +task utilization of 25%. + +2.3 CPU invariance +------------------ + +CPU capacity has a similar effect on task utilization in that running an +identical workload on CPUs of different capacity values will yield different +duty cycles. + +Consider the system described in 1.3.2., i.e.:: + +- capacity(CPU0) = C +- capacity(CPU1) = C/3 + +Executing a given periodic workload on each CPU at their maximum frequency would +result in:: + + CPU0 work ^ + | ____ ____ ____ + | | | | | | | + +----+----+----+----+----+----+----+----+----+----+-> time + + CPU1 work ^ + | ______________ ______________ ____ + | | | | | | + +----+----+----+----+----+----+----+----+----+----+-> time + +IOW, + +- duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency +- duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency + +The task utilization signal can be made CPU invariant using the following +formula:: + + task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity) + +with ``max_capacity`` being the highest CPU capacity value in the +system. Applying this formula to the above example above yields a CPU +invariant task utilization of 25%. + +2.4 Invariant task utilization +------------------------------ + +Both frequency and CPU invariance need to be applied to task utilization in +order to obtain a truly invariant signal. The pseudo-formula for a task +utilization that is both CPU and frequency invariant is thus, for a given +task p:: + + curr_frequency(cpu) capacity(cpu) + task_util_inv(p) = duty_cycle(p) * ------------------- * ------------- + max_frequency(cpu) max_capacity + +In other words, invariant task utilization describes the behaviour of a task as +if it were running on the highest-capacity CPU in the system, running at its +maximum frequency. + +Any mention of task utilization in the following sections will imply its +invariant form. + +2.5 Utilization estimation +-------------------------- + +Without a crystal ball, task behaviour (and thus task utilization) cannot +accurately be predicted the moment a task first becomes runnable. The CFS class +maintains a handful of CPU and task signals based on the Per-Entity Load +Tracking (PELT) mechanism, one of those yielding an *average* utilization (as +opposed to instantaneous). + +This means that while the capacity aware scheduling criteria will be written +considering a "true" task utilization (using a crystal ball), the implementation +will only ever be able to use an estimator thereof. + +3. Capacity aware scheduling requirements +========================================= + +3.1 CPU capacity +---------------- + +Linux cannot currently figure out CPU capacity on its own, this information thus +needs to be handed to it. Architectures must define arch_scale_cpu_capacity() +for that purpose. + +The arm, arm64, and RISC-V architectures directly map this to the arch_topology driver +CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see +Documentation/devicetree/bindings/cpu/cpu-capacity.txt. + +3.2 Frequency invariance +------------------------ + +As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task +utilization. Architectures must define arch_scale_freq_capacity(cpu) for that +purpose. + +Implementing this function requires figuring out at which frequency each CPU +have been running at. One way to implement this is to leverage hardware counters +whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86, +AMU on arm64). Another is to directly hook into cpufreq frequency transitions, +when the kernel is aware of the switched-to frequency (also employed by +arm/arm64). + +4. Scheduler topology +===================== + +During the construction of the sched domains, the scheduler will figure out +whether the system exhibits asymmetric CPU capacities. Should that be the +case: + +- The sched_asym_cpucapacity static key will be enabled. +- The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain + level that spans all unique CPU capacity values. +- The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans + CPUs with any range of asymmetry. + +The sched_asym_cpucapacity static key is intended to guard sections of code that +cater to asymmetric CPU capacity systems. Do note however that said key is +*system-wide*. Imagine the following setup using cpusets:: + + capacity C/2 C + ________ ________ + / \ / \ + CPUs 0 1 2 3 4 5 6 7 + \__/ \______________/ + cpusets cs0 cs1 + +Which could be created via: + +.. code-block:: sh + + mkdir /sys/fs/cgroup/cpuset/cs0 + echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus + echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems + + mkdir /sys/fs/cgroup/cpuset/cs1 + echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus + echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems + + echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance + +Since there *is* CPU capacity asymmetry in the system, the +sched_asym_cpucapacity static key will be enabled. However, the sched_domain +hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't +set in that hierarchy, it describes an SMP island and should be treated as such. + +Therefore, the 'canonical' pattern for protecting codepaths that cater to +asymmetric CPU capacities is to: + +- Check the sched_asym_cpucapacity static key +- If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in + the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific + CPU or group thereof) + +5. Capacity aware scheduling implementation +=========================================== + +5.1 CFS +------- + +5.1.1 Capacity fitness +~~~~~~~~~~~~~~~~~~~~~~ + +The main capacity scheduling criterion of CFS is:: + + task_util(p) < capacity(task_cpu(p)) + +This is commonly called the capacity fitness criterion, i.e. CFS must ensure a +task "fits" on its CPU. If it is violated, the task will need to achieve more +work than what its CPU can provide: it will be CPU-bound. + +Furthermore, uclamp lets userspace specify a minimum and a maximum utilization +value for a task, either via sched_setattr() or via the cgroup interface (see +Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to +clamp task_util() in the previous criterion. + +5.1.2 Wakeup CPU selection +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +CFS task wakeup CPU selection follows the capacity fitness criterion described +above. On top of that, uclamp is used to clamp the task utilization values, +which lets userspace have more leverage over the CPU selection of CFS +tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies:: + + clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu) + +By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run +on any CPU by giving it a low uclamp.max value. Conversely, it can force a small +periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by +giving it a high uclamp.min value. + +.. note:: + + Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling + (EAS), which is described in Documentation/scheduler/sched-energy.rst. + +5.1.3 Load balancing +~~~~~~~~~~~~~~~~~~~~ + +A pathological case in the wakeup CPU selection occurs when a task rarely +sleeps, if at all - it thus rarely wakes up, if at all. Consider:: + + w == wakeup event + + capacity(CPU0) = C + capacity(CPU1) = C / 3 + + workload on CPU0 + CPU work ^ + | _________ _________ ____ + | | | | | | + +----+----+----+----+----+----+----+----+----+----+-> time + w w w + + workload on CPU1 + CPU work ^ + | ____________________________________________ + | | + +----+----+----+----+----+----+----+----+----+----+-> + w + +This workload should run on CPU0, but if the task either: + +- was improperly scheduled from the start (inaccurate initial + utilization estimation) +- was properly scheduled from the start, but suddenly needs more + processing power + +then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``; +the CPU capacity scheduling criterion is violated, and there may not be any more +wakeup event to fix this up via wakeup CPU selection. + +Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism +put in place to handle this shares the same name. Misfit task migration +leverages the CFS load balancer, more specifically the active load balance part +(which caters to migrating currently running tasks). When load balance happens, +a misfit active load balance will be triggered if a misfit task can be migrated +to a CPU with more capacity than its current one. + +5.2 RT +------ + +5.2.1 Wakeup CPU selection +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +RT task wakeup CPU selection searches for a CPU that satisfies:: + + task_uclamp_min(p) <= capacity(task_cpu(cpu)) + +while still following the usual priority constraints. If none of the candidate +CPUs can satisfy this capacity criterion, then strict priority based scheduling +is followed and CPU capacities are ignored. + +5.3 DL +------ + +5.3.1 Wakeup CPU selection +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +DL task wakeup CPU selection searches for a CPU that satisfies:: + + task_bandwidth(p) < capacity(task_cpu(p)) + +while still respecting the usual bandwidth and deadline constraints. If +none of the candidate CPUs can satisfy this capacity criterion, then the +task will remain on its current CPU. diff --git a/Documentation/scheduler/sched-deadline.rst b/Documentation/scheduler/sched-deadline.rst index 14a2f7bf63fe..a727827b8dd5 100644 --- a/Documentation/scheduler/sched-deadline.rst +++ b/Documentation/scheduler/sched-deadline.rst @@ -203,12 +203,15 @@ Deadline Task Scheduling - Total bandwidth (this_bw): this is the sum of all tasks "belonging" to the runqueue, including the tasks in Inactive state. + - Maximum usable bandwidth (max_bw): This is the maximum bandwidth usable by + deadline tasks and is currently set to the RT capacity. + The algorithm reclaims the bandwidth of the tasks in Inactive state. It does so by decrementing the runtime of the executing task Ti at a pace equal to - dq = -max{ Ui / Umax, (1 - Uinact - Uextra) } dt + dq = -(max{ Ui, (Umax - Uinact - Uextra) } / Umax) dt where: @@ -588,12 +591,13 @@ Deadline Task Scheduling The system wide settings are configured under the /proc virtual file system. - For now the -rt knobs are used for -deadline admission control and the - -deadline runtime is accounted against the -rt runtime. We realize that this - isn't entirely desirable; however, it is better to have a small interface for - now, and be able to change it easily later. The ideal situation (see 5.) is to - run -rt tasks from a -deadline server; in which case the -rt bandwidth is a - direct subset of dl_bw. + For now the -rt knobs are used for -deadline admission control and with + CONFIG_RT_GROUP_SCHED the -deadline runtime is accounted against the (root) + -rt runtime. With !CONFIG_RT_GROUP_SCHED the knob only serves for the -dl + admission control. We realize that this isn't entirely desirable; however, it + is better to have a small interface for now, and be able to change it easily + later. The ideal situation (see 5.) is to run -rt tasks from a -deadline + server; in which case the -rt bandwidth is a direct subset of dl_bw. This means that, for a root_domain comprising M CPUs, -deadline tasks can be created while the sum of their bandwidths stays below: @@ -707,7 +711,7 @@ Deadline Task Scheduling and how to prevent non-root users "cheat" the system? As already discussed, we are planning also to merge this work with the EDF - throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in + throttling patches [https://lore.kernel.org/r/cover.1266931410.git.fabio@helm.retis] but we still are in the preliminary phases of the merge and we really seek feedback that would help us decide on the direction it should take. @@ -746,21 +750,19 @@ Appendix A. Test suite of the command line options. Please refer to rt-app documentation for more details (`<rt-app-sources>/doc/*.json`). - The second testing application is a modification of schedtool, called - schedtool-dl, which can be used to setup SCHED_DEADLINE parameters for a - certain pid/application. schedtool-dl is available at: - https://github.com/scheduler-tools/schedtool-dl.git. + The second testing application is done using chrt which has support + for SCHED_DEADLINE. The usage is straightforward:: - # schedtool -E -t 10000000:100000000 -e ./my_cpuhog_app + # chrt -d -T 10000000 -D 100000000 0 ./my_cpuhog_app With this, my_cpuhog_app is put to run inside a SCHED_DEADLINE reservation - of 10ms every 100ms (note that parameters are expressed in microseconds). - You can also use schedtool to create a reservation for an already running + of 10ms every 100ms (note that parameters are expressed in nanoseconds). + You can also use chrt to create a reservation for an already running application, given that you know its pid:: - # schedtool -E -t 10000000:100000000 my_app_pid + # chrt -d -T 10000000 -D 100000000 -p 0 my_app_pid Appendix B. Minimal main() ========================== diff --git a/Documentation/scheduler/sched-debug.rst b/Documentation/scheduler/sched-debug.rst new file mode 100644 index 000000000000..b5a92a39eccd --- /dev/null +++ b/Documentation/scheduler/sched-debug.rst @@ -0,0 +1,54 @@ +================= +Scheduler debugfs +================= + +Booting a kernel with debugfs enabled will give access to +scheduler specific debug files under /sys/kernel/debug/sched. Some of +those files are described below. + +numa_balancing +============== + +`numa_balancing` directory is used to hold files to control NUMA +balancing feature. If the system overhead from the feature is too +high then the rate the kernel samples for NUMA hinting faults may be +controlled by the `scan_period_min_ms, scan_delay_ms, +scan_period_max_ms, scan_size_mb` files. + + +scan_period_min_ms, scan_delay_ms, scan_period_max_ms, scan_size_mb +------------------------------------------------------------------- + +Automatic NUMA balancing scans tasks address space and unmaps pages to +detect if pages are properly placed or if the data should be migrated to a +memory node local to where the task is running. Every "scan delay" the task +scans the next "scan size" number of pages in its address space. When the +end of the address space is reached the scanner restarts from the beginning. + +In combination, the "scan delay" and "scan size" determine the scan rate. +When "scan delay" decreases, the scan rate increases. The scan delay and +hence the scan rate of every task is adaptive and depends on historical +behaviour. If pages are properly placed then the scan delay increases, +otherwise the scan delay decreases. The "scan size" is not adaptive but +the higher the "scan size", the higher the scan rate. + +Higher scan rates incur higher system overhead as page faults must be +trapped and potentially data must be migrated. However, the higher the scan +rate, the more quickly a tasks memory is migrated to a local node if the +workload pattern changes and minimises performance impact due to remote +memory accesses. These files control the thresholds for scan delays and +the number of pages scanned. + +``scan_period_min_ms`` is the minimum time in milliseconds to scan a +tasks virtual memory. It effectively controls the maximum scanning +rate for each task. + +``scan_delay_ms`` is the starting "scan delay" used for a task when it +initially forks. + +``scan_period_max_ms`` is the maximum time in milliseconds to scan a +tasks virtual memory. It effectively controls the minimum scanning +rate for each task. + +``scan_size_mb`` is how many megabytes worth of pages are scanned for +a given scan. diff --git a/Documentation/scheduler/sched-design-CFS.rst b/Documentation/scheduler/sched-design-CFS.rst index a96c72651877..b574a2644c77 100644 --- a/Documentation/scheduler/sched-design-CFS.rst +++ b/Documentation/scheduler/sched-design-CFS.rst @@ -1,3 +1,5 @@ +.. _sched_design_CFS: + ============= CFS Scheduler ============= @@ -6,10 +8,12 @@ CFS Scheduler 1. OVERVIEW ============ -CFS stands for "Completely Fair Scheduler," and is the new "desktop" process -scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the -replacement for the previous vanilla scheduler's SCHED_OTHER interactivity -code. +CFS stands for "Completely Fair Scheduler," and is the "desktop" process +scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. When +originally merged, it was the replacement for the previous vanilla +scheduler's SCHED_OTHER interactivity code. Nowadays, CFS is making room +for EEVDF, for which documentation can be found in +Documentation/scheduler/sched-eevdf.rst. 80% of CFS's design can be summed up in a single sentence: CFS basically models an "ideal, precise multi-tasking CPU" on real hardware. @@ -34,9 +38,9 @@ In CFS the virtual runtime is expressed and tracked via the per-task p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately timestamp and measure the "expected CPU time" a task should have gotten. -[ small detail: on "ideal" hardware, at any time all tasks would have the same - p->se.vruntime value --- i.e., tasks would execute simultaneously and no task - would ever get "out of balance" from the "ideal" share of CPU time. ] + Small detail: on "ideal" hardware, at any time all tasks would have the same + p->se.vruntime value --- i.e., tasks would execute simultaneously and no task + would ever get "out of balance" from the "ideal" share of CPU time. CFS's task picking logic is based on this p->se.vruntime value and it is thus very simple: it always tries to run the task with the smallest p->se.vruntime @@ -92,14 +96,17 @@ picked and the current task is preempted. CFS uses nanosecond granularity accounting and does not rely on any jiffies or other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the way the previous scheduler had, and has no heuristics whatsoever. There is -only one central tunable (you have to switch on CONFIG_SCHED_DEBUG): +only one central tunable: - /proc/sys/kernel/sched_min_granularity_ns + /sys/kernel/debug/sched/base_slice_ns which can be used to tune the scheduler from "desktop" (i.e., low latencies) to "server" (i.e., good batching) workloads. It defaults to a setting suitable for desktop workloads. SCHED_BATCH is handled by the CFS scheduler module too. +In case CONFIG_HZ results in base_slice_ns < TICK_NSEC, the value of +base_slice_ns will have little to no impact on the workloads. + Due to its design, the CFS scheduler is not prone to any of the "attacks" that exist today against the heuristics of the stock scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all work fine and do not impact @@ -180,7 +187,7 @@ This is the (partial) list of the hooks: compat_yield sysctl is turned on; in that case, it places the scheduling entity at the right-most end of the red-black tree. - - check_preempt_curr(...) + - wakeup_preempt(...) This function checks if a task that entered the runnable state should preempt the currently running task. @@ -189,10 +196,10 @@ This is the (partial) list of the hooks: This function chooses the most appropriate task eligible to run next. - - set_curr_task(...) + - set_next_task(...) - This function is called when a task changes its scheduling class or changes - its task group. + This function is called when a task changes its scheduling class, changes + its task group or is scheduled. - task_tick(...) diff --git a/Documentation/scheduler/sched-domains.rst b/Documentation/scheduler/sched-domains.rst index 5c4b7f4f0062..15e3a4cb304a 100644 --- a/Documentation/scheduler/sched-domains.rst +++ b/Documentation/scheduler/sched-domains.rst @@ -31,21 +31,21 @@ is treated as one entity. The load of a group is defined as the sum of the load of each of its member CPUs, and only when the load of a group becomes out of balance are tasks moved between groups. -In kernel/sched/core.c, trigger_load_balance() is run periodically on each CPU -through scheduler_tick(). It raises a softirq after the next regularly scheduled +In kernel/sched/core.c, sched_balance_trigger() is run periodically on each CPU +through sched_tick(). It raises a softirq after the next regularly scheduled rebalancing event for the current runqueue has arrived. The actual load -balancing workhorse, run_rebalance_domains()->rebalance_domains(), is then run +balancing workhorse, sched_balance_softirq()->sched_balance_domains(), is then run in softirq context (SCHED_SOFTIRQ). -The latter function takes two arguments: the current CPU and whether it was idle -at the time the scheduler_tick() happened and iterates over all sched domains -our CPU is on, starting from its base domain and going up the ->parent chain. -While doing that, it checks to see if the current domain has exhausted its -rebalance interval. If so, it runs load_balance() on that domain. It then checks +The latter function takes two arguments: the runqueue of current CPU and whether +the CPU was idle at the time the sched_tick() happened and iterates over all +sched domains our CPU is on, starting from its base domain and going up the ->parent +chain. While doing that, it checks to see if the current domain has exhausted its +rebalance interval. If so, it runs sched_balance_rq() on that domain. It then checks the parent sched_domain (if it exists), and the parent of the parent and so forth. -Initially, load_balance() finds the busiest group in the current sched domain. +Initially, sched_balance_rq() finds the busiest group in the current sched domain. If it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in that group. If it manages to find such a runqueue, it locks both our initial CPU's runqueue and the newly found busiest one and starts moving tasks from it @@ -65,21 +65,16 @@ of the SMP domain will span the entire machine, with each group having the cpumask of a node. Or, you could do multi-level NUMA or Opteron, for example, might have just one domain covering its one NUMA level. -The implementor should read comments in include/linux/sched.h: -struct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of -the specifics and what to tune. +The implementor should read comments in include/linux/sched/sd_flags.h: +SD_* to get an idea of the specifics and what to tune for the SD flags +of a sched_domain. -Architectures may retain the regular override the default SD_*_INIT flags -while using the generic domain builder in kernel/sched/core.c if they wish to -retain the traditional SMT->SMP->NUMA topology (or some subset of that). This -can be done by #define'ing ARCH_HASH_SCHED_TUNE. +Architectures may override the generic domain builder and the default SD flags +for a given topology level by creating a sched_domain_topology_level array and +calling set_sched_topology() with this array as the parameter. -Alternatively, the architecture may completely override the generic domain -builder by #define'ing ARCH_HASH_SCHED_DOMAIN, and exporting your -arch_init_sched_domains function. This function will attach domains to all -CPUs using cpu_attach_domain. - -The sched-domains debugging infrastructure can be enabled by enabling -CONFIG_SCHED_DEBUG. This enables an error checking parse of the sched domains -which should catch most possible errors (described above). It also prints out -the domain structure in a visual format. +The sched-domains debugging infrastructure can be enabled by 'sched_verbose' +to your cmdline. If you forgot to tweak your cmdline, you can also flip the +/sys/kernel/debug/sched/verbose knob. This enables an error checking parse of +the sched domains which should catch most possible errors (described above). It +also prints out the domain structure in a visual format. diff --git a/Documentation/scheduler/sched-eevdf.rst b/Documentation/scheduler/sched-eevdf.rst new file mode 100644 index 000000000000..83efe7c0a30d --- /dev/null +++ b/Documentation/scheduler/sched-eevdf.rst @@ -0,0 +1,43 @@ +=============== +EEVDF Scheduler +=============== + +The "Earliest Eligible Virtual Deadline First" (EEVDF) was first introduced +in a scientific publication in 1995 [1]. The Linux kernel began +transitioning to EEVDF in version 6.6 (as a new option in 2024), moving +away from the earlier Completely Fair Scheduler (CFS) in favor of a version +of EEVDF proposed by Peter Zijlstra in 2023 [2-4]. More information +regarding CFS can be found in +Documentation/scheduler/sched-design-CFS.rst. + +Similarly to CFS, EEVDF aims to distribute CPU time equally among all +runnable tasks with the same priority. To do so, it assigns a virtual run +time to each task, creating a "lag" value that can be used to determine +whether a task has received its fair share of CPU time. In this way, a task +with a positive lag is owed CPU time, while a negative lag means the task +has exceeded its portion. EEVDF picks tasks with lag greater or equal to +zero and calculates a virtual deadline (VD) for each, selecting the task +with the earliest VD to execute next. It's important to note that this +allows latency-sensitive tasks with shorter time slices to be prioritized, +which helps with their responsiveness. + +There are ongoing discussions on how to manage lag, especially for sleeping +tasks; but at the time of writing EEVDF uses a "decaying" mechanism based +on virtual run time (VRT). This prevents tasks from exploiting the system +by sleeping briefly to reset their negative lag: when a task sleeps, it +remains on the run queue but marked for "deferred dequeue," allowing its +lag to decay over VRT. Hence, long-sleeping tasks eventually have their lag +reset. Finally, tasks can preempt others if their VD is earlier, and tasks +can request specific time slices using the new sched_setattr() system call, +which further facilitates the job of latency-sensitive applications. + +REFERENCES +========== + +[1] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564 + +[2] https://lore.kernel.org/lkml/a79014e6-ea83-b316-1e12-2ae056bda6fa@linux.vnet.ibm.com/ + +[3] https://lwn.net/Articles/969062/ + +[4] https://lwn.net/Articles/925371/ diff --git a/Documentation/scheduler/sched-energy.rst b/Documentation/scheduler/sched-energy.rst index 9580c57a52bc..70e2921ef725 100644 --- a/Documentation/scheduler/sched-energy.rst +++ b/Documentation/scheduler/sched-energy.rst @@ -82,7 +82,7 @@ through the arch_scale_cpu_capacity() callback. The rest of platform knowledge used by EAS is directly read from the Energy Model (EM) framework. The EM of a platform is composed of a power cost table per 'performance domain' in the system (see Documentation/power/energy-model.rst -for futher details about performance domains). +for further details about performance domains). The scheduler manages references to the EM objects in the topology code when the scheduling domains are built, or re-built. For each root domain (rd), the @@ -281,7 +281,7 @@ mechanism called 'over-utilization'. From a general standpoint, the use-cases where EAS can help the most are those involving a light/medium CPU utilization. Whenever long CPU-bound tasks are being run, they will require all of the available CPU capacity, and there isn't -much that can be done by the scheduler to save energy without severly harming +much that can be done by the scheduler to save energy without severely harming throughput. In order to avoid hurting performance with EAS, CPUs are flagged as 'over-utilized' as soon as they are used at more than 80% of their compute capacity. As long as no CPUs are over-utilized in a root domain, load balancing @@ -328,19 +328,11 @@ section lists these dependencies and provides hints as to how they can be met. As mentioned in the introduction, EAS is only supported on platforms with asymmetric CPU topologies for now. This requirement is checked at run-time by -looking for the presence of the SD_ASYM_CPUCAPACITY flag when the scheduling +looking for the presence of the SD_ASYM_CPUCAPACITY_FULL flag when the scheduling domains are built. -The flag is set/cleared automatically by the scheduler topology code whenever -there are CPUs with different capacities in a root domain. The capacities of -CPUs are provided by arch-specific code through the arch_scale_cpu_capacity() -callback. As an example, arm and arm64 share an implementation of this callback -which uses a combination of CPUFreq data and device-tree bindings to compute the -capacity of CPUs (see drivers/base/arch_topology.c for more details). - -So, in order to use EAS on your platform your architecture must implement the -arch_scale_cpu_capacity() callback, and some of the CPUs must have a lower -capacity than others. +See Documentation/scheduler/sched-capacity.rst for requirements to be met for this +flag to be set in the sched_domain hierarchy. Please note that EAS is not fundamentally incompatible with SMP, but no significant savings on SMP platforms have been observed yet. This restriction @@ -358,36 +350,18 @@ independent EM framework in Documentation/power/energy-model.rst. Please also note that the scheduling domains need to be re-built after the EM has been registered in order to start EAS. +EAS uses the EM to make a forecasting decision on energy usage and thus it is +more focused on the difference when checking possible options for task +placement. For EAS it doesn't matter whether the EM power values are expressed +in milli-Watts or in an 'abstract scale'. + 6.3 - Energy Model complexity ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -The task wake-up path is very latency-sensitive. When the EM of a platform is -too complex (too many CPUs, too many performance domains, too many performance -states, ...), the cost of using it in the wake-up path can become prohibitive. -The energy-aware wake-up algorithm has a complexity of: - - C = Nd * (Nc + Ns) - -with: Nd the number of performance domains; Nc the number of CPUs; and Ns the -total number of OPPs (ex: for two perf. domains with 4 OPPs each, Ns = 8). - -A complexity check is performed at the root domain level, when scheduling -domains are built. EAS will not start on a root domain if its C happens to be -higher than the completely arbitrary EM_MAX_COMPLEXITY threshold (2048 at the -time of writing). - -If you really want to use EAS but the complexity of your platform's Energy -Model is too high to be used with a single root domain, you're left with only -two possible options: - - 1. split your system into separate, smaller, root domains using exclusive - cpusets and enable EAS locally on each of them. This option has the - benefit to work out of the box but the drawback of preventing load - balance between root domains, which can result in an unbalanced system - overall; - 2. submit patches to reduce the complexity of the EAS wake-up algorithm, - hence enabling it to cope with larger EMs in reasonable time. +EAS does not impose any complexity limit on the number of PDs/OPPs/CPUs but +restricts the number of CPUs to EM_MAX_NUM_CPUS to prevent overflows during +the energy estimation. 6.4 - Schedutil governor diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst new file mode 100644 index 000000000000..a1869c38046e --- /dev/null +++ b/Documentation/scheduler/sched-ext.rst @@ -0,0 +1,362 @@ +.. _sched-ext: + +========================== +Extensible Scheduler Class +========================== + +sched_ext is a scheduler class whose behavior can be defined by a set of BPF +programs - the BPF scheduler. + +* sched_ext exports a full scheduling interface so that any scheduling + algorithm can be implemented on top. + +* The BPF scheduler can group CPUs however it sees fit and schedule them + together, as tasks aren't tied to specific CPUs at the time of wakeup. + +* The BPF scheduler can be turned on and off dynamically anytime. + +* The system integrity is maintained no matter what the BPF scheduler does. + The default scheduling behavior is restored anytime an error is detected, + a runnable task stalls, or on invoking the SysRq key sequence + `SysRq-S`. + +* When the BPF scheduler triggers an error, debug information is dumped to + aid debugging. The debug dump is passed to and printed out by the + scheduler binary. The debug dump can also be accessed through the + `sched_ext_dump` tracepoint. The SysRq key sequence `SysRq-D` + triggers a debug dump. This doesn't terminate the BPF scheduler and can + only be read through the tracepoint. + +Switching to and from sched_ext +=============================== + +``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and +``tools/sched_ext`` contains the example schedulers. The following config +options should be enabled to use sched_ext: + +.. code-block:: none + + CONFIG_BPF=y + CONFIG_SCHED_CLASS_EXT=y + CONFIG_BPF_SYSCALL=y + CONFIG_BPF_JIT=y + CONFIG_DEBUG_INFO_BTF=y + CONFIG_BPF_JIT_ALWAYS_ON=y + CONFIG_BPF_JIT_DEFAULT_ON=y + CONFIG_PAHOLE_HAS_SPLIT_BTF=y + CONFIG_PAHOLE_HAS_BTF_TAG=y + +sched_ext is used only when the BPF scheduler is loaded and running. + +If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be +treated as ``SCHED_NORMAL`` and scheduled by the fair-class scheduler until the +BPF scheduler is loaded. + +When the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is not set +in ``ops->flags``, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE``, and +``SCHED_EXT`` tasks are scheduled by sched_ext. + +However, when the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is +set in ``ops->flags``, only tasks with the ``SCHED_EXT`` policy are scheduled +by sched_ext, while tasks with ``SCHED_NORMAL``, ``SCHED_BATCH`` and +``SCHED_IDLE`` policies are scheduled by the fair-class scheduler. + +Terminating the sched_ext scheduler program, triggering `SysRq-S`, or +detection of any internal error including stalled runnable tasks aborts the +BPF scheduler and reverts all tasks back to the fair-class scheduler. + +.. code-block:: none + + # make -j16 -C tools/sched_ext + # tools/sched_ext/build/bin/scx_simple + local=0 global=3 + local=5 global=24 + local=9 global=44 + local=13 global=56 + local=17 global=72 + ^CEXIT: BPF scheduler unregistered + +The current status of the BPF scheduler can be determined as follows: + +.. code-block:: none + + # cat /sys/kernel/sched_ext/state + enabled + # cat /sys/kernel/sched_ext/root/ops + simple + +You can check if any BPF scheduler has ever been loaded since boot by examining +this monotonically incrementing counter (a value of zero indicates that no BPF +scheduler has been loaded): + +.. code-block:: none + + # cat /sys/kernel/sched_ext/enable_seq + 1 + +``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more +detailed information: + +.. code-block:: none + + # tools/sched_ext/scx_show_state.py + ops : simple + enabled : 1 + switching_all : 1 + switched_all : 1 + enable_state : enabled (2) + bypass_depth : 0 + nr_rejected : 0 + enable_seq : 1 + +Whether a given task is on sched_ext can be determined as follows: + +.. code-block:: none + + # grep ext /proc/self/sched + ext.enabled : 1 + +The Basics +========== + +Userspace can implement an arbitrary BPF scheduler by loading a set of BPF +programs that implement ``struct sched_ext_ops``. The only mandatory field +is ``ops.name`` which must be a valid BPF object name. All operations are +optional. The following modified excerpt is from +``tools/sched_ext/scx_simple.bpf.c`` showing a minimal global FIFO scheduler. + +.. code-block:: c + + /* + * Decide which CPU a task should be migrated to before being + * enqueued (either at wakeup, fork time, or exec time). If an + * idle core is found by the default ops.select_cpu() implementation, + * then insert the task directly into SCX_DSQ_LOCAL and skip the + * ops.enqueue() callback. + * + * Note that this implementation has exactly the same behavior as the + * default ops.select_cpu implementation. The behavior of the scheduler + * would be exactly same if the implementation just didn't define the + * simple_select_cpu() struct_ops prog. + */ + s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, + s32 prev_cpu, u64 wake_flags) + { + s32 cpu; + /* Need to initialize or the BPF verifier will reject the program */ + bool direct = false; + + cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct); + + if (direct) + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); + + return cpu; + } + + /* + * Do a direct insertion of a task to the global DSQ. This ops.enqueue() + * callback will only be invoked if we failed to find a core to insert + * into in ops.select_cpu() above. + * + * Note that this implementation has exactly the same behavior as the + * default ops.enqueue implementation, which just dispatches the task + * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same + * if the implementation just didn't define the simple_enqueue struct_ops + * prog. + */ + void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) + { + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); + } + + s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init) + { + /* + * By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and + * SCHED_BATCH tasks should use sched_ext. + */ + return 0; + } + + void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei) + { + exit_type = ei->type; + } + + SEC(".struct_ops") + struct sched_ext_ops simple_ops = { + .select_cpu = (void *)simple_select_cpu, + .enqueue = (void *)simple_enqueue, + .init = (void *)simple_init, + .exit = (void *)simple_exit, + .name = "simple", + }; + +Dispatch Queues +--------------- + +To match the impedance between the scheduler core and the BPF scheduler, +sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a +priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``), +and one local DSQ per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage +an arbitrary number of DSQs using ``scx_bpf_create_dsq()`` and +``scx_bpf_destroy_dsq()``. + +A CPU always executes a task from its local DSQ. A task is "inserted" into a +DSQ. A task in a non-local DSQ is "move"d into the target CPU's local DSQ. + +When a CPU is looking for the next task to run, if the local DSQ is not +empty, the first task is picked. Otherwise, the CPU tries to move a task +from the global DSQ. If that doesn't yield a runnable task either, +``ops.dispatch()`` is invoked. + +Scheduling Cycle +---------------- + +The following briefly shows how a waking task is scheduled and executed. + +1. When a task is waking up, ``ops.select_cpu()`` is the first operation + invoked. This serves two purposes. First, CPU selection optimization + hint. Second, waking up the selected CPU if idle. + + The CPU selected by ``ops.select_cpu()`` is an optimization hint and not + binding. The actual decision is made at the last step of scheduling. + However, there is a small performance gain if the CPU + ``ops.select_cpu()`` returns matches the CPU the task eventually runs on. + + A side-effect of selecting a CPU is waking it up from idle. While a BPF + scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper, + using ``ops.select_cpu()`` judiciously can be simpler and more efficient. + + A task can be immediately inserted into a DSQ from ``ops.select_cpu()`` + by calling ``scx_bpf_dsq_insert()``. If the task is inserted into + ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be inserted into the + local DSQ of whichever CPU is returned from ``ops.select_cpu()``. + Additionally, inserting directly from ``ops.select_cpu()`` will cause the + ``ops.enqueue()`` callback to be skipped. + + Note that the scheduler core will ignore an invalid CPU selection, for + example, if it's outside the allowed cpumask of the task. + +2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the + task was inserted directly from ``ops.select_cpu()``). ``ops.enqueue()`` + can make one of the following decisions: + + * Immediately insert the task into either the global or a local DSQ by + calling ``scx_bpf_dsq_insert()`` with one of the following options: + ``SCX_DSQ_GLOBAL``, ``SCX_DSQ_LOCAL``, or ``SCX_DSQ_LOCAL_ON | cpu``. + + * Immediately insert the task into a custom DSQ by calling + ``scx_bpf_dsq_insert()`` with a DSQ ID which is smaller than 2^63. + + * Queue the task on the BPF side. + +3. When a CPU is ready to schedule, it first looks at its local DSQ. If + empty, it then looks at the global DSQ. If there still isn't a task to + run, ``ops.dispatch()`` is invoked which can use the following two + functions to populate the local DSQ. + + * ``scx_bpf_dsq_insert()`` inserts a task to a DSQ. Any target DSQ can be + used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``, + ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dsq_insert()`` + currently can't be called with BPF locks held, this is being worked on + and will be supported. ``scx_bpf_dsq_insert()`` schedules insertion + rather than performing them immediately. There can be up to + ``ops.dispatch_max_batch`` pending tasks. + + * ``scx_bpf_move_to_local()`` moves a task from the specified non-local + DSQ to the dispatching DSQ. This function cannot be called with any BPF + locks held. ``scx_bpf_move_to_local()`` flushes the pending insertions + tasks before trying to move from the specified DSQ. + +4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ, + the CPU runs the first one. If empty, the following steps are taken: + + * Try to move from the global DSQ. If successful, run the task. + + * If ``ops.dispatch()`` has dispatched any tasks, retry #3. + + * If the previous task is an SCX task and still runnable, keep executing + it (see ``SCX_OPS_ENQ_LAST``). + + * Go idle. + +Note that the BPF scheduler can always choose to dispatch tasks immediately +in ``ops.enqueue()`` as illustrated in the above simple example. If only the +built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as +a task is never queued on the BPF scheduler and both the local and global +DSQs are executed automatically. + +``scx_bpf_dsq_insert()`` inserts the task on the FIFO of the target DSQ. Use +``scx_bpf_dsq_insert_vtime()`` for the priority queue. Internal DSQs such as +``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue +dispatching, and must be dispatched to with ``scx_bpf_dsq_insert()``. See +the function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` +for more information. + +Task Lifecycle +-------------- + +The following pseudo-code summarizes the entire lifecycle of a task managed +by a sched_ext scheduler: + +.. code-block:: c + + ops.init_task(); /* A new task is created */ + ops.enable(); /* Enable BPF scheduling for the task */ + + while (task in SCHED_EXT) { + if (task can migrate) + ops.select_cpu(); /* Called on wakeup (optimization) */ + + ops.runnable(); /* Task becomes ready to run */ + + while (task is runnable) { + if (task is not in a DSQ) { + ops.enqueue(); /* Task can be added to a DSQ */ + + /* A CPU becomes available */ + + ops.dispatch(); /* Task is moved to a local DSQ */ + } + ops.running(); /* Task starts running on its assigned CPU */ + ops.tick(); /* Called every 1/HZ seconds */ + ops.stopping(); /* Task stops running (time slice expires or wait) */ + } + + ops.quiescent(); /* Task releases its assigned CPU (wait) */ + } + + ops.disable(); /* Disable BPF scheduling for the task */ + ops.exit_task(); /* Task is destroyed */ + +Where to Look +============= + +* ``include/linux/sched/ext.h`` defines the core data structures, ops table + and constants. + +* ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers. + The functions prefixed with ``scx_bpf_`` can be called from the BPF + scheduler. + +* ``tools/sched_ext/`` hosts example BPF scheduler implementations. + + * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a + custom DSQ. + + * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five + levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``. + +ABI Instability +=============== + +The APIs provided by sched_ext to BPF schedulers programs have no stability +guarantees. This includes the ops table callbacks and constants defined in +``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in +``kernel/sched/ext.c``. + +While we will attempt to provide a relatively stable API surface when +possible, they are subject to change without warning between kernel +versions. diff --git a/Documentation/scheduler/sched-nice-design.rst b/Documentation/scheduler/sched-nice-design.rst index 0571f1b47e64..889bf2b737dc 100644 --- a/Documentation/scheduler/sched-nice-design.rst +++ b/Documentation/scheduler/sched-nice-design.rst @@ -60,7 +60,7 @@ within the constraints of HZ and jiffies and their nasty design level coupling to timeslices and granularity it was not really viable. The second (less frequent but still periodically occurring) complaint -about Linux's nice level support was its assymetry around the origo +about Linux's nice level support was its asymmetry around the origin (which you can see demonstrated in the picture above), or more accurately: the fact that nice level behavior depended on the _absolute_ nice level as well, while the nice API itself is fundamentally diff --git a/Documentation/scheduler/sched-rt-group.rst b/Documentation/scheduler/sched-rt-group.rst index 655a096ec8fb..ab464335d320 100644 --- a/Documentation/scheduler/sched-rt-group.rst +++ b/Documentation/scheduler/sched-rt-group.rst @@ -39,10 +39,10 @@ Most notable: 1.1 The problem --------------- -Realtime scheduling is all about determinism, a group has to be able to rely on +Real-time scheduling is all about determinism, a group has to be able to rely on the amount of bandwidth (eg. CPU time) being constant. In order to schedule -multiple groups of realtime tasks, each group must be assigned a fixed portion -of the CPU time available. Without a minimum guarantee a realtime group can +multiple groups of real-time tasks, each group must be assigned a fixed portion +of the CPU time available. Without a minimum guarantee a real-time group can obviously fall short. A fuzzy upper limit is of no use since it cannot be relied upon. Which leaves us with just the single fixed portion. @@ -50,14 +50,14 @@ relied upon. Which leaves us with just the single fixed portion. ---------------- CPU time is divided by means of specifying how much time can be spent running -in a given period. We allocate this "run time" for each realtime group which -the other realtime groups will not be permitted to use. +in a given period. We allocate this "run time" for each real-time group which +the other real-time groups will not be permitted to use. -Any time not allocated to a realtime group will be used to run normal priority +Any time not allocated to a real-time group will be used to run normal priority tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by SCHED_OTHER. -Let's consider an example: a frame fixed realtime renderer must deliver 25 +Let's consider an example: a frame fixed real-time renderer must deliver 25 frames a second, which yields a period of 0.04s per frame. Now say it will also have to play some music and respond to input, leaving it with around 80% CPU time dedicated for the graphics. We can then give this group a run time of 0.8 @@ -70,7 +70,7 @@ needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s = of 0.00015s. The remaining CPU time will be used for user input and other tasks. Because -realtime tasks have explicitly allocated the CPU time they need to perform +real-time tasks have explicitly allocated the CPU time they need to perform their tasks, buffer underruns in the graphics or audio can be eliminated. NOTE: the above example is not fully implemented yet. We still @@ -87,19 +87,24 @@ lack an EDF scheduler to make non-uniform periods usable. The system wide settings are configured under the /proc virtual file system: /proc/sys/kernel/sched_rt_period_us: - The scheduling period that is equivalent to 100% CPU bandwidth + The scheduling period that is equivalent to 100% CPU bandwidth. /proc/sys/kernel/sched_rt_runtime_us: - A global limit on how much time realtime scheduling may use. Even without - CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime - processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth - available to all realtime groups. + A global limit on how much time real-time scheduling may use. This is always + less or equal to the period_us, as it denotes the time allocated from the + period_us for the real-time tasks. Without CONFIG_RT_GROUP_SCHED enabled, + this only serves for admission control of deadline tasks. With + CONFIG_RT_GROUP_SCHED=y it also signifies the total bandwidth available to + all real-time groups. * Time is specified in us because the interface is s32. This gives an operating range from 1us to about 35 minutes. * sched_rt_period_us takes values from 1 to INT_MAX. - * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1). + * sched_rt_runtime_us takes values from -1 to sched_rt_period_us. * A run time of -1 specifies runtime == period, ie. no limit. + * sched_rt_runtime_us/sched_rt_period_us > 0.05 inorder to preserve + bandwidth for fair dl_server. For accurate value check average of + runtime/period in /sys/kernel/debug/sched/fair_server/cpuX/ 2.2 Default behaviour @@ -108,7 +113,7 @@ The system wide settings are configured under the /proc virtual file system: The default values for sched_rt_period_us (1000000 or 1s) and sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away -realtime tasks will not lock up the machine but leave a little time to recover +real-time tasks will not lock up the machine but leave a little time to recover it. By setting runtime to -1 you'd get the old behaviour back. By default all bandwidth is assigned to the root group and new groups get the @@ -116,10 +121,10 @@ period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you want to assign bandwidth to another group, reduce the root group's bandwidth and assign some or all of the difference to another group. -Realtime group scheduling means you have to assign a portion of total CPU -bandwidth to the group before it will accept realtime tasks. Therefore you will -not be able to run realtime tasks as any user other than root until you have -done that, even if the user has the rights to run processes with realtime +Real-time group scheduling means you have to assign a portion of total CPU +bandwidth to the group before it will accept real-time tasks. Therefore you will +not be able to run real-time tasks as any user other than root until you have +done that, even if the user has the rights to run processes with real-time priority! diff --git a/Documentation/scheduler/sched-stats.rst b/Documentation/scheduler/sched-stats.rst index dd9b99a025f7..d82e7d2b54f0 100644 --- a/Documentation/scheduler/sched-stats.rst +++ b/Documentation/scheduler/sched-stats.rst @@ -2,9 +2,22 @@ Scheduler Statistics ==================== +Version 17 of schedstats removed 'lb_imbalance' field as it has no +significance anymore and instead added more relevant fields namely +'lb_imbalance_load', 'lb_imbalance_util', 'lb_imbalance_task' and +'lb_imbalance_misfit'. The domain field prints the name of the +corresponding sched domain from this version onwards. + +Version 16 of schedstats changed the order of definitions within +'enum cpu_idle_type', which changed the order of [CPU_MAX_IDLE_TYPES] +columns in show_schedstat(). In particular the position of CPU_IDLE +and __CPU_NOT_IDLE changed places. The size of the array is unchanged. + Version 15 of schedstats dropped counters for some sched_yield: yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is -identical to version 14. +identical to version 14. Details are available at + + https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scheduler/sched-stats.txt?id=1e1dbb259c79b Version 14 of schedstats includes support for sched_domains, which hit the mainline kernel in 2.6.20 although it is identical to the stats from version @@ -21,7 +34,14 @@ cpus on the machine, while domain0 is the most tightly focused domain, sometimes balancing only between pairs of cpus. At this time, there are no architectures which need more than three domain levels. The first field in the domain stats is a bit map indicating which cpus are affected -by that domain. +by that domain. Details are available at + + https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/sched-stats.txt?id=b762f3ffb797c + +The schedstat documentation is maintained version 10 onwards and is not +updated for version 11 and 12. The details for version 10 are available at + + https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/sched-stats.txt?id=1da177e4c3f4 These fields are counters, and only increment. Programs which make use of these will need to start with a baseline observation and then calculate @@ -56,9 +76,9 @@ Next two are try_to_wake_up() statistics: Next three are statistics describing scheduling latency: - 7) sum of all time spent running by tasks on this processor (in jiffies) + 7) sum of all time spent running by tasks on this processor (in nanoseconds) 8) sum of all time spent waiting to run by tasks on this processor (in - jiffies) + nanoseconds) 9) # of timeslices run on this cpu @@ -66,88 +86,97 @@ Domain statistics ----------------- One of these is produced per domain for each cpu described. (Note that if CONFIG_SMP is not defined, *no* domains are utilized and these lines -will not appear in the output.) +will not appear in the output. <name> is an extension to the domain field +that prints the name of the corresponding sched domain. It can appear in +schedstat version 17 and above. -domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 +domain<N> <name> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 The first field is a bit mask indicating what cpus this domain operates over. -The next 24 are a variety of load_balance() statistics in grouped into types -of idleness (idle, busy, and newly idle): +The next 33 are a variety of sched_balance_rq() statistics in grouped into types +of idleness (busy, idle and newly idle): - 1) # of times in this domain load_balance() was called when the + 1) # of times in this domain sched_balance_rq() was called when the + cpu was busy + 2) # of times in this domain sched_balance_rq() checked but found the + load did not require balancing when busy + 3) # of times in this domain sched_balance_rq() tried to move one or + more tasks and failed, when the cpu was busy + 4) Total imbalance in load when the cpu was busy + 5) Total imbalance in utilization when the cpu was busy + 6) Total imbalance in number of tasks when the cpu was busy + 7) Total imbalance due to misfit tasks when the cpu was busy + 8) # of times in this domain pull_task() was called when busy + 9) # of times in this domain pull_task() was called even though the + target task was cache-hot when busy + 10) # of times in this domain sched_balance_rq() was called but did not + find a busier queue while the cpu was busy + 11) # of times in this domain a busier queue was found while the cpu + was busy but no busier group was found + + 12) # of times in this domain sched_balance_rq() was called when the cpu was idle - 2) # of times in this domain load_balance() checked but found + 13) # of times in this domain sched_balance_rq() checked but found the load did not require balancing when the cpu was idle - 3) # of times in this domain load_balance() tried to move one or + 14) # of times in this domain sched_balance_rq() tried to move one or more tasks and failed, when the cpu was idle - 4) sum of imbalances discovered (if any) with each call to - load_balance() in this domain when the cpu was idle - 5) # of times in this domain pull_task() was called when the cpu + 15) Total imbalance in load when the cpu was idle + 16) Total imbalance in utilization when the cpu was idle + 17) Total imbalance in number of tasks when the cpu was idle + 18) Total imbalance due to misfit tasks when the cpu was idle + 19) # of times in this domain pull_task() was called when the cpu was idle - 6) # of times in this domain pull_task() was called even though + 20) # of times in this domain pull_task() was called even though the target task was cache-hot when idle - 7) # of times in this domain load_balance() was called but did + 21) # of times in this domain sched_balance_rq() was called but did not find a busier queue while the cpu was idle - 8) # of times in this domain a busier queue was found while the + 22) # of times in this domain a busier queue was found while the cpu was idle but no busier group was found - 9) # of times in this domain load_balance() was called when the - cpu was busy - 10) # of times in this domain load_balance() checked but found the - load did not require balancing when busy - 11) # of times in this domain load_balance() tried to move one or - more tasks and failed, when the cpu was busy - 12) sum of imbalances discovered (if any) with each call to - load_balance() in this domain when the cpu was busy - 13) # of times in this domain pull_task() was called when busy - 14) # of times in this domain pull_task() was called even though the - target task was cache-hot when busy - 15) # of times in this domain load_balance() was called but did not - find a busier queue while the cpu was busy - 16) # of times in this domain a busier queue was found while the cpu - was busy but no busier group was found - 17) # of times in this domain load_balance() was called when the + 23) # of times in this domain sched_balance_rq() was called when the cpu was just becoming idle - 18) # of times in this domain load_balance() checked but found the + 24) # of times in this domain sched_balance_rq() checked but found the load did not require balancing when the cpu was just becoming idle - 19) # of times in this domain load_balance() tried to move one or more + 25) # of times in this domain sched_balance_rq() tried to move one or more tasks and failed, when the cpu was just becoming idle - 20) sum of imbalances discovered (if any) with each call to - load_balance() in this domain when the cpu was just becoming idle - 21) # of times in this domain pull_task() was called when newly idle - 22) # of times in this domain pull_task() was called even though the + 26) Total imbalance in load when the cpu was just becoming idle + 27) Total imbalance in utilization when the cpu was just becoming idle + 28) Total imbalance in number of tasks when the cpu was just becoming idle + 29) Total imbalance due to misfit tasks when the cpu was just becoming idle + 30) # of times in this domain pull_task() was called when newly idle + 31) # of times in this domain pull_task() was called even though the target task was cache-hot when just becoming idle - 23) # of times in this domain load_balance() was called but did not + 32) # of times in this domain sched_balance_rq() was called but did not find a busier queue while the cpu was just becoming idle - 24) # of times in this domain a busier queue was found while the cpu + 33) # of times in this domain a busier queue was found while the cpu was just becoming idle but no busier group was found Next three are active_load_balance() statistics: - 25) # of times active_load_balance() was called - 26) # of times active_load_balance() tried to move a task and failed - 27) # of times active_load_balance() successfully moved a task + 34) # of times active_load_balance() was called + 35) # of times active_load_balance() tried to move a task and failed + 36) # of times active_load_balance() successfully moved a task Next three are sched_balance_exec() statistics: - 28) sbe_cnt is not used - 29) sbe_balanced is not used - 30) sbe_pushed is not used + 37) sbe_cnt is not used + 38) sbe_balanced is not used + 39) sbe_pushed is not used Next three are sched_balance_fork() statistics: - 31) sbf_cnt is not used - 32) sbf_balanced is not used - 33) sbf_pushed is not used + 40) sbf_cnt is not used + 41) sbf_balanced is not used + 42) sbf_pushed is not used Next three are try_to_wake_up() statistics: - 34) # of times in this domain try_to_wake_up() awoke a task that + 43) # of times in this domain try_to_wake_up() awoke a task that last ran on a different cpu in this domain - 35) # of times in this domain try_to_wake_up() moved a task to the + 44) # of times in this domain try_to_wake_up() moved a task to the waking cpu because it was cache-cold on its own cpu anyway - 36) # of times in this domain try_to_wake_up() started passive balancing + 45) # of times in this domain try_to_wake_up() started passive balancing /proc/<pid>/schedstat --------------------- @@ -155,8 +184,8 @@ schedstats also adds a new /proc/<pid>/schedstat file to include some of the same information on a per-process level. There are three fields in this file correlating for that process to: - 1) time spent on the cpu - 2) time spent waiting on a runqueue + 1) time spent on the cpu (in nanoseconds) + 2) time spent waiting on a runqueue (in nanoseconds) 3) # of timeslices run on this cpu A program could be easily written to make use of these extra fields to diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/scheduler/sched-util-clamp.rst new file mode 100644 index 000000000000..74d5b7c6431d --- /dev/null +++ b/Documentation/scheduler/sched-util-clamp.rst @@ -0,0 +1,741 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +Utilization Clamping +==================== + +1. Introduction +=============== + +Utilization clamping, also known as util clamp or uclamp, is a scheduler +feature that allows user space to help in managing the performance requirement +of tasks. It was introduced in v5.3 release. The CGroup support was merged in +v5.4. + +Uclamp is a hinting mechanism that allows the scheduler to understand the +performance requirements and restrictions of the tasks, thus it helps the +scheduler to make a better decision. And when schedutil cpufreq governor is +used, util clamp will influence the CPU frequency selection as well. + +Since the scheduler and schedutil are both driven by PELT (util_avg) signals, +util clamp acts on that to achieve its goal by clamping the signal to a certain +point; hence the name. That is, by clamping utilization we are making the +system run at a certain performance point. + +The right way to view util clamp is as a mechanism to make request or hint on +performance constraints. It consists of two tunables: + + * UCLAMP_MIN, which sets the lower bound. + * UCLAMP_MAX, which sets the upper bound. + +These two bounds will ensure a task will operate within this performance range +of the system. UCLAMP_MIN implies boosting a task, while UCLAMP_MAX implies +capping a task. + +One can tell the system (scheduler) that some tasks require a minimum +performance point to operate at to deliver the desired user experience. Or one +can tell the system that some tasks should be restricted from consuming too +much resources and should not go above a specific performance point. Viewing +the uclamp values as performance points rather than utilization is a better +abstraction from user space point of view. + +As an example, a game can use util clamp to form a feedback loop with its +perceived Frames Per Second (FPS). It can dynamically increase the minimum +performance point required by its display pipeline to ensure no frame is +dropped. It can also dynamically 'prime' up these tasks if it knows in the +coming few hundred milliseconds a computationally intensive scene is about to +happen. + +On mobile hardware where the capability of the devices varies a lot, this +dynamic feedback loop offers a great flexibility to ensure best user experience +given the capabilities of any system. + +Of course a static configuration is possible too. The exact usage will depend +on the system, application and the desired outcome. + +Another example is in Android where tasks are classified as background, +foreground, top-app, etc. Util clamp can be used to constrain how much +resources background tasks are consuming by capping the performance point they +can run at. This constraint helps reserve resources for important tasks, like +the ones belonging to the currently active app (top-app group). Beside this +helps in limiting how much power they consume. This can be more obvious in +heterogeneous systems (e.g. Arm big.LITTLE); the constraint will help bias the +background tasks to stay on the little cores which will ensure that: + + 1. The big cores are free to run top-app tasks immediately. top-app + tasks are the tasks the user is currently interacting with, hence + the most important tasks in the system. + 2. They don't run on a power hungry core and drain battery even if they + are CPU intensive tasks. + +.. note:: + **little cores**: + CPUs with capacity < 1024 + + **big cores**: + CPUs with capacity = 1024 + +By making these uclamp performance requests, or rather hints, user space can +ensure system resources are used optimally to deliver the best possible user +experience. + +Another use case is to help with **overcoming the ramp up latency inherit in +how scheduler utilization signal is calculated**. + +On the other hand, a busy task for instance that requires to run at maximum +performance point will suffer a delay of ~200ms (PELT HALFIFE = 32ms) for the +scheduler to realize that. This is known to affect workloads like gaming on +mobile devices where frames will drop due to slow response time to select the +higher frequency required for the tasks to finish their work in time. Setting +UCLAMP_MIN=1024 will ensure such tasks will always see the highest performance +level when they start running. + +The overall visible effect goes beyond better perceived user +experience/performance and stretches to help achieve a better overall +performance/watt if used effectively. + +User space can form a feedback loop with the thermal subsystem too to ensure +the device doesn't heat up to the point where it will throttle. + +Both SCHED_NORMAL/OTHER and SCHED_FIFO/RR honour uclamp requests/hints. + +In the SCHED_FIFO/RR case, uclamp gives the option to run RT tasks at any +performance point rather than being tied to MAX frequency all the time. Which +can be useful on general purpose systems that run on battery powered devices. + +Note that by design RT tasks don't have per-task PELT signal and must always +run at a constant frequency to combat undeterministic DVFS rampup delays. + +Note that using schedutil always implies a single delay to modify the frequency +when an RT task wakes up. This cost is unchanged by using uclamp. Uclamp only +helps picking what frequency to request instead of schedutil always requesting +MAX for all RT tasks. + +See :ref:`section 3.4 <uclamp-default-values>` for default values and +:ref:`3.4.1 <sched-util-clamp-min-rt-default>` on how to change RT tasks +default value. + +2. Design +========= + +Util clamp is a property of every task in the system. It sets the boundaries of +its utilization signal; acting as a bias mechanism that influences certain +decisions within the scheduler. + +The actual utilization signal of a task is never clamped in reality. If you +inspect PELT signals at any point of time you should continue to see them as +they are intact. Clamping happens only when needed, e.g: when a task wakes up +and the scheduler needs to select a suitable CPU for it to run on. + +Since the goal of util clamp is to allow requesting a minimum and maximum +performance point for a task to run on, it must be able to influence the +frequency selection as well as task placement to be most effective. Both of +which have implications on the utilization value at CPU runqueue (rq for short) +level, which brings us to the main design challenge. + +When a task wakes up on an rq, the utilization signal of the rq will be +affected by the uclamp settings of all the tasks enqueued on it. For example if +a task requests to run at UTIL_MIN = 512, then the util signal of the rq needs +to respect to this request as well as all other requests from all of the +enqueued tasks. + +To be able to aggregate the util clamp value of all the tasks attached to the +rq, uclamp must do some housekeeping at every enqueue/dequeue, which is the +scheduler hot path. Hence care must be taken since any slow down will have +significant impact on a lot of use cases and could hinder its usability in +practice. + +The way this is handled is by dividing the utilization range into buckets +(struct uclamp_bucket) which allows us to reduce the search space from every +task on the rq to only a subset of tasks on the top-most bucket. + +When a task is enqueued, the counter in the matching bucket is incremented, +and on dequeue it is decremented. This makes keeping track of the effective +uclamp value at rq level a lot easier. + +As tasks are enqueued and dequeued, we keep track of the current effective +uclamp value of the rq. See :ref:`section 2.1 <uclamp-buckets>` for details on +how this works. + +Later at any path that wants to identify the effective uclamp value of the rq, +it will simply need to read this effective uclamp value of the rq at that exact +moment of time it needs to take a decision. + +For task placement case, only Energy Aware and Capacity Aware Scheduling +(EAS/CAS) make use of uclamp for now, which implies that it is applied on +heterogeneous systems only. +When a task wakes up, the scheduler will look at the current effective uclamp +value of every rq and compare it with the potential new value if the task were +to be enqueued there. Favoring the rq that will end up with the most energy +efficient combination. + +Similarly in schedutil, when it needs to make a frequency update it will look +at the current effective uclamp value of the rq which is influenced by the set +of tasks currently enqueued there and select the appropriate frequency that +will satisfy constraints from requests. + +Other paths like setting overutilization state (which effectively disables EAS) +make use of uclamp as well. Such cases are considered necessary housekeeping to +allow the 2 main use cases above and will not be covered in detail here as they +could change with implementation details. + +.. _uclamp-buckets: + +2.1. Buckets +------------ + +:: + + [struct rq] + + (bottom) (top) + + 0 1024 + | | + +-----------+-----------+-----------+---- ----+-----------+ + | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N | + +-----------+-----------+-----------+---- ----+-----------+ + : : : + +- p0 +- p3 +- p4 + : : + +- p1 +- p5 + : + +- p2 + + +.. note:: + The diagram above is an illustration rather than a true depiction of the + internal data structure. + +To reduce the search space when trying to decide the effective uclamp value of +an rq as tasks are enqueued/dequeued, the whole utilization range is divided +into N buckets where N is configured at compile time by setting +CONFIG_UCLAMP_BUCKETS_COUNT. By default it is set to 5. + +The rq has a bucket for each uclamp_id tunables: [UCLAMP_MIN, UCLAMP_MAX]. + +The range of each bucket is 1024/N. For example, for the default value of +5 there will be 5 buckets, each of which will cover the following range: + +:: + + DELTA = round_closest(1024/5) = 204.8 = 205 + + Bucket 0: [0:204] + Bucket 1: [205:409] + Bucket 2: [410:614] + Bucket 3: [615:819] + Bucket 4: [820:1024] + +When a task p with following tunable parameters + +:: + + p->uclamp[UCLAMP_MIN] = 300 + p->uclamp[UCLAMP_MAX] = 1024 + +is enqueued into the rq, bucket 1 will be incremented for UCLAMP_MIN and bucket +4 will be incremented for UCLAMP_MAX to reflect the fact the rq has a task in +this range. + +The rq then keeps track of its current effective uclamp value for each +uclamp_id. + +When a task p is enqueued, the rq value changes to: + +:: + + // update bucket logic goes here + rq->uclamp[UCLAMP_MIN] = max(rq->uclamp[UCLAMP_MIN], p->uclamp[UCLAMP_MIN]) + // repeat for UCLAMP_MAX + +Similarly, when p is dequeued the rq value changes to: + +:: + + // update bucket logic goes here + rq->uclamp[UCLAMP_MIN] = search_top_bucket_for_highest_value() + // repeat for UCLAMP_MAX + +When all buckets are empty, the rq uclamp values are reset to system defaults. +See :ref:`section 3.4 <uclamp-default-values>` for details on default values. + + +2.2. Max aggregation +-------------------- + +Util clamp is tuned to honour the request for the task that requires the +highest performance point. + +When multiple tasks are attached to the same rq, then util clamp must make sure +the task that needs the highest performance point gets it even if there's +another task that doesn't need it or is disallowed from reaching this point. + +For example, if there are multiple tasks attached to an rq with the following +values: + +:: + + p0->uclamp[UCLAMP_MIN] = 300 + p0->uclamp[UCLAMP_MAX] = 900 + + p1->uclamp[UCLAMP_MIN] = 500 + p1->uclamp[UCLAMP_MAX] = 500 + +then assuming both p0 and p1 are enqueued to the same rq, both UCLAMP_MIN +and UCLAMP_MAX become: + +:: + + rq->uclamp[UCLAMP_MIN] = max(300, 500) = 500 + rq->uclamp[UCLAMP_MAX] = max(900, 500) = 900 + +As we shall see in :ref:`section 5.1 <uclamp-capping-fail>`, this max +aggregation is the cause of one of limitations when using util clamp, in +particular for UCLAMP_MAX hint when user space would like to save power. + +2.3. Hierarchical aggregation +----------------------------- + +As stated earlier, util clamp is a property of every task in the system. But +the actual applied (effective) value can be influenced by more than just the +request made by the task or another actor on its behalf (middleware library). + +The effective util clamp value of any task is restricted as follows: + + 1. By the uclamp settings defined by the cgroup CPU controller it is attached + to, if any. + 2. The restricted value in (1) is then further restricted by the system wide + uclamp settings. + +:ref:`Section 3 <uclamp-interfaces>` discusses the interfaces and will expand +further on that. + +For now suffice to say that if a task makes a request, its actual effective +value will have to adhere to some restrictions imposed by cgroup and system +wide settings. + +The system will still accept the request even if effectively will be beyond the +constraints, but as soon as the task moves to a different cgroup or a sysadmin +modifies the system settings, the request will be satisfied only if it is +within new constraints. + +In other words, this aggregation will not cause an error when a task changes +its uclamp values, but rather the system may not be able to satisfy requests +based on those factors. + +2.4. Range +---------- + +Uclamp performance request has the range of 0 to 1024 inclusive. + +For cgroup interface percentage is used (that is 0 to 100 inclusive). +Just like other cgroup interfaces, you can use 'max' instead of 100. + +.. _uclamp-interfaces: + +3. Interfaces +============= + +3.1. Per task interface +----------------------- + +sched_setattr() syscall was extended to accept two new fields: + +* sched_util_min: requests the minimum performance point the system should run + at when this task is running. Or lower performance bound. +* sched_util_max: requests the maximum performance point the system should run + at when this task is running. Or upper performance bound. + +For example, the following scenario have 40% to 80% utilization constraints: + +:: + + attr->sched_util_min = 40% * 1024; + attr->sched_util_max = 80% * 1024; + +When task @p is running, **the scheduler should try its best to ensure it +starts at 40% performance level**. If the task runs for a long enough time so +that its actual utilization goes above 80%, the utilization, or performance +level, will be capped. + +The special value -1 is used to reset the uclamp settings to the system +default. + +Note that resetting the uclamp value to system default using -1 is not the same +as manually setting uclamp value to system default. This distinction is +important because as we shall see in system interfaces, the default value for +RT could be changed. SCHED_NORMAL/OTHER might gain similar knobs too in the +future. + +3.2. cgroup interface +--------------------- + +There are two uclamp related values in the CPU cgroup controller: + +* cpu.uclamp.min +* cpu.uclamp.max + +When a task is attached to a CPU controller, its uclamp values will be impacted +as follows: + +* cpu.uclamp.min is a protection as described in :ref:`section 3-3 of cgroup + v2 documentation <cgroupv2-protections-distributor>`. + + If a task uclamp_min value is lower than cpu.uclamp.min, then the task will + inherit the cgroup cpu.uclamp.min value. + + In a cgroup hierarchy, effective cpu.uclamp.min is the max of (child, + parent). + +* cpu.uclamp.max is a limit as described in :ref:`section 3-2 of cgroup v2 + documentation <cgroupv2-limits-distributor>`. + + If a task uclamp_max value is higher than cpu.uclamp.max, then the task will + inherit the cgroup cpu.uclamp.max value. + + In a cgroup hierarchy, effective cpu.uclamp.max is the min of (child, + parent). + +For example, given following parameters: + +:: + + p0->uclamp[UCLAMP_MIN] = // system default; + p0->uclamp[UCLAMP_MAX] = // system default; + + p1->uclamp[UCLAMP_MIN] = 40% * 1024; + p1->uclamp[UCLAMP_MAX] = 50% * 1024; + + cgroup0->cpu.uclamp.min = 20% * 1024; + cgroup0->cpu.uclamp.max = 60% * 1024; + + cgroup1->cpu.uclamp.min = 60% * 1024; + cgroup1->cpu.uclamp.max = 100% * 1024; + +when p0 and p1 are attached to cgroup0, the values become: + +:: + + p0->uclamp[UCLAMP_MIN] = cgroup0->cpu.uclamp.min = 20% * 1024; + p0->uclamp[UCLAMP_MAX] = cgroup0->cpu.uclamp.max = 60% * 1024; + + p1->uclamp[UCLAMP_MIN] = 40% * 1024; // intact + p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact + +when p0 and p1 are attached to cgroup1, these instead become: + +:: + + p0->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024; + p0->uclamp[UCLAMP_MAX] = cgroup1->cpu.uclamp.max = 100% * 1024; + + p1->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024; + p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact + +Note that cgroup interfaces allows cpu.uclamp.max value to be lower than +cpu.uclamp.min. Other interfaces don't allow that. + +3.3. System interface +--------------------- + +3.3.1 sched_util_clamp_min +-------------------------- + +System wide limit of allowed UCLAMP_MIN range. By default it is set to 1024, +which means that permitted effective UCLAMP_MIN range for tasks is [0:1024]. +By changing it to 512 for example the range reduces to [0:512]. This is useful +to restrict how much boosting tasks are allowed to acquire. + +Requests from tasks to go above this knob value will still succeed, but +they won't be satisfied until it is more than p->uclamp[UCLAMP_MIN]. + +The value must be smaller than or equal to sched_util_clamp_max. + +3.3.2 sched_util_clamp_max +-------------------------- + +System wide limit of allowed UCLAMP_MAX range. By default it is set to 1024, +which means that permitted effective UCLAMP_MAX range for tasks is [0:1024]. + +By changing it to 512 for example the effective allowed range reduces to +[0:512]. This means is that no task can run above 512, which implies that all +rqs are restricted too. IOW, the whole system is capped to half its performance +capacity. + +This is useful to restrict the overall maximum performance point of the system. +For example, it can be handy to limit performance when running low on battery +or when the system wants to limit access to more energy hungry performance +levels when it's in idle state or screen is off. + +Requests from tasks to go above this knob value will still succeed, but they +won't be satisfied until it is more than p->uclamp[UCLAMP_MAX]. + +The value must be greater than or equal to sched_util_clamp_min. + +.. _uclamp-default-values: + +3.4. Default values +------------------- + +By default all SCHED_NORMAL/SCHED_OTHER tasks are initialized to: + +:: + + p_fair->uclamp[UCLAMP_MIN] = 0 + p_fair->uclamp[UCLAMP_MAX] = 1024 + +That is, by default they're boosted to run at the maximum performance point of +changed at boot or runtime. No argument was made yet as to why we should +provide this, but can be added in the future. + +For SCHED_FIFO/SCHED_RR tasks: + +:: + + p_rt->uclamp[UCLAMP_MIN] = 1024 + p_rt->uclamp[UCLAMP_MAX] = 1024 + +That is by default they're boosted to run at the maximum performance point of +the system which retains the historical behavior of the RT tasks. + +RT tasks default uclamp_min value can be modified at boot or runtime via +sysctl. See below section. + +.. _sched-util-clamp-min-rt-default: + +3.4.1 sched_util_clamp_min_rt_default +------------------------------------- + +Running RT tasks at maximum performance point is expensive on battery powered +devices and not necessary. To allow system developer to offer good performance +guarantees for these tasks without pushing it all the way to maximum +performance point, this sysctl knob allows tuning the best boost value to +address the system requirement without burning power running at maximum +performance point all the time. + +Application developer are encouraged to use the per task util clamp interface +to ensure they are performance and power aware. Ideally this knob should be set +to 0 by system designers and leave the task of managing performance +requirements to the apps. + +4. How to use util clamp +======================== + +Util clamp promotes the concept of user space assisted power and performance +management. At the scheduler level there is no info required to make the best +decision. However, with util clamp user space can hint to the scheduler to make +better decision about task placement and frequency selection. + +Best results are achieved by not making any assumptions about the system the +application is running on and to use it in conjunction with a feedback loop to +dynamically monitor and adjust. Ultimately this will allow for a better user +experience at a better perf/watt. + +For some systems and use cases, static setup will help to achieve good results. +Portability will be a problem in this case. How much work one can do at 100, +200 or 1024 is different for each system. Unless there's a specific target +system, static setup should be avoided. + +There are enough possibilities to create a whole framework based on util clamp +or self contained app that makes use of it directly. + +4.1. Boost important and DVFS-latency-sensitive tasks +----------------------------------------------------- + +A GUI task might not be busy to warrant driving the frequency high when it +wakes up. However, it requires to finish its work within a specific time window +to deliver the desired user experience. The right frequency it requires at +wakeup will be system dependent. On some underpowered systems it will be high, +on other overpowered ones it will be low or 0. + +This task can increase its UCLAMP_MIN value every time it misses the deadline +to ensure on next wake up it runs at a higher performance point. It should try +to approach the lowest UCLAMP_MIN value that allows to meet its deadline on any +particular system to achieve the best possible perf/watt for that system. + +On heterogeneous systems, it might be important for this task to run on +a faster CPU. + +**Generally it is advised to perceive the input as performance level or point +which will imply both task placement and frequency selection**. + +4.2. Cap background tasks +------------------------- + +Like explained for Android case in the introduction. Any app can lower +UCLAMP_MAX for some background tasks that don't care about performance but +could end up being busy and consume unnecessary system resources on the system. + +4.3. Powersave mode +------------------- + +sched_util_clamp_max system wide interface can be used to limit all tasks from +operating at the higher performance points which are usually energy +inefficient. + +This is not unique to uclamp as one can achieve the same by reducing max +frequency of the cpufreq governor. It can be considered a more convenient +alternative interface. + +4.4. Per-app performance restriction +------------------------------------ + +Middleware/Utility can provide the user an option to set UCLAMP_MIN/MAX for an +app every time it is executed to guarantee a minimum performance point and/or +limit it from draining system power at the cost of reduced performance for +these apps. + +If you want to prevent your laptop from heating up while on the go from +compiling the kernel and happy to sacrifice performance to save power, but +still would like to keep your browser performance intact, uclamp makes it +possible. + +5. Limitations +============== + +.. _uclamp-capping-fail: + +5.1. Capping frequency with uclamp_max fails under certain conditions +--------------------------------------------------------------------- + +If task p0 is capped to run at 512: + +:: + + p0->uclamp[UCLAMP_MAX] = 512 + +and it shares the rq with p1 which is free to run at any performance point: + +:: + + p1->uclamp[UCLAMP_MAX] = 1024 + +then due to max aggregation the rq will be allowed to reach max performance +point: + +:: + + rq->uclamp[UCLAMP_MAX] = max(512, 1024) = 1024 + +Assuming both p0 and p1 have UCLAMP_MIN = 0, then the frequency selection for +the rq will depend on the actual utilization value of the tasks. + +If p1 is a small task but p0 is a CPU intensive task, then due to the fact that +both are running at the same rq, p1 will cause the frequency capping to be left +from the rq although p1, which is allowed to run at any performance point, +doesn't actually need to run at that frequency. + +5.2. UCLAMP_MAX can break PELT (util_avg) signal +------------------------------------------------ + +PELT assumes that frequency will always increase as the signals grow to ensure +there's always some idle time on the CPU. But with UCLAMP_MAX, this frequency +increase will be prevented which can lead to no idle time in some +circumstances. When there's no idle time, a task will stuck in a busy loop, +which would result in util_avg being 1024. + +Combing with issue described below, this can lead to unwanted frequency spikes +when severely capped tasks share the rq with a small non capped task. + +As an example if task p, which have: + +:: + + p0->util_avg = 300 + p0->uclamp[UCLAMP_MAX] = 0 + +wakes up on an idle CPU, then it will run at min frequency (Fmin) this +CPU is capable of. The max CPU frequency (Fmax) matters here as well, +since it designates the shortest computational time to finish the task's +work on this CPU. + +:: + + rq->uclamp[UCLAMP_MAX] = 0 + +If the ratio of Fmax/Fmin is 3, then maximum value will be: + +:: + + 300 * (Fmax/Fmin) = 900 + +which indicates the CPU will still see idle time since 900 is < 1024. The +_actual_ util_avg will not be 900 though, but somewhere between 300 and 900. As +long as there's idle time, p->util_avg updates will be off by a some margin, +but not proportional to Fmax/Fmin. + +:: + + p0->util_avg = 300 + small_error + +Now if the ratio of Fmax/Fmin is 4, the maximum value becomes: + +:: + + 300 * (Fmax/Fmin) = 1200 + +which is higher than 1024 and indicates that the CPU has no idle time. When +this happens, then the _actual_ util_avg will become: + +:: + + p0->util_avg = 1024 + +If task p1 wakes up on this CPU, which have: + +:: + + p1->util_avg = 200 + p1->uclamp[UCLAMP_MAX] = 1024 + +then the effective UCLAMP_MAX for the CPU will be 1024 according to max +aggregation rule. But since the capped p0 task was running and throttled +severely, then the rq->util_avg will be: + +:: + + p0->util_avg = 1024 + p1->util_avg = 200 + + rq->util_avg = 1024 + rq->uclamp[UCLAMP_MAX] = 1024 + +Hence lead to a frequency spike since if p0 wasn't throttled we should get: + +:: + + p0->util_avg = 300 + p1->util_avg = 200 + + rq->util_avg = 500 + +and run somewhere near mid performance point of that CPU, not the Fmax we get. + +5.3. Schedutil response time issues +----------------------------------- + +schedutil has three limitations: + + 1. Hardware takes non-zero time to respond to any frequency change + request. On some platforms can be in the order of few ms. + 2. Non fast-switch systems require a worker deadline thread to wake up + and perform the frequency change, which adds measurable overhead. + 3. schedutil rate_limit_us drops any requests during this rate_limit_us + window. + +If a relatively small task is doing critical job and requires a certain +performance point when it wakes up and starts running, then all these +limitations will prevent it from getting what it wants in the time scale it +expects. + +This limitation is not only impactful when using uclamp, but will be more +prevalent as we no longer gradually ramp up or down. We could easily be +jumping between frequencies depending on the order tasks wake up, and their +respective uclamp values. + +We regard that as a limitation of the capabilities of the underlying system +itself. + +There is room to improve the behavior of schedutil rate_limit_us, but not much +to be done for 1 or 2. They are considered hard limitations of the system. diff --git a/Documentation/scheduler/schedutil.rst b/Documentation/scheduler/schedutil.rst new file mode 100644 index 000000000000..803fba8fc714 --- /dev/null +++ b/Documentation/scheduler/schedutil.rst @@ -0,0 +1,172 @@ +========= +Schedutil +========= + +.. note:: + + All this assumes a linear relation between frequency and work capacity, + we know this is flawed, but it is the best workable approximation. + + +PELT (Per Entity Load Tracking) +=============================== + +With PELT we track some metrics across the various scheduler entities, from +individual tasks to task-group slices to CPU runqueues. As the basis for this +we use an Exponentially Weighted Moving Average (EWMA), each period (1024us) +is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute +half, while the rest of history contribute the other half. + +Specifically: + + ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ... + + ewma(u) = ewma_sum(u) / ewma_sum(1) + +Since this is essentially a progression of an infinite geometric series, the +results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property +is key, since it gives the ability to recompose the averages when tasks move +around. + +Note that blocked tasks still contribute to the aggregates (task-group slices +and CPU runqueues), which reflects their expected contribution when they +resume running. + +Using this we track 2 key metrics: 'running' and 'runnable'. 'Running' +reflects the time an entity spends on the CPU, while 'runnable' reflects the +time an entity spends on the runqueue. When there is only a single task these +two metrics are the same, but once there is contention for the CPU 'running' +will decrease to reflect the fraction of time each task spends on the CPU +while 'runnable' will increase to reflect the amount of contention. + +For more detail see: kernel/sched/pelt.c + + +Frequency / CPU Invariance +========================== + +Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU +for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on +a big CPU, we allow architectures to scale the time delta with two ratios, one +Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio. + +For simple DVFS architectures (where software is in full control) we trivially +compute the ratio as:: + + f_cur + r_dvfs := ----- + f_max + +For more dynamic systems where the hardware is in control of DVFS we use +hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio. +For Intel specifically, we use:: + + APERF + f_cur := ----- * P0 + MPERF + + 4C-turbo; if available and turbo enabled + f_max := { 1C-turbo; if turbo enabled + P0; otherwise + + f_cur + r_dvfs := min( 1, ----- ) + f_max + +We pick 4C turbo over 1C turbo to make it slightly more sustainable. + +r_cpu is determined as the ratio of highest performance level of the current +CPU vs the highest performance level of any other CPU in the system. + + r_tot = r_dvfs * r_cpu + +The result is that the above 'running' and 'runnable' metrics become invariant +of DVFS and CPU type. IOW. we can transfer and compare them between CPUs. + +For more detail see: + + - kernel/sched/pelt.h:update_rq_clock_pelt() + - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation." + - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization" + + +UTIL_EST +======== + +Because periodic tasks have their averages decayed while they sleep, even +though when running their expected utilization will be the same, they suffer a +(DVFS) ramp-up after they are running again. + +To alleviate this (a default enabled option) UTIL_EST drives an Infinite +Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is +highest. UTIL_EST filters to instantly increase and only decay on decrease. + +A further runqueue wide sum (of runnable tasks) is maintained of: + + util_est := \Sum_t max( t_running, t_util_est_ewma ) + +For more detail see: kernel/sched/fair.c:util_est_dequeue() + + +UCLAMP +====== + +It is possible to set effective u_min and u_max clamps on each CFS or RT task; +the runqueue keeps an max aggregate of these clamps for all running tasks. + +For more detail see: include/uapi/linux/sched/types.h + + +Schedutil / DVFS +================ + +Every time the scheduler load tracking is updated (task wakeup, task +migration, time progression) we call out to schedutil to update the hardware +DVFS state. + +The basis is the CPU runqueue's 'running' metric, which per the above it is +the frequency invariant utilization estimate of the CPU. From this we compute +a desired frequency like:: + + max( running, util_est ); if UTIL_EST + u_cfs := { running; otherwise + + clamp( u_cfs + u_rt , u_min, u_max ); if UCLAMP_TASK + u_clamp := { u_cfs + u_rt; otherwise + + u := u_clamp + u_irq + u_dl; [approx. see source for more detail] + + f_des := min( f_max, 1.25 u * f_max ) + +XXX IO-wait: when the update is due to a task wakeup from IO-completion we +boost 'u' above. + +This frequency is then used to select a P-state/OPP or directly munged into a +CPPC style request to the hardware. + +XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min +required to satisfy the workload. + +Because these callbacks are directly from the scheduler, the DVFS hardware +interaction should be 'fast' and non-blocking. Schedutil supports +rate-limiting DVFS requests for when hardware interaction is slow and +expensive, this reduces effectiveness. + +For more information see: kernel/sched/cpufreq_schedutil.c + + +NOTES +===== + + - On low-load scenarios, where DVFS is most relevant, the 'running' numbers + will closely reflect utilization. + + - In saturated scenarios task movement will cause some transient dips, + suppose we have a CPU saturated with 4 tasks, then when we migrate a task + to an idle CPU, the old CPU will have a 'running' value of 0.75 while the + new CPU will gain 0.25. This is inevitable and time progression will + correct this. XXX do we still guarantee f_max due to no idle-time? + + - Much of the above is about avoiding DVFS dips, and independent DVFS domains + having to re-learn / ramp-up when load shifts. + |