aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/admin-guide/cgroup-v2.rst
diff options
context:
space:
mode:
authorDmitry Torokhov <dmitry.torokhov@gmail.com>2019-11-25 13:26:56 -0800
committerDmitry Torokhov <dmitry.torokhov@gmail.com>2019-11-25 13:26:56 -0800
commit976e3645923bdd2fe7893aae33fd7a21098bfb28 (patch)
treed1cb24e4c9743beef15a4796070aca7e2c08228a /Documentation/admin-guide/cgroup-v2.rst
parentRevert "Input: synaptics - enable RMI mode for X1 Extreme 2nd Generation" (diff)
parentInput: synaptics-rmi4 - fix various V4L2 compliance problems in F54 (diff)
downloadlinux-dev-976e3645923bdd2fe7893aae33fd7a21098bfb28.tar.xz
linux-dev-976e3645923bdd2fe7893aae33fd7a21098bfb28.zip
Merge branch 'next' into for-linus
Prepare input updates for 5.5 merge window.
Diffstat (limited to 'Documentation/admin-guide/cgroup-v2.rst')
-rw-r--r--Documentation/admin-guide/cgroup-v2.rst151
1 files changed, 145 insertions, 6 deletions
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3b29005aa981..5361ebec3361 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -615,8 +615,8 @@ on an IO device and is an example of this type.
Protections
-----------
-A cgroup is protected to be allocated upto the configured amount of
-the resource if the usages of all its ancestors are under their
+A cgroup is protected upto the configured amount of the resource
+as long as the usages of all its ancestors are under their
protected levels. Protections can be hard guarantees or best effort
soft boundaries. Protections can also be over-committed in which case
only upto the amount available to the parent is protected among
@@ -951,6 +951,13 @@ controller implements weight and absolute bandwidth limit models for
normal scheduling policy and absolute bandwidth allocation model for
realtime scheduling policy.
+In all the above models, cycles distribution is defined only on a temporal
+base and it does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to hint the schedutil
+cpufreq governor about the minimum desired frequency which should always be
+provided by a CPU, as well as the maximum desired frequency, which should not
+be exceeded by a CPU.
+
WARNING: cgroup2 doesn't yet support control of realtime processes and
the cpu controller can only be enabled when all RT processes are in
the root cgroup. Be aware that system management software may already
@@ -1016,6 +1023,33 @@ All time durations are in microseconds.
Shows pressure stall information for CPU. See
Documentation/accounting/psi.rst for details.
+ cpu.uclamp.min
+ A read-write single value file which exists on non-root cgroups.
+ The default is "0", i.e. no utilization boosting.
+
+ The requested minimum utilization (protection) as a percentage
+ rational number, e.g. 12.34 for 12.34%.
+
+ This interface allows reading and setting minimum utilization clamp
+ values similar to the sched_setattr(2). This minimum utilization
+ value is used to clamp the task specific minimum utilization clamp.
+
+ The requested minimum utilization (protection) is always capped by
+ the current value for the maximum utilization (limit), i.e.
+ `cpu.uclamp.max`.
+
+ cpu.uclamp.max
+ A read-write single value file which exists on non-root cgroups.
+ The default is "max". i.e. no utilization capping
+
+ The requested maximum utilization (limit) as a percentage rational
+ number, e.g. 98.76 for 98.76%.
+
+ This interface allows reading and setting maximum utilization clamp
+ values similar to the sched_setattr(2). This maximum utilization
+ value is used to clamp the task specific maximum utilization clamp.
+
+
Memory
------
@@ -1062,7 +1096,10 @@ PAGE_SIZE multiple when read back.
is within its effective min boundary, the cgroup's memory
won't be reclaimed under any conditions. If there is no
unprotected reclaimable memory available, OOM killer
- is invoked.
+ is invoked. Above the effective min boundary (or
+ effective low boundary if it is higher), pages are reclaimed
+ proportionally to the overage, reducing reclaim pressure for
+ smaller overages.
Effective min boundary is limited by memory.min values of
all ancestor cgroups. If there is memory.min overcommitment
@@ -1084,7 +1121,10 @@ PAGE_SIZE multiple when read back.
Best-effort memory protection. If the memory usage of a
cgroup is within its effective low boundary, the cgroup's
memory won't be reclaimed unless memory can be reclaimed
- from unprotected cgroups.
+ from unprotected cgroups. Above the effective low boundary (or
+ effective min boundary if it is higher), pages are reclaimed
+ proportionally to the overage, reducing reclaim pressure for
+ smaller overages.
Effective low boundary is limited by memory.low values of
all ancestor cgroups. If there is memory.low overcommitment
@@ -1435,6 +1475,103 @@ IO Interface Files
8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
+ io.cost.qos
+ A read-write nested-keyed file with exists only on the root
+ cgroup.
+
+ This file configures the Quality of Service of the IO cost
+ model based controller (CONFIG_BLK_CGROUP_IOCOST) which
+ currently implements "io.weight" proportional control. Lines
+ are keyed by $MAJ:$MIN device numbers and not ordered. The
+ line for a given device is populated on the first write for
+ the device on "io.cost.qos" or "io.cost.model". The following
+ nested keys are defined.
+
+ ====== =====================================
+ enable Weight-based control enable
+ ctrl "auto" or "user"
+ rpct Read latency percentile [0, 100]
+ rlat Read latency threshold
+ wpct Write latency percentile [0, 100]
+ wlat Write latency threshold
+ min Minimum scaling percentage [1, 10000]
+ max Maximum scaling percentage [1, 10000]
+ ====== =====================================
+
+ The controller is disabled by default and can be enabled by
+ setting "enable" to 1. "rpct" and "wpct" parameters default
+ to zero and the controller uses internal device saturation
+ state to adjust the overall IO rate between "min" and "max".
+
+ When a better control quality is needed, latency QoS
+ parameters can be configured. For example::
+
+ 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0
+
+ shows that on sdb, the controller is enabled, will consider
+ the device saturated if the 95th percentile of read completion
+ latencies is above 75ms or write 150ms, and adjust the overall
+ IO issue rate between 50% and 150% accordingly.
+
+ The lower the saturation point, the better the latency QoS at
+ the cost of aggregate bandwidth. The narrower the allowed
+ adjustment range between "min" and "max", the more conformant
+ to the cost model the IO behavior. Note that the IO issue
+ base rate may be far off from 100% and setting "min" and "max"
+ blindly can lead to a significant loss of device capacity or
+ control quality. "min" and "max" are useful for regulating
+ devices which show wide temporary behavior changes - e.g. a
+ ssd which accepts writes at the line speed for a while and
+ then completely stalls for multiple seconds.
+
+ When "ctrl" is "auto", the parameters are controlled by the
+ kernel and may change automatically. Setting "ctrl" to "user"
+ or setting any of the percentile and latency parameters puts
+ it into "user" mode and disables the automatic changes. The
+ automatic mode can be restored by setting "ctrl" to "auto".
+
+ io.cost.model
+ A read-write nested-keyed file with exists only on the root
+ cgroup.
+
+ This file configures the cost model of the IO cost model based
+ controller (CONFIG_BLK_CGROUP_IOCOST) which currently
+ implements "io.weight" proportional control. Lines are keyed
+ by $MAJ:$MIN device numbers and not ordered. The line for a
+ given device is populated on the first write for the device on
+ "io.cost.qos" or "io.cost.model". The following nested keys
+ are defined.
+
+ ===== ================================
+ ctrl "auto" or "user"
+ model The cost model in use - "linear"
+ ===== ================================
+
+ When "ctrl" is "auto", the kernel may change all parameters
+ dynamically. When "ctrl" is set to "user" or any other
+ parameters are written to, "ctrl" become "user" and the
+ automatic changes are disabled.
+
+ When "model" is "linear", the following model parameters are
+ defined.
+
+ ============= ========================================
+ [r|w]bps The maximum sequential IO throughput
+ [r|w]seqiops The maximum 4k sequential IOs per second
+ [r|w]randiops The maximum 4k random IOs per second
+ ============= ========================================
+
+ From the above, the builtin linear model determines the base
+ costs of a sequential and random IO and the cost coefficient
+ for the IO size. While simple, this model can cover most
+ common device classes acceptably.
+
+ The IO cost model isn't expected to be accurate in absolute
+ sense and is scaled to the device behavior dynamically.
+
+ If needed, tools/cgroup/iocost_coef_gen.py can be used to
+ generate device-specific coefficients.
+
io.weight
A read-write flat-keyed file which exists on non-root cgroups.
The default is "default 100".
@@ -2351,8 +2488,10 @@ system performance due to overreclaim, to the point where the feature
becomes self-defeating.
The memory.low boundary on the other hand is a top-down allocated
-reserve. A cgroup enjoys reclaim protection when it's within its low,
-which makes delegation of subtrees possible.
+reserve. A cgroup enjoys reclaim protection when it's within its
+effective low, which makes delegation of subtrees possible. It also
+enjoys having reclaim pressure proportional to its overage when
+above its effective low.
The original high boundary, the hard limit, is defined as a strict
limit that can not budge, even if the OOM killer has to be called.