aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/admin-guide/cgroup-v2.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/admin-guide/cgroup-v2.rst')
-rw-r--r--Documentation/admin-guide/cgroup-v2.rst157
1 files changed, 152 insertions, 5 deletions
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index cf88c1f98270..0fa8c0e615c2 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -9,7 +9,7 @@ This is the authoritative documentation on the design, interface and
conventions of cgroup v2. It describes all userland-visible aspects
of cgroup including core and specific controller behaviors. All
future changes must be reflected in this document. Documentation for
-v1 is available under Documentation/cgroup-v1/.
+v1 is available under Documentation/admin-guide/cgroup-v1/.
.. CONTENTS
@@ -705,6 +705,12 @@ Conventions
informational files on the root cgroup which end up showing global
information available elsewhere shouldn't exist.
+- The default time unit is microseconds. If a different unit is ever
+ used, an explicit unit suffix must be present.
+
+- A parts-per quantity should use a percentage decimal with at least
+ two digit fractional part - e.g. 13.40.
+
- If a controller implements weight based resource distribution, its
interface file should be named "weight" and have the range [1,
10000] with 100 as the default. The values are chosen to allow
@@ -945,6 +951,13 @@ controller implements weight and absolute bandwidth limit models for
normal scheduling policy and absolute bandwidth allocation model for
realtime scheduling policy.
+In all the above models, cycles distribution is defined only on a temporal
+base and it does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to hint the schedutil
+cpufreq governor about the minimum desired frequency which should always be
+provided by a CPU, as well as the maximum desired frequency, which should not
+be exceeded by a CPU.
+
WARNING: cgroup2 doesn't yet support control of realtime processes and
the cpu controller can only be enabled when all RT processes are in
the root cgroup. Be aware that system management software may already
@@ -1008,7 +1021,34 @@ All time durations are in microseconds.
A read-only nested-key file which exists on non-root cgroups.
Shows pressure stall information for CPU. See
- Documentation/accounting/psi.txt for details.
+ Documentation/accounting/psi.rst for details.
+
+ cpu.uclamp.min
+ A read-write single value file which exists on non-root cgroups.
+ The default is "0", i.e. no utilization boosting.
+
+ The requested minimum utilization (protection) as a percentage
+ rational number, e.g. 12.34 for 12.34%.
+
+ This interface allows reading and setting minimum utilization clamp
+ values similar to the sched_setattr(2). This minimum utilization
+ value is used to clamp the task specific minimum utilization clamp.
+
+ The requested minimum utilization (protection) is always capped by
+ the current value for the maximum utilization (limit), i.e.
+ `cpu.uclamp.max`.
+
+ cpu.uclamp.max
+ A read-write single value file which exists on non-root cgroups.
+ The default is "max". i.e. no utilization capping
+
+ The requested maximum utilization (limit) as a percentage rational
+ number, e.g. 98.76 for 98.76%.
+
+ This interface allows reading and setting maximum utilization clamp
+ values similar to the sched_setattr(2). This maximum utilization
+ value is used to clamp the task specific maximum utilization clamp.
+
Memory
@@ -1140,6 +1180,11 @@ PAGE_SIZE multiple when read back.
otherwise, a value change in this file generates a file
modified event.
+ Note that all fields in this file are hierarchical and the
+ file modified event can be generated due to an event down the
+ hierarchy. For for the local events at the cgroup level see
+ memory.events.local.
+
low
The number of times the cgroup is reclaimed due to
high memory pressure even though its usage is under
@@ -1179,6 +1224,11 @@ PAGE_SIZE multiple when read back.
The number of processes belonging to this cgroup
killed by any kind of OOM killer.
+ memory.events.local
+ Similar to memory.events but the fields in the file are local
+ to the cgroup i.e. not hierarchical. The file modified event
+ generated on this file reflects only the local events.
+
memory.stat
A read-only flat-keyed file which exists on non-root cgroups.
@@ -1339,7 +1389,7 @@ PAGE_SIZE multiple when read back.
A read-only nested-key file which exists on non-root cgroups.
Shows pressure stall information for memory. See
- Documentation/accounting/psi.txt for details.
+ Documentation/accounting/psi.rst for details.
Usage Guidelines
@@ -1419,6 +1469,103 @@ IO Interface Files
8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
+ io.cost.qos
+ A read-write nested-keyed file with exists only on the root
+ cgroup.
+
+ This file configures the Quality of Service of the IO cost
+ model based controller (CONFIG_BLK_CGROUP_IOCOST) which
+ currently implements "io.weight" proportional control. Lines
+ are keyed by $MAJ:$MIN device numbers and not ordered. The
+ line for a given device is populated on the first write for
+ the device on "io.cost.qos" or "io.cost.model". The following
+ nested keys are defined.
+
+ ====== =====================================
+ enable Weight-based control enable
+ ctrl "auto" or "user"
+ rpct Read latency percentile [0, 100]
+ rlat Read latency threshold
+ wpct Write latency percentile [0, 100]
+ wlat Write latency threshold
+ min Minimum scaling percentage [1, 10000]
+ max Maximum scaling percentage [1, 10000]
+ ====== =====================================
+
+ The controller is disabled by default and can be enabled by
+ setting "enable" to 1. "rpct" and "wpct" parameters default
+ to zero and the controller uses internal device saturation
+ state to adjust the overall IO rate between "min" and "max".
+
+ When a better control quality is needed, latency QoS
+ parameters can be configured. For example::
+
+ 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0
+
+ shows that on sdb, the controller is enabled, will consider
+ the device saturated if the 95th percentile of read completion
+ latencies is above 75ms or write 150ms, and adjust the overall
+ IO issue rate between 50% and 150% accordingly.
+
+ The lower the saturation point, the better the latency QoS at
+ the cost of aggregate bandwidth. The narrower the allowed
+ adjustment range between "min" and "max", the more conformant
+ to the cost model the IO behavior. Note that the IO issue
+ base rate may be far off from 100% and setting "min" and "max"
+ blindly can lead to a significant loss of device capacity or
+ control quality. "min" and "max" are useful for regulating
+ devices which show wide temporary behavior changes - e.g. a
+ ssd which accepts writes at the line speed for a while and
+ then completely stalls for multiple seconds.
+
+ When "ctrl" is "auto", the parameters are controlled by the
+ kernel and may change automatically. Setting "ctrl" to "user"
+ or setting any of the percentile and latency parameters puts
+ it into "user" mode and disables the automatic changes. The
+ automatic mode can be restored by setting "ctrl" to "auto".
+
+ io.cost.model
+ A read-write nested-keyed file with exists only on the root
+ cgroup.
+
+ This file configures the cost model of the IO cost model based
+ controller (CONFIG_BLK_CGROUP_IOCOST) which currently
+ implements "io.weight" proportional control. Lines are keyed
+ by $MAJ:$MIN device numbers and not ordered. The line for a
+ given device is populated on the first write for the device on
+ "io.cost.qos" or "io.cost.model". The following nested keys
+ are defined.
+
+ ===== ================================
+ ctrl "auto" or "user"
+ model The cost model in use - "linear"
+ ===== ================================
+
+ When "ctrl" is "auto", the kernel may change all parameters
+ dynamically. When "ctrl" is set to "user" or any other
+ parameters are written to, "ctrl" become "user" and the
+ automatic changes are disabled.
+
+ When "model" is "linear", the following model parameters are
+ defined.
+
+ ============= ========================================
+ [r|w]bps The maximum sequential IO throughput
+ [r|w]seqiops The maximum 4k sequential IOs per second
+ [r|w]randiops The maximum 4k random IOs per second
+ ============= ========================================
+
+ From the above, the builtin linear model determines the base
+ costs of a sequential and random IO and the cost coefficient
+ for the IO size. While simple, this model can cover most
+ common device classes acceptably.
+
+ The IO cost model isn't expected to be accurate in absolute
+ sense and is scaled to the device behavior dynamically.
+
+ If needed, tools/cgroup/iocost_coef_gen.py can be used to
+ generate device-specific coefficients.
+
io.weight
A read-write flat-keyed file which exists on non-root cgroups.
The default is "default 100".
@@ -1482,7 +1629,7 @@ IO Interface Files
A read-only nested-key file which exists on non-root cgroups.
Shows pressure stall information for IO. See
- Documentation/accounting/psi.txt for details.
+ Documentation/accounting/psi.rst for details.
Writeback
@@ -2108,7 +2255,7 @@ following two functions.
a queue (device) has been associated with the bio and
before submission.
- wbc_account_io(@wbc, @page, @bytes)
+ wbc_account_cgroup_owner(@wbc, @page, @bytes)
Should be called for each data segment being written out.
While this function doesn't care exactly when it's called
during the writeback session, it's the easiest and most