diff options
author | 2025-05-27 20:59:53 -0700 | |
---|---|---|
committer | 2025-05-27 20:59:53 -0700 | |
commit | 3b66e6b3c098646f75c54e269dd963e2281c555c (patch) | |
tree | 79568dd6b16cac32b86ff646ddaaafd17dbb826e /Documentation/admin-guide | |
parent | Merge tag 'wq-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq (diff) | |
parent | sched_ext: Convert cgroup BPF support to use cgroup_lifetime_notifier (diff) | |
download | linux-rng-3b66e6b3c098646f75c54e269dd963e2281c555c.tar.xz linux-rng-3b66e6b3c098646f75c54e269dd963e2281c555c.zip |
Merge tag 'cgroup-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- cgroup rstat shared the tracking tree across all controllers with the
rationale being that a cgroup which is using one resource is likely
to be using other resources at the same time (ie. if something is
allocating memory, it's probably consuming CPU cycles).
However, this turned out to not scale very well especially with memcg
using rstat for internal operations which made memcg stat read and
flush patterns substantially different from other controllers. JP
Kobryn split the rstat tree per controller.
- cgroup BPF support was hooking into cgroup init/exit paths directly.
Convert them to use a notifier chain instead so that other usages can
be added easily. The two of the patches which implement this are
mislabeled as belonging to sched_ext instead of cgroup. Sorry.
- Relatively minor cpuset updates
- Documentation updates
* tag 'cgroup-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (23 commits)
sched_ext: Convert cgroup BPF support to use cgroup_lifetime_notifier
sched_ext: Introduce cgroup_lifetime_notifier
cgroup: Minor reorganization of cgroup_create()
cgroup, docs: cpu controller's interaction with various scheduling policies
cgroup, docs: convert space indentation to tab indentation
cgroup: avoid per-cpu allocation of size zero rstat cpu locks
cgroup, docs: be specific about bandwidth control of rt processes
cgroup: document the rstat per-cpu initialization
cgroup: helper for checking rstat participation of css
cgroup: use subsystem-specific rstat locks to avoid contention
cgroup: use separate rstat trees for each subsystem
cgroup: compare css to cgroup::self in helper for distingushing css
cgroup: warn on rstat usage by early init subsystems
cgroup/cpuset: drop useless cpumask_empty() in compute_effective_exclusive_cpumask()
cgroup/rstat: Improve cgroup_rstat_push_children() documentation
cgroup: fix goto ordering in cgroup_init()
cgroup: fix pointer check in css_rstat_init()
cgroup/cpuset: Add warnings to catch inconsistency in exclusive CPUs
cgroup/cpuset: Fix obsolete comment in cpuset_css_offline()
cgroup/cpuset: Always use cpu_active_mask
...
Diffstat (limited to 'Documentation/admin-guide')
-rw-r--r-- | Documentation/admin-guide/cgroup-v2.rst | 79 |
1 files changed, 57 insertions, 22 deletions
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 9e7de8e70048..1edc26622594 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1076,7 +1076,7 @@ cpufreq governor about the minimum desired frequency which should always be provided by a CPU, as well as the maximum desired frequency, which should not be exceeded by a CPU. -WARNING: cgroup2 cpu controller doesn't yet fully support the control of +WARNING: cgroup2 cpu controller doesn't yet support the (bandwidth) control of realtime processes. For a kernel built with the CONFIG_RT_GROUP_SCHED option enabled for group scheduling of realtime processes, the cpu controller can only be enabled when all RT processes are in the root cgroup. Be aware that system @@ -1095,19 +1095,34 @@ realtime processes irrespective of CONFIG_RT_GROUP_SCHED. CPU Interface Files ~~~~~~~~~~~~~~~~~~~ -All time durations are in microseconds. +The interaction of a process with the cpu controller depends on its scheduling +policy and the underlying scheduler. From the point of view of the cpu controller, +processes can be categorized as follows: + +* Processes under the fair-class scheduler +* Processes under a BPF scheduler with the ``cgroup_set_weight`` callback +* Everything else: ``SCHED_{FIFO,RR,DEADLINE}`` and processes under a BPF scheduler + without the ``cgroup_set_weight`` callback + +For details on when a process is under the fair-class scheduler or a BPF scheduler, +check out :ref:`Documentation/scheduler/sched-ext.rst <sched-ext>`. + +For each of the following interface files, the above categories +will be referred to. All time durations are in microseconds. cpu.stat A read-only flat-keyed file. This file exists whether the controller is enabled or not. - It always reports the following three stats: + It always reports the following three stats, which account for all the + processes in the cgroup: - usage_usec - user_usec - system_usec - and the following five when the controller is enabled: + and the following five when the controller is enabled, which account for + only the processes under the fair-class scheduler: - nr_periods - nr_throttled @@ -1125,6 +1140,10 @@ All time durations are in microseconds. If the cgroup has been configured to be SCHED_IDLE (cpu.idle = 1), then the weight will show as a 0. + This file affects only processes under the fair-class scheduler and a BPF + scheduler with the ``cgroup_set_weight`` callback depending on what the + callback actually does. + cpu.weight.nice A read-write single value file which exists on non-root cgroups. The default is "0". @@ -1137,6 +1156,10 @@ All time durations are in microseconds. granularity is coarser for the nice values, the read value is the closest approximation of the current weight. + This file affects only processes under the fair-class scheduler and a BPF + scheduler with the ``cgroup_set_weight`` callback depending on what the + callback actually does. + cpu.max A read-write two value file which exists on non-root cgroups. The default is "max 100000". @@ -1149,43 +1172,55 @@ All time durations are in microseconds. $PERIOD duration. "max" for $MAX indicates no limit. If only one number is written, $MAX is updated. + This file affects only processes under the fair-class scheduler. + cpu.max.burst A read-write single value file which exists on non-root cgroups. The default is "0". The burst in the range [0, $MAX]. + This file affects only processes under the fair-class scheduler. + cpu.pressure A read-write nested-keyed file. Shows pressure stall information for CPU. See :ref:`Documentation/accounting/psi.rst <psi>` for details. + This file accounts for all the processes in the cgroup. + cpu.uclamp.min - A read-write single value file which exists on non-root cgroups. - The default is "0", i.e. no utilization boosting. + A read-write single value file which exists on non-root cgroups. + The default is "0", i.e. no utilization boosting. + + The requested minimum utilization (protection) as a percentage + rational number, e.g. 12.34 for 12.34%. - The requested minimum utilization (protection) as a percentage - rational number, e.g. 12.34 for 12.34%. + This interface allows reading and setting minimum utilization clamp + values similar to the sched_setattr(2). This minimum utilization + value is used to clamp the task specific minimum utilization clamp, + including those of realtime processes. - This interface allows reading and setting minimum utilization clamp - values similar to the sched_setattr(2). This minimum utilization - value is used to clamp the task specific minimum utilization clamp. + The requested minimum utilization (protection) is always capped by + the current value for the maximum utilization (limit), i.e. + `cpu.uclamp.max`. - The requested minimum utilization (protection) is always capped by - the current value for the maximum utilization (limit), i.e. - `cpu.uclamp.max`. + This file affects all the processes in the cgroup. cpu.uclamp.max - A read-write single value file which exists on non-root cgroups. - The default is "max". i.e. no utilization capping + A read-write single value file which exists on non-root cgroups. + The default is "max". i.e. no utilization capping + + The requested maximum utilization (limit) as a percentage rational + number, e.g. 98.76 for 98.76%. - The requested maximum utilization (limit) as a percentage rational - number, e.g. 98.76 for 98.76%. + This interface allows reading and setting maximum utilization clamp + values similar to the sched_setattr(2). This maximum utilization + value is used to clamp the task specific maximum utilization clamp, + including those of realtime processes. - This interface allows reading and setting maximum utilization clamp - values similar to the sched_setattr(2). This maximum utilization - value is used to clamp the task specific maximum utilization clamp. + This file affects all the processes in the cgroup. cpu.idle A read-write single value file which exists on non-root cgroups. @@ -1197,7 +1232,7 @@ All time durations are in microseconds. own relative priorities, but the cgroup itself will be treated as very low priority relative to its peers. - + This file affects only processes under the fair-class scheduler. Memory ------ |