From 4ec22e9c5a90e3809dd52014d5d239af8831a520 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 8 Nov 2018 10:08:35 -0500 Subject: cpuset: Enable cpuset controller in default hierarchy Given the fact that thread mode had been merged into 4.14, it is now time to enable cpuset to be used in the default hierarchy (cgroup v2) as it is clearly threaded. The cpuset controller had experienced feature creep since its introduction more than a decade ago. Besides the core cpus and mems control files to limit cpus and memory nodes, there are a bunch of additional features that can be controlled from the userspace. Some of the features are of doubtful usefulness and may not be actively used. This patch enables cpuset controller in the default hierarchy with a minimal set of features, namely just the cpus and mems and their effective_* counterparts. We can certainly add more features to the default hierarchy in the future if there is a real user need for them later on. Alternatively, with the unified hiearachy, it may make more sense to move some of those additional cpuset features, if desired, to memory controller or may be to the cpu controller instead of staying with cpuset. Signed-off-by: Waiman Long Acked-by: Peter Zijlstra (Intel) Signed-off-by: Tejun Heo --- Documentation/admin-guide/cgroup-v2.rst | 109 ++++++++++++++++++++++++++++++-- 1 file changed, 104 insertions(+), 5 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 476722b7b636..01b70f69304e 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -56,11 +56,13 @@ v1 is available under Documentation/cgroup-v1/. 5-3-3-2. IO Latency Interface Files 5-4. PID 5-4-1. PID Interface Files - 5-5. Device - 5-6. RDMA - 5-6-1. RDMA Interface Files - 5-7. Misc - 5-7-1. perf_event + 5-5. Cpuset + 5.5-1. Cpuset Interface Files + 5-6. Device + 5-7. RDMA + 5-7-1. RDMA Interface Files + 5-8. Misc + 5-8-1. perf_event 5-N. Non-normative information 5-N-1. CPU controller root cgroup process behaviour 5-N-2. IO controller root cgroup process behaviour @@ -1610,6 +1612,103 @@ through fork() or clone(). These will return -EAGAIN if the creation of a new process would cause a cgroup policy to be violated. +Cpuset +------ + +The "cpuset" controller provides a mechanism for constraining +the CPU and memory node placement of tasks to only the resources +specified in the cpuset interface files in a task's current cgroup. +This is especially valuable on large NUMA systems where placing jobs +on properly sized subsets of the systems with careful processor and +memory placement to reduce cross-node memory access and contention +can improve overall system performance. + +The "cpuset" controller is hierarchical. That means the controller +cannot use CPUs or memory nodes not allowed in its parent. + + +Cpuset Interface Files +~~~~~~~~~~~~~~~~~~~~~~ + + cpuset.cpus + A read-write multiple values file which exists on non-root + cpuset-enabled cgroups. + + It lists the requested CPUs to be used by tasks within this + cgroup. The actual list of CPUs to be granted, however, is + subjected to constraints imposed by its parent and can differ + from the requested CPUs. + + The CPU numbers are comma-separated numbers or ranges. + For example: + + # cat cpuset.cpus + 0-4,6,8-10 + + An empty value indicates that the cgroup is using the same + setting as the nearest cgroup ancestor with a non-empty + "cpuset.cpus" or all the available CPUs if none is found. + + The value of "cpuset.cpus" stays constant until the next update + and won't be affected by any CPU hotplug events. + + cpuset.cpus.effective + A read-only multiple values file which exists on non-root + cpuset-enabled cgroups. + + It lists the onlined CPUs that are actually granted to this + cgroup by its parent. These CPUs are allowed to be used by + tasks within the current cgroup. + + If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows + all the CPUs from the parent cgroup that can be available to + be used by this cgroup. Otherwise, it should be a subset of + "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus" + can be granted. In this case, it will be treated just like an + empty "cpuset.cpus". + + Its value will be affected by CPU hotplug events. + + cpuset.mems + A read-write multiple values file which exists on non-root + cpuset-enabled cgroups. + + It lists the requested memory nodes to be used by tasks within + this cgroup. The actual list of memory nodes granted, however, + is subjected to constraints imposed by its parent and can differ + from the requested memory nodes. + + The memory node numbers are comma-separated numbers or ranges. + For example: + + # cat cpuset.mems + 0-1,3 + + An empty value indicates that the cgroup is using the same + setting as the nearest cgroup ancestor with a non-empty + "cpuset.mems" or all the available memory nodes if none + is found. + + The value of "cpuset.mems" stays constant until the next update + and won't be affected by any memory nodes hotplug events. + + cpuset.mems.effective + A read-only multiple values file which exists on non-root + cpuset-enabled cgroups. + + It lists the onlined memory nodes that are actually granted to + this cgroup by its parent. These memory nodes are allowed to + be used by tasks within the current cgroup. + + If "cpuset.mems" is empty, it shows all the memory nodes from the + parent cgroup that will be available to be used by this cgroup. + Otherwise, it should be a subset of "cpuset.mems" unless none of + the memory nodes listed in "cpuset.mems" can be granted. In this + case, it will be treated just like an empty "cpuset.mems". + + Its value will be affected by memory nodes hotplug events. + + Device controller ----------------- -- cgit v1.2.3-59-g8ed1b From 5776ceccd4de2a53dec740422a409e9e588c5a70 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 8 Nov 2018 10:08:43 -0500 Subject: cpuset: Expose cpus.effective and mems.effective on cgroup v2 root Because of the fact that setting the "cpuset.sched.partition" in a direct child of root can remove CPUs from the root's effective CPU list, it makes sense to know what CPUs are left in the root cgroup for scheduling purpose. So the "cpuset.cpus.effective" control file is now exposed in the v2 cgroup root. For consistency, the "cpuset.mems.effective" control file is exposed as well. Signed-off-by: Waiman Long Acked-by: Peter Zijlstra (Intel) Signed-off-by: Tejun Heo --- Documentation/admin-guide/cgroup-v2.rst | 4 ++-- kernel/cgroup/cpuset.c | 2 -- 2 files changed, 2 insertions(+), 4 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 01b70f69304e..595b0757ad2b 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1653,7 +1653,7 @@ Cpuset Interface Files and won't be affected by any CPU hotplug events. cpuset.cpus.effective - A read-only multiple values file which exists on non-root + A read-only multiple values file which exists on all cpuset-enabled cgroups. It lists the onlined CPUs that are actually granted to this @@ -1693,7 +1693,7 @@ Cpuset Interface Files and won't be affected by any memory nodes hotplug events. cpuset.mems.effective - A read-only multiple values file which exists on non-root + A read-only multiple values file which exists on all cpuset-enabled cgroups. It lists the onlined memory nodes that are actually granted to diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 3960de7a75cc..fc1a809cd5bb 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2574,14 +2574,12 @@ static struct cftype dfl_files[] = { .name = "cpus.effective", .seq_show = cpuset_common_seq_show, .private = FILE_EFFECTIVE_CPULIST, - .flags = CFTYPE_NOT_ON_ROOT, }, { .name = "mems.effective", .seq_show = cpuset_common_seq_show, .private = FILE_EFFECTIVE_MEMLIST, - .flags = CFTYPE_NOT_ON_ROOT, }, { -- cgit v1.2.3-59-g8ed1b From 90e92f2d557ee3b0883a3bee76150b9026dba192 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 8 Nov 2018 10:08:45 -0500 Subject: cpuset: Add documentation about the new "cpuset.sched.partition" flag The cgroup-v2.rst file is updated to document the purpose of the new "cpuset.sched.partition" flag and how its usage. Signed-off-by: Waiman Long Acked-by: Peter Zijlstra (Intel) Signed-off-by: Tejun Heo --- Documentation/admin-guide/cgroup-v2.rst | 73 +++++++++++++++++++++++++++++++++ 1 file changed, 73 insertions(+) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 595b0757ad2b..f83a5231bbe3 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1708,6 +1708,79 @@ Cpuset Interface Files Its value will be affected by memory nodes hotplug events. + cpuset.sched.partition + A read-write single value file which exists on non-root + cpuset-enabled cgroups. This flag is owned by the parent cgroup + and is not delegatable. + + It accepts only the following input values when written to. + + "root" or "1" - a paritition root + "member" or "0" - a non-root member of a partition + + When set to be a partition root, the current cgroup is the + root of a new partition or scheduling domain that comprises + itself and all its descendants except those that are separate + partition roots themselves and their descendants. The root + cgroup is always a partition root. + + There are constraints on where a partition root can be set. + It can only be set in a cgroup if all the following conditions + are true. + + 1) The "cpuset.cpus" is not empty and the list of CPUs are + exclusive, i.e. they are not shared by any of its siblings. + 2) The parent cgroup is a partition root. + 3) The "cpuset.cpus" is also a proper subset of the parent's + "cpuset.cpus.effective". + 4) There is no child cgroups with cpuset enabled. This is for + eliminating corner cases that have to be handled if such a + condition is allowed. + + Setting it to partition root will take the CPUs away from the + effective CPUs of the parent cgroup. Once it is set, this + file cannot be reverted back to "member" if there are any child + cgroups with cpuset enabled. + + A parent partition cannot distribute all its CPUs to its + child partitions. There must be at least one cpu left in the + parent partition. + + Once becoming a partition root, changes to "cpuset.cpus" is + generally allowed as long as the first condition above is true, + the change will not take away all the CPUs from the parent + partition and the new "cpuset.cpus" value is a superset of its + children's "cpuset.cpus" values. + + Sometimes, external factors like changes to ancestors' + "cpuset.cpus" or cpu hotplug can cause the state of the partition + root to change. On read, the "cpuset.sched.partition" file + can show the following values. + + "member" Non-root member of a partition + "root" Partition root + "root invalid" Invalid partition root + + It is a partition root if the first 2 partition root conditions + above are true and at least one CPU from "cpuset.cpus" is + granted by the parent cgroup. + + A partition root can become invalid if none of CPUs requested + in "cpuset.cpus" can be granted by the parent cgroup or the + parent cgroup is no longer a partition root itself. In this + case, it is not a real partition even though the restriction + of the first partition root condition above will still apply. + The cpu affinity of all the tasks in the cgroup will then be + associated with CPUs in the nearest ancestor partition. + + An invalid partition root can be transitioned back to a + real partition root if at least one of the requested CPUs + can now be granted by its parent. In this case, the cpu + affinity of all the tasks in the formerly invalid partition + will be associated to the CPUs of the newly formed partition. + Changing the partition state of an invalid partition root to + "member" is always allowed even if child cpusets are present. + Device controller ----------------- -- cgit v1.2.3-59-g8ed1b From b1e3aeb11c5e86ee0988a038c4e7682d6beaa977 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Tue, 13 Nov 2018 12:03:33 -0800 Subject: cpuset: Minor cgroup2 interface updates * Rename the partition file from "cpuset.sched.partition" to "cpuset.cpus.partition". * When writing to the partition file, drop "0" and "1" and only accept "member" and "root". Signed-off-by: Tejun Heo Cc: Peter Zijlstra (Intel) Cc: Waiman Long --- Documentation/admin-guide/cgroup-v2.rst | 6 +++--- kernel/cgroup/cpuset.c | 8 ++++---- 2 files changed, 7 insertions(+), 7 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index f83a5231bbe3..07e06136a550 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1708,15 +1708,15 @@ Cpuset Interface Files Its value will be affected by memory nodes hotplug events. - cpuset.sched.partition + cpuset.cpus.partition A read-write single value file which exists on non-root cpuset-enabled cgroups. This flag is owned by the parent cgroup and is not delegatable. It accepts only the following input values when written to. - "root" or "1" - a paritition root - "member" or "0" - a non-root member of a partition + "root" - a paritition root + "member" - a non-root member of a partition When set to be a partition root, the current cgroup is the root of a new partition or scheduling domain that comprises diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index b897314bab53..1151e93d71b6 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2468,11 +2468,11 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf, buf = strstrip(buf); /* - * Convert "root"/"1" to 1, and convert "member"/"0" to 0. + * Convert "root" to ENABLED, and convert "member" to DISABLED. */ - if (!strcmp(buf, "root") || !strcmp(buf, "1")) + if (!strcmp(buf, "root")) val = PRS_ENABLED; - else if (!strcmp(buf, "member") || !strcmp(buf, "0")) + else if (!strcmp(buf, "member")) val = PRS_DISABLED; else return -EINVAL; @@ -2631,7 +2631,7 @@ static struct cftype dfl_files[] = { }, { - .name = "sched.partition", + .name = "cpus.partition", .seq_show = sched_partition_show, .write = sched_partition_write, .private = FILE_PARTITION_ROOT, -- cgit v1.2.3-59-g8ed1b From 3fc9c12d27b4ded4f1f761a800558dab2e6bbac5 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 28 Dec 2018 10:31:07 -0800 Subject: cgroup: Add named hierarchy disabling to cgroup_no_v1 boot param It can be useful to inhibit all cgroup1 hierarchies especially during transition and for debugging. cgroup_no_v1 can block hierarchies with controllers which leaves out the named hierarchies. Expand it to cover the named hierarchies so that "cgroup_no_v1=all,named" disables all cgroup1 hierarchies. Signed-off-by: Tejun Heo Suggested-by: Marcin Pawlowski Signed-off-by: Tejun Heo --- Documentation/admin-guide/kernel-parameters.txt | 8 ++++++-- kernel/cgroup/cgroup-v1.c | 14 +++++++++++++- 2 files changed, 19 insertions(+), 3 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 81d1d5a74728..8b7652449228 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -486,10 +486,14 @@ cut the overhead, others just disable the usage. So only cgroup_disable=memory is actually worthy} - cgroup_no_v1= [KNL] Disable one, multiple, all cgroup controllers in v1 - Format: { controller[,controller...] | "all" } + cgroup_no_v1= [KNL] Disable cgroup controllers and named hierarchies in v1 + Format: { { controller | "all" | "named" } + [,{ controller | "all" | "named" }...] } Like cgroup_disable, but only applies to cgroup v1; the blacklisted controllers remain available in cgroup2. + "all" blacklists all controllers and "named" disables + named mounts. Specifying both "all" and "named" disables + all v1 hierarchies. cgroup.memory= [KNL] Pass options to the cgroup memory controller. Format: diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 51063e7a93c2..583b969b0c0e 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -27,6 +27,9 @@ /* Controllers blocked by the commandline in v1 */ static u16 cgroup_no_v1_mask; +/* disable named v1 mounts */ +static bool cgroup_no_v1_named; + /* * pidlist destructions need to be flushed on cgroup destruction. Use a * separate workqueue as flush domain. @@ -963,6 +966,10 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts) } if (!strncmp(token, "name=", 5)) { const char *name = token + 5; + + /* blocked by boot param? */ + if (cgroup_no_v1_named) + return -ENOENT; /* Can't specify an empty name */ if (!strlen(name)) return -EINVAL; @@ -1292,7 +1299,12 @@ static int __init cgroup_no_v1(char *str) if (!strcmp(token, "all")) { cgroup_no_v1_mask = U16_MAX; - break; + continue; + } + + if (!strcmp(token, "named")) { + cgroup_no_v1_named = true; + continue; } for_each_subsys(ss, i) { -- cgit v1.2.3-59-g8ed1b