<feed xmlns='http://www.w3.org/2005/Atom'>
<title>wireguard-linux/kernel/cgroup, branch update-toolchain</title>
<subtitle>WireGuard for the Linux kernel</subtitle>
<id>https://git.zx2c4.com/wireguard-linux/atom/kernel/cgroup?h=update-toolchain</id>
<link rel='self' href='https://git.zx2c4.com/wireguard-linux/atom/kernel/cgroup?h=update-toolchain'/>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/'/>
<updated>2024-07-15T04:04:03Z</updated>
<entry>
<title>Merge branch 'for-6.10-fixes' into for-6.11</title>
<updated>2024-07-15T04:04:03Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2024-07-15T04:04:03Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=9283ff5be1510a35356656a6c1efe14f765c936a'/>
<id>urn:sha1:9283ff5be1510a35356656a6c1efe14f765c936a</id>
<content type='text'>
</content>
</entry>
<entry>
<title>cgroup/misc: Introduce misc.events.local</title>
<updated>2024-07-12T16:45:23Z</updated>
<author>
<name>Xiu Jianfeng</name>
<email>xiujianfeng@huawei.com</email>
</author>
<published>2024-07-11T10:14:57Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=6a26f9c68901797261bc145975a02f85be0c1d8f'/>
<id>urn:sha1:6a26f9c68901797261bc145975a02f85be0c1d8f</id>
<content type='text'>
Currently the event counting provided by misc.events is hierarchical,
it's not practical if user is only concerned with events of a
specified cgroup. Therefore, introduce misc.events.local collect events
specific to the given cgroup.

This is analogous to memory.events.local and pids.events.local.

Signed-off-by: Xiu Jianfeng &lt;xiujianfeng@huawei.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>cgroup/rstat: add force idle show helper</title>
<updated>2024-07-05T18:32:08Z</updated>
<author>
<name>Chen Ridong</name>
<email>chenridong@huawei.com</email>
</author>
<published>2024-07-04T14:01:19Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=b824766504e49f3fdcbb8c722e70996a78c3636e'/>
<id>urn:sha1:b824766504e49f3fdcbb8c722e70996a78c3636e</id>
<content type='text'>
In the function cgroup_base_stat_cputime_show, there are five
instances of #ifdef, which makes the code not concise.
To address this, add the function cgroup_force_idle_show
to make the code more succinct.

Signed-off-by: Chen Ridong &lt;chenridong@huawei.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>cgroup: Protect css-&gt;cgroup write under css_set_lock</title>
<updated>2024-07-03T18:59:06Z</updated>
<author>
<name>Waiman Long</name>
<email>longman@redhat.com</email>
</author>
<published>2024-07-03T18:52:29Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=57b56d16800e8961278ecff0dc755d46c4575092'/>
<id>urn:sha1:57b56d16800e8961278ecff0dc755d46c4575092</id>
<content type='text'>
The writing of css-&gt;cgroup associated with the cgroup root in
rebind_subsystems() is currently protected only by cgroup_mutex.
However, the reading of css-&gt;cgroup in both proc_cpuset_show() and
proc_cgroup_show() is protected just by css_set_lock. That makes the
readers susceptible to racing problems like data tearing or caching.
It is also a problem that can be reported by KCSAN.

This can be fixed by using READ_ONCE() and WRITE_ONCE() to access
css-&gt;cgroup. Alternatively, the writing of css-&gt;cgroup can be moved
under css_set_lock as well which is done by this patch.

Signed-off-by: Waiman Long &lt;longman@redhat.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>cgroup/misc: Introduce misc.peak</title>
<updated>2024-07-03T18:08:43Z</updated>
<author>
<name>Xiu Jianfeng</name>
<email>xiujianfeng@huawei.com</email>
</author>
<published>2024-07-03T00:36:46Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=1028f391d5f9d4248e2f49193e6de2516ad630f8'/>
<id>urn:sha1:1028f391d5f9d4248e2f49193e6de2516ad630f8</id>
<content type='text'>
Introduce misc.peak to record the historical maximum usage of the
resource, as in some scenarios the value of misc.max could be
adjusted based on the peak usage of the resource.

Signed-off-by: Xiu Jianfeng &lt;xiujianfeng@huawei.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>cgroup/cpuset: Prevent UAF in proc_cpuset_show()</title>
<updated>2024-06-28T17:10:31Z</updated>
<author>
<name>Chen Ridong</name>
<email>chenridong@huawei.com</email>
</author>
<published>2024-06-28T01:36:04Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=1be59c97c83ccd67a519d8a49486b3a8a73ca28a'/>
<id>urn:sha1:1be59c97c83ccd67a519d8a49486b3a8a73ca28a</id>
<content type='text'>
An UAF can happen when /proc/cpuset is read as reported in [1].

This can be reproduced by the following methods:
1.add an mdelay(1000) before acquiring the cgroup_lock In the
 cgroup_path_ns function.
2.$cat /proc/&lt;pid&gt;/cpuset   repeatly.
3.$mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset/
$umount /sys/fs/cgroup/cpuset/   repeatly.

The race that cause this bug can be shown as below:

(umount)		|	(cat /proc/&lt;pid&gt;/cpuset)
css_release		|	proc_cpuset_show
css_release_work_fn	|	css = task_get_css(tsk, cpuset_cgrp_id);
css_free_rwork_fn	|	cgroup_path_ns(css-&gt;cgroup, ...);
cgroup_destroy_root	|	mutex_lock(&amp;cgroup_mutex);
rebind_subsystems	|
cgroup_free_root 	|
			|	// cgrp was freed, UAF
			|	cgroup_path_ns_locked(cgrp,..);

When the cpuset is initialized, the root node top_cpuset.css.cgrp
will point to &amp;cgrp_dfl_root.cgrp. In cgroup v1, the mount operation will
allocate cgroup_root, and top_cpuset.css.cgrp will point to the allocated
&amp;cgroup_root.cgrp. When the umount operation is executed,
top_cpuset.css.cgrp will be rebound to &amp;cgrp_dfl_root.cgrp.

The problem is that when rebinding to cgrp_dfl_root, there are cases
where the cgroup_root allocated by setting up the root for cgroup v1
is cached. This could lead to a Use-After-Free (UAF) if it is
subsequently freed. The descendant cgroups of cgroup v1 can only be
freed after the css is released. However, the css of the root will never
be released, yet the cgroup_root should be freed when it is unmounted.
This means that obtaining a reference to the css of the root does
not guarantee that css.cgrp-&gt;root will not be freed.

Fix this problem by using rcu_read_lock in proc_cpuset_show().
As cgroup_root is kfree_rcu after commit d23b5c577715
("cgroup: Make operations on the cgroup root_list RCU safe"),
css-&gt;cgroup won't be freed during the critical section.
To call cgroup_path_ns_locked, css_set_lock is needed, so it is safe to
replace task_get_css with task_css.

[1] https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd

Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Signed-off-by: Chen Ridong &lt;chenridong@huawei.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>cgroup/cpuset: Make cpuset.cpus.exclusive independent of cpuset.cpus</title>
<updated>2024-06-19T17:37:38Z</updated>
<author>
<name>Waiman Long</name>
<email>longman@redhat.com</email>
</author>
<published>2024-06-17T14:39:44Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=737bb142a00d53de3743ae389732721b3b9f0191'/>
<id>urn:sha1:737bb142a00d53de3743ae389732721b3b9f0191</id>
<content type='text'>
The "cpuset.cpus.exclusive.effective" value is currently limited to a
subset of its "cpuset.cpus". This makes the exclusive CPUs distribution
hierarchy subsumed within the larger "cpuset.cpus" hierarchy. We have to
decide on what CPUs are used locally and what CPUs can be passed down as
exclusive CPUs down the hierarchy and combine them into "cpuset.cpus".

The advantage of the current scheme is to have only one hierarchy to
worry about. However, it make it harder to use as all the "cpuset.cpus"
values have to be properly set along the way down to the designated remote
partition root. It also makes it more cumbersome to find out what CPUs
can be used locally.

Make creation of remote partition simpler by breaking the
dependency of "cpuset.cpus.exclusive" on "cpuset.cpus" and make
them independent entities. Now we have two separate hierarchies -
one for setting "cpuset.cpus.effective" and the other one for setting
"cpuset.cpus.exclusive.effective". We may not need to set "cpuset.cpus"
when we activate a partition root anymore.

Also update Documentation/admin-guide/cgroup-v2.rst and cpuset.c comment
to document this change.

Suggested-by: Petr Malat &lt;oss@malat.biz&gt;
Signed-off-by: Waiman Long &lt;longman@redhat.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>cgroup/cpuset: Delay setting of CS_CPU_EXCLUSIVE until valid partition</title>
<updated>2024-06-19T17:37:37Z</updated>
<author>
<name>Waiman Long</name>
<email>longman@redhat.com</email>
</author>
<published>2024-06-17T14:39:43Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=fe8cd2736e75c8ca3aed1ef181a834e41dc5310f'/>
<id>urn:sha1:fe8cd2736e75c8ca3aed1ef181a834e41dc5310f</id>
<content type='text'>
The CS_CPU_EXCLUSIVE flag is currently set whenever cpuset.cpus.exclusive
is set to make sure that the exclusivity test will be run to ensure its
exclusiveness. At the same time, this flag can be changed whenever the
partition root state is changed. For example, the CS_CPU_EXCLUSIVE flag
will be reset whenever a partition root becomes invalid. This makes
using CS_CPU_EXCLUSIVE to ensure exclusiveness a bit fragile.

The current scheme also makes setting up a cpuset.cpus.exclusive
hierarchy to enable remote partition harder as cpuset.cpus.exclusive
cannot overlap with any cpuset.cpus of sibling cpusets if their
cpuset.cpus.exclusive aren't set.

Solve these issues by deferring the setting of CS_CPU_EXCLUSIVE flag
until the cpuset become a valid partition root while adding new checks
in validate_change() to ensure that cpuset.cpus.exclusive of sibling
cpusets cannot overlap.

An additional check is also added to validate_change() to make sure that
cpuset.cpus of one cpuset cannot be a subset of cpuset.cpus.exclusive
of a sibling cpuset to avoid the problem that none of those CPUs will
be available when these exclusive CPUs are extracted out to a newly
enabled partition root. The Documentation/admin-guide/cgroup-v2.rst
file is updated to document the new constraints.

Signed-off-by: Waiman Long &lt;longman@redhat.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>cgroup/cpuset: Fix remote root partition creation problem</title>
<updated>2024-06-19T17:37:37Z</updated>
<author>
<name>Waiman Long</name>
<email>longman@redhat.com</email>
</author>
<published>2024-06-17T14:39:41Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=ccac8e8de99cbcf5e7f53251ebce917bf7bcc29c'/>
<id>urn:sha1:ccac8e8de99cbcf5e7f53251ebce917bf7bcc29c</id>
<content type='text'>
Since commit 181c8e091aae ("cgroup/cpuset: Introduce remote partition"),
a remote partition can be created underneath a non-partition root cpuset
as long as its exclusive_cpus are set to distribute exclusive CPUs down
to its children. The generate_sched_domains() function, however, doesn't
take into account this new behavior and hence will fail to create the
sched domain needed for a remote root (non-isolated) partition.

There are two issues related to remote partition support. First of
all, generate_sched_domains() has a fast path that is activated if
root_load_balance is true and top_cpuset.nr_subparts is non-zero. The
later condition isn't quite correct for remote partitions as nr_subparts
just shows the number of local child partitions underneath it. There
can be no local child partition under top_cpuset even if there are
remote partitions further down the hierarchy. Fix that by checking
for subpartitions_cpus which contains exclusive CPUs allocated to both
local and remote partitions.

Secondly, the valid partition check for subtree skipping in the csa[]
generation loop isn't enough as remote partition does not need to
have a partition root parent. Fix this problem by breaking csa[] array
generation loop of generate_sched_domains() into v1 and v2 specific parts
and checking a cpuset's exclusive_cpus before skipping its subtree in
the v2 case.

Also simplify generate_sched_domains() for cgroup v2 as only
non-isolating partition roots should be included in building the cpuset
array and none of the v1 scheduling attributes other than a different
way to create an isolated partition are supported.

Fixes: 181c8e091aae ("cgroup/cpuset: Introduce remote partition")
Signed-off-by: Waiman Long &lt;longman@redhat.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>cgroup: avoid the unnecessary list_add(dying_tasks) in cgroup_exit()</title>
<updated>2024-06-19T17:18:41Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2024-06-17T14:31:52Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/wireguard-linux/commit/?id=6fe960147e08e931dda1ef325dc69320e544aa73'/>
<id>urn:sha1:6fe960147e08e931dda1ef325dc69320e544aa73</id>
<content type='text'>
cgroup_exit() needs to do this only if the exiting task is a leader and it
is not the last live thread.  The patch doesn't use delay_group_leader(),
atomic_read(signal-&gt;live) matches the code css_task_iter_advance() more.

cgroup_release() can now check list_empty(task-&gt;cg_list) before it takes
css_set_lock and calls ss_set_skip_task_iters().

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
</feed>
