aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/cgroups
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/cgroups')
-rw-r--r--Documentation/cgroups/00-INDEX18
-rw-r--r--Documentation/cgroups/cgroups.txt36
-rw-r--r--Documentation/cgroups/cpuacct.txt18
-rw-r--r--Documentation/cgroups/cpusets.txt12
-rw-r--r--Documentation/cgroups/devices.txt2
-rw-r--r--Documentation/cgroups/memcg_test.txt22
-rw-r--r--Documentation/cgroups/memory.txt55
-rw-r--r--Documentation/cgroups/resource_counter.txt27
8 files changed, 143 insertions, 47 deletions
diff --git a/Documentation/cgroups/00-INDEX b/Documentation/cgroups/00-INDEX
new file mode 100644
index 000000000000..3f58fa3d6d00
--- /dev/null
+++ b/Documentation/cgroups/00-INDEX
@@ -0,0 +1,18 @@
+00-INDEX
+ - this file
+cgroups.txt
+ - Control Groups definition, implementation details, examples and API.
+cpuacct.txt
+ - CPU Accounting Controller; account CPU usage for groups of tasks.
+cpusets.txt
+ - documents the cpusets feature; assign CPUs and Mem to a set of tasks.
+devices.txt
+ - Device Whitelist Controller; description, interface and security.
+freezer-subsystem.txt
+ - checkpointing; rationale to not use signals, interface.
+memcg_test.txt
+ - Memory Resource Controller; implementation details.
+memory.txt
+ - Memory Resource Controller; design, accounting, interface, testing.
+resource_counter.txt
+ - Resource Counter API.
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 93feb8444489..6eb1a97e88ce 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -56,7 +56,7 @@ hierarchy, and a set of subsystems; each subsystem has system-specific
state attached to each cgroup in the hierarchy. Each hierarchy has
an instance of the cgroup virtual filesystem associated with it.
-At any one time there may be multiple active hierachies of task
+At any one time there may be multiple active hierarchies of task
cgroups. Each hierarchy is a partition of all tasks in the system.
User level code may create and destroy cgroups by name in an
@@ -124,10 +124,10 @@ following lines:
/ \
Prof (15%) students (5%)
-Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go
+Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd go
into NFS network class.
-At the same time firefox/lynx will share an appropriate CPU/Memory class
+At the same time Firefox/Lynx will share an appropriate CPU/Memory class
depending on who launched it (prof/student).
With the ability to classify tasks differently for different resources
@@ -325,7 +325,7 @@ and then start a subshell 'sh' in that cgroup:
Creating, modifying, using the cgroups can be done through the cgroup
virtual filesystem.
-To mount a cgroup hierarchy will all available subsystems, type:
+To mount a cgroup hierarchy with all available subsystems, type:
# mount -t cgroup xxx /dev/cgroup
The "xxx" is not interpreted by the cgroup code, but will appear in
@@ -333,12 +333,23 @@ The "xxx" is not interpreted by the cgroup code, but will appear in
To mount a cgroup hierarchy with just the cpuset and numtasks
subsystems, type:
-# mount -t cgroup -o cpuset,numtasks hier1 /dev/cgroup
+# mount -t cgroup -o cpuset,memory hier1 /dev/cgroup
To change the set of subsystems bound to a mounted hierarchy, just
remount with different options:
+# mount -o remount,cpuset,ns hier1 /dev/cgroup
-# mount -o remount,cpuset,ns /dev/cgroup
+Now memory is removed from the hierarchy and ns is added.
+
+Note this will add ns to the hierarchy but won't remove memory or
+cpuset, because the new options are appended to the old ones:
+# mount -o remount,ns /dev/cgroup
+
+To Specify a hierarchy's release_agent:
+# mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
+ xxx /dev/cgroup
+
+Note that specifying 'release_agent' more than once will return failure.
Note that changing the set of subsystems is currently only supported
when the hierarchy consists of a single (root) cgroup. Supporting
@@ -349,6 +360,11 @@ Then under /dev/cgroup you can find a tree that corresponds to the
tree of the cgroups in the system. For instance, /dev/cgroup
is the cgroup that holds the whole system.
+If you want to change the value of release_agent:
+# echo "/sbin/new_release_agent" > /dev/cgroup/release_agent
+
+It can also be changed via remount.
+
If you want to create a new cgroup under /dev/cgroup:
# cd /dev/cgroup
# mkdir my_cgroup
@@ -476,11 +492,13 @@ cgroup->parent is still valid. (Note - can also be called for a
newly-created cgroup if an error occurs after this subsystem's
create() method has been called for the new cgroup).
-void pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);
+int pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);
Called before checking the reference count on each subsystem. This may
be useful for subsystems which have some extra references even if
-there are not tasks in the cgroup.
+there are not tasks in the cgroup. If pre_destroy() returns error code,
+rmdir() will fail with it. From this behavior, pre_destroy() can be
+called multiple times against a cgroup.
int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
struct task_struct *task)
@@ -521,7 +539,7 @@ always handled well.
void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp)
(cgroup_mutex held by caller)
-Called at the end of cgroup_clone() to do any paramater
+Called at the end of cgroup_clone() to do any parameter
initialization which might be required before a task could attach. For
example in cpusets, no task may attach before 'cpus' and 'mems' are set
up.
diff --git a/Documentation/cgroups/cpuacct.txt b/Documentation/cgroups/cpuacct.txt
index bb775fbe43d7..8b930946c52a 100644
--- a/Documentation/cgroups/cpuacct.txt
+++ b/Documentation/cgroups/cpuacct.txt
@@ -30,3 +30,21 @@ The above steps create a new group g1 and move the current shell
process (bash) into it. CPU time consumed by this bash and its children
can be obtained from g1/cpuacct.usage and the same is accumulated in
/cgroups/cpuacct.usage also.
+
+cpuacct.stat file lists a few statistics which further divide the
+CPU time obtained by the cgroup into user and system times. Currently
+the following statistics are supported:
+
+user: Time spent by tasks of the cgroup in user mode.
+system: Time spent by tasks of the cgroup in kernel mode.
+
+user and system are in USER_HZ unit.
+
+cpuacct controller uses percpu_counter interface to collect user and
+system times. This has two side effects:
+
+- It is theoretically possible to see wrong values for user and system times.
+ This is because percpu_counter_read() on 32bit systems isn't safe
+ against concurrent writes.
+- It is possible to see slightly outdated values for user and system times
+ due to the batch processing nature of percpu_counter.
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
index 0611e9528c7c..f9ca389dddf4 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -131,7 +131,7 @@ Cpusets extends these two mechanisms as follows:
- The hierarchy of cpusets can be mounted at /dev/cpuset, for
browsing and manipulation from user space.
- A cpuset may be marked exclusive, which ensures that no other
- cpuset (except direct ancestors and descendents) may contain
+ cpuset (except direct ancestors and descendants) may contain
any overlapping CPUs or Memory Nodes.
- You can list all the tasks (by pid) attached to any cpuset.
@@ -226,7 +226,7 @@ nodes with memory--using the cpuset_track_online_nodes() hook.
--------------------------------
If a cpuset is cpu or mem exclusive, no other cpuset, other than
-a direct ancestor or descendent, may share any of the same CPUs or
+a direct ancestor or descendant, may share any of the same CPUs or
Memory Nodes.
A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled",
@@ -427,7 +427,7 @@ child cpusets have this flag enabled.
When doing this, you don't usually want to leave any unpinned tasks in
the top cpuset that might use non-trivial amounts of CPU, as such tasks
may be artificially constrained to some subset of CPUs, depending on
-the particulars of this flag setting in descendent cpusets. Even if
+the particulars of this flag setting in descendant cpusets. Even if
such a task could use spare CPU cycles in some other CPUs, the kernel
scheduler might not consider the possibility of load balancing that
task to that underused CPU.
@@ -531,9 +531,9 @@ be idle.
Of course it takes some searching cost to find movable tasks and/or
idle CPUs, the scheduler might not search all CPUs in the domain
-everytime. In fact, in some architectures, the searching ranges on
+every time. In fact, in some architectures, the searching ranges on
events are limited in the same socket or node where the CPU locates,
-while the load balance on tick searchs all.
+while the load balance on tick searches all.
For example, assume CPU Z is relatively far from CPU X. Even if CPU Z
is idle while CPU X and the siblings are busy, scheduler can't migrate
@@ -601,7 +601,7 @@ its new cpuset, then the task will continue to use whatever subset
of MPOL_BIND nodes are still allowed in the new cpuset. If the task
was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
in the new cpuset, then the task will be essentially treated as if it
-was MPOL_BIND bound to the new cpuset (even though its numa placement,
+was MPOL_BIND bound to the new cpuset (even though its NUMA placement,
as queried by get_mempolicy(), doesn't change). If a task is moved
from one cpuset to another, then the kernel will adjust the tasks
memory placement, as above, the next time that the kernel attempts
diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt
index 7cc6e6a60672..57ca4c89fe5c 100644
--- a/Documentation/cgroups/devices.txt
+++ b/Documentation/cgroups/devices.txt
@@ -42,7 +42,7 @@ suffice, but we can decide the best way to adequately restrict
movement as people get some experience with this. We may just want
to require CAP_SYS_ADMIN, which at least is a separate bit from
CAP_MKNOD. We may want to just refuse moving to a cgroup which
-isn't a descendent of the current one. Or we may want to use
+isn't a descendant of the current one. Or we may want to use
CAP_MAC_ADMIN, since we really are trying to lock down root.
CAP_SYS_ADMIN is needed to modify the whitelist or move another
diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
index 523a9c16c400..72db89ed0609 100644
--- a/Documentation/cgroups/memcg_test.txt
+++ b/Documentation/cgroups/memcg_test.txt
@@ -1,5 +1,5 @@
Memory Resource Controller(Memcg) Implementation Memo.
-Last Updated: 2009/1/19
+Last Updated: 2009/1/20
Base Kernel Version: based on 2.6.29-rc2.
Because VM is getting complex (one of reasons is memcg...), memcg's behavior
@@ -356,7 +356,25 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
(Shell-B)
# move all tasks in /cgroup/test to /cgroup
# /sbin/swapoff -a
- # rmdir /test/cgroup
+ # rmdir /cgroup/test
# kill malloc task.
Of course, tmpfs v.s. swapoff test should be tested, too.
+
+ 9.8 OOM-Killer
+ Out-of-memory caused by memcg's limit will kill tasks under
+ the memcg. When hierarchy is used, a task under hierarchy
+ will be killed by the kernel.
+ In this case, panic_on_oom shouldn't be invoked and tasks
+ in other groups shouldn't be killed.
+
+ It's not difficult to cause OOM under memcg as following.
+ Case A) when you can swapoff
+ #swapoff -a
+ #echo 50M > /memory.limit_in_bytes
+ run 51M of malloc
+
+ Case B) when you use mem+swap limitation.
+ #echo 50M > memory.limit_in_bytes
+ #echo 50M > memory.memsw.limit_in_bytes
+ run 51M of malloc
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index e1501964df1e..1a608877b14e 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -6,15 +6,14 @@ used here with the memory controller that is used in hardware.
Salient features
-a. Enable control of both RSS (mapped) and Page Cache (unmapped) pages
+a. Enable control of Anonymous, Page Cache (mapped and unmapped) and
+ Swap Cache memory pages.
b. The infrastructure allows easy addition of other types of memory to control
c. Provides *zero overhead* for non memory controller users
d. Provides a double LRU: global memory pressure causes reclaim from the
global LRU; a cgroup on hitting a limit, reclaims from the per
cgroup LRU
-NOTE: Swap Cache (unmapped) is not accounted now.
-
Benefits and Purpose of the memory controller
The memory controller isolates the memory behaviour of a group of tasks
@@ -290,34 +289,44 @@ will be charged as a new owner of it.
moved to the parent. If you want to avoid that, force_empty will be useful.
5.2 stat file
- memory.stat file includes following statistics (now)
- cache - # of pages from page-cache and shmem.
- rss - # of pages from anonymous memory.
- pgpgin - # of event of charging
- pgpgout - # of event of uncharging
- active_anon - # of pages on active lru of anon, shmem.
- inactive_anon - # of pages on active lru of anon, shmem
- active_file - # of pages on active lru of file-cache
- inactive_file - # of pages on inactive lru of file cache
- unevictable - # of pages cannot be reclaimed.(mlocked etc)
-
- Below is depend on CONFIG_DEBUG_VM.
- inactive_ratio - VM inernal parameter. (see mm/page_alloc.c)
- recent_rotated_anon - VM internal parameter. (see mm/vmscan.c)
- recent_rotated_file - VM internal parameter. (see mm/vmscan.c)
- recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)
- recent_scanned_file - VM internal parameter. (see mm/vmscan.c)
-
- Memo:
+
+memory.stat file includes following statistics
+
+cache - # of bytes of page cache memory.
+rss - # of bytes of anonymous and swap cache memory.
+pgpgin - # of pages paged in (equivalent to # of charging events).
+pgpgout - # of pages paged out (equivalent to # of uncharging events).
+active_anon - # of bytes of anonymous and swap cache memory on active
+ lru list.
+inactive_anon - # of bytes of anonymous memory and swap cache memory on
+ inactive lru list.
+active_file - # of bytes of file-backed memory on active lru list.
+inactive_file - # of bytes of file-backed memory on inactive lru list.
+unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc).
+
+The following additional stats are dependent on CONFIG_DEBUG_VM.
+
+inactive_ratio - VM internal parameter. (see mm/page_alloc.c)
+recent_rotated_anon - VM internal parameter. (see mm/vmscan.c)
+recent_rotated_file - VM internal parameter. (see mm/vmscan.c)
+recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)
+recent_scanned_file - VM internal parameter. (see mm/vmscan.c)
+
+Memo:
recent_rotated means recent frequency of lru rotation.
recent_scanned means recent # of scans to lru.
showing for better debug please see the code for meanings.
+Note:
+ Only anonymous and swap cache memory is listed as part of 'rss' stat.
+ This should not be confused with the true 'resident set size' or the
+ amount of physical memory used by the cgroup. Per-cgroup rss
+ accounting is not done yet.
5.3 swappiness
Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
- Following cgroup's swapiness can't be changed.
+ Following cgroups' swapiness can't be changed.
- root cgroup (uses /proc/sys/vm/swappiness).
- a cgroup which uses hierarchy and it has child cgroup.
- a cgroup which uses hierarchy and not the root of hierarchy.
diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index f196ac1d7d25..95b24d766eab 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -47,13 +47,18 @@ to work with it.
2. Basic accounting routines
- a. void res_counter_init(struct res_counter *rc)
+ a. void res_counter_init(struct res_counter *rc,
+ struct res_counter *rc_parent)
Initializes the resource counter. As usual, should be the first
routine called for a new counter.
- b. int res_counter_charge[_locked]
- (struct res_counter *rc, unsigned long val)
+ The struct res_counter *parent can be used to define a hierarchical
+ child -> parent relationship directly in the res_counter structure,
+ NULL can be used to define no relationship.
+
+ c. int res_counter_charge(struct res_counter *rc, unsigned long val,
+ struct res_counter **limit_fail_at)
When a resource is about to be allocated it has to be accounted
with the appropriate resource counter (controller should determine
@@ -67,15 +72,25 @@ to work with it.
* if the charging is performed first, then it should be uncharged
on error path (if the one is called).
- c. void res_counter_uncharge[_locked]
+ If the charging fails and a hierarchical dependency exists, the
+ limit_fail_at parameter is set to the particular res_counter element
+ where the charging failed.
+
+ d. int res_counter_charge_locked
+ (struct res_counter *rc, unsigned long val)
+
+ The same as res_counter_charge(), but it must not acquire/release the
+ res_counter->lock internally (it must be called with res_counter->lock
+ held).
+
+ e. void res_counter_uncharge[_locked]
(struct res_counter *rc, unsigned long val)
When a resource is released (freed) it should be de-accounted
from the resource counter it was accounted to. This is called
"uncharging".
- The _locked routines imply that the res_counter->lock is taken.
-
+ The _locked routines imply that the res_counter->lock is taken.
2.1 Other accounting routines