From 3e32cb2e0a12b6915056ff04601cf1bb9b44f967 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 10 Dec 2014 15:42:31 -0800 Subject: mm: memcontrol: lockless page counters Memory is internally accounted in bytes, using spinlock-protected 64-bit counters, even though the smallest accounting delta is a page. The counter interface is also convoluted and does too many things. Introduce a new lockless word-sized page counter API, then change all memory accounting over to it. The translation from and to bytes then only happens when interfacing with userspace. The removed locking overhead is noticable when scaling beyond the per-cpu charge caches - on a 4-socket machine with 144-threads, the following test shows the performance differences of 288 memcgs concurrently running a page fault benchmark: vanilla: 18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% ) 1,380,638 context-switches # 0.074 K/sec ( +- 0.75% ) 24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% ) 1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% ) 50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% ) stalled-cycles-frontend stalled-cycles-backend 8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% ) 1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% ) 1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% ) 132.474343877 seconds time elapsed ( +- 0.21% ) lockless: 12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% ) 832,850 context-switches # 0.068 K/sec ( +- 0.54% ) 15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% ) 1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% ) 32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% ) stalled-cycles-frontend stalled-cycles-backend 9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% ) 2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% ) 1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% ) 91.369330729 seconds time elapsed ( +- 0.45% ) On top of improved scalability, this also gets rid of the icky long long types in the very heart of memcg, which is great for 32 bit and also makes the code a lot more readable. Notable differences between the old and new API: - res_counter_charge() and res_counter_charge_nofail() become page_counter_try_charge() and page_counter_charge() resp. to match the more common kernel naming scheme of try_do()/do() - res_counter_uncharge_until() is only ever used to cancel a local counter and never to uncharge bigger segments of a hierarchy, so it's replaced by the simpler page_counter_cancel() - res_counter_set_limit() is replaced by page_counter_limit(), which expects its callers to serialize against themselves - res_counter_memparse_write_strategy() is replaced by page_counter_limit(), which rounds down to the nearest page size - rather than up. This is more reasonable for explicitely requested hard upper limits. - to keep charging light-weight, page_counter_try_charge() charges speculatively, only to roll back if the result exceeds the limit. Because of this, a failing bigger charge can temporarily lock out smaller charges that would otherwise succeed. The error is bounded to the difference between the smallest and the biggest possible charge size, so for memcg, this means that a failing THP charge can send base page charges into reclaim upto 2MB (4MB) before the limit would have been reached. This should be acceptable. [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse] [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE] Signed-off-by: Johannes Weiner Acked-by: Michal Hocko Acked-by: Vladimir Davydov Cc: Tejun Heo Cc: David Rientjes Cc: Stephen Rothwell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memory.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 02ab997a1ed2..f624727ab404 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -52,9 +52,9 @@ Brief summary of control files. tasks # attach a task(thread) and show list of threads cgroup.procs # show list of processes cgroup.event_control # an interface for event_fd() - memory.usage_in_bytes # show current res_counter usage for memory + memory.usage_in_bytes # show current usage for memory (See 5.5 for details) - memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap + memory.memsw.usage_in_bytes # show current usage for memory+Swap (See 5.5 for details) memory.limit_in_bytes # set/show limit of memory usage memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage -- cgit v1.2.3-59-g8ed1b From 71f87bee38edddb21d97895fa938744cf3f477bb Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 10 Dec 2014 15:42:34 -0800 Subject: mm: hugetlb_cgroup: convert to lockless page counters Abandon the spinlock-protected byte counters in favor of the unlocked page counters in the hugetlb controller as well. Signed-off-by: Johannes Weiner Reviewed-by: Vladimir Davydov Acked-by: Michal Hocko Cc: Tejun Heo Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/hugetlb.txt | 2 +- include/linux/hugetlb_cgroup.h | 1 - init/Kconfig | 3 +- mm/hugetlb_cgroup.c | 103 +++++++++++++++++++++----------------- 4 files changed, 61 insertions(+), 48 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cgroups/hugetlb.txt b/Documentation/cgroups/hugetlb.txt index a9faaca1f029..106245c3aecc 100644 --- a/Documentation/cgroups/hugetlb.txt +++ b/Documentation/cgroups/hugetlb.txt @@ -29,7 +29,7 @@ Brief summary of control files hugetlb..limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage hugetlb..max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded - hugetlb..usage_in_bytes # show current res_counter usage for "hugepagesize" hugetlb + hugetlb..usage_in_bytes # show current usage for "hugepagesize" hugetlb hugetlb..failcnt # show the number of allocation failure due to HugeTLB limit For a system supporting two hugepage size (16M and 16G) the control diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h index 0129f89cf98d..bcc853eccc85 100644 --- a/include/linux/hugetlb_cgroup.h +++ b/include/linux/hugetlb_cgroup.h @@ -16,7 +16,6 @@ #define _LINUX_HUGETLB_CGROUP_H #include -#include struct hugetlb_cgroup; /* diff --git a/init/Kconfig b/init/Kconfig index fd9e88791ba4..a60d1442d1df 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1051,7 +1051,8 @@ config MEMCG_KMEM config CGROUP_HUGETLB bool "HugeTLB Resource Controller for Control Groups" - depends on RESOURCE_COUNTERS && HUGETLB_PAGE + depends on HUGETLB_PAGE + select PAGE_COUNTER default n help Provides a cgroup Resource Controller for HugeTLB pages. diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index a67c26e0f360..037e1c00a5b7 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -14,6 +14,7 @@ */ #include +#include #include #include #include @@ -23,7 +24,7 @@ struct hugetlb_cgroup { /* * the counter to account for hugepages from hugetlb. */ - struct res_counter hugepage[HUGE_MAX_HSTATE]; + struct page_counter hugepage[HUGE_MAX_HSTATE]; }; #define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val)) @@ -60,7 +61,7 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg) int idx; for (idx = 0; idx < hugetlb_max_hstate; idx++) { - if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE)) > 0) + if (page_counter_read(&h_cg->hugepage[idx])) return true; } return false; @@ -79,12 +80,12 @@ hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) if (parent_h_cgroup) { for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) - res_counter_init(&h_cgroup->hugepage[idx], - &parent_h_cgroup->hugepage[idx]); + page_counter_init(&h_cgroup->hugepage[idx], + &parent_h_cgroup->hugepage[idx]); } else { root_h_cgroup = h_cgroup; for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) - res_counter_init(&h_cgroup->hugepage[idx], NULL); + page_counter_init(&h_cgroup->hugepage[idx], NULL); } return &h_cgroup->css; } @@ -108,9 +109,8 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css) static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg, struct page *page) { - int csize; - struct res_counter *counter; - struct res_counter *fail_res; + unsigned int nr_pages; + struct page_counter *counter; struct hugetlb_cgroup *page_hcg; struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg); @@ -123,15 +123,15 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg, if (!page_hcg || page_hcg != h_cg) goto out; - csize = PAGE_SIZE << compound_order(page); + nr_pages = 1 << compound_order(page); if (!parent) { parent = root_h_cgroup; /* root has no limit */ - res_counter_charge_nofail(&parent->hugepage[idx], - csize, &fail_res); + page_counter_charge(&parent->hugepage[idx], nr_pages); } counter = &h_cg->hugepage[idx]; - res_counter_uncharge_until(counter, counter->parent, csize); + /* Take the pages off the local counter */ + page_counter_cancel(counter, nr_pages); set_hugetlb_cgroup(page, parent); out: @@ -166,9 +166,8 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, struct hugetlb_cgroup **ptr) { int ret = 0; - struct res_counter *fail_res; + struct page_counter *counter; struct hugetlb_cgroup *h_cg = NULL; - unsigned long csize = nr_pages * PAGE_SIZE; if (hugetlb_cgroup_disabled()) goto done; @@ -187,7 +186,7 @@ again: } rcu_read_unlock(); - ret = res_counter_charge(&h_cg->hugepage[idx], csize, &fail_res); + ret = page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter); css_put(&h_cg->css); done: *ptr = h_cg; @@ -213,7 +212,6 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page) { struct hugetlb_cgroup *h_cg; - unsigned long csize = nr_pages * PAGE_SIZE; if (hugetlb_cgroup_disabled()) return; @@ -222,61 +220,76 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, if (unlikely(!h_cg)) return; set_hugetlb_cgroup(page, NULL); - res_counter_uncharge(&h_cg->hugepage[idx], csize); + page_counter_uncharge(&h_cg->hugepage[idx], nr_pages); return; } void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg) { - unsigned long csize = nr_pages * PAGE_SIZE; - if (hugetlb_cgroup_disabled() || !h_cg) return; if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER) return; - res_counter_uncharge(&h_cg->hugepage[idx], csize); + page_counter_uncharge(&h_cg->hugepage[idx], nr_pages); return; } +enum { + RES_USAGE, + RES_LIMIT, + RES_MAX_USAGE, + RES_FAILCNT, +}; + static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) { - int idx, name; + struct page_counter *counter; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css); - idx = MEMFILE_IDX(cft->private); - name = MEMFILE_ATTR(cft->private); + counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)]; - return res_counter_read_u64(&h_cg->hugepage[idx], name); + switch (MEMFILE_ATTR(cft->private)) { + case RES_USAGE: + return (u64)page_counter_read(counter) * PAGE_SIZE; + case RES_LIMIT: + return (u64)counter->limit * PAGE_SIZE; + case RES_MAX_USAGE: + return (u64)counter->watermark * PAGE_SIZE; + case RES_FAILCNT: + return counter->failcnt; + default: + BUG(); + } } +static DEFINE_MUTEX(hugetlb_limit_mutex); + static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { - int idx, name, ret; - unsigned long long val; + int ret, idx; + unsigned long nr_pages; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of)); + if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */ + return -EINVAL; + buf = strstrip(buf); + ret = page_counter_memparse(buf, &nr_pages); + if (ret) + return ret; + idx = MEMFILE_IDX(of_cft(of)->private); - name = MEMFILE_ATTR(of_cft(of)->private); - switch (name) { + switch (MEMFILE_ATTR(of_cft(of)->private)) { case RES_LIMIT: - if (hugetlb_cgroup_is_root(h_cg)) { - /* Can't set limit on root */ - ret = -EINVAL; - break; - } - /* This function does all necessary parse...reuse it */ - ret = res_counter_memparse_write_strategy(buf, &val); - if (ret) - break; - val = ALIGN(val, 1ULL << huge_page_shift(&hstates[idx])); - ret = res_counter_set_limit(&h_cg->hugepage[idx], val); + mutex_lock(&hugetlb_limit_mutex); + ret = page_counter_limit(&h_cg->hugepage[idx], nr_pages); + mutex_unlock(&hugetlb_limit_mutex); break; default: ret = -EINVAL; @@ -288,18 +301,18 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of, static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { - int idx, name, ret = 0; + int ret = 0; + struct page_counter *counter; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of)); - idx = MEMFILE_IDX(of_cft(of)->private); - name = MEMFILE_ATTR(of_cft(of)->private); + counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)]; - switch (name) { + switch (MEMFILE_ATTR(of_cft(of)->private)) { case RES_MAX_USAGE: - res_counter_reset_max(&h_cg->hugepage[idx]); + page_counter_reset_watermark(counter); break; case RES_FAILCNT: - res_counter_reset_failcnt(&h_cg->hugepage[idx]); + counter->failcnt = 0; break; default: ret = -EINVAL; -- cgit v1.2.3-59-g8ed1b From 5b1efc027c0b51ca3e76f4e00c83358f8349f543 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 10 Dec 2014 15:42:37 -0800 Subject: kernel: res_counter: remove the unused API All memory accounting and limiting has been switched over to the lockless page counters. Bye, res_counter! [akpm@linux-foundation.org: update Documentation/cgroups/memory.txt] [mhocko@suse.cz: ditch the last remainings of res_counter] Signed-off-by: Johannes Weiner Acked-by: Vladimir Davydov Acked-by: Michal Hocko Cc: Tejun Heo Cc: David Rientjes Cc: Paul Bolle Signed-off-by: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memory.txt | 17 ++- Documentation/cgroups/resource_counter.txt | 197 ------------------------- include/linux/res_counter.h | 223 ----------------------------- init/Kconfig | 6 - kernel/Makefile | 1 - kernel/res_counter.c | 211 --------------------------- 6 files changed, 8 insertions(+), 647 deletions(-) delete mode 100644 Documentation/cgroups/resource_counter.txt delete mode 100644 include/linux/res_counter.h delete mode 100644 kernel/res_counter.c (limited to 'Documentation') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index f624727ab404..67613ff0270c 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -116,16 +116,16 @@ The memory controller is the first controller developed. 2.1. Design -The core of the design is a counter called the res_counter. The res_counter -tracks the current memory usage and limit of the group of processes associated -with the controller. Each cgroup has a memory controller specific data -structure (mem_cgroup) associated with it. +The core of the design is a counter called the page_counter. The +page_counter tracks the current memory usage and limit of the group of +processes associated with the controller. Each cgroup has a memory controller +specific data structure (mem_cgroup) associated with it. 2.2. Accounting +--------------------+ - | mem_cgroup | - | (res_counter) | + | mem_cgroup | + | (page_counter) | +--------------------+ / ^ \ / | \ @@ -352,9 +352,8 @@ set: 0. Configuration a. Enable CONFIG_CGROUPS -b. Enable CONFIG_RESOURCE_COUNTERS -c. Enable CONFIG_MEMCG -d. Enable CONFIG_MEMCG_SWAP (to use swap extension) +b. Enable CONFIG_MEMCG +c. Enable CONFIG_MEMCG_SWAP (to use swap extension) d. Enable CONFIG_MEMCG_KMEM (to use kmem extension) 1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt deleted file mode 100644 index 762ca54eb929..000000000000 --- a/Documentation/cgroups/resource_counter.txt +++ /dev/null @@ -1,197 +0,0 @@ - - The Resource Counter - -The resource counter, declared at include/linux/res_counter.h, -is supposed to facilitate the resource management by controllers -by providing common stuff for accounting. - -This "stuff" includes the res_counter structure and routines -to work with it. - - - -1. Crucial parts of the res_counter structure - - a. unsigned long long usage - - The usage value shows the amount of a resource that is consumed - by a group at a given time. The units of measurement should be - determined by the controller that uses this counter. E.g. it can - be bytes, items or any other unit the controller operates on. - - b. unsigned long long max_usage - - The maximal value of the usage over time. - - This value is useful when gathering statistical information about - the particular group, as it shows the actual resource requirements - for a particular group, not just some usage snapshot. - - c. unsigned long long limit - - The maximal allowed amount of resource to consume by the group. In - case the group requests for more resources, so that the usage value - would exceed the limit, the resource allocation is rejected (see - the next section). - - d. unsigned long long failcnt - - The failcnt stands for "failures counter". This is the number of - resource allocation attempts that failed. - - c. spinlock_t lock - - Protects changes of the above values. - - - -2. Basic accounting routines - - a. void res_counter_init(struct res_counter *rc, - struct res_counter *rc_parent) - - Initializes the resource counter. As usual, should be the first - routine called for a new counter. - - The struct res_counter *parent can be used to define a hierarchical - child -> parent relationship directly in the res_counter structure, - NULL can be used to define no relationship. - - c. int res_counter_charge(struct res_counter *rc, unsigned long val, - struct res_counter **limit_fail_at) - - When a resource is about to be allocated it has to be accounted - with the appropriate resource counter (controller should determine - which one to use on its own). This operation is called "charging". - - This is not very important which operation - resource allocation - or charging - is performed first, but - * if the allocation is performed first, this may create a - temporary resource over-usage by the time resource counter is - charged; - * if the charging is performed first, then it should be uncharged - on error path (if the one is called). - - If the charging fails and a hierarchical dependency exists, the - limit_fail_at parameter is set to the particular res_counter element - where the charging failed. - - d. u64 res_counter_uncharge(struct res_counter *rc, unsigned long val) - - When a resource is released (freed) it should be de-accounted - from the resource counter it was accounted to. This is called - "uncharging". The return value of this function indicate the amount - of charges still present in the counter. - - The _locked routines imply that the res_counter->lock is taken. - - e. u64 res_counter_uncharge_until - (struct res_counter *rc, struct res_counter *top, - unsigned long val) - - Almost same as res_counter_uncharge() but propagation of uncharge - stops when rc == top. This is useful when kill a res_counter in - child cgroup. - - 2.1 Other accounting routines - - There are more routines that may help you with common needs, like - checking whether the limit is reached or resetting the max_usage - value. They are all declared in include/linux/res_counter.h. - - - -3. Analyzing the resource counter registrations - - a. If the failcnt value constantly grows, this means that the counter's - limit is too tight. Either the group is misbehaving and consumes too - many resources, or the configuration is not suitable for the group - and the limit should be increased. - - b. The max_usage value can be used to quickly tune the group. One may - set the limits to maximal values and either load the container with - a common pattern or leave one for a while. After this the max_usage - value shows the amount of memory the container would require during - its common activity. - - Setting the limit a bit above this value gives a pretty good - configuration that works in most of the cases. - - c. If the max_usage is much less than the limit, but the failcnt value - is growing, then the group tries to allocate a big chunk of resource - at once. - - d. If the max_usage is much less than the limit, but the failcnt value - is 0, then this group is given too high limit, that it does not - require. It is better to lower the limit a bit leaving more resource - for other groups. - - - -4. Communication with the control groups subsystem (cgroups) - -All the resource controllers that are using cgroups and resource counters -should provide files (in the cgroup filesystem) to work with the resource -counter fields. They are recommended to adhere to the following rules: - - a. File names - - Field name File name - --------------------------------------------------- - usage usage_in_ - max_usage max_usage_in_ - limit limit_in_ - failcnt failcnt - lock no file :) - - b. Reading from file should show the corresponding field value in the - appropriate format. - - c. Writing to file - - Field Expected behavior - ---------------------------------- - usage prohibited - max_usage reset to usage - limit set the limit - failcnt reset to zero - - - -5. Usage example - - a. Declare a task group (take a look at cgroups subsystem for this) and - fold a res_counter into it - - struct my_group { - struct res_counter res; - - - } - - b. Put hooks in resource allocation/release paths - - int alloc_something(...) - { - if (res_counter_charge(res_counter_ptr, amount) < 0) - return -ENOMEM; - - - } - - void release_something(...) - { - res_counter_uncharge(res_counter_ptr, amount); - - - } - - In order to keep the usage value self-consistent, both the - "res_counter_ptr" and the "amount" in release_something() should be - the same as they were in the alloc_something() when the releasing - resource was allocated. - - c. Provide the way to read res_counter values and set them (the cgroups - still can help with it). - - c. Compile and run :) diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h deleted file mode 100644 index 56b7bc32db4f..000000000000 --- a/include/linux/res_counter.h +++ /dev/null @@ -1,223 +0,0 @@ -#ifndef __RES_COUNTER_H__ -#define __RES_COUNTER_H__ - -/* - * Resource Counters - * Contain common data types and routines for resource accounting - * - * Copyright 2007 OpenVZ SWsoft Inc - * - * Author: Pavel Emelianov - * - * See Documentation/cgroups/resource_counter.txt for more - * info about what this counter is. - */ - -#include -#include - -/* - * The core object. the cgroup that wishes to account for some - * resource may include this counter into its structures and use - * the helpers described beyond - */ - -struct res_counter { - /* - * the current resource consumption level - */ - unsigned long long usage; - /* - * the maximal value of the usage from the counter creation - */ - unsigned long long max_usage; - /* - * the limit that usage cannot exceed - */ - unsigned long long limit; - /* - * the limit that usage can be exceed - */ - unsigned long long soft_limit; - /* - * the number of unsuccessful attempts to consume the resource - */ - unsigned long long failcnt; - /* - * the lock to protect all of the above. - * the routines below consider this to be IRQ-safe - */ - spinlock_t lock; - /* - * Parent counter, used for hierarchial resource accounting - */ - struct res_counter *parent; -}; - -#define RES_COUNTER_MAX ULLONG_MAX - -/** - * Helpers to interact with userspace - * res_counter_read_u64() - returns the value of the specified member. - * res_counter_read/_write - put/get the specified fields from the - * res_counter struct to/from the user - * - * @counter: the counter in question - * @member: the field to work with (see RES_xxx below) - * @buf: the buffer to opeate on,... - * @nbytes: its size... - * @pos: and the offset. - */ - -u64 res_counter_read_u64(struct res_counter *counter, int member); - -ssize_t res_counter_read(struct res_counter *counter, int member, - const char __user *buf, size_t nbytes, loff_t *pos, - int (*read_strategy)(unsigned long long val, char *s)); - -int res_counter_memparse_write_strategy(const char *buf, - unsigned long long *res); - -/* - * the field descriptors. one for each member of res_counter - */ - -enum { - RES_USAGE, - RES_MAX_USAGE, - RES_LIMIT, - RES_FAILCNT, - RES_SOFT_LIMIT, -}; - -/* - * helpers for accounting - */ - -void res_counter_init(struct res_counter *counter, struct res_counter *parent); - -/* - * charge - try to consume more resource. - * - * @counter: the counter - * @val: the amount of the resource. each controller defines its own - * units, e.g. numbers, bytes, Kbytes, etc - * - * returns 0 on success and <0 if the counter->usage will exceed the - * counter->limit - * - * charge_nofail works the same, except that it charges the resource - * counter unconditionally, and returns < 0 if the after the current - * charge we are over limit. - */ - -int __must_check res_counter_charge(struct res_counter *counter, - unsigned long val, struct res_counter **limit_fail_at); -int res_counter_charge_nofail(struct res_counter *counter, - unsigned long val, struct res_counter **limit_fail_at); - -/* - * uncharge - tell that some portion of the resource is released - * - * @counter: the counter - * @val: the amount of the resource - * - * these calls check for usage underflow and show a warning on the console - * - * returns the total charges still present in @counter. - */ - -u64 res_counter_uncharge(struct res_counter *counter, unsigned long val); - -u64 res_counter_uncharge_until(struct res_counter *counter, - struct res_counter *top, - unsigned long val); -/** - * res_counter_margin - calculate chargeable space of a counter - * @cnt: the counter - * - * Returns the difference between the hard limit and the current usage - * of resource counter @cnt. - */ -static inline unsigned long long res_counter_margin(struct res_counter *cnt) -{ - unsigned long long margin; - unsigned long flags; - - spin_lock_irqsave(&cnt->lock, flags); - if (cnt->limit > cnt->usage) - margin = cnt->limit - cnt->usage; - else - margin = 0; - spin_unlock_irqrestore(&cnt->lock, flags); - return margin; -} - -/** - * Get the difference between the usage and the soft limit - * @cnt: The counter - * - * Returns 0 if usage is less than or equal to soft limit - * The difference between usage and soft limit, otherwise. - */ -static inline unsigned long long -res_counter_soft_limit_excess(struct res_counter *cnt) -{ - unsigned long long excess; - unsigned long flags; - - spin_lock_irqsave(&cnt->lock, flags); - if (cnt->usage <= cnt->soft_limit) - excess = 0; - else - excess = cnt->usage - cnt->soft_limit; - spin_unlock_irqrestore(&cnt->lock, flags); - return excess; -} - -static inline void res_counter_reset_max(struct res_counter *cnt) -{ - unsigned long flags; - - spin_lock_irqsave(&cnt->lock, flags); - cnt->max_usage = cnt->usage; - spin_unlock_irqrestore(&cnt->lock, flags); -} - -static inline void res_counter_reset_failcnt(struct res_counter *cnt) -{ - unsigned long flags; - - spin_lock_irqsave(&cnt->lock, flags); - cnt->failcnt = 0; - spin_unlock_irqrestore(&cnt->lock, flags); -} - -static inline int res_counter_set_limit(struct res_counter *cnt, - unsigned long long limit) -{ - unsigned long flags; - int ret = -EBUSY; - - spin_lock_irqsave(&cnt->lock, flags); - if (cnt->usage <= limit) { - cnt->limit = limit; - ret = 0; - } - spin_unlock_irqrestore(&cnt->lock, flags); - return ret; -} - -static inline int -res_counter_set_soft_limit(struct res_counter *cnt, - unsigned long long soft_limit) -{ - unsigned long flags; - - spin_lock_irqsave(&cnt->lock, flags); - cnt->soft_limit = soft_limit; - spin_unlock_irqrestore(&cnt->lock, flags); - return 0; -} - -#endif diff --git a/init/Kconfig b/init/Kconfig index a60d1442d1df..1761c72bc1a0 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -972,12 +972,6 @@ config CGROUP_CPUACCT Provides a simple Resource Controller for monitoring the total CPU consumed by the tasks in a cgroup. -config RESOURCE_COUNTERS - bool "Resource counters" - help - This option enables controller independent resource accounting - infrastructure that works with cgroups. - config PAGE_COUNTER bool diff --git a/kernel/Makefile b/kernel/Makefile index 17ea6d4a9a24..a59481a3fa6c 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -57,7 +57,6 @@ obj-$(CONFIG_UTS_NS) += utsname.o obj-$(CONFIG_USER_NS) += user_namespace.o obj-$(CONFIG_PID_NS) += pid_namespace.o obj-$(CONFIG_IKCONFIG) += configs.o -obj-$(CONFIG_RESOURCE_COUNTERS) += res_counter.o obj-$(CONFIG_SMP) += stop_machine.o obj-$(CONFIG_KPROBES_SANITY_TEST) += test_kprobes.o obj-$(CONFIG_AUDIT) += audit.o auditfilter.o diff --git a/kernel/res_counter.c b/kernel/res_counter.c deleted file mode 100644 index e791130f85a7..000000000000 --- a/kernel/res_counter.c +++ /dev/null @@ -1,211 +0,0 @@ -/* - * resource cgroups - * - * Copyright 2007 OpenVZ SWsoft Inc - * - * Author: Pavel Emelianov - * - */ - -#include -#include -#include -#include -#include -#include - -void res_counter_init(struct res_counter *counter, struct res_counter *parent) -{ - spin_lock_init(&counter->lock); - counter->limit = RES_COUNTER_MAX; - counter->soft_limit = RES_COUNTER_MAX; - counter->parent = parent; -} - -static u64 res_counter_uncharge_locked(struct res_counter *counter, - unsigned long val) -{ - if (WARN_ON(counter->usage < val)) - val = counter->usage; - - counter->usage -= val; - return counter->usage; -} - -static int res_counter_charge_locked(struct res_counter *counter, - unsigned long val, bool force) -{ - int ret = 0; - - if (counter->usage + val > counter->limit) { - counter->failcnt++; - ret = -ENOMEM; - if (!force) - return ret; - } - - counter->usage += val; - if (counter->usage > counter->max_usage) - counter->max_usage = counter->usage; - return ret; -} - -static int __res_counter_charge(struct res_counter *counter, unsigned long val, - struct res_counter **limit_fail_at, bool force) -{ - int ret, r; - unsigned long flags; - struct res_counter *c, *u; - - r = ret = 0; - *limit_fail_at = NULL; - local_irq_save(flags); - for (c = counter; c != NULL; c = c->parent) { - spin_lock(&c->lock); - r = res_counter_charge_locked(c, val, force); - spin_unlock(&c->lock); - if (r < 0 && !ret) { - ret = r; - *limit_fail_at = c; - if (!force) - break; - } - } - - if (ret < 0 && !force) { - for (u = counter; u != c; u = u->parent) { - spin_lock(&u->lock); - res_counter_uncharge_locked(u, val); - spin_unlock(&u->lock); - } - } - local_irq_restore(flags); - - return ret; -} - -int res_counter_charge(struct res_counter *counter, unsigned long val, - struct res_counter **limit_fail_at) -{ - return __res_counter_charge(counter, val, limit_fail_at, false); -} - -int res_counter_charge_nofail(struct res_counter *counter, unsigned long val, - struct res_counter **limit_fail_at) -{ - return __res_counter_charge(counter, val, limit_fail_at, true); -} - -u64 res_counter_uncharge_until(struct res_counter *counter, - struct res_counter *top, - unsigned long val) -{ - unsigned long flags; - struct res_counter *c; - u64 ret = 0; - - local_irq_save(flags); - for (c = counter; c != top; c = c->parent) { - u64 r; - spin_lock(&c->lock); - r = res_counter_uncharge_locked(c, val); - if (c == counter) - ret = r; - spin_unlock(&c->lock); - } - local_irq_restore(flags); - return ret; -} - -u64 res_counter_uncharge(struct res_counter *counter, unsigned long val) -{ - return res_counter_uncharge_until(counter, NULL, val); -} - -static inline unsigned long long * -res_counter_member(struct res_counter *counter, int member) -{ - switch (member) { - case RES_USAGE: - return &counter->usage; - case RES_MAX_USAGE: - return &counter->max_usage; - case RES_LIMIT: - return &counter->limit; - case RES_FAILCNT: - return &counter->failcnt; - case RES_SOFT_LIMIT: - return &counter->soft_limit; - }; - - BUG(); - return NULL; -} - -ssize_t res_counter_read(struct res_counter *counter, int member, - const char __user *userbuf, size_t nbytes, loff_t *pos, - int (*read_strategy)(unsigned long long val, char *st_buf)) -{ - unsigned long long *val; - char buf[64], *s; - - s = buf; - val = res_counter_member(counter, member); - if (read_strategy) - s += read_strategy(*val, s); - else - s += sprintf(s, "%llu\n", *val); - return simple_read_from_buffer((void __user *)userbuf, nbytes, - pos, buf, s - buf); -} - -#if BITS_PER_LONG == 32 -u64 res_counter_read_u64(struct res_counter *counter, int member) -{ - unsigned long flags; - u64 ret; - - spin_lock_irqsave(&counter->lock, flags); - ret = *res_counter_member(counter, member); - spin_unlock_irqrestore(&counter->lock, flags); - - return ret; -} -#else -u64 res_counter_read_u64(struct res_counter *counter, int member) -{ - return *res_counter_member(counter, member); -} -#endif - -int res_counter_memparse_write_strategy(const char *buf, - unsigned long long *resp) -{ - char *end; - unsigned long long res; - - /* return RES_COUNTER_MAX(unlimited) if "-1" is specified */ - if (*buf == '-') { - int rc = kstrtoull(buf + 1, 10, &res); - - if (rc) - return rc; - if (res != 1) - return -EINVAL; - *resp = RES_COUNTER_MAX; - return 0; - } - - res = memparse(buf, &end); - if (*end != '\0') - return -EINVAL; - - if (PAGE_ALIGN(res) >= res) - res = PAGE_ALIGN(res); - else - res = RES_COUNTER_MAX; - - *resp = res; - - return 0; -} -- cgit v1.2.3-59-g8ed1b From 1306a85aed3ec3db98945aafb7dfbe5648a1203c Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 10 Dec 2014 15:44:52 -0800 Subject: mm: embed the memcg pointer directly into struct page Memory cgroups used to have 5 per-page pointers. To allow users to disable that amount of overhead during runtime, those pointers were allocated in a separate array, with a translation layer between them and struct page. There is now only one page pointer remaining: the memcg pointer, that indicates which cgroup the page is associated with when charged. The complexity of runtime allocation and the runtime translation overhead is no longer justified to save that *potential* 0.19% of memory. With CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding after the page->private member and doesn't even increase struct page, and then this patch actually saves space. Remaining users that care can still compile their kernels without CONFIG_MEMCG. text data bss dec hex filename 8828345 1725264 983040 11536649 b00909 vmlinux.old 8827425 1725264 966656 11519345 afc571 vmlinux.new [mhocko@suse.cz: update Documentation/cgroups/memory.txt] Signed-off-by: Johannes Weiner Acked-by: Michal Hocko Acked-by: Vladimir Davydov Acked-by: David S. Miller Acked-by: KAMEZAWA Hiroyuki Cc: "Kirill A. Shutemov" Cc: Michal Hocko Cc: Vladimir Davydov Cc: Tejun Heo Cc: Joonsoo Kim Acked-by: Konstantin Khlebnikov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memory.txt | 5 + include/linux/memcontrol.h | 6 +- include/linux/mm_types.h | 5 + include/linux/mmzone.h | 12 -- include/linux/page_cgroup.h | 53 ------- init/main.c | 7 - mm/memcontrol.c | 124 +++++---------- mm/page_alloc.c | 2 - mm/page_cgroup.c | 319 --------------------------------------- 9 files changed, 46 insertions(+), 487 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 67613ff0270c..46b2b5080317 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -1,5 +1,10 @@ Memory Resource Controller +NOTE: This document is hopelessly outdated and it asks for a complete + rewrite. It still contains a useful information so we are keeping it + here but make sure to check the current code if you need a deeper + understanding. + NOTE: The Memory Resource Controller has generically been referred to as the memory controller in this document. Do not confuse memory controller used here with the memory controller that is used in hardware. diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index de018766be45..c4d080875164 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -25,7 +25,6 @@ #include struct mem_cgroup; -struct page_cgroup; struct page; struct mm_struct; struct kmem_cache; @@ -466,8 +465,6 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order) * memcg_kmem_uncharge_pages: uncharge pages from memcg * @page: pointer to struct page being freed * @order: allocation order. - * - * there is no need to specify memcg here, since it is embedded in page_cgroup */ static inline void memcg_kmem_uncharge_pages(struct page *page, int order) @@ -484,8 +481,7 @@ memcg_kmem_uncharge_pages(struct page *page, int order) * * Needs to be called after memcg_kmem_newpage_charge, regardless of success or * failure of the allocation. if @page is NULL, this function will revert the - * charges. Otherwise, it will commit the memcg given by @memcg to the - * corresponding page_cgroup. + * charges. Otherwise, it will commit @page to @memcg. */ static inline void memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 004e9d17b47e..bf9f57529dcf 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -22,6 +22,7 @@ #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1)) struct address_space; +struct mem_cgroup; #define USE_SPLIT_PTE_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS) #define USE_SPLIT_PMD_PTLOCKS (USE_SPLIT_PTE_PTLOCKS && \ @@ -167,6 +168,10 @@ struct page { struct page *first_page; /* Compound tail pages */ }; +#ifdef CONFIG_MEMCG + struct mem_cgroup *mem_cgroup; +#endif + /* * On machines where all RAM is mapped into kernel address space, * we can simply calculate the virtual address. On machines with diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index ffe66e381c04..3879d7664dfc 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -722,9 +722,6 @@ typedef struct pglist_data { int nr_zones; #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */ struct page *node_mem_map; -#ifdef CONFIG_MEMCG - struct page_cgroup *node_page_cgroup; -#endif #endif #ifndef CONFIG_NO_BOOTMEM struct bootmem_data *bdata; @@ -1078,7 +1075,6 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn) #define SECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SECTION_MASK) struct page; -struct page_cgroup; struct mem_section { /* * This is, logically, a pointer to an array of struct @@ -1096,14 +1092,6 @@ struct mem_section { /* See declaration of similar field in struct zone */ unsigned long *pageblock_flags; -#ifdef CONFIG_MEMCG - /* - * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use - * section. (see memcontrol.h/page_cgroup.h about this.) - */ - struct page_cgroup *page_cgroup; - unsigned long pad; -#endif /* * WARNING: mem_section must be a power-of-2 in size for the * calculation and use of SECTION_ROOT_MASK to make sense. diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 1289be6b436c..65be35785c86 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -1,59 +1,6 @@ #ifndef __LINUX_PAGE_CGROUP_H #define __LINUX_PAGE_CGROUP_H -struct pglist_data; - -#ifdef CONFIG_MEMCG -struct mem_cgroup; - -/* - * Page Cgroup can be considered as an extended mem_map. - * A page_cgroup page is associated with every page descriptor. The - * page_cgroup helps us identify information about the cgroup - * All page cgroups are allocated at boot or memory hotplug event, - * then the page cgroup for pfn always exists. - */ -struct page_cgroup { - struct mem_cgroup *mem_cgroup; -}; - -extern void pgdat_page_cgroup_init(struct pglist_data *pgdat); - -#ifdef CONFIG_SPARSEMEM -static inline void page_cgroup_init_flatmem(void) -{ -} -extern void page_cgroup_init(void); -#else -extern void page_cgroup_init_flatmem(void); -static inline void page_cgroup_init(void) -{ -} -#endif - -struct page_cgroup *lookup_page_cgroup(struct page *page); - -#else /* !CONFIG_MEMCG */ -struct page_cgroup; - -static inline void pgdat_page_cgroup_init(struct pglist_data *pgdat) -{ -} - -static inline struct page_cgroup *lookup_page_cgroup(struct page *page) -{ - return NULL; -} - -static inline void page_cgroup_init(void) -{ -} - -static inline void page_cgroup_init_flatmem(void) -{ -} -#endif /* CONFIG_MEMCG */ - #include #ifdef CONFIG_MEMCG_SWAP diff --git a/init/main.c b/init/main.c index 321d0ceb26d3..d2e4ead4891f 100644 --- a/init/main.c +++ b/init/main.c @@ -51,7 +51,6 @@ #include #include #include -#include #include #include #include @@ -485,11 +484,6 @@ void __init __weak thread_info_cache_init(void) */ static void __init mm_init(void) { - /* - * page_cgroup requires contiguous pages, - * bigger than MAX_ORDER unless SPARSEMEM. - */ - page_cgroup_init_flatmem(); mem_init(); kmem_cache_init(); percpu_init_late(); @@ -627,7 +621,6 @@ asmlinkage __visible void __init start_kernel(void) initrd_start = 0; } #endif - page_cgroup_init(); debug_objects_mem_init(); kmemleak_init(); setup_per_cpu_pageset(); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 78cb3b05a9fa..b864067791dc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1274,7 +1274,6 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone) { struct mem_cgroup_per_zone *mz; struct mem_cgroup *memcg; - struct page_cgroup *pc; struct lruvec *lruvec; if (mem_cgroup_disabled()) { @@ -1282,8 +1281,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone) goto out; } - pc = lookup_page_cgroup(page); - memcg = pc->mem_cgroup; + memcg = page->mem_cgroup; /* * Swapcache readahead pages are added to the LRU - and * possibly migrated - before they are charged. @@ -2020,16 +2018,13 @@ struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page, unsigned long *flags) { struct mem_cgroup *memcg; - struct page_cgroup *pc; rcu_read_lock(); if (mem_cgroup_disabled()) return NULL; - - pc = lookup_page_cgroup(page); again: - memcg = pc->mem_cgroup; + memcg = page->mem_cgroup; if (unlikely(!memcg)) return NULL; @@ -2038,7 +2033,7 @@ again: return memcg; spin_lock_irqsave(&memcg->move_lock, *flags); - if (memcg != pc->mem_cgroup) { + if (memcg != page->mem_cgroup) { spin_unlock_irqrestore(&memcg->move_lock, *flags); goto again; } @@ -2405,15 +2400,12 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id) struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page) { struct mem_cgroup *memcg; - struct page_cgroup *pc; unsigned short id; swp_entry_t ent; VM_BUG_ON_PAGE(!PageLocked(page), page); - pc = lookup_page_cgroup(page); - memcg = pc->mem_cgroup; - + memcg = page->mem_cgroup; if (memcg) { if (!css_tryget_online(&memcg->css)) memcg = NULL; @@ -2463,10 +2455,9 @@ static void unlock_page_lru(struct page *page, int isolated) static void commit_charge(struct page *page, struct mem_cgroup *memcg, bool lrucare) { - struct page_cgroup *pc = lookup_page_cgroup(page); int isolated; - VM_BUG_ON_PAGE(pc->mem_cgroup, page); + VM_BUG_ON_PAGE(page->mem_cgroup, page); /* * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page @@ -2477,7 +2468,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg, /* * Nobody should be changing or seriously looking at - * pc->mem_cgroup at this point: + * page->mem_cgroup at this point: * * - the page is uncharged * @@ -2489,7 +2480,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg, * - a page cache insertion, a swapin fault, or a migration * have the page locked */ - pc->mem_cgroup = memcg; + page->mem_cgroup = memcg; if (lrucare) unlock_page_lru(page, isolated); @@ -2972,8 +2963,6 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order) void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order) { - struct page_cgroup *pc; - VM_BUG_ON(mem_cgroup_is_root(memcg)); /* The page allocation failed. Revert */ @@ -2981,14 +2970,12 @@ void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, memcg_uncharge_kmem(memcg, 1 << order); return; } - pc = lookup_page_cgroup(page); - pc->mem_cgroup = memcg; + page->mem_cgroup = memcg; } void __memcg_kmem_uncharge_pages(struct page *page, int order) { - struct page_cgroup *pc = lookup_page_cgroup(page); - struct mem_cgroup *memcg = pc->mem_cgroup; + struct mem_cgroup *memcg = page->mem_cgroup; if (!memcg) return; @@ -2996,7 +2983,7 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order) VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page); memcg_uncharge_kmem(memcg, 1 << order); - pc->mem_cgroup = NULL; + page->mem_cgroup = NULL; } #else static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg) @@ -3014,16 +3001,15 @@ static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg) */ void mem_cgroup_split_huge_fixup(struct page *head) { - struct page_cgroup *pc = lookup_page_cgroup(head); int i; if (mem_cgroup_disabled()) return; for (i = 1; i < HPAGE_PMD_NR; i++) - pc[i].mem_cgroup = pc[0].mem_cgroup; + head[i].mem_cgroup = head->mem_cgroup; - __this_cpu_sub(pc[0].mem_cgroup->stat->count[MEM_CGROUP_STAT_RSS_HUGE], + __this_cpu_sub(head->mem_cgroup->stat->count[MEM_CGROUP_STAT_RSS_HUGE], HPAGE_PMD_NR); } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ @@ -3032,7 +3018,6 @@ void mem_cgroup_split_huge_fixup(struct page *head) * mem_cgroup_move_account - move account of the page * @page: the page * @nr_pages: number of regular pages (>1 for huge pages) - * @pc: page_cgroup of the page. * @from: mem_cgroup which the page is moved from. * @to: mem_cgroup which the page is moved to. @from != @to. * @@ -3045,7 +3030,6 @@ void mem_cgroup_split_huge_fixup(struct page *head) */ static int mem_cgroup_move_account(struct page *page, unsigned int nr_pages, - struct page_cgroup *pc, struct mem_cgroup *from, struct mem_cgroup *to) { @@ -3065,7 +3049,7 @@ static int mem_cgroup_move_account(struct page *page, goto out; /* - * Prevent mem_cgroup_migrate() from looking at pc->mem_cgroup + * Prevent mem_cgroup_migrate() from looking at page->mem_cgroup * of its source page while we change it: page migration takes * both pages off the LRU, but page cache replacement doesn't. */ @@ -3073,7 +3057,7 @@ static int mem_cgroup_move_account(struct page *page, goto out; ret = -EINVAL; - if (pc->mem_cgroup != from) + if (page->mem_cgroup != from) goto out_unlock; spin_lock_irqsave(&from->move_lock, flags); @@ -3093,13 +3077,13 @@ static int mem_cgroup_move_account(struct page *page, } /* - * It is safe to change pc->mem_cgroup here because the page + * It is safe to change page->mem_cgroup here because the page * is referenced, charged, and isolated - we can't race with * uncharging, charging, migration, or LRU putback. */ /* caller should have done css_get */ - pc->mem_cgroup = to; + page->mem_cgroup = to; spin_unlock_irqrestore(&from->move_lock, flags); ret = 0; @@ -3174,36 +3158,17 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry, #endif #ifdef CONFIG_DEBUG_VM -static struct page_cgroup *lookup_page_cgroup_used(struct page *page) -{ - struct page_cgroup *pc; - - pc = lookup_page_cgroup(page); - /* - * Can be NULL while feeding pages into the page allocator for - * the first time, i.e. during boot or memory hotplug; - * or when mem_cgroup_disabled(). - */ - if (likely(pc) && pc->mem_cgroup) - return pc; - return NULL; -} - bool mem_cgroup_bad_page_check(struct page *page) { if (mem_cgroup_disabled()) return false; - return lookup_page_cgroup_used(page) != NULL; + return page->mem_cgroup != NULL; } void mem_cgroup_print_bad_page(struct page *page) { - struct page_cgroup *pc; - - pc = lookup_page_cgroup_used(page); - if (pc) - pr_alert("pc:%p pc->mem_cgroup:%p\n", pc, pc->mem_cgroup); + pr_alert("page->mem_cgroup:%p\n", page->mem_cgroup); } #endif @@ -5123,7 +5088,6 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma, unsigned long addr, pte_t ptent, union mc_target *target) { struct page *page = NULL; - struct page_cgroup *pc; enum mc_target_type ret = MC_TARGET_NONE; swp_entry_t ent = { .val = 0 }; @@ -5137,13 +5101,12 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma, if (!page && !ent.val) return ret; if (page) { - pc = lookup_page_cgroup(page); /* * Do only loose check w/o serialization. - * mem_cgroup_move_account() checks the pc is valid or + * mem_cgroup_move_account() checks the page is valid or * not under LRU exclusion. */ - if (pc->mem_cgroup == mc.from) { + if (page->mem_cgroup == mc.from) { ret = MC_TARGET_PAGE; if (target) target->page = page; @@ -5171,15 +5134,13 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma, unsigned long addr, pmd_t pmd, union mc_target *target) { struct page *page = NULL; - struct page_cgroup *pc; enum mc_target_type ret = MC_TARGET_NONE; page = pmd_page(pmd); VM_BUG_ON_PAGE(!page || !PageHead(page), page); if (!move_anon()) return ret; - pc = lookup_page_cgroup(page); - if (pc->mem_cgroup == mc.from) { + if (page->mem_cgroup == mc.from) { ret = MC_TARGET_PAGE; if (target) { get_page(page); @@ -5378,7 +5339,6 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, enum mc_target_type target_type; union mc_target target; struct page *page; - struct page_cgroup *pc; /* * We don't take compound_lock() here but no race with splitting thp @@ -5399,9 +5359,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, if (target_type == MC_TARGET_PAGE) { page = target.page; if (!isolate_lru_page(page)) { - pc = lookup_page_cgroup(page); if (!mem_cgroup_move_account(page, HPAGE_PMD_NR, - pc, mc.from, mc.to)) { + mc.from, mc.to)) { mc.precharge -= HPAGE_PMD_NR; mc.moved_charge += HPAGE_PMD_NR; } @@ -5429,9 +5388,7 @@ retry: page = target.page; if (isolate_lru_page(page)) goto put; - pc = lookup_page_cgroup(page); - if (!mem_cgroup_move_account(page, 1, pc, - mc.from, mc.to)) { + if (!mem_cgroup_move_account(page, 1, mc.from, mc.to)) { mc.precharge--; /* we uncharge from mc.from later. */ mc.moved_charge++; @@ -5619,7 +5576,6 @@ static void __init enable_swap_cgroup(void) void mem_cgroup_swapout(struct page *page, swp_entry_t entry) { struct mem_cgroup *memcg; - struct page_cgroup *pc; unsigned short oldid; VM_BUG_ON_PAGE(PageLRU(page), page); @@ -5628,8 +5584,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) if (!do_swap_account) return; - pc = lookup_page_cgroup(page); - memcg = pc->mem_cgroup; + memcg = page->mem_cgroup; /* Readahead page, never charged */ if (!memcg) @@ -5639,7 +5594,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) VM_BUG_ON_PAGE(oldid, page); mem_cgroup_swap_statistics(memcg, true); - pc->mem_cgroup = NULL; + page->mem_cgroup = NULL; if (!mem_cgroup_is_root(memcg)) page_counter_uncharge(&memcg->memory, 1); @@ -5706,7 +5661,6 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm, goto out; if (PageSwapCache(page)) { - struct page_cgroup *pc = lookup_page_cgroup(page); /* * Every swap fault against a single page tries to charge the * page, bail as early as possible. shmem_unuse() encounters @@ -5714,7 +5668,7 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm, * the page lock, which serializes swap cache removal, which * in turn serializes uncharging. */ - if (pc->mem_cgroup) + if (page->mem_cgroup) goto out; } @@ -5867,7 +5821,6 @@ static void uncharge_list(struct list_head *page_list) next = page_list->next; do { unsigned int nr_pages = 1; - struct page_cgroup *pc; page = list_entry(next, struct page, lru); next = page->lru.next; @@ -5875,23 +5828,22 @@ static void uncharge_list(struct list_head *page_list) VM_BUG_ON_PAGE(PageLRU(page), page); VM_BUG_ON_PAGE(page_count(page), page); - pc = lookup_page_cgroup(page); - if (!pc->mem_cgroup) + if (!page->mem_cgroup) continue; /* * Nobody should be changing or seriously looking at - * pc->mem_cgroup at this point, we have fully + * page->mem_cgroup at this point, we have fully * exclusive access to the page. */ - if (memcg != pc->mem_cgroup) { + if (memcg != page->mem_cgroup) { if (memcg) { uncharge_batch(memcg, pgpgout, nr_anon, nr_file, nr_huge, page); pgpgout = nr_anon = nr_file = nr_huge = 0; } - memcg = pc->mem_cgroup; + memcg = page->mem_cgroup; } if (PageTransHuge(page)) { @@ -5905,7 +5857,7 @@ static void uncharge_list(struct list_head *page_list) else nr_file += nr_pages; - pc->mem_cgroup = NULL; + page->mem_cgroup = NULL; pgpgout++; } while (next != page_list); @@ -5924,14 +5876,11 @@ static void uncharge_list(struct list_head *page_list) */ void mem_cgroup_uncharge(struct page *page) { - struct page_cgroup *pc; - if (mem_cgroup_disabled()) return; /* Don't touch page->lru of any random page, pre-check: */ - pc = lookup_page_cgroup(page); - if (!pc->mem_cgroup) + if (!page->mem_cgroup) return; INIT_LIST_HEAD(&page->lru); @@ -5968,7 +5917,6 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage, bool lrucare) { struct mem_cgroup *memcg; - struct page_cgroup *pc; int isolated; VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage); @@ -5983,8 +5931,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage, return; /* Page cache replacement: new page already charged? */ - pc = lookup_page_cgroup(newpage); - if (pc->mem_cgroup) + if (newpage->mem_cgroup) return; /* @@ -5993,15 +5940,14 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage, * uncharged page when the PFN walker finds a page that * reclaim just put back on the LRU but has not released yet. */ - pc = lookup_page_cgroup(oldpage); - memcg = pc->mem_cgroup; + memcg = oldpage->mem_cgroup; if (!memcg) return; if (lrucare) lock_page_lru(oldpage, &isolated); - pc->mem_cgroup = NULL; + oldpage->mem_cgroup = NULL; if (lrucare) unlock_page_lru(oldpage, isolated); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 97b6966816e5..22cfdeffbf69 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -48,7 +48,6 @@ #include #include #include -#include #include #include #include @@ -4853,7 +4852,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, #endif init_waitqueue_head(&pgdat->kswapd_wait); init_waitqueue_head(&pgdat->pfmemalloc_wait); - pgdat_page_cgroup_init(pgdat); for (j = 0; j < MAX_NR_ZONES; j++) { struct zone *zone = pgdat->node_zones + j; diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c index 5331c2bd85a2..f0f31c1d4d0c 100644 --- a/mm/page_cgroup.c +++ b/mm/page_cgroup.c @@ -1,326 +1,7 @@ #include -#include -#include -#include #include -#include -#include -#include #include -#include #include -#include - -static unsigned long total_usage; - -#if !defined(CONFIG_SPARSEMEM) - - -void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat) -{ - pgdat->node_page_cgroup = NULL; -} - -struct page_cgroup *lookup_page_cgroup(struct page *page) -{ - unsigned long pfn = page_to_pfn(page); - unsigned long offset; - struct page_cgroup *base; - - base = NODE_DATA(page_to_nid(page))->node_page_cgroup; -#ifdef CONFIG_DEBUG_VM - /* - * The sanity checks the page allocator does upon freeing a - * page can reach here before the page_cgroup arrays are - * allocated when feeding a range of pages to the allocator - * for the first time during bootup or memory hotplug. - */ - if (unlikely(!base)) - return NULL; -#endif - offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn; - return base + offset; -} - -static int __init alloc_node_page_cgroup(int nid) -{ - struct page_cgroup *base; - unsigned long table_size; - unsigned long nr_pages; - - nr_pages = NODE_DATA(nid)->node_spanned_pages; - if (!nr_pages) - return 0; - - table_size = sizeof(struct page_cgroup) * nr_pages; - - base = memblock_virt_alloc_try_nid_nopanic( - table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS), - BOOTMEM_ALLOC_ACCESSIBLE, nid); - if (!base) - return -ENOMEM; - NODE_DATA(nid)->node_page_cgroup = base; - total_usage += table_size; - return 0; -} - -void __init page_cgroup_init_flatmem(void) -{ - - int nid, fail; - - if (mem_cgroup_disabled()) - return; - - for_each_online_node(nid) { - fail = alloc_node_page_cgroup(nid); - if (fail) - goto fail; - } - printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage); - printk(KERN_INFO "please try 'cgroup_disable=memory' option if you" - " don't want memory cgroups\n"); - return; -fail: - printk(KERN_CRIT "allocation of page_cgroup failed.\n"); - printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n"); - panic("Out of memory"); -} - -#else /* CONFIG_FLAT_NODE_MEM_MAP */ - -struct page_cgroup *lookup_page_cgroup(struct page *page) -{ - unsigned long pfn = page_to_pfn(page); - struct mem_section *section = __pfn_to_section(pfn); -#ifdef CONFIG_DEBUG_VM - /* - * The sanity checks the page allocator does upon freeing a - * page can reach here before the page_cgroup arrays are - * allocated when feeding a range of pages to the allocator - * for the first time during bootup or memory hotplug. - */ - if (!section->page_cgroup) - return NULL; -#endif - return section->page_cgroup + pfn; -} - -static void *__meminit alloc_page_cgroup(size_t size, int nid) -{ - gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN; - void *addr = NULL; - - addr = alloc_pages_exact_nid(nid, size, flags); - if (addr) { - kmemleak_alloc(addr, size, 1, flags); - return addr; - } - - if (node_state(nid, N_HIGH_MEMORY)) - addr = vzalloc_node(size, nid); - else - addr = vzalloc(size); - - return addr; -} - -static int __meminit init_section_page_cgroup(unsigned long pfn, int nid) -{ - struct mem_section *section; - struct page_cgroup *base; - unsigned long table_size; - - section = __pfn_to_section(pfn); - - if (section->page_cgroup) - return 0; - - table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION; - base = alloc_page_cgroup(table_size, nid); - - /* - * The value stored in section->page_cgroup is (base - pfn) - * and it does not point to the memory block allocated above, - * causing kmemleak false positives. - */ - kmemleak_not_leak(base); - - if (!base) { - printk(KERN_ERR "page cgroup allocation failure\n"); - return -ENOMEM; - } - - /* - * The passed "pfn" may not be aligned to SECTION. For the calculation - * we need to apply a mask. - */ - pfn &= PAGE_SECTION_MASK; - section->page_cgroup = base - pfn; - total_usage += table_size; - return 0; -} -#ifdef CONFIG_MEMORY_HOTPLUG -static void free_page_cgroup(void *addr) -{ - if (is_vmalloc_addr(addr)) { - vfree(addr); - } else { - struct page *page = virt_to_page(addr); - size_t table_size = - sizeof(struct page_cgroup) * PAGES_PER_SECTION; - - BUG_ON(PageReserved(page)); - kmemleak_free(addr); - free_pages_exact(addr, table_size); - } -} - -static void __free_page_cgroup(unsigned long pfn) -{ - struct mem_section *ms; - struct page_cgroup *base; - - ms = __pfn_to_section(pfn); - if (!ms || !ms->page_cgroup) - return; - base = ms->page_cgroup + pfn; - free_page_cgroup(base); - ms->page_cgroup = NULL; -} - -static int __meminit online_page_cgroup(unsigned long start_pfn, - unsigned long nr_pages, - int nid) -{ - unsigned long start, end, pfn; - int fail = 0; - - start = SECTION_ALIGN_DOWN(start_pfn); - end = SECTION_ALIGN_UP(start_pfn + nr_pages); - - if (nid == -1) { - /* - * In this case, "nid" already exists and contains valid memory. - * "start_pfn" passed to us is a pfn which is an arg for - * online__pages(), and start_pfn should exist. - */ - nid = pfn_to_nid(start_pfn); - VM_BUG_ON(!node_state(nid, N_ONLINE)); - } - - for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) { - if (!pfn_present(pfn)) - continue; - fail = init_section_page_cgroup(pfn, nid); - } - if (!fail) - return 0; - - /* rollback */ - for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) - __free_page_cgroup(pfn); - - return -ENOMEM; -} - -static int __meminit offline_page_cgroup(unsigned long start_pfn, - unsigned long nr_pages, int nid) -{ - unsigned long start, end, pfn; - - start = SECTION_ALIGN_DOWN(start_pfn); - end = SECTION_ALIGN_UP(start_pfn + nr_pages); - - for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) - __free_page_cgroup(pfn); - return 0; - -} - -static int __meminit page_cgroup_callback(struct notifier_block *self, - unsigned long action, void *arg) -{ - struct memory_notify *mn = arg; - int ret = 0; - switch (action) { - case MEM_GOING_ONLINE: - ret = online_page_cgroup(mn->start_pfn, - mn->nr_pages, mn->status_change_nid); - break; - case MEM_OFFLINE: - offline_page_cgroup(mn->start_pfn, - mn->nr_pages, mn->status_change_nid); - break; - case MEM_CANCEL_ONLINE: - offline_page_cgroup(mn->start_pfn, - mn->nr_pages, mn->status_change_nid); - break; - case MEM_GOING_OFFLINE: - break; - case MEM_ONLINE: - case MEM_CANCEL_OFFLINE: - break; - } - - return notifier_from_errno(ret); -} - -#endif - -void __init page_cgroup_init(void) -{ - unsigned long pfn; - int nid; - - if (mem_cgroup_disabled()) - return; - - for_each_node_state(nid, N_MEMORY) { - unsigned long start_pfn, end_pfn; - - start_pfn = node_start_pfn(nid); - end_pfn = node_end_pfn(nid); - /* - * start_pfn and end_pfn may not be aligned to SECTION and the - * page->flags of out of node pages are not initialized. So we - * scan [start_pfn, the biggest section's pfn < end_pfn) here. - */ - for (pfn = start_pfn; - pfn < end_pfn; - pfn = ALIGN(pfn + 1, PAGES_PER_SECTION)) { - - if (!pfn_valid(pfn)) - continue; - /* - * Nodes's pfns can be overlapping. - * We know some arch can have a nodes layout such as - * -------------pfn--------------> - * N0 | N1 | N2 | N0 | N1 | N2|.... - */ - if (pfn_to_nid(pfn) != nid) - continue; - if (init_section_page_cgroup(pfn, nid)) - goto oom; - } - } - hotplug_memory_notifier(page_cgroup_callback, 0); - printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage); - printk(KERN_INFO "please try 'cgroup_disable=memory' option if you " - "don't want memory cgroups\n"); - return; -oom: - printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n"); - panic("Out of memory"); -} - -void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat) -{ - return; -} - -#endif - #ifdef CONFIG_MEMCG_SWAP -- cgit v1.2.3-59-g8ed1b From 9e3961a0979817c612b10b2da4f3045ec9faa779 Mon Sep 17 00:00:00 2001 From: Prarit Bhargava Date: Wed, 10 Dec 2014 15:45:50 -0800 Subject: kernel: add panic_on_warn There have been several times where I have had to rebuild a kernel to cause a panic when hitting a WARN() in the code in order to get a crash dump from a system. Sometimes this is easy to do, other times (such as in the case of a remote admin) it is not trivial to send new images to the user. A much easier method would be a switch to change the WARN() over to a panic. This makes debugging easier in that I can now test the actual image the WARN() was seen on and I do not have to engage in remote debugging. This patch adds a panic_on_warn kernel parameter and /proc/sys/kernel/panic_on_warn calls panic() in the warn_slowpath_common() path. The function will still print out the location of the warning. An example of the panic_on_warn output: The first line below is from the WARN_ON() to output the WARN_ON()'s location. After that the panic() output is displayed. WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]() Kernel panic - not syncing: panic_on_warn set ... CPU: 30 PID: 11698 Comm: insmod Tainted: G W OE 3.17.0+ #57 Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013 0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190 0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df Call Trace: [] dump_stack+0x46/0x58 [] panic+0xd0/0x204 [] ? init_dummy+0x1f/0x30 [dummy_module] [] warn_slowpath_common+0xd0/0xd0 [] ? dummy_greetings+0x40/0x40 [dummy_module] [] warn_slowpath_null+0x1a/0x20 [] init_dummy+0x1f/0x30 [dummy_module] [] do_one_initcall+0xd4/0x210 [] ? __vunmap+0xc2/0x110 [] load_module+0x16a9/0x1b30 [] ? store_uevent+0x70/0x70 [] ? copy_module_from_fd.isra.44+0x129/0x180 [] SyS_finit_module+0xa6/0xd0 [] system_call_fastpath+0x12/0x17 Successfully tested by me. hpa said: There is another very valid use for this: many operators would rather a machine shuts down than being potentially compromised either functionally or security-wise. Signed-off-by: Prarit Bhargava Cc: Jonathan Corbet Cc: Rusty Russell Cc: "H. Peter Anvin" Cc: Andi Kleen Cc: Masami Hiramatsu Acked-by: Yasuaki Ishimatsu Cc: Fabian Frederick Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kdump/kdump.txt | 7 +++++++ Documentation/kernel-parameters.txt | 3 +++ Documentation/sysctl/kernel.txt | 40 ++++++++++++++++++++++++------------- include/linux/kernel.h | 1 + include/uapi/linux/sysctl.h | 1 + kernel/panic.c | 13 ++++++++++++ kernel/sysctl.c | 9 +++++++++ kernel/sysctl_binary.c | 1 + 8 files changed, 61 insertions(+), 14 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt index 6c0b9f27e465..bc4bd5a44b88 100644 --- a/Documentation/kdump/kdump.txt +++ b/Documentation/kdump/kdump.txt @@ -471,6 +471,13 @@ format. Crash is available on Dave Anderson's site at the following URL: http://people.redhat.com/~anderson/ +Trigger Kdump on WARN() +======================= + +The kernel parameter, panic_on_warn, calls panic() in all WARN() paths. This +will cause a kdump to occur at the panic() call. In cases where a user wants +to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set to 1 +to achieve the same behaviour. Contact ======= diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 838f3776c924..d6eb3636fe5a 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2509,6 +2509,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted. timeout < 0: reboot immediately Format: + panic_on_warn panic() instead of WARN(). Useful to cause kdump + on a WARN(). + crash_kexec_post_notifiers Run kdump after running panic-notifiers and dumping kmsg. This only for the users who doubt kdump always diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 57baff5bdb80..b5d0c8501a18 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -54,8 +54,9 @@ show up in /proc/sys/kernel: - overflowuid - panic - panic_on_oops -- panic_on_unrecovered_nmi - panic_on_stackoverflow +- panic_on_unrecovered_nmi +- panic_on_warn - pid_max - powersave-nap [ PPC only ] - printk @@ -527,19 +528,6 @@ the recommended setting is 60. ============================================================== -panic_on_unrecovered_nmi: - -The default Linux behaviour on an NMI of either memory or unknown is -to continue operation. For many environments such as scientific -computing it is preferable that the box is taken out and the error -dealt with than an uncorrected parity/ECC error get propagated. - -A small number of systems do generate NMI's for bizarre random reasons -such as power management so the default is off. That sysctl works like -the existing panic controls already in that directory. - -============================================================== - panic_on_oops: Controls the kernel's behaviour when an oops or BUG is encountered. @@ -563,6 +551,30 @@ This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled. ============================================================== +panic_on_unrecovered_nmi: + +The default Linux behaviour on an NMI of either memory or unknown is +to continue operation. For many environments such as scientific +computing it is preferable that the box is taken out and the error +dealt with than an uncorrected parity/ECC error get propagated. + +A small number of systems do generate NMI's for bizarre random reasons +such as power management so the default is off. That sysctl works like +the existing panic controls already in that directory. + +============================================================== + +panic_on_warn: + +Calls panic() in the WARN() path when set to 1. This is useful to avoid +a kernel rebuild when attempting to kdump at the location of a WARN(). + +0: only WARN(), default behaviour. + +1: call panic() after printing out WARN() location. + +============================================================== + perf_cpu_time_max_percent: Hints to the kernel how much CPU time it should be allowed to diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 446d76a87ba1..233ea8107038 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -427,6 +427,7 @@ extern int panic_timeout; extern int panic_on_oops; extern int panic_on_unrecovered_nmi; extern int panic_on_io_nmi; +extern int panic_on_warn; extern int sysctl_panic_on_stackoverflow; /* * Only to be used by arch init code. If the user over-wrote the default diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h index 43aaba1cc037..0956373b56db 100644 --- a/include/uapi/linux/sysctl.h +++ b/include/uapi/linux/sysctl.h @@ -153,6 +153,7 @@ enum KERN_MAX_LOCK_DEPTH=74, /* int: rtmutex's maximum lock depth */ KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */ KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */ + KERN_PANIC_ON_WARN=77, /* int: call panic() in WARN() functions */ }; diff --git a/kernel/panic.c b/kernel/panic.c index cf80672b7924..4d8d6f906dec 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -33,6 +33,7 @@ static int pause_on_oops; static int pause_on_oops_flag; static DEFINE_SPINLOCK(pause_on_oops_lock); static bool crash_kexec_post_notifiers; +int panic_on_warn __read_mostly; int panic_timeout = CONFIG_PANIC_TIMEOUT; EXPORT_SYMBOL_GPL(panic_timeout); @@ -428,6 +429,17 @@ static void warn_slowpath_common(const char *file, int line, void *caller, if (args) vprintk(args->fmt, args->args); + if (panic_on_warn) { + /* + * This thread may hit another WARN() in the panic path. + * Resetting this prevents additional WARN() from panicking the + * system on this thread. Other threads are blocked by the + * panic_mutex in panic(). + */ + panic_on_warn = 0; + panic("panic_on_warn set ...\n"); + } + print_modules(); dump_stack(); print_oops_end_marker(); @@ -485,6 +497,7 @@ EXPORT_SYMBOL(__stack_chk_fail); core_param(panic, panic_timeout, int, 0644); core_param(pause_on_oops, pause_on_oops, int, 0644); +core_param(panic_on_warn, panic_on_warn, int, 0644); static int __init setup_crash_kexec_post_notifiers(char *s) { diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 15f2511a1b7c..7c54ff79afd7 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1104,6 +1104,15 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec, }, #endif + { + .procname = "panic_on_warn", + .data = &panic_on_warn, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &one, + }, { } }; diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c index 9a4f750a2963..7e7746a42a62 100644 --- a/kernel/sysctl_binary.c +++ b/kernel/sysctl_binary.c @@ -137,6 +137,7 @@ static const struct bin_table bin_kern_table[] = { { CTL_INT, KERN_COMPAT_LOG, "compat-log" }, { CTL_INT, KERN_MAX_LOCK_DEPTH, "max_lock_depth" }, { CTL_INT, KERN_PANIC_ON_NMI, "panic_on_unrecovered_nmi" }, + { CTL_INT, KERN_PANIC_ON_WARN, "panic_on_warn" }, {} }; -- cgit v1.2.3-59-g8ed1b From 222a12fca6048249d9007f2a4c5fbcea532e8522 Mon Sep 17 00:00:00 2001 From: Johan Hovold Date: Wed, 10 Dec 2014 15:53:13 -0800 Subject: rtc: omap: add support for pmic_power_en Add new property "ti,system-power-controller" to register the RTC as a power-off handler. Some RTC IP revisions can control an external PMIC via the pmic_power_en pin, which can be configured to transition to OFF on ALARM2 events and back to ON on subsequent ALARM (wakealarm) events. This is based on earlier work by Colin Foe-Parker and AnilKumar Ch. [1] [1] https://www.mail-archive.com/linux-omap@vger.kernel.org/msg82127.html [akpm@linux-foundation.org: add comment] Signed-off-by: Johan Hovold Cc: Colin Foe-Parker Cc: AnilKumar Ch Cc: Alessandro Zummo Cc: Tony Lindgren Cc: Benot Cousson Cc: Lokesh Vutla Cc: Guenter Roeck Cc: Sekhar Nori Cc: Tero Kristo Cc: Keerthy J Tested-by: Felipe Balbi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/devicetree/bindings/rtc/rtc-omap.txt | 9 +- drivers/rtc/rtc-omap.c | 100 +++++++++++++++++++++ 2 files changed, 108 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/rtc/rtc-omap.txt b/Documentation/devicetree/bindings/rtc/rtc-omap.txt index 5a0f02d34d95..750efd40c72e 100644 --- a/Documentation/devicetree/bindings/rtc/rtc-omap.txt +++ b/Documentation/devicetree/bindings/rtc/rtc-omap.txt @@ -5,11 +5,17 @@ Required properties: - "ti,da830-rtc" - for RTC IP used similar to that on DA8xx SoC family. - "ti,am3352-rtc" - for RTC IP used similar to that on AM335x SoC family. This RTC IP has special WAKE-EN Register to enable - Wakeup generation for event Alarm. + Wakeup generation for event Alarm. It can also be + used to control an external PMIC via the + pmic_power_en pin. - reg: Address range of rtc register set - interrupts: rtc timer, alarm interrupts in order - interrupt-parent: phandle for the interrupt controller +Optional properties: +- ti,system-power-controller: whether the rtc is controlling the system power + through pmic_power_en + Example: rtc@1c23000 { @@ -18,4 +24,5 @@ rtc@1c23000 { interrupts = <19 19>; interrupt-parent = <&intc>; + ti,system-power-controller; }; diff --git a/drivers/rtc/rtc-omap.c b/drivers/rtc/rtc-omap.c index c508e45ca3ce..e83f51ae7f63 100644 --- a/drivers/rtc/rtc-omap.c +++ b/drivers/rtc/rtc-omap.c @@ -68,6 +68,15 @@ #define OMAP_RTC_IRQWAKEEN 0x7c +#define OMAP_RTC_ALARM2_SECONDS_REG 0x80 +#define OMAP_RTC_ALARM2_MINUTES_REG 0x84 +#define OMAP_RTC_ALARM2_HOURS_REG 0x88 +#define OMAP_RTC_ALARM2_DAYS_REG 0x8c +#define OMAP_RTC_ALARM2_MONTHS_REG 0x90 +#define OMAP_RTC_ALARM2_YEARS_REG 0x94 + +#define OMAP_RTC_PMIC_REG 0x98 + /* OMAP_RTC_CTRL_REG bit fields: */ #define OMAP_RTC_CTRL_SPLIT BIT(7) #define OMAP_RTC_CTRL_DISABLE BIT(6) @@ -80,6 +89,7 @@ /* OMAP_RTC_STATUS_REG bit fields: */ #define OMAP_RTC_STATUS_POWER_UP BIT(7) +#define OMAP_RTC_STATUS_ALARM2 BIT(7) #define OMAP_RTC_STATUS_ALARM BIT(6) #define OMAP_RTC_STATUS_1D_EVENT BIT(5) #define OMAP_RTC_STATUS_1H_EVENT BIT(4) @@ -89,6 +99,7 @@ #define OMAP_RTC_STATUS_BUSY BIT(0) /* OMAP_RTC_INTERRUPTS_REG bit fields: */ +#define OMAP_RTC_INTERRUPTS_IT_ALARM2 BIT(4) #define OMAP_RTC_INTERRUPTS_IT_ALARM BIT(3) #define OMAP_RTC_INTERRUPTS_IT_TIMER BIT(2) @@ -98,6 +109,9 @@ /* OMAP_RTC_IRQWAKEEN bit fields: */ #define OMAP_RTC_IRQWAKEEN_ALARM_WAKEEN BIT(1) +/* OMAP_RTC_PMIC bit fields: */ +#define OMAP_RTC_PMIC_POWER_EN_EN BIT(16) + /* OMAP_RTC_KICKER values */ #define KICK0_VALUE 0x83e70b13 #define KICK1_VALUE 0x95a4f1e0 @@ -106,6 +120,7 @@ struct omap_rtc_device_type { bool has_32kclk_en; bool has_kicker; bool has_irqwakeen; + bool has_pmic_mode; bool has_power_up_reset; }; @@ -115,6 +130,7 @@ struct omap_rtc { int irq_alarm; int irq_timer; u8 interrupts_reg; + bool is_pmic_controller; const struct omap_rtc_device_type *type; }; @@ -345,6 +361,70 @@ static int omap_rtc_set_alarm(struct device *dev, struct rtc_wkalrm *alm) return 0; } +static struct omap_rtc *omap_rtc_power_off_rtc; + +/* + * omap_rtc_poweroff: RTC-controlled power off + * + * The RTC can be used to control an external PMIC via the pmic_power_en pin, + * which can be configured to transition to OFF on ALARM2 events. + * + * Notes: + * The two-second alarm offset is the shortest offset possible as the alarm + * registers must be set before the next timer update and the offset + * calculation is too heavy for everything to be done within a single access + * period (~15 us). + * + * Called with local interrupts disabled. + */ +static void omap_rtc_power_off(void) +{ + struct omap_rtc *rtc = omap_rtc_power_off_rtc; + struct rtc_time tm; + unsigned long now; + u32 val; + + /* enable pmic_power_en control */ + val = rtc_readl(rtc, OMAP_RTC_PMIC_REG); + rtc_writel(rtc, OMAP_RTC_PMIC_REG, val | OMAP_RTC_PMIC_POWER_EN_EN); + + /* set alarm two seconds from now */ + omap_rtc_read_time_raw(rtc, &tm); + bcd2tm(&tm); + rtc_tm_to_time(&tm, &now); + rtc_time_to_tm(now + 2, &tm); + + if (tm2bcd(&tm) < 0) { + dev_err(&rtc->rtc->dev, "power off failed\n"); + return; + } + + rtc_wait_not_busy(rtc); + + rtc_write(rtc, OMAP_RTC_ALARM2_SECONDS_REG, tm.tm_sec); + rtc_write(rtc, OMAP_RTC_ALARM2_MINUTES_REG, tm.tm_min); + rtc_write(rtc, OMAP_RTC_ALARM2_HOURS_REG, tm.tm_hour); + rtc_write(rtc, OMAP_RTC_ALARM2_DAYS_REG, tm.tm_mday); + rtc_write(rtc, OMAP_RTC_ALARM2_MONTHS_REG, tm.tm_mon); + rtc_write(rtc, OMAP_RTC_ALARM2_YEARS_REG, tm.tm_year); + + /* + * enable ALARM2 interrupt + * + * NOTE: this fails on AM3352 if rtc_write (writeb) is used + */ + val = rtc_read(rtc, OMAP_RTC_INTERRUPTS_REG); + rtc_writel(rtc, OMAP_RTC_INTERRUPTS_REG, + val | OMAP_RTC_INTERRUPTS_IT_ALARM2); + + /* + * Wait for alarm to trigger (within two seconds) and external PMIC to + * power off the system. Add a 500 ms margin for external latencies + * (e.g. debounce circuits). + */ + mdelay(2500); +} + static struct rtc_class_ops omap_rtc_ops = { .read_time = omap_rtc_read_time, .set_time = omap_rtc_set_time, @@ -361,6 +441,7 @@ static const struct omap_rtc_device_type omap_rtc_am3352_type = { .has_32kclk_en = true, .has_kicker = true, .has_irqwakeen = true, + .has_pmic_mode = true, }; static const struct omap_rtc_device_type omap_rtc_da830_type = { @@ -412,6 +493,9 @@ static int __init omap_rtc_probe(struct platform_device *pdev) of_id = of_match_device(omap_rtc_of_match, &pdev->dev); if (of_id) { rtc->type = of_id->data; + rtc->is_pmic_controller = rtc->type->has_pmic_mode && + of_property_read_bool(pdev->dev.of_node, + "ti,system-power-controller"); } else { id_entry = platform_get_device_id(pdev); rtc->type = (void *)id_entry->driver_data; @@ -460,6 +544,9 @@ static int __init omap_rtc_probe(struct platform_device *pdev) mask = OMAP_RTC_STATUS_ALARM; + if (rtc->type->has_pmic_mode) + mask |= OMAP_RTC_STATUS_ALARM2; + if (rtc->type->has_power_up_reset) { mask |= OMAP_RTC_STATUS_POWER_UP; if (reg & OMAP_RTC_STATUS_POWER_UP) @@ -520,6 +607,13 @@ static int __init omap_rtc_probe(struct platform_device *pdev) goto err; } + if (rtc->is_pmic_controller) { + if (!pm_power_off) { + omap_rtc_power_off_rtc = rtc; + pm_power_off = omap_rtc_power_off; + } + } + return 0; err: @@ -536,6 +630,12 @@ static int __exit omap_rtc_remove(struct platform_device *pdev) { struct omap_rtc *rtc = platform_get_drvdata(pdev); + if (pm_power_off == omap_rtc_power_off && + omap_rtc_power_off_rtc == rtc) { + pm_power_off = NULL; + omap_rtc_power_off_rtc = NULL; + } + device_init_wakeup(&pdev->dev, 0); /* leave rtc running, but disable irqs */ -- cgit v1.2.3-59-g8ed1b From dd01a1c526d200b25b421460550f67d5a9c6874b Mon Sep 17 00:00:00 2001 From: Tomas Novotny Date: Wed, 10 Dec 2014 15:53:59 -0800 Subject: of: add vendor prefix for Pericom Technology Signed-off-by: Tomas Novotny Cc: Alessandro Zummo Cc: Grant Likely Cc: Rob Herring Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/devicetree/bindings/vendor-prefixes.txt | 1 + 1 file changed, 1 insertion(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt b/Documentation/devicetree/bindings/vendor-prefixes.txt index 0d354625299c..2417cb0b493b 100644 --- a/Documentation/devicetree/bindings/vendor-prefixes.txt +++ b/Documentation/devicetree/bindings/vendor-prefixes.txt @@ -115,6 +115,7 @@ nxp NXP Semiconductors onnn ON Semiconductor Corp. opencores OpenCores.org panasonic Panasonic Corporation +pericom Pericom Technology Inc. phytec PHYTEC Messtechnik GmbH picochip Picochip Ltd plathome Plat'Home Co., Ltd. -- cgit v1.2.3-59-g8ed1b From 094d3ee3ce8c30756446518cc1895b6bbbca12ef Mon Sep 17 00:00:00 2001 From: Johan Hovold Date: Wed, 10 Dec 2014 15:54:14 -0800 Subject: rtc: omap: drop vendor-prefix from power-controller dt property Drop the vendor-prefix from the "ti,system-power-controller" device-tree property name. It has been agreed to make "system-power-controller" a standard property and to drop the vendor-prefix that is currently used by several drivers. Note that drivers that have used ",system-power-controller" in a released kernel will need to support both versions. Signed-off-by: Johan Hovold Cc: Tony Lindgren Cc: Benot Cousson Cc: Alessandro Zummo Cc: Felipe Balbi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/devicetree/bindings/rtc/rtc-omap.txt | 4 ++-- arch/arm/boot/dts/am335x-boneblack.dts | 2 +- drivers/rtc/rtc-omap.c | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/rtc/rtc-omap.txt b/Documentation/devicetree/bindings/rtc/rtc-omap.txt index 750efd40c72e..4ba4dbd34289 100644 --- a/Documentation/devicetree/bindings/rtc/rtc-omap.txt +++ b/Documentation/devicetree/bindings/rtc/rtc-omap.txt @@ -13,7 +13,7 @@ Required properties: - interrupt-parent: phandle for the interrupt controller Optional properties: -- ti,system-power-controller: whether the rtc is controlling the system power +- system-power-controller: whether the rtc is controlling the system power through pmic_power_en Example: @@ -24,5 +24,5 @@ rtc@1c23000 { interrupts = <19 19>; interrupt-parent = <&intc>; - ti,system-power-controller; + system-power-controller; }; diff --git a/arch/arm/boot/dts/am335x-boneblack.dts b/arch/arm/boot/dts/am335x-boneblack.dts index 520b63605eb6..5c42d259fa68 100644 --- a/arch/arm/boot/dts/am335x-boneblack.dts +++ b/arch/arm/boot/dts/am335x-boneblack.dts @@ -82,5 +82,5 @@ }; &rtc { - ti,system-power-controller; + system-power-controller; }; diff --git a/drivers/rtc/rtc-omap.c b/drivers/rtc/rtc-omap.c index fba13e53c278..4f1c6ca97211 100644 --- a/drivers/rtc/rtc-omap.c +++ b/drivers/rtc/rtc-omap.c @@ -502,7 +502,7 @@ static int __init omap_rtc_probe(struct platform_device *pdev) rtc->type = of_id->data; rtc->is_pmic_controller = rtc->type->has_pmic_mode && of_property_read_bool(pdev->dev.of_node, - "ti,system-power-controller"); + "system-power-controller"); } else { id_entry = platform_get_device_id(pdev); rtc->type = (void *)id_entry->driver_data; -- cgit v1.2.3-59-g8ed1b