memcg: reclaim memory from nodes in round-robin order

Presently, memory cgroup's direct reclaim frees memory from the current node. But this has some troubles. Usually when a set of threads works in a cooperative way, they tend to operate on the same node. So if they hit limits under memcg they will reclaim memory from themselves, damaging the active working set. For example, assume 2 node system which has Node 0 and Node 1 and a memcg which has 1G limit. After some work, file cache remains and the usages are Node 0: 1M Node 1: 998M. and run an application on Node 0, it will eat its foot before freeing unnecessary file caches. This patch adds round-robin for NUMA and adds equal pressure to each node. When using cpuset's spread memory feature, this will work very well. But yes, a better algorithm is needed. [akpm@linux-foundation.org: comment editing] [kamezawa.hiroyu@jp.fujitsu.com: fix time comparisons] Signed-off-by: Ying Han <yinghan@google.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Ying Han <yinghan@google.com> 2011-05-26 16:25:33 -0700
committer: Linus Torvalds <torvalds@linux-foundation.org> 2011-05-26 17:12:35 -0700
commit: 889976dbcb1218119fdd950fb7819084e37d7d37 (patch)
tree: 7508706ddb6bcbe0f673aca3744f30f281b17734 /mm/vmscan.c
parent: MAINTAINERS: add mm/page_cgroup.c into memcg subsystem (diff)
download: linux-dev-889976dbcb1218119fdd950fb7819084e37d7d37.tar.xz
linux-dev-889976dbcb1218119fdd950fb7819084e37d7d37.zip
1 files changed, 9 insertions, 1 deletions
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 884ae08c16cc..b0875871820d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2226,6 +2226,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 {
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
+	int nid;
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
@@ -2242,7 +2243,14 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 		.gfp_mask = sc.gfp_mask,
 	};
 
-	zonelist = NODE_DATA(numa_node_id())->node_zonelists;
+	/*
+	 * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
+	 * take care of from where we get pages. So the node where we start the
+	 * scan does not need to be the current node.
+	 */
+	nid = mem_cgroup_select_victim_node(mem_cont);
+
+	zonelist = NODE_DATA(nid)->node_zonelists;
 
 	trace_mm_vmscan_memcg_reclaim_begin(0,
 					    sc.may_writepage,
author	Ying Han <yinghan@google.com>	2011-05-26 16:25:33 -0700
committer	Linus Torvalds <torvalds@linux-foundation.org>	2011-05-26 17:12:35 -0700
commit	889976dbcb1218119fdd950fb7819084e37d7d37 (patch)
tree	7508706ddb6bcbe0f673aca3744f30f281b17734 /mm/vmscan.c
parent	MAINTAINERS: add mm/page_cgroup.c into memcg subsystem (diff)
download	linux-dev-889976dbcb1218119fdd950fb7819084e37d7d37.tar.xz linux-dev-889976dbcb1218119fdd950fb7819084e37d7d37.zip