aboutsummaryrefslogtreecommitdiffstats
path: root/mm (follow)
AgeCommit message (Collapse)AuthorFilesLines
2009-09-22page-allocator: maintain rolling count of pages to free from the PCPMel Gorman1-9/+15
When round-robin freeing pages from the PCP lists, empty lists may be encountered. In the event one of the lists has more pages than another, there may be numerous checks for list_empty() which is undesirable. This patch maintains a count of pages to free which is incremented when empty lists are encountered. The intention is that more pages will then be freed from fuller lists than the empty ones reducing the number of empty list checks in the free path. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22page-allocator: split per-cpu list into one-list-per-migrate-typeMel Gorman1-47/+59
The following two patches remove searching in the page allocator fast-path by maintaining multiple free-lists in the per-cpu structure. At the time the search was introduced, increasing the per-cpu structures would waste a lot of memory as per-cpu structures were statically allocated at compile-time. This is no longer the case. The patches are as follows. They are based on mmotm-2009-08-27. Patch 1 adds multiple lists to struct per_cpu_pages, one per migratetype that can be stored on the PCP lists. Patch 2 notes that the pcpu drain path check empty lists multiple times. The patch reduces the number of checks by maintaining a count of free lists encountered. Lists containing pages will then free multiple pages in batch The patches were tested with kernbench, netperf udp/tcp, hackbench and sysbench. The netperf tests were not bound to any CPU in particular and were run such that the results should be 99% confidence that the reported results are within 1% of the estimated mean. sysbench was run with a postgres background and read-only tests. Similar to netperf, it was run multiple times so that it's 99% confidence results are within 1%. The patches were tested on x86, x86-64 and ppc64 as x86: Intel Pentium D 3GHz with 8G RAM (no-brand machine) kernbench - No significant difference, variance well within noise netperf-udp - 1.34% to 2.28% gain netperf-tcp - 0.45% to 1.22% gain hackbench - Small variances, very close to noise sysbench - Very small gains x86-64: AMD Phenom 9950 1.3GHz with 8G RAM (no-brand machine) kernbench - No significant difference, variance well within noise netperf-udp - 1.83% to 10.42% gains netperf-tcp - No conclusive until buffer >= PAGE_SIZE 4096 +15.83% 8192 + 0.34% (not significant) 16384 + 1% hackbench - Small gains, very close to noise sysbench - 0.79% to 1.6% gain ppc64: PPC970MP 2.5GHz with 10GB RAM (it's a terrasoft powerstation) kernbench - No significant difference, variance well within noise netperf-udp - 2-3% gain for almost all buffer sizes tested netperf-tcp - losses on small buffers, gains on larger buffers possibly indicates some bad caching effect. hackbench - No significant difference sysbench - 2-4% gain This patch: Currently the per-cpu page allocator searches the PCP list for pages of the correct migrate-type to reduce the possibility of pages being inappropriate placed from a fragmentation perspective. This search is potentially expensive in a fast-path and undesirable. Splitting the per-cpu list into multiple lists increases the size of a per-cpu structure and this was potentially a major problem at the time the search was introduced. These problem has been mitigated as now only the necessary number of structures is allocated for the running system. This patch replaces a list search in the per-cpu allocator with one list per migrate type. The potential snag with this approach is when bulk freeing pages. We round-robin free pages based on migrate type which has little bearing on the cache hotness of the page and potentially checks empty lists repeatedly in the event the majority of PCP pages are of one type. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22oom: oom_kill doesn't kill vfork parent (or child)KOSAKI Motohiro1-16/+1
Current oom_kill doesn't only kill the victim process, but also kill all thas shread the same mm. it mean vfork parent will be killed. This is definitely incorrect. another process have another oom_adj. we shouldn't ignore their oom_adj (it might have OOM_DISABLE). following caller hit the minefield. =============================== switch (constraint) { case CONSTRAINT_MEMORY_POLICY: oom_kill_process(current, gfp_mask, order, 0, NULL, "No available memory (MPOL_BIND)"); break; Note: force_sig(SIGKILL) send SIGKILL to all thread in the process. We don't need to care multi thread in here. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22oom: make oom_score to per-process valueKOSAKI Motohiro1-6/+29
oom-killer kills a process, not task. Then oom_score should be calculated as per-process too. it makes consistency more and makes speed up select_bad_process(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22oom: move oom_adj value from task_struct to signal_structKOSAKI Motohiro1-19/+15
Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm/vmscan: remove page_queue_congested() commentVincent Li1-1/+0
Commit 084f71ae5c(kill page_queue_congested()) removed page_queue_congested(). Remove the page_queue_congested() comment in vmscan pageout() too. Signed-off-by: Vincent Li <macli@brc.ubc.ca> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: do batched scans for mem_cgroupWu Fengguang2-10/+12
For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1, in which case shrink_list() _still_ calls isolate_pages() with the much larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list scan rate by up to 32 times. For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4. So when shrink_zone() expects to scan 4 pages in the active/inactive list, the active list will be scanned 4 pages, while the inactive list will be (over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that could break the balance between the two lists. It can further impact the scan of anon active list, due to the anon active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone(): inactive anon list over scanned => inactive_anon_is_low() == TRUE => shrink_active_list() => active anon list over scanned So the end result may be - anon inactive => over scanned - anon active => over scanned (maybe not as much) - file inactive => over scanned - file active => under scanned (relatively) The accesses to nr_saved_scan are not lock protected and so not 100% accurate, however we can tolerate small errors and the resulted small imbalanced scan rates between zones. Cc: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm/vmscan: rename zone_nr_pages() to zone_nr_lru_pages()Vincent Li1-7/+7
The name `zone_nr_pages' can be mis-read as zone's (total) number pages, but it actually returns zone's LRU list number pages. Signed-off-by: Vincent Li <macli@brc.ubc.ca> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: also use alloc_large_system_hash() for the PID hash tableJan Beulich1-3/+10
This is being done by allowing boot time allocations to specify that they may want a sub-page sized amount of memory. Overall this seems more consistent with the other hash table allocations, and allows making two supposedly mm-only variables really mm-only (nr_{kernel,all}_pages). Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: replace various uses of num_physpages by totalram_pagesJan Beulich3-4/+4
Sizing of memory allocations shouldn't depend on the number of physical pages found in a system, as that generally includes (perhaps a huge amount of) non-RAM pages. The amount of what actually is usable as storage should instead be used as a basis here. Some of the calculations (i.e. those not intending to use high memory) should likely even use (totalram_pages - totalhigh_pages). Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Dave Airlie <airlied@linux.ie> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22memory hotplug: fix updating of num_physpages for hot plugged memoryJan Beulich1-2/+4
Sizing of memory allocations shouldn't depend on the number of physical pages found in a system, as that generally includes (perhaps a huge amount of) non-RAM pages. The amount of what actually is usable as storage should instead be used as a basis here. In line with that, the memory hotplug code should update num_physpages in a way that it retains its original (post-boot) meaning; in particular, decreasing the value should at best be done with great care - this patch doesn't try to ever decrease this value at all as it doesn't really seem meaningful to do so. Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22page-allocator: limit the number of MIGRATE_RESERVE pageblocks per zoneMel Gorman1-1/+11
After anti-fragmentation was merged, a bug was reported whereby devices that depended on high-order atomic allocations were failing. The solution was to preserve a property in the buddy allocator which tended to keep the minimum number of free pages in the zone at the lower physical addresses and contiguous. To preserve this property, MIGRATE_RESERVE was introduced and a number of pageblocks at the start of a zone would be marked "reserve", the number of which depended on min_free_kbytes. Anti-fragmentation works by avoiding the mixing of page migratetypes within the same pageblock. One way of helping this is to increase min_free_kbytes because it becomes less like that it will be necessary to place pages of of MIGRATE_RESERVE is unbounded, the free memory is kept there in large contiguous blocks instead of helping anti-fragmentation as much as it should. With the page-allocator tracepoint patches applied, it was found during anti-fragmentation tests that the number of fragmentation-related events were far higher than expected even with min_free_kbytes at higher values. This patch limits the number of MIGRATE_RESERVE blocks that exist per zone to two. For example, with a sufficient min_free_kbytes, 4MB of memory will be kept aside on an x86-64 and remain more or less free and contiguous for the systems uptime. This should be sufficient for devices depending on high-order atomic allocations while helping fragmentation control when min_free_kbytes is tuned appropriately. As side-effect of this patch is that the reserve variable is converted to int as unsigned long was the wrong type to use when ensuring that only the required number of reserve blocks are created. With the patches applied, fragmentation-related events as measured by the page allocator tracepoints were significantly reduced when running some fragmentation stress-tests on systems with min_free_kbytes tuned to a value appropriate for hugepage allocations at runtime. On x86, the events recorded were reduced by 99.8%, on x86-64 by 99.72% and on ppc64 by 99.83%. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: document is_page_cache_freeable()Johannes Weiner1-0/+5
Enlighten the reader of this code about what reference count makes a page cache page freeable. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: return boolean from page_has_private()Johannes Weiner2-2/+2
Make page_has_private() return a true boolean value and remove the double negations from the two callsites using it for arithmetic. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: return boolean from page_is_file_cache()Johannes Weiner3-5/+5
page_is_file_cache() has been used for both boolean checks and LRU arithmetic, which was always a bit weird. Now that page_lru_base_type() exists for LRU arithmetic, make page_is_file_cache() a real predicate function and adjust the boolean-using callsites to drop those pesky double negations. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: introduce page_lru_base_type()Johannes Weiner2-5/+5
Instead of abusing page_is_file_cache() for LRU list index arithmetic, add another helper with a more appropriate name and convert the non-boolean users of page_is_file_cache() accordingly. This new helper gives the LRU base type a page is supposed to live on, inactive anon or inactive file. [hugh.dickins@tiscali.co.uk: convert del_page_from_lru() also] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: drop unneeded double negationsJohannes Weiner3-6/+6
Remove double negations where the operand is already boolean. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: remove broken 'kzalloc' mempoolSage Weil1-7/+0
The kzalloc mempool zeros items when they are initially allocated, but does not rezero used items that are returned to the pool. Consequently mempool_alloc()s may return non-zeroed memory. Since there are/were only two in-tree users for mempool_create_kzalloc_pool(), and 'fixing' this in a way that will re-zero used (but not new) items before first use is non-trivial, just remove it. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: includecheck fix for mm/nommu.cJaswinder Singh Rajput1-2/+0
Fix the following 'make includecheck' warning: mm/nommu.c: internal.h is included more than once. Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Cc: David Howells <dhowells@redhat.com> Acked-by: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: includecheck fix for mm/shmem.cJaswinder Singh Rajput1-1/+0
Fix the following 'make includecheck' warning: mm/shmem.c: linux/vfs.h is included more than once. Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: add_to_swap_cache() does not return -EEXISTDaisuke Nishimura2-37/+40
After commit 355cfa73 ("mm: modify swap_map and add SWAP_HAS_CACHE flag"), only the context which have set SWAP_HAS_CACHE flag by swapcache_prepare() or get_swap_page() would call add_to_swap_cache(). So add_to_swap_cache() doesn't return -EEXIST any more. Even though it doesn't return -EEXIST, it's not good behavior conceptually to call swapcache_prepare() in the -EEXIST case, because it means clearing SWAP_HAS_CACHE flag while the entry is on swap cache. This patch removes redundant codes and comments from callers of it, and adds VM_BUG_ON() in error path of add_to_swap_cache() and some comments. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: add_to_swap_cache() must not sleepDaisuke Nishimura1-24/+46
After commit 355cfa73 ("mm: modify swap_map and add SWAP_HAS_CACHE flag"), read_swap_cache_async() will busy-wait while a entry doesn't exist in swap cache but it has SWAP_HAS_CACHE flag. Such entries can exist on add/delete path of swap cache. On add path, add_to_swap_cache() is called soon after SWAP_HAS_CACHE flag is set, and on delete path, swapcache_free() will be called (SWAP_HAS_CACHE flag is cleared) soon after __delete_from_swap_cache() is called. So, the busy-wait works well in most cases. But this mechanism can cause soft lockup if add_to_swap_cache() sleeps and read_swap_cache_async() tries to swap-in the same entry on the same cpu. This patch calls radix_tree_preload() before swapcache_prepare() and divides add_to_swap_cache() into two part: radix_tree_preload() part and radix_tree_insert() part(define it as __add_to_swap_cache()). Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22tracing, page-allocator: add trace event for page traffic related to the buddy listsMel Gorman1-0/+3
The page allocation trace event reports that a page was successfully allocated but it does not specify where it came from. When analysing performance, it can be important to distinguish between pages coming from the per-cpu allocator and pages coming from the buddy lists as the latter requires the zone lock to the taken and more data structures to be examined. This patch adds a trace event for __rmqueue reporting when a page is being allocated from the buddy lists. It distinguishes between being called to refill the per-cpu lists or whether it is a high-order allocation. Similarly, this patch adds an event to catch when the PCP lists are being drained a little and pages are going back to the buddy lists. This is trickier to draw conclusions from but high activity on those events could explain why there were a large number of cache misses on a page-allocator-intensive workload. The coalescing and splitting of buddies involves a lot of writing of page metadata and cache line bounces not to mention the acquisition of an interrupt-safe lock necessary to enter this path. [akpm@linux-foundation.org: fix build] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Ingo Molnar <mingo@elte.hu> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Ming Chun <macli@brc.ubc.ca> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22tracing, page-allocator: add trace events for anti-fragmentation falling back to other migratetypesMel Gorman1-0/+4
Fragmentation avoidance depends on being able to use free pages from lists of the appropriate migrate type. In the event this is not possible, __rmqueue_fallback() selects a different list and in some circumstances change the migratetype of the pageblock. Simplistically, the more times this event occurs, the more likely that fragmentation will be a problem later for hugepage allocation at least but there are other considerations such as the order of page being split to satisfy the allocation. This patch adds a trace event for __rmqueue_fallback() that reports what page is being used for the fallback, the orders of relevant pages, the desired migratetype and the migratetype of the lists being used, whether the pageblock changed type and whether this event is important with respect to fragmentation avoidance or not. This information can be used to help analyse fragmentation avoidance and help decide whether min_free_kbytes should be increased or not. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Ingo Molnar <mingo@elte.hu> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Ming Chun <macli@brc.ubc.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22tracing, page-allocator: add trace events for page allocation and page freeingMel Gorman1-1/+6
This patch adds trace events for the allocation and freeing of pages, including the freeing of pagevecs. Using the events, it will be known what struct page and pfns are being allocated and freed and what the call site was in many cases. The page alloc tracepoints be used as an indicator as to whether the workload was heavily dependant on the page allocator or not. You can make a guess based on vmstat but you can't get a per-process breakdown. Depending on the call path, the call_site for page allocation may be __get_free_pages() instead of a useful callsite. Instead of passing down a return address similar to slab debugging, the user should enable the stacktrace and seg-addr options to get a proper stack trace. The pagevec free tracepoint has a different usecase. It can be used to get a idea of how many pages are being dumped off the LRU and whether it is kswapd doing the work or a process doing direct reclaim. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Ingo Molnar <mingo@elte.hu> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Ming Chun <macli@brc.ubc.ca> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22page-allocator: remove dead function free_cold_page()Mel Gorman1-5/+0
The function free_cold_page() has no callers so delete it. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22kcore: fix vread/vwrite to be aware of holesKAMEZAWA Hiroyuki1-23/+176
vread/vwrite access vmalloc area without checking there is a page or not. In most case, this works well. In old ages, the caller of get_vm_ara() is only IOREMAP and there is no memory hole within vm_struct's [addr...addr + size - PAGE_SIZE] ( -PAGE_SIZE is for a guard page.) After per-cpu-alloc patch, it uses get_vm_area() for reserve continuous virtual address but remap _later_. There tend to be a hole in valid vmalloc area in vm_struct lists. Then, skip the hole (not mapped page) is necessary. This patch updates vread/vwrite() for avoiding memory hole. Routines which access vmalloc area without knowing for which addr is used are - /proc/kcore - /dev/kmem kcore checks IOREMAP, /dev/kmem doesn't. After this patch, IOREMAP is checked and /dev/kmem will avoid to read/write it. Fixes to /proc/kcore will be in the next patch in series. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Mike Smith <scgtrp@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22vmalloc: unmap vmalloc area after hiding itKAMEZAWA Hiroyuki1-5/+9
vmap area should be purged after vm_struct is removed from the list because vread/vwrite etc...believes the range is valid while it's on vm_struct list. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Mike Smith <scgtrp@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22page-allocator: change migratetype for all pageblocks within a high-order page during __rmqueue_fallbackMel Gorman1-2/+14
When there are no pages of a target migratetype free, the page allocator selects a high-order block of another migratetype to allocate from. When the order of the page taken is greater than pageblock_order, all pageblocks within that high-order page should change migratetype so that pages are later freed to the correct free-lists. The current behaviour is that pageblocks change migratetype if the order being split matches the pageblock_order. When pageblock_order < MAX_ORDER-1, ownership is not changing correct and pages are being later freed to the incorrect list and this impacts fragmentation avoidance. This patch changes all pageblocks within the high-order page being split to the correct migratetype. Without the patch, allocation success rates for hugepages under stress were about 59% of physical memory on x86-64. With the patch applied, this goes up to 65%. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: kmem_cache_create(): make it easier to catch NULL cache namesBenjamin Herrenschmidt1-0/+3
Right now, if you inadvertently pass NULL to kmem_cache_create() at boot time, it crashes much later after boot somewhere deep inside sysfs which makes it very non obvious to figure out what's going on. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: mremap use err from ksm_madviseHugh Dickins1-3/+5
mremap move's use of ksm_madvise() was assuming -ENOMEM on failure, because ksm_madvise used to say -EAGAIN for that; but ksm_madvise now says -ENOMEM (letting madvise convert that to -EAGAIN), and can also say -ERESTARTSYS when signalled: so pass the error from ksm_madvise. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: unmerge is an origin of OOMsHugh Dickins3-3/+5
Just as the swapoff system call allocates many pages of RAM to various processes, perhaps triggering OOM, so "echo 2 >/sys/kernel/mm/ksm/run" (unmerge) is liable to allocate many pages of RAM to various processes, perhaps triggering OOM; and each is normally run from a modest admin process (swapoff or shell), easily repeated until it succeeds. So treat unmerge_and_remove_all_rmap_items() in the same way that we treat try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both with that, to ask the OOM killer to kill them first, to prevent them from spawning more and more OOM kills. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: clean up obsolete referencesHugh Dickins2-2/+13
A few cleanups, given the munlock fix: the comment on ksm_test_exit() no longer applies, and it can be made private to ksm.c; there's no more reference to mmu_gather or tlb.h, and mmap.c doesn't need ksm.h. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: remove VM_MERGEABLE_FLAGSHugh Dickins1-4/+2
KSM originally stood for Kernel Shared Memory: but the kernel has long supported shared memory, and VM_SHARED and VM_MAYSHARE vmas, and KSM is something else. So we switched to saying "merge" instead of "share". But Chris Wright points out that this is confusing where mmap.c merges adjacent vmas: most especially in the name VM_MERGEABLE_FLAGS, used by is_mergeable_vma() to let vmas be merged despite flags being different. Call it VMA_MERGE_DESPITE_FLAGS? Perhaps, but at present it consists only of VM_CAN_NONLINEAR: so for now it's clearer on all sides to use that directly, with a comment on it in is_mergeable_vma(). Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: add some documentationHugh Dickins1-0/+1
Add Documentation/vm/ksm.txt: how to use the Kernel Samepage Merging feature Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: sysfs and defaultsHugh Dickins1-7/+19
At present KSM is just a waste of space if you don't have CONFIG_SYSFS=y to provide the /sys/kernel/mm/ksm files to tune and activate it. Make KSM depend on SYSFS? Could do, but it might be better to provide some defaults so that KSM works out-of-the-box, ready for testers to madvise MADV_MERGEABLE, even without SYSFS. Though anyone serious is likely to want to retune the numbers to their taste once they have experience; and whether these settings ever reach 2.6.32 can be discussed along the way. Save 1kB from tiny kernels by #ifdef'ing the SYSFS side of it. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: fix deadlock with munlock in exit_mmapAndrea Arcangeli3-13/+3
Rawhide users have reported hang at startup when cryptsetup is run: the same problem can be simply reproduced by running a program int main() { mlockall(MCL_CURRENT | MCL_FUTURE); return 0; } The problem is that exit_mmap() applies munlock_vma_pages_all() to clean up VM_LOCKED areas, and its current implementation (stupidly) tries to fault in absent pages, for example where PROT_NONE prevented them being faulted in when mlocking. Whereas the "ksm: fix oom deadlock" patch, knowing there's a race by which KSM might try to fault in pages after exit_mmap() had finally zapped the range, backs out of such faults doing nothing when its ksm_test_exit() notices mm_users 0. So revert that part of "ksm: fix oom deadlock" which moved the ksm_exit() call from before exit_mmap() to the middle of exit_mmap(); and remove those ksm_test_exit() checks from the page fault paths, so allowing the munlocking to proceed without interference. ksm_exit, if there are rmap_items still chained on this mm slot, takes mmap_sem write side: so preventing KSM from working on an mm while exit_mmap runs. And KSM will bail out as soon as it notices that mm_users is already zero, thanks to its internal ksm_test_exit checks. So that when a task is killed by OOM killer or the user, KSM will not indefinitely prevent it from running exit_mmap to release its memory. This does break a part of what "ksm: fix oom deadlock" was trying to achieve. When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even when ksmd itself has to cancel a KSM page, it is possible that the first OOM-kill victim would be the KSM process being faulted: then its memory won't be freed until a second victim has been selected (freeing memory for the unmerging fault to complete). But the OOM killer is already liable to kill a second victim once the intended victim's p->mm goes to NULL: so there's not much point in rejecting this KSM patch before fixing that OOM behaviour. It is very much more important to allow KSM users to boot up, than to haggle over an unlikely and poorly supported OOM case. We also intend to fix munlocking to not fault pages: at which point this patch _could_ be reverted; though that would be controversial, so we hope to find a better solution. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Justin M. Forbes <jforbes@redhat.com> Acked-for-now-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: fix oom deadlockHugh Dickins3-48/+110
There's a now-obvious deadlock in KSM's out-of-memory handling: imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex, trying to allocate a page to break KSM in an mm which becomes the OOM victim (quite likely in the unmerge case): it's killed and goes to exit, and hangs there waiting to acquire ksm_thread_mutex. Clearly we must not require ksm_thread_mutex in __ksm_exit, simple though that made everything else: perhaps use mmap_sem somehow? And part of the answer lies in the comments on unmerge_ksm_pages: __ksm_exit should also leave all the rmap_item removal to ksmd. But there's a fundamental problem, that KSM relies upon mmap_sem to guarantee the consistency of the mm it's dealing with, yet exit_mmap tears down an mm without taking mmap_sem. And bumping mm_users won't help at all, that just ensures that the pages the OOM killer assumes are on their way to being freed will not be freed. The best answer seems to be, to move the ksm_exit callout from just before exit_mmap, to the middle of exit_mmap: after the mm's pages have been freed (if the mmu_gather is flushed), but before its page tables and vma structures have been freed; and down_write,up_write mmap_sem there to serialize with KSM's own reliance on mmap_sem. But KSM then needs to be careful, whenever it downs mmap_sem, to check that the mm is not already exiting: there's a danger of using find_vma on a layout that's being torn apart, or writing into page tables which have been freed for reuse; and even do_anonymous_page and __do_fault need to check they're not being called by break_ksm to reinstate a pte after zap_pte_range has zapped that page table. Though it might be clearer to add an exiting flag, set while holding mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating a zapped pte. All we need is to check whether mm_users is 0 - but must remember that ksmd may detect that before __ksm_exit is reached. So, ksm_test_exit(mm) added to comment such checks on mm->mm_users. __ksm_exit now has to leave clearing up the rmap_items to ksmd, that needs ksm_thread_mutex; but shift the exiting mm just after the ksm_scan cursor so that it will soon be dealt with. __ksm_enter raise mm_count to hold the mm_struct, ksmd's exit processing (exactly like its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it, similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd). But also give __ksm_exit a fast path: when there's no complication (no rmap_items attached to mm and it's not at the ksm_scan cursor), it can safely do all the exiting work itself. This is not just an optimization: when ksmd is not running, the raised mm_count would otherwise leak mm_structs. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: distribute remove_mm_from_listsHugh Dickins1-55/+42
Do some housekeeping in ksm.c, to help make the next patch easier to understand: remove the function remove_mm_from_lists, distributing its code to its callsites scan_get_next_rmap_item and __ksm_exit. That turns out to be a win in scan_get_next_rmap_item: move its remove_trailing_rmap_items and cursor advancement up, and it becomes simpler than before. __ksm_exit becomes messier, but will change again; and moving its remove_trailing_rmap_items up lets us strengthen the unstable tree item's age condition in remove_rmap_item_from_tree. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: fix endless loop on oomHugh Dickins1-23/+85
break_ksm has been looping endlessly ignoring VM_FAULT_OOM: that should only be a problem for ksmd when a memory control group imposes limits (normally the OOM killer will kill others with an mm until it succeeds); but in general (especially for MADV_UNMERGEABLE and KSM_RUN_UNMERGE) we do need to route the error (or kill) back to the caller (or sighandling). Test signal_pending in unmerge_ksm_pages, which could be a lengthy procedure if it has to spill into swap: returning -ERESTARTSYS so that trivial signals will restart but fatals will terminate (is that right? we do different things in different places in mm, none exactly this). unmerge_and_remove_all_rmap_items was forgetting to lock when going down the mm_list: fix that. Whether it's successful or not, reset ksm_scan cursor to head; but only if it's successful, reset seqnr (shown in full_scans) - page counts will have gone down to zero. This patch leaves a significant OOM deadlock, but it's a good step on the way, and that deadlock is fixed in a subsequent patch. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: five little cleanupsHugh Dickins1-68/+44
1. We don't use __break_cow entry point now: merge it into break_cow. 2. remove_all_slot_rmap_items is just a special case of remove_trailing_rmap_items: use the latter instead. 3. Extend comment on unmerge_ksm_pages and rmap_items. 4. try_to_merge_two_pages should use try_to_merge_with_ksm_page instead of duplicating its code; and so swap them around. 5. Comment on cmp_and_merge_page described last year's: update it. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: keep quiet while list emptyHugh Dickins1-6/+22
ksm_scan_thread already sleeps in wait_event_interruptible until setting ksm_run activates it; but if there's nothing on its list to look at, i.e. nobody has yet said madvise MADV_MERGEABLE, it's a shame to be clocking up system time and full_scans: ksmd_should_run added to check that too. And move the mutex_lock out around it: the new counts showed that when ksm_run is stopped, a little work often got done afterwards, because it had been read before taking the mutex. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: break cow once unsharedHugh Dickins1-0/+8
We kept agreeing not to bother about the unswappable shared KSM pages which later become unshared by others: observation suggests they're not a significant proportion. But they are disadvantageous, and it is easier to break COW to replace them by swappable pages, than offer statistics to show that they don't matter; then we can stop worrying about them. Doing this in ksm_do_scan, they don't go through cmp_and_merge_page on this pass: give them a good chance of getting into the unstable tree on the next pass, or back into the stable, by computing checksum now. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: pages_unshared and pages_volatileHugh Dickins1-1/+51
The pages_shared and pages_sharing counts give a good picture of how successful KSM is at sharing; but no clue to how much wasted work it's doing to get there. Add pages_unshared (count of unique pages waiting in the unstable tree, hoping to find a mate) and pages_volatile. pages_volatile is harder to define. It includes those pages changing too fast to get into the unstable tree, but also whatever other edge conditions prevent a page getting into the trees: a high value may deserve investigation. Don't try to calculate it from the various conditions: it's the total of rmap_items less those accounted for. Also show full_scans: the number of completed scans of everything registered in the mm list. The locking for all these counts is simply ksm_thread_mutex. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: move pages_sharing updatesHugh Dickins1-15/+9
The pages_shared count is incremented and decremented when adding a node to and removing a node from the stable tree: easy to understand. But the pages_sharing count was hard to follow, being adjusted in various places: increment and decrement it when adding to and removing from the stable tree. And the pages_sharing variable used to include the pages_shared, then those were subtracted when shown in the pages_sharing sysfs file: now keep it as an exclusive count of leaves hanging off the stable tree nodes, throughout. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: rename kernel_pages_allocatedHugh Dickins1-29/+28
We're not implementing swapping of KSM pages in its first release; but when that follows, "kernel_pages_allocated" will be a very poor name for the sysfs file showing number of nodes in the stable tree: rename that to "pages_shared" throughout. But we already have a "pages_shared", counting those page slots sharing the shared pages: first rename that to... "pages_sharing". What will become of "max_kernel_pages" when the pages shared can be swapped? I guess it will just be removed, so keep that name. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: change ksm nice level to be 5Izik Eidus1-1/+1
ksm should try not to disturb other tasks as much as possible. Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Chris Wright <chrisw@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: change copyright messageIzik Eidus1-1/+2
Adding Hugh Dickins into the authors list. Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Chris Wright <chrisw@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: prevent mremap move poisoningHugh Dickins1-0/+12
KSM's scan allows for user pages to be COWed or unmapped at any time, without requiring any notification. But its stable tree does assume that when it finds a KSM page where it placed a KSM page, then it is the same KSM page that it placed there. mremap move could break that assumption: if an area containing a KSM page was unmapped, then an area containing a different KSM page was moved with mremap into the place of the original, before KSM's scan came around to notice. That could then poison a node of the stable tree, so that memcmps would "lie" and upset the ordering of the tree. Probably noone will ever need mremap move on a VM_MERGEABLE area; except that prohibiting it would make trouble for schemes in which we try making everything VM_MERGEABLE e.g. for testing: an mremap which normally works would then fail mysteriously. There's no need to go to any trouble, such as re-sorting KSM's list of rmap_items to match the new layout: simply unmerge the area to COW all its KSM pages before moving, but leave VM_MERGEABLE on so that they're remerged later. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: Kernel SamePage MergingIzik Eidus1-5/+1484
Ksm is code that allows merging of identical pages between one or more applications, in a way invisible to the applications that use it. Pages that are merged are marked as read-only, then COWed when any application tries to change them. Whereas fork() allows sharing anonymous pages between parent and child, ksm can share anonymous pages between unrelated processes. Ksm works by walking over the memory pages of the applications it scans, in order to find identical pages. It uses two sorted data structures, called the stable and unstable trees, to locate identical pages in an effective way. When ksm finds two identical pages, it marks them as readonly and merges them into a single page. After the pages have been marked as readonly and merged into one, Linux treats them as normal copy-on-write pages, copying to a fresh anonymous page if write access is required later. Ksm scans and merges anonymous pages only in those memory areas that have been registered with it by madvise(addr, length, MADV_MERGEABLE). The ksm scanner is controlled by sysfs files in /sys/kernel/mm/ksm/: max_kernel_pages - the maximum number of unswappable kernel pages which may be allocated by ksm (0 for unlimited). kernel_pages_allocated - how many ksm pages are currently allocated, sharing identical content between different processes (pages unswappable in this release). pages_shared - how many pages have been saved by sharing with ksm pages (kernel_pages_allocated being excluded from this count). pages_to_scan - how many pages ksm should scan before sleeping. sleep_millisecs - how many milliseconds ksm should sleep between scans. run - write 0 to disable ksm, read 0 while ksm is disabled (default), write 1 to run ksm, read 1 while ksm is running, write 2 to disable ksm and unmerge all its pages. Includes contributions by Andrea Arcangeli Chris Wright and Hugh Dickins. [hugh.dickins@tiscali.co.uk: fix rare page leak] Signed-off-by: Izik Eidus <ieidus@redhat.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>