From 58568d2a8215cb6f55caf2332017d7bdff954e1c Mon Sep 17 00:00:00 2001 From: Miao Xie Date: Tue, 16 Jun 2009 15:31:49 -0700 Subject: cpuset,mm: update tasks' mems_allowed in time Fix allocating page cache/slab object on the unallowed node when memory spread is set by updating tasks' mems_allowed after its cpuset's mems is changed. In order to update tasks' mems_allowed in time, we must modify the code of memory policy. Because the memory policy is applied in the process's context originally. After applying this patch, one task directly manipulates anothers mems_allowed, and we use alloc_lock in the task_struct to protect mems_allowed and memory policy of the task. But in the fast path, we didn't use lock to protect them, because adding a lock may lead to performance regression. But if we don't add a lock,the task might see no nodes when changing cpuset's mems_allowed to some non-overlapping set. In order to avoid it, we set all new allowed nodes, then clear newly disallowed ones. [lee.schermerhorn@hp.com: The rework of mpol_new() to extract the adjusting of the node mask to apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind() with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local allocation. Fix this by adding the check for MPOL_PREFERRED and empty node mask to mpol_new_mpolicy(). Remove the now unneeded 'nodes = NULL' from mpol_new(). Note that mpol_new_mempolicy() is always called with a non-NULL 'nodes' parameter now that it has been removed from mpol_new(). Therefore, we don't need to test nodes for NULL before testing it for 'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to verify this assumption.] [lee.schermerhorn@hp.com: I don't think the function name 'mpol_new_mempolicy' is descriptive enough to differentiate it from mpol_new(). This function applies cpuset set context, usually constraining nodes to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag is set, it also translates the nodes. So I settled on 'mpol_set_nodemask()', because the comment block for mpol_new() mentions that we need to call this function to "set nodes". Some additional minor line length, whitespace and typo cleanup.] Signed-off-by: Miao Xie Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Christoph Lameter Cc: Paul Menage Cc: Nick Piggin Cc: Yasunori Goto Cc: Pekka Enberg Cc: David Rientjes Signed-off-by: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 17d5f539a9aa..7cc3179e3591 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1569,10 +1569,7 @@ nofail_alloc: /* We now go into synchronous reclaim */ cpuset_memory_pressure_bump(); - /* - * The task's cpuset might have expanded its set of allowable nodes - */ - cpuset_update_task_memory_state(); + p->flags |= PF_MEMALLOC; lockdep_set_current_reclaim_state(gfp_mask); -- cgit v1.2.3-59-g8ed1b From 6c0db4664b49417d80988953e69c323721353227 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 16 Jun 2009 15:31:50 -0700 Subject: mm: alloc_large_system_hash check order On an x86_64 with 4GB ram, tcp_init()'s call to alloc_large_system_hash(), to allocate tcp_hashinfo.ehash, is now triggering an mmotm WARN_ON_ONCE on order >= MAX_ORDER - it's hoping for order 11. alloc_large_system_hash() had better make its own check on the order. Signed-off-by: Hugh Dickins Cc: David Miller Cc: Mel Gorman Cc: Eric Dumazet Cc: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7cc3179e3591..cbed869fd831 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4511,7 +4511,10 @@ void *__init alloc_large_system_hash(const char *tablename, table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL); else { unsigned long order = get_order(size); - table = (void*) __get_free_pages(GFP_ATOMIC, order); + + if (order < MAX_ORDER) + table = (void *)__get_free_pages(GFP_ATOMIC, + order); /* * If bucketsize is not a power-of-two, we may free * some pages at the end of hash table. -- cgit v1.2.3-59-g8ed1b From d239171e4f6efd58d7e423853056b1b6a74f1446 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:31:52 -0700 Subject: page allocator: replace __alloc_pages_internal() with __alloc_pages_nodemask() The start of a large patch series to clean up and optimise the page allocator. The performance improvements are in a wide range depending on the exact machine but the results I've seen so fair are approximately; kernbench: 0 to 0.12% (elapsed time) 0.49% to 3.20% (sys time) aim9: -4% to 30% (for page_test and brk_test) tbench: -1% to 4% hackbench: -2.5% to 3.45% (mostly within the noise though) netperf-udp -1.34% to 4.06% (varies between machines a bit) netperf-tcp -0.44% to 5.22% (varies between machines a bit) I haven't sysbench figures at hand, but previously they were within the -0.5% to 2% range. On netperf, the client and server were bound to opposite number CPUs to maximise the problems with cache line bouncing of the struct pages so I expect different people to report different results for netperf depending on their exact machine and how they ran the test (different machines, same cpus client/server, shared cache but two threads client/server, different socket client/server etc). I also measured the vmlinux sizes for a single x86-based config with CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM. The core of the .config is based on the Debian Lenny kernel config so I expect it to be reasonably typical. This patch: __alloc_pages_internal is the core page allocator function but essentially it is an alias of __alloc_pages_nodemask. Naming a publicly available and exported function "internal" is also a big ugly. This patch renames __alloc_pages_internal() to __alloc_pages_nodemask() and deletes the old nodemask function. Warning - This patch renames an exported symbol. No kernel driver is affected by external drivers calling __alloc_pages_internal() should change the call to __alloc_pages_nodemask() without any alteration of parameters. Signed-off-by: Mel Gorman Reviewed-by: Christoph Lameter Reviewed-by: KOSAKI Motohiro Reviewed-by: Pekka Enberg Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/gfp.h | 12 ++---------- mm/page_alloc.c | 4 ++-- 2 files changed, 4 insertions(+), 12 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3760e7c5de02..549ec5583103 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -172,24 +172,16 @@ static inline void arch_alloc_page(struct page *page, int order) { } #endif struct page * -__alloc_pages_internal(gfp_t gfp_mask, unsigned int order, +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, nodemask_t *nodemask); static inline struct page * __alloc_pages(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist) { - return __alloc_pages_internal(gfp_mask, order, zonelist, NULL); + return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL); } -static inline struct page * -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, - struct zonelist *zonelist, nodemask_t *nodemask) -{ - return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask); -} - - static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index cbed869fd831..d58df9031503 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1458,7 +1458,7 @@ try_next_zone: * This is the 'heart' of the zoned buddy allocator. */ struct page * -__alloc_pages_internal(gfp_t gfp_mask, unsigned int order, +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, nodemask_t *nodemask) { const gfp_t wait = gfp_mask & __GFP_WAIT; @@ -1667,7 +1667,7 @@ nopage: got_pg: return page; } -EXPORT_SYMBOL(__alloc_pages_internal); +EXPORT_SYMBOL(__alloc_pages_nodemask); /* * Common helper functions. -- cgit v1.2.3-59-g8ed1b From b3c466ce512923298ae8c0121d3e9f397a3f1210 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:31:53 -0700 Subject: page allocator: do not sanity check order in the fast path No user of the allocator API should be passing in an order >= MAX_ORDER but we check for it on each and every allocation. Delete this check and make it a VM_BUG_ON check further down the call path. [akpm@linux-foundation.org: s/VM_BUG_ON/WARN_ON_ONCE/] Signed-off-by: Mel Gorman Reviewed-by: Christoph Lameter Reviewed-by: KOSAKI Motohiro Reviewed-by: Pekka Enberg Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/gfp.h | 6 ------ mm/page_alloc.c | 3 +++ 2 files changed, 3 insertions(+), 6 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 549ec5583103..c2d3fe03b5d2 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -185,9 +185,6 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order, static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) { - if (unlikely(order >= MAX_ORDER)) - return NULL; - /* Unknown node is current node */ if (nid < 0) nid = numa_node_id(); @@ -201,9 +198,6 @@ extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order); static inline struct page * alloc_pages(gfp_t gfp_mask, unsigned int order) { - if (unlikely(order >= MAX_ORDER)) - return NULL; - return alloc_pages_current(gfp_mask, order); } extern struct page *alloc_page_vma(gfp_t gfp_mask, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d58df9031503..bfbd95c0610f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1401,6 +1401,9 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order, classzone_idx = zone_idx(preferred_zone); + if (WARN_ON_ONCE(order >= MAX_ORDER)) + return NULL; + zonelist_scan: /* * Scan zonelist, looking for a zone with enough free. -- cgit v1.2.3-59-g8ed1b From 7f82af9742a9346794ecc1515139daed480e7025 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:31:56 -0700 Subject: page allocator: check only once if the zonelist is suitable for the allocation It is possible with __GFP_THISNODE that no zones are suitable. This patch makes sure the check is only made once. Signed-off-by: Mel Gorman Reviewed-by: Christoph Lameter Reviewed-by: KOSAKI Motohiro Reviewed-by: Pekka Enberg Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index bfbd95c0610f..6be8fcb6f74f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1483,9 +1483,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, if (should_fail_alloc_page(gfp_mask, order)) return NULL; -restart: - z = zonelist->_zonerefs; /* the list of zones suitable for gfp_mask */ - + /* the list of zones suitable for gfp_mask */ + z = zonelist->_zonerefs; if (unlikely(!z->zone)) { /* * Happens if we have an empty zonelist as a result of @@ -1494,6 +1493,7 @@ restart: return NULL; } +restart: page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET); if (page) -- cgit v1.2.3-59-g8ed1b From 11e33f6a55ed7847d9c8ffe185ef87faf7806abe Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:31:57 -0700 Subject: page allocator: break up the allocator entry point into fast and slow paths The core of the page allocator is one giant function which allocates memory on the stack and makes calculations that may not be needed for every allocation. This patch breaks up the allocator path into fast and slow paths for clarity. Note the slow paths are still inlined but the entry is marked unlikely. If they were not inlined, it actally increases text size to generate the as there is only one call site. Signed-off-by: Mel Gorman Reviewed-by: Christoph Lameter Cc: KOSAKI Motohiro Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 353 ++++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 228 insertions(+), 125 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6be8fcb6f74f..512bf9a618c7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1457,47 +1457,171 @@ try_next_zone: return page; } -/* - * This is the 'heart' of the zoned buddy allocator. - */ -struct page * -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, - struct zonelist *zonelist, nodemask_t *nodemask) +static inline int +should_alloc_retry(gfp_t gfp_mask, unsigned int order, + unsigned long pages_reclaimed) { - const gfp_t wait = gfp_mask & __GFP_WAIT; - enum zone_type high_zoneidx = gfp_zone(gfp_mask); - struct zoneref *z; - struct zone *zone; - struct page *page; - struct reclaim_state reclaim_state; - struct task_struct *p = current; - int do_retry; - int alloc_flags; - unsigned long did_some_progress; - unsigned long pages_reclaimed = 0; + /* Do not loop if specifically requested */ + if (gfp_mask & __GFP_NORETRY) + return 0; - lockdep_trace_alloc(gfp_mask); + /* + * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER + * means __GFP_NOFAIL, but that may not be true in other + * implementations. + */ + if (order <= PAGE_ALLOC_COSTLY_ORDER) + return 1; - might_sleep_if(wait); + /* + * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is + * specified, then we retry until we no longer reclaim any pages + * (above), or we've reclaimed an order of pages at least as + * large as the allocation's order. In both cases, if the + * allocation still fails, we stop retrying. + */ + if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order)) + return 1; - if (should_fail_alloc_page(gfp_mask, order)) - return NULL; + /* + * Don't let big-order allocations loop unless the caller + * explicitly requests that. + */ + if (gfp_mask & __GFP_NOFAIL) + return 1; - /* the list of zones suitable for gfp_mask */ - z = zonelist->_zonerefs; - if (unlikely(!z->zone)) { - /* - * Happens if we have an empty zonelist as a result of - * GFP_THISNODE being used on a memoryless node - */ + return 0; +} + +static inline struct page * +__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, + struct zonelist *zonelist, enum zone_type high_zoneidx, + nodemask_t *nodemask) +{ + struct page *page; + + /* Acquire the OOM killer lock for the zones in zonelist */ + if (!try_set_zone_oom(zonelist, gfp_mask)) { + schedule_timeout_uninterruptible(1); return NULL; } -restart: - page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, - zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET); + /* + * Go through the zonelist yet one more time, keep very high watermark + * here, this is only to catch a parallel oom killing, we must fail if + * we're still under heavy pressure. + */ + page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, + order, zonelist, high_zoneidx, + ALLOC_WMARK_HIGH|ALLOC_CPUSET); if (page) - goto got_pg; + goto out; + + /* The OOM killer will not help higher order allocs */ + if (order > PAGE_ALLOC_COSTLY_ORDER) + goto out; + + /* Exhausted what can be done so it's blamo time */ + out_of_memory(zonelist, gfp_mask, order); + +out: + clear_zonelist_oom(zonelist, gfp_mask); + return page; +} + +/* The really slow allocator path where we enter direct reclaim */ +static inline struct page * +__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, + struct zonelist *zonelist, enum zone_type high_zoneidx, + nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress) +{ + struct page *page = NULL; + struct reclaim_state reclaim_state; + struct task_struct *p = current; + + cond_resched(); + + /* We now go into synchronous reclaim */ + cpuset_memory_pressure_bump(); + + /* + * The task's cpuset might have expanded its set of allowable nodes + */ + p->flags |= PF_MEMALLOC; + lockdep_set_current_reclaim_state(gfp_mask); + reclaim_state.reclaimed_slab = 0; + p->reclaim_state = &reclaim_state; + + *did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask); + + p->reclaim_state = NULL; + lockdep_clear_current_reclaim_state(); + p->flags &= ~PF_MEMALLOC; + + cond_resched(); + + if (order != 0) + drain_all_pages(); + + if (likely(*did_some_progress)) + page = get_page_from_freelist(gfp_mask, nodemask, order, + zonelist, high_zoneidx, alloc_flags); + return page; +} + +static inline int +is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask) +{ + if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE))) + && !in_interrupt()) + return 1; + return 0; +} + +/* + * This is called in the allocator slow-path if the allocation request is of + * sufficient urgency to ignore watermarks and take other desperate measures + */ +static inline struct page * +__alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order, + struct zonelist *zonelist, enum zone_type high_zoneidx, + nodemask_t *nodemask) +{ + struct page *page; + + do { + page = get_page_from_freelist(gfp_mask, nodemask, order, + zonelist, high_zoneidx, ALLOC_NO_WATERMARKS); + + if (!page && gfp_mask & __GFP_NOFAIL) + congestion_wait(WRITE, HZ/50); + } while (!page && (gfp_mask & __GFP_NOFAIL)); + + return page; +} + +static inline +void wake_all_kswapd(unsigned int order, struct zonelist *zonelist, + enum zone_type high_zoneidx) +{ + struct zoneref *z; + struct zone *zone; + + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) + wakeup_kswapd(zone, order); +} + +static inline struct page * +__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, + struct zonelist *zonelist, enum zone_type high_zoneidx, + nodemask_t *nodemask) +{ + const gfp_t wait = gfp_mask & __GFP_WAIT; + struct page *page = NULL; + int alloc_flags; + unsigned long pages_reclaimed = 0; + unsigned long did_some_progress; + struct task_struct *p = current; /* * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and @@ -1510,8 +1634,7 @@ restart: if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE) goto nopage; - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) - wakeup_kswapd(zone, order); + wake_all_kswapd(order, zonelist, high_zoneidx); /* * OK, we're below the kswapd watermark and have kicked background @@ -1531,6 +1654,7 @@ restart: if (wait) alloc_flags |= ALLOC_CPUSET; +restart: /* * Go through the zonelist again. Let __GFP_HIGH and allocations * coming from realtime tasks go deeper into reserves. @@ -1544,23 +1668,18 @@ restart: if (page) goto got_pg; - /* This allocation should allow future memory freeing. */ - rebalance: - if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE))) - && !in_interrupt()) { + /* Allocate without watermarks if the context allows */ + if (is_allocation_high_priority(p, gfp_mask)) { + /* Do not dip into emergency reserves if specified */ if (!(gfp_mask & __GFP_NOMEMALLOC)) { -nofail_alloc: - /* go through the zonelist yet again, ignoring mins */ - page = get_page_from_freelist(gfp_mask, nodemask, order, - zonelist, high_zoneidx, ALLOC_NO_WATERMARKS); + page = __alloc_pages_high_priority(gfp_mask, order, + zonelist, high_zoneidx, nodemask); if (page) goto got_pg; - if (gfp_mask & __GFP_NOFAIL) { - congestion_wait(WRITE, HZ/50); - goto nofail_alloc; - } } + + /* Ensure no recursion into the allocator */ goto nopage; } @@ -1568,93 +1687,42 @@ nofail_alloc: if (!wait) goto nopage; - cond_resched(); - - /* We now go into synchronous reclaim */ - cpuset_memory_pressure_bump(); - - p->flags |= PF_MEMALLOC; - - lockdep_set_current_reclaim_state(gfp_mask); - reclaim_state.reclaimed_slab = 0; - p->reclaim_state = &reclaim_state; - - did_some_progress = try_to_free_pages(zonelist, order, - gfp_mask, nodemask); - - p->reclaim_state = NULL; - lockdep_clear_current_reclaim_state(); - p->flags &= ~PF_MEMALLOC; + /* Try direct reclaim and then allocating */ + page = __alloc_pages_direct_reclaim(gfp_mask, order, + zonelist, high_zoneidx, + nodemask, + alloc_flags, &did_some_progress); + if (page) + goto got_pg; - cond_resched(); + /* + * If we failed to make any progress reclaiming, then we are + * running out of options and have to consider going OOM + */ + if (!did_some_progress) { + if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) { + page = __alloc_pages_may_oom(gfp_mask, order, + zonelist, high_zoneidx, + nodemask); + if (page) + goto got_pg; - if (order != 0) - drain_all_pages(); + /* + * The OOM killer does not trigger for high-order allocations + * but if no progress is being made, there are no other + * options and retrying is unlikely to help + */ + if (order > PAGE_ALLOC_COSTLY_ORDER) + goto nopage; - if (likely(did_some_progress)) { - page = get_page_from_freelist(gfp_mask, nodemask, order, - zonelist, high_zoneidx, alloc_flags); - if (page) - goto got_pg; - } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) { - if (!try_set_zone_oom(zonelist, gfp_mask)) { - schedule_timeout_uninterruptible(1); goto restart; } - - /* - * Go through the zonelist yet one more time, keep - * very high watermark here, this is only to catch - * a parallel oom killing, we must fail if we're still - * under heavy pressure. - */ - page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, - order, zonelist, high_zoneidx, - ALLOC_WMARK_HIGH|ALLOC_CPUSET); - if (page) { - clear_zonelist_oom(zonelist, gfp_mask); - goto got_pg; - } - - /* The OOM killer will not help higher order allocs so fail */ - if (order > PAGE_ALLOC_COSTLY_ORDER) { - clear_zonelist_oom(zonelist, gfp_mask); - goto nopage; - } - - out_of_memory(zonelist, gfp_mask, order); - clear_zonelist_oom(zonelist, gfp_mask); - goto restart; } - /* - * Don't let big-order allocations loop unless the caller explicitly - * requests that. Wait for some write requests to complete then retry. - * - * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER - * means __GFP_NOFAIL, but that may not be true in other - * implementations. - * - * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is - * specified, then we retry until we no longer reclaim any pages - * (above), or we've reclaimed an order of pages at least as - * large as the allocation's order. In both cases, if the - * allocation still fails, we stop retrying. - */ + /* Check if we should retry the allocation */ pages_reclaimed += did_some_progress; - do_retry = 0; - if (!(gfp_mask & __GFP_NORETRY)) { - if (order <= PAGE_ALLOC_COSTLY_ORDER) { - do_retry = 1; - } else { - if (gfp_mask & __GFP_REPEAT && - pages_reclaimed < (1 << order)) - do_retry = 1; - } - if (gfp_mask & __GFP_NOFAIL) - do_retry = 1; - } - if (do_retry) { + if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) { + /* Wait for some write requests to complete then retry */ congestion_wait(WRITE, HZ/50); goto rebalance; } @@ -1669,6 +1737,41 @@ nopage: } got_pg: return page; + +} + +/* + * This is the 'heart' of the zoned buddy allocator. + */ +struct page * +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, + struct zonelist *zonelist, nodemask_t *nodemask) +{ + enum zone_type high_zoneidx = gfp_zone(gfp_mask); + struct page *page; + + lockdep_trace_alloc(gfp_mask); + + might_sleep_if(gfp_mask & __GFP_WAIT); + + if (should_fail_alloc_page(gfp_mask, order)) + return NULL; + + /* + * Check the zones suitable for the gfp_mask contain at least one + * valid zone. It's possible to have an empty zonelist as a result + * of GFP_THISNODE and a memoryless node + */ + if (unlikely(!zonelist->_zonerefs->zone)) + return NULL; + + page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, + zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET); + if (unlikely(!page)) + page = __alloc_pages_slowpath(gfp_mask, order, + zonelist, high_zoneidx, nodemask); + + return page; } EXPORT_SYMBOL(__alloc_pages_nodemask); -- cgit v1.2.3-59-g8ed1b From 49255c619fbd482d704289b5eb2795f8e3b7ff2e Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:31:58 -0700 Subject: page allocator: move check for disabled anti-fragmentation out of fastpath On low-memory systems, anti-fragmentation gets disabled as there is nothing it can do and it would just incur overhead shuffling pages between lists constantly. Currently the check is made in the free page fast path for every page. This patch moves it to a slow path. On machines with low memory, there will be small amount of additional overhead as pages get shuffled between lists but it should quickly settle. Signed-off-by: Mel Gorman Reviewed-by: Christoph Lameter Reviewed-by: KOSAKI Motohiro Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 3 --- mm/page_alloc.c | 4 ++++ 2 files changed, 4 insertions(+), 3 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a47c879e1304..0aa4445b0b8a 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -50,9 +50,6 @@ extern int page_group_by_mobility_disabled; static inline int get_pageblock_migratetype(struct page *page) { - if (unlikely(page_group_by_mobility_disabled)) - return MIGRATE_UNMOVABLE; - return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 512bf9a618c7..b09859629e93 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -168,6 +168,10 @@ int page_group_by_mobility_disabled __read_mostly; static void set_pageblock_migratetype(struct page *page, int migratetype) { + + if (unlikely(page_group_by_mobility_disabled)) + migratetype = MIGRATE_UNMOVABLE; + set_pageblock_flags_group(page, (unsigned long)migratetype, PB_migrate, PB_migrate_end); } -- cgit v1.2.3-59-g8ed1b From 5117f45d11a9ee62d9b086f1312f3f31781ff155 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:31:59 -0700 Subject: page allocator: calculate the preferred zone for allocation only once get_page_from_freelist() can be called multiple times for an allocation. Part of this calculates the preferred_zone which is the first usable zone in the zonelist but the zone depends on the GFP flags specified at the beginning of the allocation call. This patch calculates preferred_zone once. It's safe to do this because if preferred_zone is NULL at the start of the call, no amount of direct reclaim or other actions will change the fact the allocation will fail. [akpm@linux-foundation.org: remove (void) casts] Signed-off-by: Mel Gorman Reviewed-by: KOSAKI Motohiro Reviewed-by: Pekka Enberg Cc: Christoph Lameter Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 54 ++++++++++++++++++++++++++++++++---------------------- 1 file changed, 32 insertions(+), 22 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b09859629e93..8fc6d1f42f0b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1388,26 +1388,21 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z) */ static struct page * get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order, - struct zonelist *zonelist, int high_zoneidx, int alloc_flags) + struct zonelist *zonelist, int high_zoneidx, int alloc_flags, + struct zone *preferred_zone) { struct zoneref *z; struct page *page = NULL; int classzone_idx; - struct zone *zone, *preferred_zone; + struct zone *zone; nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */ int zlc_active = 0; /* set if using zonelist_cache */ int did_zlc_setup = 0; /* just call zlc_setup() one time */ - (void)first_zones_zonelist(zonelist, high_zoneidx, nodemask, - &preferred_zone); - if (!preferred_zone) - return NULL; - - classzone_idx = zone_idx(preferred_zone); - if (WARN_ON_ONCE(order >= MAX_ORDER)) return NULL; + classzone_idx = zone_idx(preferred_zone); zonelist_scan: /* * Scan zonelist, looking for a zone with enough free. @@ -1500,7 +1495,7 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order, static inline struct page * __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, - nodemask_t *nodemask) + nodemask_t *nodemask, struct zone *preferred_zone) { struct page *page; @@ -1517,7 +1512,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, */ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, zonelist, high_zoneidx, - ALLOC_WMARK_HIGH|ALLOC_CPUSET); + ALLOC_WMARK_HIGH|ALLOC_CPUSET, + preferred_zone); if (page) goto out; @@ -1537,7 +1533,8 @@ out: static inline struct page * __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, - nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress) + nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone, + unsigned long *did_some_progress) { struct page *page = NULL; struct reclaim_state reclaim_state; @@ -1569,7 +1566,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, if (likely(*did_some_progress)) page = get_page_from_freelist(gfp_mask, nodemask, order, - zonelist, high_zoneidx, alloc_flags); + zonelist, high_zoneidx, + alloc_flags, preferred_zone); return page; } @@ -1589,13 +1587,14 @@ is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask) static inline struct page * __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, - nodemask_t *nodemask) + nodemask_t *nodemask, struct zone *preferred_zone) { struct page *page; do { page = get_page_from_freelist(gfp_mask, nodemask, order, - zonelist, high_zoneidx, ALLOC_NO_WATERMARKS); + zonelist, high_zoneidx, ALLOC_NO_WATERMARKS, + preferred_zone); if (!page && gfp_mask & __GFP_NOFAIL) congestion_wait(WRITE, HZ/50); @@ -1618,7 +1617,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist, static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, - nodemask_t *nodemask) + nodemask_t *nodemask, struct zone *preferred_zone) { const gfp_t wait = gfp_mask & __GFP_WAIT; struct page *page = NULL; @@ -1668,7 +1667,8 @@ restart: * See also cpuset_zone_allowed() comment in kernel/cpuset.c. */ page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, - high_zoneidx, alloc_flags); + high_zoneidx, alloc_flags, + preferred_zone); if (page) goto got_pg; @@ -1678,7 +1678,7 @@ rebalance: /* Do not dip into emergency reserves if specified */ if (!(gfp_mask & __GFP_NOMEMALLOC)) { page = __alloc_pages_high_priority(gfp_mask, order, - zonelist, high_zoneidx, nodemask); + zonelist, high_zoneidx, nodemask, preferred_zone); if (page) goto got_pg; } @@ -1695,7 +1695,8 @@ rebalance: page = __alloc_pages_direct_reclaim(gfp_mask, order, zonelist, high_zoneidx, nodemask, - alloc_flags, &did_some_progress); + alloc_flags, preferred_zone, + &did_some_progress); if (page) goto got_pg; @@ -1707,7 +1708,7 @@ rebalance: if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) { page = __alloc_pages_may_oom(gfp_mask, order, zonelist, high_zoneidx, - nodemask); + nodemask, preferred_zone); if (page) goto got_pg; @@ -1752,6 +1753,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, nodemask_t *nodemask) { enum zone_type high_zoneidx = gfp_zone(gfp_mask); + struct zone *preferred_zone; struct page *page; lockdep_trace_alloc(gfp_mask); @@ -1769,11 +1771,19 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, if (unlikely(!zonelist->_zonerefs->zone)) return NULL; + /* The preferred zone is used for statistics later */ + first_zones_zonelist(zonelist, high_zoneidx, nodemask, &preferred_zone); + if (!preferred_zone) + return NULL; + + /* First allocation attempt */ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, - zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET); + zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET, + preferred_zone); if (unlikely(!page)) page = __alloc_pages_slowpath(gfp_mask, order, - zonelist, high_zoneidx, nodemask); + zonelist, high_zoneidx, nodemask, + preferred_zone); return page; } -- cgit v1.2.3-59-g8ed1b From 3dd2826698b6902aafd9441ce28ebb44735fd0d6 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:32:00 -0700 Subject: page allocator: calculate the migratetype for allocation only once GFP mask is converted into a migratetype when deciding which pagelist to take a page from. However, it is happening multiple times per allocation, at least once per zone traversed. Calculate it once. Signed-off-by: Mel Gorman Cc: Christoph Lameter Cc: KOSAKI Motohiro Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 43 ++++++++++++++++++++++++++----------------- 1 file changed, 26 insertions(+), 17 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8fc6d1f42f0b..d3be076ea9c5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1061,13 +1061,13 @@ void split_page(struct page *page, unsigned int order) * or two. */ static struct page *buffered_rmqueue(struct zone *preferred_zone, - struct zone *zone, int order, gfp_t gfp_flags) + struct zone *zone, int order, gfp_t gfp_flags, + int migratetype) { unsigned long flags; struct page *page; int cold = !!(gfp_flags & __GFP_COLD); int cpu; - int migratetype = allocflags_to_migratetype(gfp_flags); again: cpu = get_cpu(); @@ -1389,7 +1389,7 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z) static struct page * get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order, struct zonelist *zonelist, int high_zoneidx, int alloc_flags, - struct zone *preferred_zone) + struct zone *preferred_zone, int migratetype) { struct zoneref *z; struct page *page = NULL; @@ -1433,7 +1433,8 @@ zonelist_scan: } } - page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask); + page = buffered_rmqueue(preferred_zone, zone, order, + gfp_mask, migratetype); if (page) break; this_zone_full: @@ -1495,7 +1496,8 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order, static inline struct page * __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, - nodemask_t *nodemask, struct zone *preferred_zone) + nodemask_t *nodemask, struct zone *preferred_zone, + int migratetype) { struct page *page; @@ -1513,7 +1515,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET, - preferred_zone); + preferred_zone, migratetype); if (page) goto out; @@ -1534,7 +1536,7 @@ static inline struct page * __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone, - unsigned long *did_some_progress) + int migratetype, unsigned long *did_some_progress) { struct page *page = NULL; struct reclaim_state reclaim_state; @@ -1567,7 +1569,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, if (likely(*did_some_progress)) page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, - alloc_flags, preferred_zone); + alloc_flags, preferred_zone, + migratetype); return page; } @@ -1587,14 +1590,15 @@ is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask) static inline struct page * __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, - nodemask_t *nodemask, struct zone *preferred_zone) + nodemask_t *nodemask, struct zone *preferred_zone, + int migratetype) { struct page *page; do { page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, ALLOC_NO_WATERMARKS, - preferred_zone); + preferred_zone, migratetype); if (!page && gfp_mask & __GFP_NOFAIL) congestion_wait(WRITE, HZ/50); @@ -1617,7 +1621,8 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist, static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, - nodemask_t *nodemask, struct zone *preferred_zone) + nodemask_t *nodemask, struct zone *preferred_zone, + int migratetype) { const gfp_t wait = gfp_mask & __GFP_WAIT; struct page *page = NULL; @@ -1668,7 +1673,8 @@ restart: */ page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, alloc_flags, - preferred_zone); + preferred_zone, + migratetype); if (page) goto got_pg; @@ -1678,7 +1684,8 @@ rebalance: /* Do not dip into emergency reserves if specified */ if (!(gfp_mask & __GFP_NOMEMALLOC)) { page = __alloc_pages_high_priority(gfp_mask, order, - zonelist, high_zoneidx, nodemask, preferred_zone); + zonelist, high_zoneidx, nodemask, preferred_zone, + migratetype); if (page) goto got_pg; } @@ -1696,7 +1703,7 @@ rebalance: zonelist, high_zoneidx, nodemask, alloc_flags, preferred_zone, - &did_some_progress); + migratetype, &did_some_progress); if (page) goto got_pg; @@ -1708,7 +1715,8 @@ rebalance: if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) { page = __alloc_pages_may_oom(gfp_mask, order, zonelist, high_zoneidx, - nodemask, preferred_zone); + nodemask, preferred_zone, + migratetype); if (page) goto got_pg; @@ -1755,6 +1763,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, enum zone_type high_zoneidx = gfp_zone(gfp_mask); struct zone *preferred_zone; struct page *page; + int migratetype = allocflags_to_migratetype(gfp_mask); lockdep_trace_alloc(gfp_mask); @@ -1779,11 +1788,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, /* First allocation attempt */ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET, - preferred_zone); + preferred_zone, migratetype); if (unlikely(!page)) page = __alloc_pages_slowpath(gfp_mask, order, zonelist, high_zoneidx, nodemask, - preferred_zone); + preferred_zone, migratetype); return page; } -- cgit v1.2.3-59-g8ed1b From 341ce06f69abfafa31b9468410a13dbd60e2b237 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 16 Jun 2009 15:32:02 -0700 Subject: page allocator: calculate the alloc_flags for allocation only once Factor out the mapping between GFP and alloc_flags only once. Once factored out, it only needs to be calculated once but some care must be taken. [neilb@suse.de says] As the test: - if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE))) - && !in_interrupt()) { - if (!(gfp_mask & __GFP_NOMEMALLOC)) { has been replaced with a slightly weaker one: + if (alloc_flags & ALLOC_NO_WATERMARKS) { Without care, this would allow recursion into the allocator via direct reclaim. This patch ensures we do not recurse when PF_MEMALLOC is set but TF_MEMDIE callers are now allowed to directly reclaim where they would have been prevented in the past. Signed-off-by: Peter Zijlstra Acked-by: Pekka Enberg Signed-off-by: Mel Gorman Cc: Neil Brown Cc: Christoph Lameter Cc: KOSAKI Motohiro Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 94 ++++++++++++++++++++++++++++++--------------------------- 1 file changed, 50 insertions(+), 44 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d3be076ea9c5..ef870fb92f74 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1574,15 +1574,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, return page; } -static inline int -is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask) -{ - if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE))) - && !in_interrupt()) - return 1; - return 0; -} - /* * This is called in the allocator slow-path if the allocation request is of * sufficient urgency to ignore watermarks and take other desperate measures @@ -1618,6 +1609,42 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist, wakeup_kswapd(zone, order); } +static inline int +gfp_to_alloc_flags(gfp_t gfp_mask) +{ + struct task_struct *p = current; + int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET; + const gfp_t wait = gfp_mask & __GFP_WAIT; + + /* + * The caller may dip into page reserves a bit more if the caller + * cannot run direct reclaim, or if the caller has realtime scheduling + * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will + * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH). + */ + if (gfp_mask & __GFP_HIGH) + alloc_flags |= ALLOC_HIGH; + + if (!wait) { + alloc_flags |= ALLOC_HARDER; + /* + * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. + * See also cpuset_zone_allowed() comment in kernel/cpuset.c. + */ + alloc_flags &= ~ALLOC_CPUSET; + } else if (unlikely(rt_task(p))) + alloc_flags |= ALLOC_HARDER; + + if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) { + if (!in_interrupt() && + ((p->flags & PF_MEMALLOC) || + unlikely(test_thread_flag(TIF_MEMDIE)))) + alloc_flags |= ALLOC_NO_WATERMARKS; + } + + return alloc_flags; +} + static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, @@ -1648,56 +1675,35 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * OK, we're below the kswapd watermark and have kicked background * reclaim. Now things get more complex, so set up alloc_flags according * to how we want to proceed. - * - * The caller may dip into page reserves a bit more if the caller - * cannot run direct reclaim, or if the caller has realtime scheduling - * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will - * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH). */ - alloc_flags = ALLOC_WMARK_MIN; - if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait) - alloc_flags |= ALLOC_HARDER; - if (gfp_mask & __GFP_HIGH) - alloc_flags |= ALLOC_HIGH; - if (wait) - alloc_flags |= ALLOC_CPUSET; + alloc_flags = gfp_to_alloc_flags(gfp_mask); restart: - /* - * Go through the zonelist again. Let __GFP_HIGH and allocations - * coming from realtime tasks go deeper into reserves. - * - * This is the last chance, in general, before the goto nopage. - * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. - * See also cpuset_zone_allowed() comment in kernel/cpuset.c. - */ + /* This is the last chance, in general, before the goto nopage. */ page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, - high_zoneidx, alloc_flags, - preferred_zone, - migratetype); + high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS, + preferred_zone, migratetype); if (page) goto got_pg; rebalance: /* Allocate without watermarks if the context allows */ - if (is_allocation_high_priority(p, gfp_mask)) { - /* Do not dip into emergency reserves if specified */ - if (!(gfp_mask & __GFP_NOMEMALLOC)) { - page = __alloc_pages_high_priority(gfp_mask, order, - zonelist, high_zoneidx, nodemask, preferred_zone, - migratetype); - if (page) - goto got_pg; - } - - /* Ensure no recursion into the allocator */ - goto nopage; + if (alloc_flags & ALLOC_NO_WATERMARKS) { + page = __alloc_pages_high_priority(gfp_mask, order, + zonelist, high_zoneidx, nodemask, + preferred_zone, migratetype); + if (page) + goto got_pg; } /* Atomic allocations - we can't balance anything */ if (!wait) goto nopage; + /* Avoid recursion of direct reclaim */ + if (p->flags & PF_MEMALLOC) + goto nopage; + /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, zonelist, high_zoneidx, -- cgit v1.2.3-59-g8ed1b From a56f57ff94c25d5d80def06f3ed8fe7f99147762 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:32:02 -0700 Subject: page allocator: remove a branch by assuming __GFP_HIGH == ALLOC_HIGH Allocations that specify __GFP_HIGH get the ALLOC_HIGH flag. If these flags are equal to each other, we can eliminate a branch. [akpm@linux-foundation.org: Suggested the hack] Signed-off-by: Mel Gorman Reviewed-by: Pekka Enberg Reviewed-by: KOSAKI Motohiro Cc: Christoph Lameter Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ef870fb92f74..94f33e2b7f0b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1616,14 +1616,16 @@ gfp_to_alloc_flags(gfp_t gfp_mask) int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET; const gfp_t wait = gfp_mask & __GFP_WAIT; + /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */ + BUILD_BUG_ON(__GFP_HIGH != ALLOC_HIGH); + /* * The caller may dip into page reserves a bit more if the caller * cannot run direct reclaim, or if the caller has realtime scheduling * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH). */ - if (gfp_mask & __GFP_HIGH) - alloc_flags |= ALLOC_HIGH; + alloc_flags |= (gfp_mask & __GFP_HIGH); if (!wait) { alloc_flags |= ALLOC_HARDER; -- cgit v1.2.3-59-g8ed1b From 728ec980fb9fa2d65d9e05444079a53615985e7b Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:32:04 -0700 Subject: page allocator: inline __rmqueue_smallest() Inline __rmqueue_smallest by altering flow very slightly so that there is only one call site. Because there is only one call-site, this function can then be inlined without causing text bloat. On an x86-based config, this patch reduces text by 16 bytes. Signed-off-by: Mel Gorman Reviewed-by: Christoph Lameter Reviewed-by: KOSAKI Motohiro Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 94f33e2b7f0b..04713f649fd4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -661,7 +661,8 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) * Go through the free lists for the given migratetype and remove * the smallest available page from the freelists */ -static struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, +static inline +struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, int migratetype) { unsigned int current_order; @@ -831,8 +832,7 @@ static struct page *__rmqueue_fallback(struct zone *zone, int order, } } - /* Use MIGRATE_RESERVE rather than fail an allocation */ - return __rmqueue_smallest(zone, order, MIGRATE_RESERVE); + return NULL; } /* @@ -844,11 +844,23 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order, { struct page *page; +retry_reserve: page = __rmqueue_smallest(zone, order, migratetype); - if (unlikely(!page)) + if (unlikely(!page) && migratetype != MIGRATE_RESERVE) { page = __rmqueue_fallback(zone, order, migratetype); + /* + * Use MIGRATE_RESERVE rather than fail an allocation. goto + * is used because __rmqueue_smallest is an inline function + * and we want just one call site + */ + if (!page) { + migratetype = MIGRATE_RESERVE; + goto retry_reserve; + } + } + return page; } -- cgit v1.2.3-59-g8ed1b From 0a15c3e9f649f71464ac39e6378f1fde6f995322 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:32:05 -0700 Subject: page allocator: inline buffered_rmqueue() buffered_rmqueue() is in the fast path so inline it. Because it only has one call site, this function can then be inlined without causing text bloat. On an x86-based config, it made no difference as the savings were padded out by NOP instructions. Milage varies but text will either decrease in size or remain static. Signed-off-by: Mel Gorman Reviewed-by: KOSAKI Motohiro Cc: Christoph Lameter Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 04713f649fd4..c101921e6a64 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1072,7 +1072,8 @@ void split_page(struct page *page, unsigned int order) * we cheat by calling it from here, in the order > 0 path. Saves a branch * or two. */ -static struct page *buffered_rmqueue(struct zone *preferred_zone, +static inline +struct page *buffered_rmqueue(struct zone *preferred_zone, struct zone *zone, int order, gfp_t gfp_flags, int migratetype) { -- cgit v1.2.3-59-g8ed1b From 0ac3a4099b0171ff965836182bc688bb8ca01058 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:32:06 -0700 Subject: page allocator: inline __rmqueue_fallback() __rmqueue_fallback() is in the slow path but has only one call site. Because there is only one call-site, this function can then be inlined without causing text bloat. On an x86-based config, it made no difference as the savings were padded out by NOP instructions. Milage varies but text will either decrease in size or remain static. Signed-off-by: Mel Gorman Cc: Christoph Lameter Cc: KOSAKI Motohiro Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c101921e6a64..91e29b3ed2b6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -771,8 +771,8 @@ static int move_freepages_block(struct zone *zone, struct page *page, } /* Remove an element from the buddy allocator from the fallback list */ -static struct page *__rmqueue_fallback(struct zone *zone, int order, - int start_migratetype) +static inline struct page * +__rmqueue_fallback(struct zone *zone, int order, int start_migratetype) { struct free_area * area; int current_order; -- cgit v1.2.3-59-g8ed1b From ed0ae21dc5fe3b9ad4cf1c7bb2bfd2ad596c481c Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:32:07 -0700 Subject: page allocator: do not call get_pageblock_migratetype() more than necessary get_pageblock_migratetype() is potentially called twice for every page free. Once, when being freed to the pcp lists and once when being freed back to buddy. When freeing from the pcp lists, it is known what the pageblock type was at the time of free so use it rather than rechecking. In low memory situations under memory pressure, this might skew anti-fragmentation slightly but the interference is minimal and decisions that are fragmenting memory are being made anyway. Signed-off-by: Mel Gorman Reviewed-by: Christoph Lameter Reviewed-by: KOSAKI Motohiro Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 91e29b3ed2b6..8f334d339b08 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -452,16 +452,18 @@ static inline int page_is_buddy(struct page *page, struct page *buddy, */ static inline void __free_one_page(struct page *page, - struct zone *zone, unsigned int order) + struct zone *zone, unsigned int order, + int migratetype) { unsigned long page_idx; int order_size = 1 << order; - int migratetype = get_pageblock_migratetype(page); if (unlikely(PageCompound(page))) if (unlikely(destroy_compound_page(page, order))) return; + VM_BUG_ON(migratetype == -1); + page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1); VM_BUG_ON(page_idx & (order_size - 1)); @@ -530,17 +532,18 @@ static void free_pages_bulk(struct zone *zone, int count, page = list_entry(list->prev, struct page, lru); /* have to delete it as __free_one_page list manipulates */ list_del(&page->lru); - __free_one_page(page, zone, order); + __free_one_page(page, zone, order, page_private(page)); } spin_unlock(&zone->lock); } -static void free_one_page(struct zone *zone, struct page *page, int order) +static void free_one_page(struct zone *zone, struct page *page, int order, + int migratetype) { spin_lock(&zone->lock); zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE); zone->pages_scanned = 0; - __free_one_page(page, zone, order); + __free_one_page(page, zone, order, migratetype); spin_unlock(&zone->lock); } @@ -565,7 +568,8 @@ static void __free_pages_ok(struct page *page, unsigned int order) local_irq_save(flags); __count_vm_events(PGFREE, 1 << order); - free_one_page(page_zone(page), page, order); + free_one_page(page_zone(page), page, order, + get_pageblock_migratetype(page)); local_irq_restore(flags); } -- cgit v1.2.3-59-g8ed1b From da456f14d2f2d7350f2b9440af79c85a34c7eed5 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:32:08 -0700 Subject: page allocator: do not disable interrupts in free_page_mlock() free_page_mlock() tests and clears PG_mlocked using locked versions of the bit operations. If set, it disables interrupts to update counters and this happens on every page free even though interrupts are disabled very shortly afterwards a second time. This is wasteful. This patch splits what free_page_mlock() does. The bit check is still made. However, the update of counters is delayed until the interrupts are disabled and the non-lock version for clearing the bit is used. One potential weirdness with this split is that the counters do not get updated if the bad_page() check is triggered but a system showing bad pages is getting screwed already. Signed-off-by: Mel Gorman Reviewed-by: Christoph Lameter Reviewed-by: Pekka Enberg Reviewed-by: KOSAKI Motohiro Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Acked-by: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/internal.h | 11 +++-------- mm/page_alloc.c | 8 +++++++- 2 files changed, 10 insertions(+), 9 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/internal.h b/mm/internal.h index 987bb03fbdd8..58ec1bc262c3 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page) */ static inline void free_page_mlock(struct page *page) { - if (unlikely(TestClearPageMlocked(page))) { - unsigned long flags; - - local_irq_save(flags); - __dec_zone_page_state(page, NR_MLOCK); - __count_vm_event(UNEVICTABLE_MLOCKFREED); - local_irq_restore(flags); - } + __ClearPageMlocked(page); + __dec_zone_page_state(page, NR_MLOCK); + __count_vm_event(UNEVICTABLE_MLOCKFREED); } #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8f334d339b08..03a386d24ef2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -495,7 +495,6 @@ static inline void __free_one_page(struct page *page, static inline int free_pages_check(struct page *page) { - free_page_mlock(page); if (unlikely(page_mapcount(page) | (page->mapping != NULL) | (page_count(page) != 0) | @@ -552,6 +551,7 @@ static void __free_pages_ok(struct page *page, unsigned int order) unsigned long flags; int i; int bad = 0; + int clearMlocked = PageMlocked(page); for (i = 0 ; i < (1 << order) ; ++i) bad += free_pages_check(page + i); @@ -567,6 +567,8 @@ static void __free_pages_ok(struct page *page, unsigned int order) kernel_map_pages(page, 1 << order, 0); local_irq_save(flags); + if (unlikely(clearMlocked)) + free_page_mlock(page); __count_vm_events(PGFREE, 1 << order); free_one_page(page_zone(page), page, order, get_pageblock_migratetype(page)); @@ -1013,6 +1015,7 @@ static void free_hot_cold_page(struct page *page, int cold) struct zone *zone = page_zone(page); struct per_cpu_pages *pcp; unsigned long flags; + int clearMlocked = PageMlocked(page); if (PageAnon(page)) page->mapping = NULL; @@ -1028,7 +1031,10 @@ static void free_hot_cold_page(struct page *page, int cold) pcp = &zone_pcp(zone, get_cpu())->pcp; local_irq_save(flags); + if (unlikely(clearMlocked)) + free_page_mlock(page); __count_vm_event(PGFREE); + if (cold) list_add_tail(&page->lru, &pcp->list); else -- cgit v1.2.3-59-g8ed1b From d395b73428d9748fb70b33477c9b2acae62f360a Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:32:09 -0700 Subject: page allocator: do not setup zonelist cache when there is only one node There is a zonelist cache which is used to track zones that are not in the allowed cpuset or found to be recently full. This is to reduce cache footprint on large machines. On smaller machines, it just incurs cost for no gain. This patch only uses the zonelist cache when there are NUMA nodes. Signed-off-by: Mel Gorman Reviewed-by: Christoph Lameter Cc: KOSAKI Motohiro Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 03a386d24ef2..fd8e3ca0cf3b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1464,8 +1464,11 @@ this_zone_full: if (NUMA_BUILD) zlc_mark_zone_full(zonelist, z); try_next_zone: - if (NUMA_BUILD && !did_zlc_setup) { - /* we do zlc_setup after the first zone is tried */ + if (NUMA_BUILD && !did_zlc_setup && num_online_nodes() > 1) { + /* + * we do zlc_setup after the first zone is tried but only + * if there are multiple nodes make it worthwhile + */ allowednodes = zlc_setup(zonelist, alloc_flags); zlc_active = 1; did_zlc_setup = 1; -- cgit v1.2.3-59-g8ed1b From a3af9c389a7f3e675313f442fdd8c247c1cdb66b Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Tue, 16 Jun 2009 15:32:10 -0700 Subject: page allocator: do not check for compound pages during the page allocator sanity checks A number of sanity checks are made on each page allocation and free including that the page count is zero. page_count() checks for compound pages and checks the count of the head page if true. However, in these paths, we do not care if the page is compound or not as the count of each tail page should also be zero. This patch makes two changes to the use of page_count() in the free path. It converts one check of page_count() to a VM_BUG_ON() as the count should have been unconditionally checked earlier in the free path. It also avoids checking for compound pages. [mel@csn.ul.ie: Wrote changelog] Signed-off-by: Mel Gorman Signed-off-by: Nick Piggin Reviewed-by: Christoph Lameter Cc: KOSAKI Motohiro Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fd8e3ca0cf3b..8485735fc690 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -421,7 +421,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy, return 0; if (PageBuddy(buddy) && page_order(buddy) == order) { - BUG_ON(page_count(buddy) != 0); + VM_BUG_ON(page_count(buddy) != 0); return 1; } return 0; @@ -497,7 +497,7 @@ static inline int free_pages_check(struct page *page) { if (unlikely(page_mapcount(page) | (page->mapping != NULL) | - (page_count(page) != 0) | + (atomic_read(&page->_count) != 0) | (page->flags & PAGE_FLAGS_CHECK_AT_FREE))) { bad_page(page); return 1; @@ -642,7 +642,7 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) { if (unlikely(page_mapcount(page) | (page->mapping != NULL) | - (page_count(page) != 0) | + (atomic_read(&page->_count) != 0) | (page->flags & PAGE_FLAGS_CHECK_AT_PREP))) { bad_page(page); return 1; -- cgit v1.2.3-59-g8ed1b From 418589663d6011de9006425b6c5721e1544fb47a Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:32:12 -0700 Subject: page allocator: use allocation flags as an index to the zone watermark ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether pages_min, pages_low or pages_high is used as the zone watermark when allocating the pages. Two branches in the allocator hotpath determine which watermark to use. This patch uses the flags as an array index into a watermark array that is indexed with WMARK_* defines accessed via helpers. All call sites that use zone->pages_* are updated to use the helpers for accessing the values and the array offsets for setting. Signed-off-by: Mel Gorman Reviewed-by: Christoph Lameter Cc: KOSAKI Motohiro Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/vm.txt | 11 +++++----- Documentation/vm/balance | 18 ++++++++-------- arch/m32r/mm/discontig.c | 6 +++--- include/linux/mmzone.h | 16 +++++++++++++- mm/page_alloc.c | 51 +++++++++++++++++++++++---------------------- mm/vmscan.c | 39 ++++++++++++++++++---------------- mm/vmstat.c | 6 +++--- 7 files changed, 83 insertions(+), 64 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 6fab2dcbb4d3..0ea5adbc5b16 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -233,8 +233,8 @@ These protections are added to score to judge whether this zone should be used for page allocation or should be reclaimed. In this example, if normal pages (index=2) are required to this DMA zone and -pages_high is used for watermark, the kernel judges this zone should not be -used because pages_free(1355) is smaller than watermark + protection[2] +watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should +not be used because pages_free(1355) is smaller than watermark + protection[2] (4 + 2004 = 2008). If this protection value is 0, this zone would be used for normal page requirement. If requirement is DMA zone(index=0), protection[0] (=0) is used. @@ -280,9 +280,10 @@ The default value is 65536. min_free_kbytes: This is used to force the Linux VM to keep a minimum number -of kilobytes free. The VM uses this number to compute a pages_min -value for each lowmem zone in the system. Each lowmem zone gets -a number of reserved free pages based proportionally on its size. +of kilobytes free. The VM uses this number to compute a +watermark[WMARK_MIN] value for each lowmem zone in the system. +Each lowmem zone gets a number of reserved free pages based +proportionally on its size. Some minimal amount of memory is needed to satisfy PF_MEMALLOC allocations; if you set this to lower than 1024KB, your system will diff --git a/Documentation/vm/balance b/Documentation/vm/balance index bd3d31bc4915..c46e68cf9344 100644 --- a/Documentation/vm/balance +++ b/Documentation/vm/balance @@ -75,15 +75,15 @@ Page stealing from process memory and shm is done if stealing the page would alleviate memory pressure on any zone in the page's node that has fallen below its watermark. -pages_min/pages_low/pages_high/low_on_memory/zone_wake_kswapd: These are -per-zone fields, used to determine when a zone needs to be balanced. When -the number of pages falls below pages_min, the hysteric field low_on_memory -gets set. This stays set till the number of free pages becomes pages_high. -When low_on_memory is set, page allocation requests will try to free some -pages in the zone (providing GFP_WAIT is set in the request). Orthogonal -to this, is the decision to poke kswapd to free some zone pages. That -decision is not hysteresis based, and is done when the number of free -pages is below pages_low; in which case zone_wake_kswapd is also set. +watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These +are per-zone fields, used to determine when a zone needs to be balanced. When +the number of pages falls below watermark[WMARK_MIN], the hysteric field +low_on_memory gets set. This stays set till the number of free pages becomes +watermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will +try to free some pages in the zone (providing GFP_WAIT is set in the request). +Orthogonal to this, is the decision to poke kswapd to free some zone pages. +That decision is not hysteresis based, and is done when the number of free +pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set. (Good) Ideas that I have heard: diff --git a/arch/m32r/mm/discontig.c b/arch/m32r/mm/discontig.c index 7daf897292cf..b7a78ad429b7 100644 --- a/arch/m32r/mm/discontig.c +++ b/arch/m32r/mm/discontig.c @@ -154,9 +154,9 @@ unsigned long __init zone_sizes_init(void) * Use all area of internal RAM. * see __alloc_pages() */ - NODE_DATA(1)->node_zones->pages_min = 0; - NODE_DATA(1)->node_zones->pages_low = 0; - NODE_DATA(1)->node_zones->pages_high = 0; + NODE_DATA(1)->node_zones->watermark[WMARK_MIN] = 0; + NODE_DATA(1)->node_zones->watermark[WMARK_LOW] = 0; + NODE_DATA(1)->node_zones->watermark[WMARK_HIGH] = 0; return holes; } diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 0aa4445b0b8a..dd8487f0442f 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -163,6 +163,17 @@ static inline int is_unevictable_lru(enum lru_list l) #endif } +enum zone_watermarks { + WMARK_MIN, + WMARK_LOW, + WMARK_HIGH, + NR_WMARK +}; + +#define min_wmark_pages(z) (z->watermark[WMARK_MIN]) +#define low_wmark_pages(z) (z->watermark[WMARK_LOW]) +#define high_wmark_pages(z) (z->watermark[WMARK_HIGH]) + struct per_cpu_pages { int count; /* number of pages in the list */ int high; /* high watermark, emptying needed */ @@ -275,7 +286,10 @@ struct zone_reclaim_stat { struct zone { /* Fields commonly accessed by the page allocator */ - unsigned long pages_min, pages_low, pages_high; + + /* zone watermarks, access with *_wmark_pages(zone) macros */ + unsigned long watermark[NR_WMARK]; + /* * We don't know if the memory that we're going to allocate will be freeable * or/and it will be released eventually, so to avoid totally wasting several diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8485735fc690..abe26003124d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1150,10 +1150,15 @@ failed: return NULL; } -#define ALLOC_NO_WATERMARKS 0x01 /* don't check watermarks at all */ -#define ALLOC_WMARK_MIN 0x02 /* use pages_min watermark */ -#define ALLOC_WMARK_LOW 0x04 /* use pages_low watermark */ -#define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */ +/* The ALLOC_WMARK bits are used as an index to zone->watermark */ +#define ALLOC_WMARK_MIN WMARK_MIN +#define ALLOC_WMARK_LOW WMARK_LOW +#define ALLOC_WMARK_HIGH WMARK_HIGH +#define ALLOC_NO_WATERMARKS 0x04 /* don't check watermarks at all */ + +/* Mask to get the watermark bits */ +#define ALLOC_WMARK_MASK (ALLOC_NO_WATERMARKS-1) + #define ALLOC_HARDER 0x10 /* try to alloc harder */ #define ALLOC_HIGH 0x20 /* __GFP_HIGH set */ #define ALLOC_CPUSET 0x40 /* check for correct cpuset */ @@ -1440,14 +1445,10 @@ zonelist_scan: !cpuset_zone_allowed_softwall(zone, gfp_mask)) goto try_next_zone; + BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK); if (!(alloc_flags & ALLOC_NO_WATERMARKS)) { unsigned long mark; - if (alloc_flags & ALLOC_WMARK_MIN) - mark = zone->pages_min; - else if (alloc_flags & ALLOC_WMARK_LOW) - mark = zone->pages_low; - else - mark = zone->pages_high; + mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK]; if (!zone_watermark_ok(zone, order, mark, classzone_idx, alloc_flags)) { if (!zone_reclaim_mode || @@ -1959,7 +1960,7 @@ static unsigned int nr_free_zone_pages(int offset) for_each_zone_zonelist(zone, z, zonelist, offset) { unsigned long size = zone->present_pages; - unsigned long high = zone->pages_high; + unsigned long high = high_wmark_pages(zone); if (size > high) sum += size - high; } @@ -2096,9 +2097,9 @@ void show_free_areas(void) "\n", zone->name, K(zone_page_state(zone, NR_FREE_PAGES)), - K(zone->pages_min), - K(zone->pages_low), - K(zone->pages_high), + K(min_wmark_pages(zone)), + K(low_wmark_pages(zone)), + K(high_wmark_pages(zone)), K(zone_page_state(zone, NR_ACTIVE_ANON)), K(zone_page_state(zone, NR_INACTIVE_ANON)), K(zone_page_state(zone, NR_ACTIVE_FILE)), @@ -2702,8 +2703,8 @@ static inline unsigned long wait_table_bits(unsigned long size) /* * Mark a number of pageblocks as MIGRATE_RESERVE. The number - * of blocks reserved is based on zone->pages_min. The memory within the - * reserve will tend to store contiguous free pages. Setting min_free_kbytes + * of blocks reserved is based on min_wmark_pages(zone). The memory within + * the reserve will tend to store contiguous free pages. Setting min_free_kbytes * higher will lead to a bigger reserve which will get freed as contiguous * blocks as reclaim kicks in */ @@ -2716,7 +2717,7 @@ static void setup_zone_migrate_reserve(struct zone *zone) /* Get the start pfn, end pfn and the number of blocks to reserve */ start_pfn = zone->zone_start_pfn; end_pfn = start_pfn + zone->spanned_pages; - reserve = roundup(zone->pages_min, pageblock_nr_pages) >> + reserve = roundup(min_wmark_pages(zone), pageblock_nr_pages) >> pageblock_order; for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) { @@ -4319,8 +4320,8 @@ static void calculate_totalreserve_pages(void) max = zone->lowmem_reserve[j]; } - /* we treat pages_high as reserved pages. */ - max += zone->pages_high; + /* we treat the high watermark as reserved pages. */ + max += high_wmark_pages(zone); if (max > zone->present_pages) max = zone->present_pages; @@ -4400,7 +4401,7 @@ void setup_per_zone_pages_min(void) * need highmem pages, so cap pages_min to a small * value here. * - * The (pages_high-pages_low) and (pages_low-pages_min) + * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN) * deltas controls asynch page reclaim, and so should * not be capped for highmem. */ @@ -4411,17 +4412,17 @@ void setup_per_zone_pages_min(void) min_pages = SWAP_CLUSTER_MAX; if (min_pages > 128) min_pages = 128; - zone->pages_min = min_pages; + zone->watermark[WMARK_MIN] = min_pages; } else { /* * If it's a lowmem zone, reserve a number of pages * proportionate to the zone's size. */ - zone->pages_min = tmp; + zone->watermark[WMARK_MIN] = tmp; } - zone->pages_low = zone->pages_min + (tmp >> 2); - zone->pages_high = zone->pages_min + (tmp >> 1); + zone->watermark[WMARK_LOW] = min_wmark_pages(zone) + (tmp >> 2); + zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1); setup_zone_migrate_reserve(zone); spin_unlock_irqrestore(&zone->lock, flags); } @@ -4566,7 +4567,7 @@ int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write, * whenever sysctl_lowmem_reserve_ratio changes. * * The reserve ratio obviously has absolutely no relation with the - * pages_min watermarks. The lowmem reserve ratio can only make sense + * minimum watermarks. The lowmem reserve ratio can only make sense * if in function of the boot time zone sizes. */ int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write, diff --git a/mm/vmscan.c b/mm/vmscan.c index a6b7d14812e6..e5245d051647 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1401,7 +1401,7 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc, free = zone_page_state(zone, NR_FREE_PAGES); /* If we have very few page cache pages, force-scan anon pages. */ - if (unlikely(file + free <= zone->pages_high)) { + if (unlikely(file + free <= high_wmark_pages(zone))) { percent[0] = 100; percent[1] = 0; return; @@ -1533,11 +1533,13 @@ static void shrink_zone(int priority, struct zone *zone, * try to reclaim pages from zones which will satisfy the caller's allocation * request. * - * We reclaim from a zone even if that zone is over pages_high. Because: + * We reclaim from a zone even if that zone is over high_wmark_pages(zone). + * Because: * a) The caller may be trying to free *extra* pages to satisfy a higher-order * allocation or - * b) The zones may be over pages_high but they must go *over* pages_high to - * satisfy the `incremental min' zone defense algorithm. + * b) The target zone may be at high_wmark_pages(zone) but the lower zones + * must go *over* high_wmark_pages(zone) to satisfy the `incremental min' + * zone defense algorithm. * * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. @@ -1743,7 +1745,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, /* * For kswapd, balance_pgdat() will work across all this node's zones until - * they are all at pages_high. + * they are all at high_wmark_pages(zone). * * Returns the number of pages which were actually freed. * @@ -1756,11 +1758,11 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, * the zone for when the problem goes away. * * kswapd scans the zones in the highmem->normal->dma direction. It skips - * zones which have free_pages > pages_high, but once a zone is found to have - * free_pages <= pages_high, we scan that zone and the lower zones regardless - * of the number of free pages in the lower zones. This interoperates with - * the page allocator fallback scheme to ensure that aging of pages is balanced - * across the zones. + * zones which have free_pages > high_wmark_pages(zone), but once a zone is + * found to have free_pages <= high_wmark_pages(zone), we scan that zone and the + * lower zones regardless of the number of free pages in the lower zones. This + * interoperates with the page allocator fallback scheme to ensure that aging + * of pages is balanced across the zones. */ static unsigned long balance_pgdat(pg_data_t *pgdat, int order) { @@ -1781,7 +1783,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order) }; /* * temp_priority is used to remember the scanning priority at which - * this zone was successfully refilled to free_pages == pages_high. + * this zone was successfully refilled to + * free_pages == high_wmark_pages(zone). */ int temp_priority[MAX_NR_ZONES]; @@ -1826,8 +1829,8 @@ loop_again: shrink_active_list(SWAP_CLUSTER_MAX, zone, &sc, priority, 0); - if (!zone_watermark_ok(zone, order, zone->pages_high, - 0, 0)) { + if (!zone_watermark_ok(zone, order, + high_wmark_pages(zone), 0, 0)) { end_zone = i; break; } @@ -1861,8 +1864,8 @@ loop_again: priority != DEF_PRIORITY) continue; - if (!zone_watermark_ok(zone, order, zone->pages_high, - end_zone, 0)) + if (!zone_watermark_ok(zone, order, + high_wmark_pages(zone), end_zone, 0)) all_zones_ok = 0; temp_priority[i] = priority; sc.nr_scanned = 0; @@ -1871,8 +1874,8 @@ loop_again: * We put equal pressure on every zone, unless one * zone has way too many pages free already. */ - if (!zone_watermark_ok(zone, order, 8*zone->pages_high, - end_zone, 0)) + if (!zone_watermark_ok(zone, order, + 8*high_wmark_pages(zone), end_zone, 0)) shrink_zone(priority, zone, &sc); reclaim_state->reclaimed_slab = 0; nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, @@ -2038,7 +2041,7 @@ void wakeup_kswapd(struct zone *zone, int order) return; pgdat = zone->zone_pgdat; - if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0)) + if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) return; if (pgdat->kswapd_max_order < order) pgdat->kswapd_max_order = order; diff --git a/mm/vmstat.c b/mm/vmstat.c index 74d66dba0cbe..415110772c73 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -714,9 +714,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, "\n spanned %lu" "\n present %lu", zone_page_state(zone, NR_FREE_PAGES), - zone->pages_min, - zone->pages_low, - zone->pages_high, + min_wmark_pages(zone), + low_wmark_pages(zone), + high_wmark_pages(zone), zone->pages_scanned, zone->lru[LRU_ACTIVE_ANON].nr_scan, zone->lru[LRU_INACTIVE_ANON].nr_scan, -- cgit v1.2.3-59-g8ed1b From f2260e6b1f4eba0f5b5906795117791b5c660154 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:32:13 -0700 Subject: page allocator: update NR_FREE_PAGES only as necessary When pages are being freed to the buddy allocator, the zone NR_FREE_PAGES counter must be updated. In the case of bulk per-cpu page freeing, it's updated once per page. This retouches cache lines more than necessary. Update the counters one per per-cpu bulk free. Signed-off-by: Mel Gorman Reviewed-by: Christoph Lameter Reviewed-by: KOSAKI Motohiro Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index abe26003124d..d56e377ad085 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -456,7 +456,6 @@ static inline void __free_one_page(struct page *page, int migratetype) { unsigned long page_idx; - int order_size = 1 << order; if (unlikely(PageCompound(page))) if (unlikely(destroy_compound_page(page, order))) @@ -466,10 +465,9 @@ static inline void __free_one_page(struct page *page, page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1); - VM_BUG_ON(page_idx & (order_size - 1)); + VM_BUG_ON(page_idx & ((1 << order) - 1)); VM_BUG_ON(bad_range(zone, page)); - __mod_zone_page_state(zone, NR_FREE_PAGES, order_size); while (order < MAX_ORDER-1) { unsigned long combined_idx; struct page *buddy; @@ -524,6 +522,8 @@ static void free_pages_bulk(struct zone *zone, int count, spin_lock(&zone->lock); zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE); zone->pages_scanned = 0; + + __mod_zone_page_state(zone, NR_FREE_PAGES, count << order); while (count--) { struct page *page; @@ -542,6 +542,8 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_lock(&zone->lock); zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE); zone->pages_scanned = 0; + + __mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order); __free_one_page(page, zone, order, migratetype); spin_unlock(&zone->lock); } @@ -686,7 +688,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, list_del(&page->lru); rmv_page_order(page); area->nr_free--; - __mod_zone_page_state(zone, NR_FREE_PAGES, - (1UL << order)); expand(zone, page, order, current_order, area, migratetype); return page; } @@ -826,8 +827,6 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) /* Remove the page from the freelists */ list_del(&page->lru); rmv_page_order(page); - __mod_zone_page_state(zone, NR_FREE_PAGES, - -(1UL << order)); if (current_order == pageblock_order) set_pageblock_migratetype(page, @@ -900,6 +899,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, set_page_private(page, migratetype); list = &page->lru; } + __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order)); spin_unlock(&zone->lock); return i; } @@ -1129,6 +1129,7 @@ again: } else { spin_lock_irqsave(&zone->lock, flags); page = __rmqueue(zone, order, migratetype); + __mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order)); spin_unlock(&zone->lock); if (!page) goto failed; -- cgit v1.2.3-59-g8ed1b From 974709bdb2a34db378fc84140220f363f558d0d6 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:32:14 -0700 Subject: page allocator: get the pageblock migratetype without disabling interrupts Local interrupts are disabled when freeing pages to the PCP list. Part of that free checks what the migratetype of the pageblock the page is in but it checks this with interrupts disabled and interupts should never be disabled longer than necessary. This patch checks the pagetype with interrupts enabled with the impact that it is possible a page is freed to the wrong list when a pageblock changes type. As that block is now already considered mixed from an anti-fragmentation perspective, it's not of vital importance. Signed-off-by: Mel Gorman Cc: Christoph Lameter Cc: KOSAKI Motohiro Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d56e377ad085..e60e41474332 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1030,6 +1030,7 @@ static void free_hot_cold_page(struct page *page, int cold) kernel_map_pages(page, 1, 0); pcp = &zone_pcp(zone, get_cpu())->pcp; + set_page_private(page, get_pageblock_migratetype(page)); local_irq_save(flags); if (unlikely(clearMlocked)) free_page_mlock(page); @@ -1039,7 +1040,6 @@ static void free_hot_cold_page(struct page *page, int cold) list_add_tail(&page->lru, &pcp->list); else list_add(&page->lru, &pcp->list); - set_page_private(page, get_pageblock_migratetype(page)); pcp->count++; if (pcp->count >= pcp->high) { free_pages_bulk(zone, pcp->batch, &pcp->list, 0); -- cgit v1.2.3-59-g8ed1b From 62bc62a873116805774ffd37d7f86aa4faa832b1 Mon Sep 17 00:00:00 2001 From: Christoph Lameter Date: Tue, 16 Jun 2009 15:32:15 -0700 Subject: page allocator: use a pre-calculated value instead of num_online_nodes() in fast paths num_online_nodes() is called in a number of places but most often by the page allocator when deciding whether the zonelist needs to be filtered based on cpusets or the zonelist cache. This is actually a heavy function and touches a number of cache lines. This patch stores the number of online nodes at boot time and updates the value when nodes get onlined and offlined. The value is then used in a number of important paths in place of num_online_nodes(). [rientjes@google.com: do not override definition of node_set_online() with macro] Signed-off-by: Christoph Lameter Signed-off-by: Mel Gorman Cc: KOSAKI Motohiro Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/nodemask.h | 19 ++++++++++++++++--- mm/hugetlb.c | 4 ++-- mm/page_alloc.c | 10 ++++++---- mm/slub.c | 2 +- net/sunrpc/svc.c | 2 +- 5 files changed, 26 insertions(+), 11 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index 848025cd7087..829b94b156f2 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -408,6 +408,19 @@ static inline int num_node_state(enum node_states state) #define next_online_node(nid) next_node((nid), node_states[N_ONLINE]) extern int nr_node_ids; +extern int nr_online_nodes; + +static inline void node_set_online(int nid) +{ + node_set_state(nid, N_ONLINE); + nr_online_nodes = num_node_state(N_ONLINE); +} + +static inline void node_set_offline(int nid) +{ + node_clear_state(nid, N_ONLINE); + nr_online_nodes = num_node_state(N_ONLINE); +} #else static inline int node_state(int node, enum node_states state) @@ -434,7 +447,10 @@ static inline int num_node_state(enum node_states state) #define first_online_node 0 #define next_online_node(nid) (MAX_NUMNODES) #define nr_node_ids 1 +#define nr_online_nodes 1 +#define node_set_online(node) node_set_state((node), N_ONLINE) +#define node_set_offline(node) node_clear_state((node), N_ONLINE) #endif #define node_online_map node_states[N_ONLINE] @@ -454,9 +470,6 @@ static inline int num_node_state(enum node_states state) #define node_online(node) node_state((node), N_ONLINE) #define node_possible(node) node_state((node), N_POSSIBLE) -#define node_set_online(node) node_set_state((node), N_ONLINE) -#define node_set_offline(node) node_clear_state((node), N_ONLINE) - #define for_each_node(node) for_each_node_state(node, N_POSSIBLE) #define for_each_online_node(node) for_each_node_state(node, N_ONLINE) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 2f8241f300f5..7b9b6015b2ec 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -875,7 +875,7 @@ static void return_unused_surplus_pages(struct hstate *h, * can no longer free unreserved surplus pages. This occurs when * the nodes with surplus pages have no free pages. */ - unsigned long remaining_iterations = num_online_nodes(); + unsigned long remaining_iterations = nr_online_nodes; /* Uncommit the reservation */ h->resv_huge_pages -= unused_resv_pages; @@ -904,7 +904,7 @@ static void return_unused_surplus_pages(struct hstate *h, h->surplus_huge_pages--; h->surplus_huge_pages_node[nid]--; nr_pages--; - remaining_iterations = num_online_nodes(); + remaining_iterations = nr_online_nodes; } } } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e60e41474332..0c9f406e3c44 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -161,7 +161,9 @@ static unsigned long __meminitdata dma_reserve; #if MAX_NUMNODES > 1 int nr_node_ids __read_mostly = MAX_NUMNODES; +int nr_online_nodes __read_mostly = 1; EXPORT_SYMBOL(nr_node_ids); +EXPORT_SYMBOL(nr_online_nodes); #endif int page_group_by_mobility_disabled __read_mostly; @@ -1466,7 +1468,7 @@ this_zone_full: if (NUMA_BUILD) zlc_mark_zone_full(zonelist, z); try_next_zone: - if (NUMA_BUILD && !did_zlc_setup && num_online_nodes() > 1) { + if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) { /* * we do zlc_setup after the first zone is tried but only * if there are multiple nodes make it worthwhile @@ -2265,7 +2267,7 @@ int numa_zonelist_order_handler(ctl_table *table, int write, } -#define MAX_NODE_LOAD (num_online_nodes()) +#define MAX_NODE_LOAD (nr_online_nodes) static int node_load[MAX_NUMNODES]; /** @@ -2474,7 +2476,7 @@ static void build_zonelists(pg_data_t *pgdat) /* NUMA-aware ordering of nodes */ local_node = pgdat->node_id; - load = num_online_nodes(); + load = nr_online_nodes; prev_node = local_node; nodes_clear(used_mask); @@ -2625,7 +2627,7 @@ void build_all_zonelists(void) printk("Built %i zonelists in %s order, mobility grouping %s. " "Total pages: %ld\n", - num_online_nodes(), + nr_online_nodes, zonelist_order_name[current_zonelist_order], page_group_by_mobility_disabled ? "off" : "on", vm_total_pages); diff --git a/mm/slub.c b/mm/slub.c index 30354bfeb43d..1c950775d6be 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3737,7 +3737,7 @@ static int list_locations(struct kmem_cache *s, char *buf, to_cpumask(l->cpus)); } - if (num_online_nodes() > 1 && !nodes_empty(l->nodes) && + if (nr_online_nodes > 1 && !nodes_empty(l->nodes) && len < PAGE_SIZE - 60) { len += sprintf(buf + len, " nodes="); len += nodelist_scnprintf(buf + len, PAGE_SIZE - len - 50, diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c index 8847add6ca16..5ed8931dfe98 100644 --- a/net/sunrpc/svc.c +++ b/net/sunrpc/svc.c @@ -124,7 +124,7 @@ svc_pool_map_choose_mode(void) { unsigned int node; - if (num_online_nodes() > 1) { + if (nr_online_nodes > 1) { /* * Actually have multiple NUMA nodes, * so split pools on NUMA node boundaries -- cgit v1.2.3-59-g8ed1b From 092cead6175bb1b3d3078a34ba71c939d526c70b Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 16 Jun 2009 15:32:17 -0700 Subject: page allocator: move free_page_mlock() to page_alloc.c Currently, free_page_mlock() is only called from page_alloc.c. Thus, we can move it to page_alloc.c. Cc: Lee Schermerhorn Cc: Mel Gorman Cc: Christoph Lameter Cc: Pekka Enberg Cc: Dave Hansen Signed-off-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/internal.h | 13 ------------- mm/page_alloc.c | 16 ++++++++++++++++ 2 files changed, 16 insertions(+), 13 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/internal.h b/mm/internal.h index 58ec1bc262c3..4b1672a8cf76 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -150,18 +150,6 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page) } } -/* - * free_page_mlock() -- clean up attempts to free and mlocked() page. - * Page should not be on lru, so no need to fix that up. - * free_pages_check() will verify... - */ -static inline void free_page_mlock(struct page *page) -{ - __ClearPageMlocked(page); - __dec_zone_page_state(page, NR_MLOCK); - __count_vm_event(UNEVICTABLE_MLOCKFREED); -} - #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */ static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p) { @@ -170,7 +158,6 @@ static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p) static inline void clear_page_mlock(struct page *page) { } static inline void mlock_vma_page(struct page *page) { } static inline void mlock_migrate_page(struct page *new, struct page *old) { } -static inline void free_page_mlock(struct page *page) { } #endif /* CONFIG_HAVE_MLOCKED_PAGE_BIT */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0c9f406e3c44..5dac5d8cb148 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -493,6 +493,22 @@ static inline void __free_one_page(struct page *page, zone->free_area[order].nr_free++; } +#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT +/* + * free_page_mlock() -- clean up attempts to free and mlocked() page. + * Page should not be on lru, so no need to fix that up. + * free_pages_check() will verify... + */ +static inline void free_page_mlock(struct page *page) +{ + __ClearPageMlocked(page); + __dec_zone_page_state(page, NR_MLOCK); + __count_vm_event(UNEVICTABLE_MLOCKFREED); +} +#else +static void free_page_mlock(struct page *page) { } +#endif + static inline int free_pages_check(struct page *page) { if (unlikely(page_mapcount(page) | -- cgit v1.2.3-59-g8ed1b From 72807a74c0172376bba6b5b27702c9f702b526e9 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:32:18 -0700 Subject: page allocator: sanity check order in the page allocator slow path Callers may speculatively call different allocators in order of preference trying to allocate a buffer of a given size. The order needed to allocate this may be larger than what the page allocator can normally handle. While the allocator mostly does the right thing, it should not direct reclaim or wakeup kswapd with a bogus order. This patch sanity checks the order in the slow path and returns NULL if it is too large. Signed-off-by: Mel Gorman Signed-off-by: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5dac5d8cb148..85759cdd6973 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1446,9 +1446,6 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order, int zlc_active = 0; /* set if using zonelist_cache */ int did_zlc_setup = 0; /* just call zlc_setup() one time */ - if (WARN_ON_ONCE(order >= MAX_ORDER)) - return NULL; - classzone_idx = zone_idx(preferred_zone); zonelist_scan: /* @@ -1706,6 +1703,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, unsigned long did_some_progress; struct task_struct *p = current; + /* + * In the slowpath, we sanity check order to avoid ever trying to + * reclaim >= MAX_ORDER areas which will never succeed. Callers may + * be using allocators in order of preference for an area that is + * too large. + */ + if (WARN_ON_ONCE(order >= MAX_ORDER)) + return NULL; + /* * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and * __GFP_NOWARN set) should not cause reclaim since the subsystem -- cgit v1.2.3-59-g8ed1b From a1dd268cf6306565a31a48deff8bf4f6b4b105f7 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:32:19 -0700 Subject: mm: use alloc_pages_exact() in alloc_large_system_hash() to avoid duplicated logic alloc_large_system_hash() has logic for freeing pages at the end of an excessively large power-of-two buffer that is a duplicate of what is in alloc_pages_exact(). This patch converts alloc_large_system_hash() to use alloc_pages_exact(). Signed-off-by: Mel Gorman Acked-by: Hugh Dickins Cc: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 21 ++++----------------- 1 file changed, 4 insertions(+), 17 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 85759cdd6973..8ca06d87dc1f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4699,26 +4699,13 @@ void *__init alloc_large_system_hash(const char *tablename, else if (hashdist) table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL); else { - unsigned long order = get_order(size); - - if (order < MAX_ORDER) - table = (void *)__get_free_pages(GFP_ATOMIC, - order); /* * If bucketsize is not a power-of-two, we may free - * some pages at the end of hash table. + * some pages at the end of hash table which + * alloc_pages_exact() automatically does */ - if (table) { - unsigned long alloc_end = (unsigned long)table + - (PAGE_SIZE << order); - unsigned long used = (unsigned long)table + - PAGE_ALIGN(size); - split_page(virt_to_page(table), order); - while (used < alloc_end) { - free_page(used); - used += PAGE_SIZE; - } - } + if (get_order(size) < MAX_ORDER) + table = alloc_pages_exact(size, GFP_ATOMIC); } } while (!table && size > PAGE_SIZE && --log2qty); -- cgit v1.2.3-59-g8ed1b From 20a0307c0396c2edb651401d2f2db193dda2f3c9 Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Tue, 16 Jun 2009 15:32:22 -0700 Subject: mm: introduce PageHuge() for testing huge/gigantic pages A series of patches to enhance the /proc/pagemap interface and to add a userspace executable which can be used to present the pagemap data. Export 10 more flags to end users (and more for kernel developers): 11. KPF_MMAP (pseudo flag) memory mapped page 12. KPF_ANON (pseudo flag) memory mapped page (anonymous) 13. KPF_SWAPCACHE page is in swap cache 14. KPF_SWAPBACKED page is swap/RAM backed 15. KPF_COMPOUND_HEAD (*) 16. KPF_COMPOUND_TAIL (*) 17. KPF_HUGE hugeTLB pages 18. KPF_UNEVICTABLE page is in the unevictable LRU list 19. KPF_HWPOISON hardware detected corruption 20. KPF_NOPAGE (pseudo flag) no page frame at the address (*) For compound pages, exporting _both_ head/tail info enables users to tell where a compound page starts/ends, and its order. a simple demo of the page-types tool # ./page-types -h page-types [options] -r|--raw Raw mode, for kernel developers -a|--addr addr-spec Walk a range of pages -b|--bits bits-spec Walk pages with specified bits -l|--list Show page details in ranges -L|--list-each Show page details one by one -N|--no-summary Don't show summay info -h|--help Show this usage message addr-spec: N one page at offset N (unit: pages) N+M pages range from N to N+M-1 N,M pages range from N to M-1 N, pages range from N to end ,M pages range from 0 to M bits-spec: bit1,bit2 (flags & (bit1|bit2)) != 0 bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1 bit1,~bit2 (flags & (bit1|bit2)) == bit1 =bit1,bit2 flags == (bit1|bit2) bit-names: locked error referenced uptodate dirty lru active slab writeback reclaim buddy mmap anonymous swapcache swapbacked compound_head compound_tail huge unevictable hwpoison nopage reserved(r) mlocked(r) mappedtodisk(r) private(r) private_2(r) owner_private(r) arch(r) uncached(r) readahead(o) slob_free(o) slub_frozen(o) slub_debug(o) (r) raw mode bits (o) overloaded bits # ./page-types flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000000 487369 1903 _________________________________ 0x0000000000000014 5 0 __R_D____________________________ referenced,dirty 0x0000000000000020 1 0 _____l___________________________ lru 0x0000000000000024 34 0 __R__l___________________________ referenced,lru 0x0000000000000028 3838 14 ___U_l___________________________ uptodate,lru 0x0001000000000028 48 0 ___U_l_______________________I___ uptodate,lru,readahead 0x000000000000002c 6478 25 __RU_l___________________________ referenced,uptodate,lru 0x000100000000002c 47 0 __RU_l_______________________I___ referenced,uptodate,lru,readahead 0x0000000000000040 8344 32 ______A__________________________ active 0x0000000000000060 1 0 _____lA__________________________ lru,active 0x0000000000000068 348 1 ___U_lA__________________________ uptodate,lru,active 0x0001000000000068 12 0 ___U_lA______________________I___ uptodate,lru,active,readahead 0x000000000000006c 988 3 __RU_lA__________________________ referenced,uptodate,lru,active 0x000100000000006c 48 0 __RU_lA______________________I___ referenced,uptodate,lru,active,readahead 0x0000000000004078 1 0 ___UDlA_______b__________________ uptodate,dirty,lru,active,swapbacked 0x000000000000407c 34 0 __RUDlA_______b__________________ referenced,uptodate,dirty,lru,active,swapbacked 0x0000000000000400 503 1 __________B______________________ buddy 0x0000000000000804 1 0 __R________M_____________________ referenced,mmap 0x0000000000000828 1029 4 ___U_l_____M_____________________ uptodate,lru,mmap 0x0001000000000828 43 0 ___U_l_____M_________________I___ uptodate,lru,mmap,readahead 0x000000000000082c 382 1 __RU_l_____M_____________________ referenced,uptodate,lru,mmap 0x000100000000082c 12 0 __RU_l_____M_________________I___ referenced,uptodate,lru,mmap,readahead 0x0000000000000868 192 0 ___U_lA____M_____________________ uptodate,lru,active,mmap 0x0001000000000868 12 0 ___U_lA____M_________________I___ uptodate,lru,active,mmap,readahead 0x000000000000086c 800 3 __RU_lA____M_____________________ referenced,uptodate,lru,active,mmap 0x000100000000086c 31 0 __RU_lA____M_________________I___ referenced,uptodate,lru,active,mmap,readahead 0x0000000000004878 2 0 ___UDlA____M__b__________________ uptodate,dirty,lru,active,mmap,swapbacked 0x0000000000001000 492 1 ____________a____________________ anonymous 0x0000000000005808 4 0 ___U_______Ma_b__________________ uptodate,mmap,anonymous,swapbacked 0x0000000000005868 2839 11 ___U_lA____Ma_b__________________ uptodate,lru,active,mmap,anonymous,swapbacked 0x000000000000586c 30 0 __RU_lA____Ma_b__________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked total 513968 2007 # ./page-types -r flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000000 468002 1828 _________________________________ 0x0000000100000000 19102 74 _____________________r___________ reserved 0x0000000000008000 41 0 _______________H_________________ compound_head 0x0000000000010000 188 0 ________________T________________ compound_tail 0x0000000000008014 1 0 __R_D__________H_________________ referenced,dirty,compound_head 0x0000000000010014 4 0 __R_D___________T________________ referenced,dirty,compound_tail 0x0000000000000020 1 0 _____l___________________________ lru 0x0000000800000024 34 0 __R__l__________________P________ referenced,lru,private 0x0000000000000028 3794 14 ___U_l___________________________ uptodate,lru 0x0001000000000028 46 0 ___U_l_______________________I___ uptodate,lru,readahead 0x0000000400000028 44 0 ___U_l_________________d_________ uptodate,lru,mappedtodisk 0x0001000400000028 2 0 ___U_l_________________d_____I___ uptodate,lru,mappedtodisk,readahead 0x000000000000002c 6434 25 __RU_l___________________________ referenced,uptodate,lru 0x000100000000002c 47 0 __RU_l_______________________I___ referenced,uptodate,lru,readahead 0x000000040000002c 14 0 __RU_l_________________d_________ referenced,uptodate,lru,mappedtodisk 0x000000080000002c 30 0 __RU_l__________________P________ referenced,uptodate,lru,private 0x0000000800000040 8124 31 ______A_________________P________ active,private 0x0000000000000040 219 0 ______A__________________________ active 0x0000000800000060 1 0 _____lA_________________P________ lru,active,private 0x0000000000000068 322 1 ___U_lA__________________________ uptodate,lru,active 0x0001000000000068 12 0 ___U_lA______________________I___ uptodate,lru,active,readahead 0x0000000400000068 13 0 ___U_lA________________d_________ uptodate,lru,active,mappedtodisk 0x0000000800000068 12 0 ___U_lA_________________P________ uptodate,lru,active,private 0x000000000000006c 977 3 __RU_lA__________________________ referenced,uptodate,lru,active 0x000100000000006c 48 0 __RU_lA______________________I___ referenced,uptodate,lru,active,readahead 0x000000040000006c 5 0 __RU_lA________________d_________ referenced,uptodate,lru,active,mappedtodisk 0x000000080000006c 3 0 __RU_lA_________________P________ referenced,uptodate,lru,active,private 0x0000000c0000006c 3 0 __RU_lA________________dP________ referenced,uptodate,lru,active,mappedtodisk,private 0x0000000c00000068 1 0 ___U_lA________________dP________ uptodate,lru,active,mappedtodisk,private 0x0000000000004078 1 0 ___UDlA_______b__________________ uptodate,dirty,lru,active,swapbacked 0x000000000000407c 34 0 __RUDlA_______b__________________ referenced,uptodate,dirty,lru,active,swapbacked 0x0000000000000400 538 2 __________B______________________ buddy 0x0000000000000804 1 0 __R________M_____________________ referenced,mmap 0x0000000000000828 1029 4 ___U_l_____M_____________________ uptodate,lru,mmap 0x0001000000000828 43 0 ___U_l_____M_________________I___ uptodate,lru,mmap,readahead 0x000000000000082c 382 1 __RU_l_____M_____________________ referenced,uptodate,lru,mmap 0x000100000000082c 12 0 __RU_l_____M_________________I___ referenced,uptodate,lru,mmap,readahead 0x0000000000000868 192 0 ___U_lA____M_____________________ uptodate,lru,active,mmap 0x0001000000000868 12 0 ___U_lA____M_________________I___ uptodate,lru,active,mmap,readahead 0x000000000000086c 800 3 __RU_lA____M_____________________ referenced,uptodate,lru,active,mmap 0x000100000000086c 31 0 __RU_lA____M_________________I___ referenced,uptodate,lru,active,mmap,readahead 0x0000000000004878 2 0 ___UDlA____M__b__________________ uptodate,dirty,lru,active,mmap,swapbacked 0x0000000000001000 492 1 ____________a____________________ anonymous 0x0000000000005008 2 0 ___U________a_b__________________ uptodate,anonymous,swapbacked 0x0000000000005808 4 0 ___U_______Ma_b__________________ uptodate,mmap,anonymous,swapbacked 0x000000000000580c 1 0 __RU_______Ma_b__________________ referenced,uptodate,mmap,anonymous,swapbacked 0x0000000000005868 2839 11 ___U_lA____Ma_b__________________ uptodate,lru,active,mmap,anonymous,swapbacked 0x000000000000586c 29 0 __RU_lA____Ma_b__________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked total 513968 2007 # ./page-types --raw --list --no-summary --bits reserved offset count flags 0 15 _____________________r___________ 31 4 _____________________r___________ 159 97 _____________________r___________ 4096 2067 _____________________r___________ 6752 2390 _____________________r___________ 9355 3 _____________________r___________ 9728 14526 _____________________r___________ This patch: Introduce PageHuge(), which identifies huge/gigantic pages by their dedicated compound destructor functions. Also move prep_compound_gigantic_page() to hugetlb.c and make __free_pages_ok() non-static. Signed-off-by: Wu Fengguang Cc: KOSAKI Motohiro Cc: Andi Kleen Cc: Matt Mackall Cc: Alexey Dobriyan Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/page.c | 1 + include/linux/hugetlb.h | 7 ++++ mm/hugetlb.c | 98 +++++++++++++++++++++++++++++++------------------ mm/internal.h | 5 +-- mm/page_alloc.c | 17 --------- 5 files changed, 73 insertions(+), 55 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/fs/proc/page.c b/fs/proc/page.c index e9983837d08d..38dd88b7ce84 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include "internal.h" diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 03be7f29ca01..a05a5ef33391 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -11,6 +11,8 @@ struct ctl_table; +int PageHuge(struct page *page); + static inline int is_vm_hugetlb_page(struct vm_area_struct *vma) { return vma->vm_flags & VM_HUGETLB; @@ -61,6 +63,11 @@ void hugetlb_change_protection(struct vm_area_struct *vma, #else /* !CONFIG_HUGETLB_PAGE */ +static inline int PageHuge(struct page *page) +{ + return 0; +} + static inline int is_vm_hugetlb_page(struct vm_area_struct *vma) { return 0; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 7b9b6015b2ec..a56e6f3ce979 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -578,41 +578,6 @@ static void free_huge_page(struct page *page) hugetlb_put_quota(mapping, 1); } -/* - * Increment or decrement surplus_huge_pages. Keep node-specific counters - * balanced by operating on them in a round-robin fashion. - * Returns 1 if an adjustment was made. - */ -static int adjust_pool_surplus(struct hstate *h, int delta) -{ - static int prev_nid; - int nid = prev_nid; - int ret = 0; - - VM_BUG_ON(delta != -1 && delta != 1); - do { - nid = next_node(nid, node_online_map); - if (nid == MAX_NUMNODES) - nid = first_node(node_online_map); - - /* To shrink on this node, there must be a surplus page */ - if (delta < 0 && !h->surplus_huge_pages_node[nid]) - continue; - /* Surplus cannot exceed the total number of pages */ - if (delta > 0 && h->surplus_huge_pages_node[nid] >= - h->nr_huge_pages_node[nid]) - continue; - - h->surplus_huge_pages += delta; - h->surplus_huge_pages_node[nid] += delta; - ret = 1; - break; - } while (nid != prev_nid); - - prev_nid = nid; - return ret; -} - static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) { set_compound_page_dtor(page, free_huge_page); @@ -623,6 +588,34 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) put_page(page); /* free it into the hugepage allocator */ } +static void prep_compound_gigantic_page(struct page *page, unsigned long order) +{ + int i; + int nr_pages = 1 << order; + struct page *p = page + 1; + + /* we rely on prep_new_huge_page to set the destructor */ + set_compound_order(page, order); + __SetPageHead(page); + for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { + __SetPageTail(p); + p->first_page = page; + } +} + +int PageHuge(struct page *page) +{ + compound_page_dtor *dtor; + + if (!PageCompound(page)) + return 0; + + page = compound_head(page); + dtor = get_compound_page_dtor(page); + + return dtor == free_huge_page; +} + static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid) { struct page *page; @@ -1140,6 +1133,41 @@ static inline void try_to_free_low(struct hstate *h, unsigned long count) } #endif +/* + * Increment or decrement surplus_huge_pages. Keep node-specific counters + * balanced by operating on them in a round-robin fashion. + * Returns 1 if an adjustment was made. + */ +static int adjust_pool_surplus(struct hstate *h, int delta) +{ + static int prev_nid; + int nid = prev_nid; + int ret = 0; + + VM_BUG_ON(delta != -1 && delta != 1); + do { + nid = next_node(nid, node_online_map); + if (nid == MAX_NUMNODES) + nid = first_node(node_online_map); + + /* To shrink on this node, there must be a surplus page */ + if (delta < 0 && !h->surplus_huge_pages_node[nid]) + continue; + /* Surplus cannot exceed the total number of pages */ + if (delta > 0 && h->surplus_huge_pages_node[nid] >= + h->nr_huge_pages_node[nid]) + continue; + + h->surplus_huge_pages += delta; + h->surplus_huge_pages_node[nid] += delta; + ret = 1; + break; + } while (nid != prev_nid); + + prev_nid = nid; + return ret; +} + #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages) static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) { diff --git a/mm/internal.h b/mm/internal.h index 4b1672a8cf76..b4ac332e8072 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -16,9 +16,6 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, unsigned long floor, unsigned long ceiling); -extern void prep_compound_page(struct page *page, unsigned long order); -extern void prep_compound_gigantic_page(struct page *page, unsigned long order); - static inline void set_page_count(struct page *page, int v) { atomic_set(&page->_count, v); @@ -51,6 +48,8 @@ extern void putback_lru_page(struct page *page); */ extern unsigned long highest_memmap_pfn; extern void __free_pages_bootmem(struct page *page, unsigned int order); +extern void prep_compound_page(struct page *page, unsigned long order); + /* * function for dealing with page's order in buddy system. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8ca06d87dc1f..131655cdb6b2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -300,23 +300,6 @@ void prep_compound_page(struct page *page, unsigned long order) } } -#ifdef CONFIG_HUGETLBFS -void prep_compound_gigantic_page(struct page *page, unsigned long order) -{ - int i; - int nr_pages = 1 << order; - struct page *p = page + 1; - - set_compound_page_dtor(page, free_compound_page); - set_compound_order(page, order); - __SetPageHead(page); - for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { - __SetPageTail(p); - p->first_page = page; - } -} -#endif - static int destroy_compound_page(struct page *page, unsigned long order) { int i; -- cgit v1.2.3-59-g8ed1b From 6e08a369ee10b361ac1cdcdf4fabd420fd08beb3 Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Tue, 16 Jun 2009 15:32:29 -0700 Subject: vmscan: cleanup the scan batching code The vmscan batching logic is twisting. Move it into a standalone function nr_scan_try_batch() and document it. No behavior change. Signed-off-by: Wu Fengguang Acked-by: Rik van Riel Cc: Nick Piggin Cc: Christoph Lameter Acked-by: Peter Zijlstra Acked-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 4 ++-- mm/page_alloc.c | 2 +- mm/vmscan.c | 39 ++++++++++++++++++++++++++++----------- mm/vmstat.c | 8 ++++---- 4 files changed, 35 insertions(+), 18 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index dd8487f0442f..db976b9f8791 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -334,9 +334,9 @@ struct zone { /* Fields commonly accessed by the page reclaim scanner */ spinlock_t lru_lock; - struct { + struct zone_lru { struct list_head list; - unsigned long nr_scan; + unsigned long nr_saved_scan; /* accumulated for batching */ } lru[NR_LRU_LISTS]; struct zone_reclaim_stat reclaim_stat; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 131655cdb6b2..e5b8f628d166 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3657,7 +3657,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, zone_pcp_init(zone); for_each_lru(l) { INIT_LIST_HEAD(&zone->lru[l].list); - zone->lru[l].nr_scan = 0; + zone->lru[l].nr_saved_scan = 0; } zone->reclaim_stat.recent_rotated[0] = 0; zone->reclaim_stat.recent_rotated[1] = 0; diff --git a/mm/vmscan.c b/mm/vmscan.c index 9673437a5457..d4da097533ce 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1492,6 +1492,26 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc, percent[1] = 100 - percent[0]; } +/* + * Smallish @nr_to_scan's are deposited in @nr_saved_scan, + * until we collected @swap_cluster_max pages to scan. + */ +static unsigned long nr_scan_try_batch(unsigned long nr_to_scan, + unsigned long *nr_saved_scan, + unsigned long swap_cluster_max) +{ + unsigned long nr; + + *nr_saved_scan += nr_to_scan; + nr = *nr_saved_scan; + + if (nr >= swap_cluster_max) + *nr_saved_scan = 0; + else + nr = 0; + + return nr; +} /* * This is a basic per-zone page freer. Used by both kswapd and direct reclaim. @@ -1517,14 +1537,11 @@ static void shrink_zone(int priority, struct zone *zone, scan >>= priority; scan = (scan * percent[file]) / 100; } - if (scanning_global_lru(sc)) { - zone->lru[l].nr_scan += scan; - nr[l] = zone->lru[l].nr_scan; - if (nr[l] >= swap_cluster_max) - zone->lru[l].nr_scan = 0; - else - nr[l] = 0; - } else + if (scanning_global_lru(sc)) + nr[l] = nr_scan_try_batch(scan, + &zone->lru[l].nr_saved_scan, + swap_cluster_max); + else nr[l] = scan; } @@ -2124,11 +2141,11 @@ static void shrink_all_zones(unsigned long nr_pages, int prio, l == LRU_ACTIVE_FILE)) continue; - zone->lru[l].nr_scan += (lru_pages >> prio) + 1; - if (zone->lru[l].nr_scan >= nr_pages || pass > 3) { + zone->lru[l].nr_saved_scan += (lru_pages >> prio) + 1; + if (zone->lru[l].nr_saved_scan >= nr_pages || pass > 3) { unsigned long nr_to_scan; - zone->lru[l].nr_scan = 0; + zone->lru[l].nr_saved_scan = 0; nr_to_scan = min(nr_pages, lru_pages); nr_reclaimed += shrink_list(l, nr_to_scan, zone, sc, prio); diff --git a/mm/vmstat.c b/mm/vmstat.c index 415110772c73..84c055556911 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -718,10 +718,10 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, low_wmark_pages(zone), high_wmark_pages(zone), zone->pages_scanned, - zone->lru[LRU_ACTIVE_ANON].nr_scan, - zone->lru[LRU_INACTIVE_ANON].nr_scan, - zone->lru[LRU_ACTIVE_FILE].nr_scan, - zone->lru[LRU_INACTIVE_FILE].nr_scan, + zone->lru[LRU_ACTIVE_ANON].nr_saved_scan, + zone->lru[LRU_INACTIVE_ANON].nr_saved_scan, + zone->lru[LRU_ACTIVE_FILE].nr_saved_scan, + zone->lru[LRU_INACTIVE_FILE].nr_saved_scan, zone->spanned_pages, zone->present_pages); -- cgit v1.2.3-59-g8ed1b From 5c87eada68fe5d29a5f67528f81b6e45124f579b Mon Sep 17 00:00:00 2001 From: Cyrill Gorcunov Date: Tue, 16 Jun 2009 15:32:32 -0700 Subject: mm: setup_per_zone_inactive_ratio - do not call for int_sqrt if not needed int_sqrt() returns 0 if its argument is zero so call it if only needed. Signed-off-by: Cyrill Gorcunov Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e5b8f628d166..db8c46ffa9f5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4471,8 +4471,9 @@ static void setup_per_zone_inactive_ratio(void) /* Zone size in gigabytes */ gb = zone->present_pages >> (30 - PAGE_SHIFT); - ratio = int_sqrt(10 * gb); - if (!ratio) + if (gb) + ratio = int_sqrt(10 * gb); + else ratio = 1; zone->inactive_ratio = ratio; -- cgit v1.2.3-59-g8ed1b From e9bb35df6f813ca46f8e6273add657643c7df73f Mon Sep 17 00:00:00 2001 From: Cyrill Gorcunov Date: Tue, 16 Jun 2009 15:32:32 -0700 Subject: mm: setup_per_zone_inactive_ratio - fix comment and make it __init The caller of setup_per_zone_inactive_ratio is an __init function. There is no need to keep the callee after it completed as well. Also fix a comment. Acked-by: David Rientjes Signed-off-by: Cyrill Gorcunov Reviewed-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index db8c46ffa9f5..076463cb21ba 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4440,8 +4440,6 @@ void setup_per_zone_pages_min(void) } /** - * setup_per_zone_inactive_ratio - called when min_free_kbytes changes. - * * The inactive anon list should be small enough that the VM never has to * do too much work, but large enough that each inactive page has a chance * to be referenced again before it is swapped out. @@ -4462,7 +4460,7 @@ void setup_per_zone_pages_min(void) * 1TB 101 10GB * 10TB 320 32GB */ -static void setup_per_zone_inactive_ratio(void) +static void __init setup_per_zone_inactive_ratio(void) { struct zone *zone; -- cgit v1.2.3-59-g8ed1b From dab48dab37d2770824420d1e01730a107fade1aa Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Tue, 16 Jun 2009 15:32:37 -0700 Subject: page-allocator: warn if __GFP_NOFAIL is used for a large allocation __GFP_NOFAIL is a bad fiction. Allocations _can_ fail, and callers should detect and suitably handle this (and not by lamely moving the infinite loop up to the caller level either). Attempting to use __GFP_NOFAIL for a higher-order allocation is even worse, so add a once-off runtime check for this to slap people around for even thinking about trying it. Cc: David Rientjes Acked-by: Mel Gorman Acked-by: Peter Zijlstra Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 076463cb21ba..61290ea721c8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1128,6 +1128,19 @@ again: list_del(&page->lru); pcp->count--; } else { + if (unlikely(gfp_flags & __GFP_NOFAIL)) { + /* + * __GFP_NOFAIL is not to be used in new code. + * + * All __GFP_NOFAIL callers should be fixed so that they + * properly detect and handle allocation failures. + * + * We most definitely don't want callers attempting to + * allocate greater than single-page units with + * __GFP_NOFAIL. + */ + WARN_ON_ONCE(order > 0); + } spin_lock_irqsave(&zone->lock, flags); page = __rmqueue(zone, order, migratetype); __mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order)); -- cgit v1.2.3-59-g8ed1b From 7f33d49a2ed546e01f7b1d0607661810f2421859 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Tue, 16 Jun 2009 15:32:41 -0700 Subject: mm, PM/Freezer: Disable OOM killer when tasks are frozen Currently, the following scenario appears to be possible in theory: * Tasks are frozen for hibernation or suspend. * Free pages are almost exhausted. * Certain piece of code in the suspend code path attempts to allocate some memory using GFP_KERNEL and allocation order less than or equal to PAGE_ALLOC_COSTLY_ORDER. * __alloc_pages_internal() cannot find a free page so it invokes the OOM killer. * The OOM killer attempts to kill a task, but the task is frozen, so it doesn't die immediately. * __alloc_pages_internal() jumps to 'restart', unsuccessfully tries to find a free page and invokes the OOM killer. * No progress can be made. Although it is now hard to trigger during hibernation due to the memory shrinking carried out by the hibernation code, it is theoretically possible to trigger during suspend after the memory shrinking has been removed from that code path. Moreover, since memory allocations are going to be used for the hibernation memory shrinking, it will be even more likely to happen during hibernation. To prevent it from happening, introduce the oom_killer_disabled switch that will cause __alloc_pages_internal() to fail in the situations in which the OOM killer would have been called and make the freezer set this switch after tasks have been successfully frozen. [akpm@linux-foundation.org: be nicer to the namespace] Signed-off-by: Rafael J. Wysocki Cc: Fengguang Wu Cc: David Rientjes Acked-by: Pavel Machek Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/gfp.h | 12 ++++++++++++ kernel/power/process.c | 5 +++++ mm/page_alloc.c | 4 ++++ 3 files changed, 21 insertions(+) (limited to 'mm/page_alloc.c') diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 4efa33088a82..06b7e8cc80ac 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -243,4 +243,16 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp); void drain_all_pages(void); void drain_local_pages(void *dummy); +extern bool oom_killer_disabled; + +static inline void oom_killer_disable(void) +{ + oom_killer_disabled = true; +} + +static inline void oom_killer_enable(void) +{ + oom_killer_disabled = false; +} + #endif /* __LINUX_GFP_H */ diff --git a/kernel/power/process.c b/kernel/power/process.c index ca634019497a..da2072d73811 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -117,9 +117,12 @@ int freeze_processes(void) if (error) goto Exit; printk("done."); + + oom_killer_disable(); Exit: BUG_ON(in_atomic()); printk("\n"); + return error; } @@ -145,6 +148,8 @@ static void thaw_tasks(bool nosig_only) void thaw_processes(void) { + oom_killer_enable(); + printk("Restarting tasks ... "); thaw_tasks(true); thaw_tasks(false); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 61290ea721c8..5b09488d0f55 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -178,6 +178,8 @@ static void set_pageblock_migratetype(struct page *page, int migratetype) PB_migrate, PB_migrate_end); } +bool oom_killer_disabled __read_mostly; + #ifdef CONFIG_DEBUG_VM static int page_outside_zone_boundaries(struct zone *zone, struct page *page) { @@ -1769,6 +1771,8 @@ rebalance: */ if (!did_some_progress) { if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) { + if (oom_killer_disabled) + goto nopage; page = __alloc_pages_may_oom(gfp_mask, order, zonelist, high_zoneidx, nodemask, preferred_zone, -- cgit v1.2.3-59-g8ed1b From bc75d33f0fc1d56e734db1f56d3cfc8097b8e0cf Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Tue, 16 Jun 2009 15:32:48 -0700 Subject: page-allocator: clean up functions related to pages_min Change the names of two functions. It doesn't affect behavior. Presently, setup_per_zone_pages_min() changes low, high of zone as well as min. So a better name is setup_per_zone_wmarks(). That's because Mel changed zone->pages_[hig/low/min] to zone->watermark array in "page allocator: replace the watermark-related union in struct zone with a watermark[] array". * setup_per_zone_pages_min => setup_per_zone_wmarks Of course, we have to change init_per_zone_pages_min, too. There are not pages_min any more. * init_per_zone_pages_min => init_per_zone_wmark_min [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Minchan Kim Acked-by: Mel Gorman Cc: KOSAKI Motohiro Cc: Rik van Riel Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 2 +- mm/memory_hotplug.c | 2 +- mm/page_alloc.c | 17 +++++++++-------- 3 files changed, 11 insertions(+), 10 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/include/linux/mm.h b/include/linux/mm.h index c80dbd4a62c6..6e11cda1ba02 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1052,7 +1052,7 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn); extern void set_dma_reserve(unsigned long new_dma_reserve); extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long, enum memmap_context); -extern void setup_per_zone_pages_min(void); +extern void setup_per_zone_wmarks(void); extern void mem_init(void); extern void __init mmap_init(void); extern void show_mem(void); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index c083cf5fd6df..037291e15b27 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -422,7 +422,7 @@ int online_pages(unsigned long pfn, unsigned long nr_pages) zone->present_pages += onlined_pages; zone->zone_pgdat->node_present_pages += onlined_pages; - setup_per_zone_pages_min(); + setup_per_zone_wmarks(); if (onlined_pages) { kswapd_run(zone_to_nid(zone)); node_set_state(zone_to_nid(zone), N_HIGH_MEMORY); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5b09488d0f55..8629c84fd9ba 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4396,12 +4396,13 @@ static void setup_per_zone_lowmem_reserve(void) } /** - * setup_per_zone_pages_min - called when min_free_kbytes changes. + * setup_per_zone_wmarks - called when min_free_kbytes changes + * or when memory is hot-added * - * Ensures that the pages_{min,low,high} values for each zone are set correctly - * with respect to min_free_kbytes. + * Ensures that the watermark[min,low,high] values for each zone are set + * correctly with respect to min_free_kbytes. */ -void setup_per_zone_pages_min(void) +void setup_per_zone_wmarks(void) { unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10); unsigned long lowmem_pages = 0; @@ -4519,7 +4520,7 @@ static void __init setup_per_zone_inactive_ratio(void) * 8192MB: 11584k * 16384MB: 16384k */ -static int __init init_per_zone_pages_min(void) +static int __init init_per_zone_wmark_min(void) { unsigned long lowmem_kbytes; @@ -4530,12 +4531,12 @@ static int __init init_per_zone_pages_min(void) min_free_kbytes = 128; if (min_free_kbytes > 65536) min_free_kbytes = 65536; - setup_per_zone_pages_min(); + setup_per_zone_wmarks(); setup_per_zone_lowmem_reserve(); setup_per_zone_inactive_ratio(); return 0; } -module_init(init_per_zone_pages_min) +module_init(init_per_zone_wmark_min) /* * min_free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so @@ -4547,7 +4548,7 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int write, { proc_dointvec(table, write, file, buffer, length, ppos); if (write) - setup_per_zone_pages_min(); + setup_per_zone_wmarks(); return 0; } -- cgit v1.2.3-59-g8ed1b From 96cb4df5ddf5e6d5785b5acd4003e3689b87f896 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Tue, 16 Jun 2009 15:32:49 -0700 Subject: page-allocator: add inactive ratio calculation function of each zone Factor the per-zone arithemetic inside setup_per_zone_inactive_ratio()'s loop into a a separate function, calculate_zone_inactive_ratio(). This function will be used in a later patch [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Minchan Kim Reviewed-by: KOSAKI Motohiro Cc: Rik van Riel Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 1 + mm/page_alloc.c | 28 ++++++++++++++++------------ 2 files changed, 17 insertions(+), 12 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/include/linux/mm.h b/include/linux/mm.h index 6e11cda1ba02..f0473a88ca2e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1053,6 +1053,7 @@ extern void set_dma_reserve(unsigned long new_dma_reserve); extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long, enum memmap_context); extern void setup_per_zone_wmarks(void); +extern void calculate_zone_inactive_ratio(struct zone *zone); extern void mem_init(void); extern void __init mmap_init(void); extern void show_mem(void); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8629c84fd9ba..303607f1d323 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4478,22 +4478,26 @@ void setup_per_zone_wmarks(void) * 1TB 101 10GB * 10TB 320 32GB */ -static void __init setup_per_zone_inactive_ratio(void) +void calculate_zone_inactive_ratio(struct zone *zone) { - struct zone *zone; + unsigned int gb, ratio; - for_each_zone(zone) { - unsigned int gb, ratio; + /* Zone size in gigabytes */ + gb = zone->present_pages >> (30 - PAGE_SHIFT); + if (gb) + ratio = int_sqrt(10 * gb); + else + ratio = 1; - /* Zone size in gigabytes */ - gb = zone->present_pages >> (30 - PAGE_SHIFT); - if (gb) - ratio = int_sqrt(10 * gb); - else - ratio = 1; + zone->inactive_ratio = ratio; +} - zone->inactive_ratio = ratio; - } +static void __init setup_per_zone_inactive_ratio(void) +{ + struct zone *zone; + + for_each_zone(zone) + calculate_zone_inactive_ratio(zone); } /* -- cgit v1.2.3-59-g8ed1b From bce7394a3ef82b8477952fbab838e4a6e8cb47d2 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Tue, 16 Jun 2009 15:32:50 -0700 Subject: page-allocator: reset wmark_min and inactive ratio of zone when hotplug happens Solve two problems. Whenever memory hotplug sucessfully happens, zone->present_pages have to be changed. 1) Now memory hotplug calls setup_per_zone_wmark_min only when online_pages called, not offline_pages. It breaks balance. 2) If zone->present_pages is changed, we also have to change zone->inactive_ratio. That's because inactive_ratio depends on zone->present_pages. Signed-off-by: Minchan Kim Acked-by: Yasunori Goto Cc: Rik van Riel Cc: KOSAKI Motohiro Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 4 ++++ mm/page_alloc.c | 2 +- 2 files changed, 5 insertions(+), 1 deletion(-) (limited to 'mm/page_alloc.c') diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 037291e15b27..e4412a676c88 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -423,6 +423,7 @@ int online_pages(unsigned long pfn, unsigned long nr_pages) zone->zone_pgdat->node_present_pages += onlined_pages; setup_per_zone_wmarks(); + calculate_zone_inactive_ratio(zone); if (onlined_pages) { kswapd_run(zone_to_nid(zone)); node_set_state(zone_to_nid(zone), N_HIGH_MEMORY); @@ -832,6 +833,9 @@ repeat: totalram_pages -= offlined_pages; num_physpages -= offlined_pages; + setup_per_zone_wmarks(); + calculate_zone_inactive_ratio(zone); + vm_total_pages = nr_free_pagecache_pages(); writeback_set_ratelimit(); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 303607f1d323..00e293734fc9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4397,7 +4397,7 @@ static void setup_per_zone_lowmem_reserve(void) /** * setup_per_zone_wmarks - called when min_free_kbytes changes - * or when memory is hot-added + * or when memory is hot-{added|removed} * * Ensures that the watermark[min,low,high] values for each zone are set * correctly with respect to min_free_kbytes. -- cgit v1.2.3-59-g8ed1b From 6837765963f1723e80ca97b1fae660f3a60d77df Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 16 Jun 2009 15:32:51 -0700 Subject: mm: remove CONFIG_UNEVICTABLE_LRU config option Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this configurability is unnecessary. Signed-off-by: KOSAKI Motohiro Cc: Johannes Weiner Cc: Andi Kleen Acked-by: Minchan Kim Cc: David Woodhouse Cc: Matt Mackall Cc: Rik van Riel Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/base/node.c | 4 ---- fs/proc/meminfo.c | 4 ---- fs/proc/page.c | 2 -- include/linux/mmzone.h | 13 ------------- include/linux/page-flags.h | 16 +--------------- include/linux/pagemap.h | 12 ------------ include/linux/rmap.h | 7 ------- include/linux/swap.h | 19 ------------------- include/linux/vmstat.h | 2 -- kernel/sysctl.c | 2 -- mm/Kconfig | 14 +------------- mm/internal.h | 6 ------ mm/mlock.c | 22 ---------------------- mm/page_alloc.c | 9 --------- mm/rmap.c | 3 +-- mm/vmscan.c | 17 ----------------- mm/vmstat.c | 4 ---- 17 files changed, 3 insertions(+), 153 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/drivers/base/node.c b/drivers/base/node.c index 40b809742a1c..91d4087b4039 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -72,10 +72,8 @@ static ssize_t node_read_meminfo(struct sys_device * dev, "Node %d Inactive(anon): %8lu kB\n" "Node %d Active(file): %8lu kB\n" "Node %d Inactive(file): %8lu kB\n" -#ifdef CONFIG_UNEVICTABLE_LRU "Node %d Unevictable: %8lu kB\n" "Node %d Mlocked: %8lu kB\n" -#endif #ifdef CONFIG_HIGHMEM "Node %d HighTotal: %8lu kB\n" "Node %d HighFree: %8lu kB\n" @@ -105,10 +103,8 @@ static ssize_t node_read_meminfo(struct sys_device * dev, nid, K(node_page_state(nid, NR_INACTIVE_ANON)), nid, K(node_page_state(nid, NR_ACTIVE_FILE)), nid, K(node_page_state(nid, NR_INACTIVE_FILE)), -#ifdef CONFIG_UNEVICTABLE_LRU nid, K(node_page_state(nid, NR_UNEVICTABLE)), nid, K(node_page_state(nid, NR_MLOCK)), -#endif #ifdef CONFIG_HIGHMEM nid, K(i.totalhigh), nid, K(i.freehigh), diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index c6b0302af4c4..d5c410d47fae 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -64,10 +64,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v) "Inactive(anon): %8lu kB\n" "Active(file): %8lu kB\n" "Inactive(file): %8lu kB\n" -#ifdef CONFIG_UNEVICTABLE_LRU "Unevictable: %8lu kB\n" "Mlocked: %8lu kB\n" -#endif #ifdef CONFIG_HIGHMEM "HighTotal: %8lu kB\n" "HighFree: %8lu kB\n" @@ -109,10 +107,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v) K(pages[LRU_INACTIVE_ANON]), K(pages[LRU_ACTIVE_FILE]), K(pages[LRU_INACTIVE_FILE]), -#ifdef CONFIG_UNEVICTABLE_LRU K(pages[LRU_UNEVICTABLE]), K(global_page_state(NR_MLOCK)), -#endif #ifdef CONFIG_HIGHMEM K(i.totalhigh), K(i.freehigh), diff --git a/fs/proc/page.c b/fs/proc/page.c index 9d926bd279a4..2707c6c7a20f 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -172,10 +172,8 @@ static u64 get_uflags(struct page *page) u |= kpf_copy_bit(k, KPF_SWAPCACHE, PG_swapcache); u |= kpf_copy_bit(k, KPF_SWAPBACKED, PG_swapbacked); -#ifdef CONFIG_UNEVICTABLE_LRU u |= kpf_copy_bit(k, KPF_UNEVICTABLE, PG_unevictable); u |= kpf_copy_bit(k, KPF_MLOCKED, PG_mlocked); -#endif #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR u |= kpf_copy_bit(k, KPF_UNCACHED, PG_uncached); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index db976b9f8791..889598537370 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -83,13 +83,8 @@ enum zone_stat_item { NR_ACTIVE_ANON, /* " " " " " */ NR_INACTIVE_FILE, /* " " " " " */ NR_ACTIVE_FILE, /* " " " " " */ -#ifdef CONFIG_UNEVICTABLE_LRU NR_UNEVICTABLE, /* " " " " " */ NR_MLOCK, /* mlock()ed pages found and moved off LRU */ -#else - NR_UNEVICTABLE = NR_ACTIVE_FILE, /* avoid compiler errors in dead code */ - NR_MLOCK = NR_ACTIVE_FILE, -#endif NR_ANON_PAGES, /* Mapped anonymous pages */ NR_FILE_MAPPED, /* pagecache pages mapped into pagetables. only modified from process context */ @@ -132,11 +127,7 @@ enum lru_list { LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE, LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE, LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE, -#ifdef CONFIG_UNEVICTABLE_LRU LRU_UNEVICTABLE, -#else - LRU_UNEVICTABLE = LRU_ACTIVE_FILE, /* avoid compiler errors in dead code */ -#endif NR_LRU_LISTS }; @@ -156,11 +147,7 @@ static inline int is_active_lru(enum lru_list l) static inline int is_unevictable_lru(enum lru_list l) { -#ifdef CONFIG_UNEVICTABLE_LRU return (l == LRU_UNEVICTABLE); -#else - return 0; -#endif } enum zone_watermarks { diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 62214c7d2d93..d6792f88a176 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -95,9 +95,7 @@ enum pageflags { PG_reclaim, /* To be reclaimed asap */ PG_buddy, /* Page is free, on buddy lists */ PG_swapbacked, /* Page is backed by RAM/swap */ -#ifdef CONFIG_UNEVICTABLE_LRU PG_unevictable, /* Page is "unevictable" */ -#endif #ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT PG_mlocked, /* Page is vma mlocked */ #endif @@ -248,14 +246,8 @@ PAGEFLAG_FALSE(SwapCache) SETPAGEFLAG_NOOP(SwapCache) CLEARPAGEFLAG_NOOP(SwapCache) #endif -#ifdef CONFIG_UNEVICTABLE_LRU PAGEFLAG(Unevictable, unevictable) __CLEARPAGEFLAG(Unevictable, unevictable) TESTCLEARFLAG(Unevictable, unevictable) -#else -PAGEFLAG_FALSE(Unevictable) TESTCLEARFLAG_FALSE(Unevictable) - SETPAGEFLAG_NOOP(Unevictable) CLEARPAGEFLAG_NOOP(Unevictable) - __CLEARPAGEFLAG_NOOP(Unevictable) -#endif #ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT #define MLOCK_PAGES 1 @@ -382,12 +374,6 @@ static inline void __ClearPageTail(struct page *page) #endif /* !PAGEFLAGS_EXTENDED */ -#ifdef CONFIG_UNEVICTABLE_LRU -#define __PG_UNEVICTABLE (1 << PG_unevictable) -#else -#define __PG_UNEVICTABLE 0 -#endif - #ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT #define __PG_MLOCKED (1 << PG_mlocked) #else @@ -403,7 +389,7 @@ static inline void __ClearPageTail(struct page *page) 1 << PG_private | 1 << PG_private_2 | \ 1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ - __PG_UNEVICTABLE | __PG_MLOCKED) + 1 << PG_unevictable | __PG_MLOCKED) /* * Flags checked when a page is prepped for return by the page allocator. diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 34da5230faab..aec3252afcf5 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -22,9 +22,7 @@ enum mapping_flags { AS_EIO = __GFP_BITS_SHIFT + 0, /* IO error on async write */ AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */ AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */ -#ifdef CONFIG_UNEVICTABLE_LRU AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */ -#endif }; static inline void mapping_set_error(struct address_space *mapping, int error) @@ -37,8 +35,6 @@ static inline void mapping_set_error(struct address_space *mapping, int error) } } -#ifdef CONFIG_UNEVICTABLE_LRU - static inline void mapping_set_unevictable(struct address_space *mapping) { set_bit(AS_UNEVICTABLE, &mapping->flags); @@ -55,14 +51,6 @@ static inline int mapping_unevictable(struct address_space *mapping) return test_bit(AS_UNEVICTABLE, &mapping->flags); return !!mapping; } -#else -static inline void mapping_set_unevictable(struct address_space *mapping) { } -static inline void mapping_clear_unevictable(struct address_space *mapping) { } -static inline int mapping_unevictable(struct address_space *mapping) -{ - return 0; -} -#endif static inline gfp_t mapping_gfp_mask(struct address_space * mapping) { diff --git a/include/linux/rmap.h b/include/linux/rmap.h index b35bc0e19cd9..619379a1dd98 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -105,18 +105,11 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *); */ int page_mkclean(struct page *); -#ifdef CONFIG_UNEVICTABLE_LRU /* * called in munlock()/munmap() path to check for other vmas holding * the page mlocked. */ int try_to_munlock(struct page *); -#else -static inline int try_to_munlock(struct page *page) -{ - return 0; /* a.k.a. SWAP_SUCCESS */ -} -#endif #else /* !CONFIG_MMU */ diff --git a/include/linux/swap.h b/include/linux/swap.h index d476aad3ff57..f30c06908f09 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -235,7 +235,6 @@ static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) } #endif -#ifdef CONFIG_UNEVICTABLE_LRU extern int page_evictable(struct page *page, struct vm_area_struct *vma); extern void scan_mapping_unevictable_pages(struct address_space *); @@ -244,24 +243,6 @@ extern int scan_unevictable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); extern int scan_unevictable_register_node(struct node *node); extern void scan_unevictable_unregister_node(struct node *node); -#else -static inline int page_evictable(struct page *page, - struct vm_area_struct *vma) -{ - return 1; -} - -static inline void scan_mapping_unevictable_pages(struct address_space *mapping) -{ -} - -static inline int scan_unevictable_register_node(struct node *node) -{ - return 0; -} - -static inline void scan_unevictable_unregister_node(struct node *node) { } -#endif extern int kswapd_run(int nid); diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 524cd1b28ecb..ff4696c6dce3 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -41,7 +41,6 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifdef CONFIG_HUGETLB_PAGE HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL, #endif -#ifdef CONFIG_UNEVICTABLE_LRU UNEVICTABLE_PGCULLED, /* culled to noreclaim list */ UNEVICTABLE_PGSCANNED, /* scanned for reclaimability */ UNEVICTABLE_PGRESCUED, /* rescued from noreclaim list */ @@ -50,7 +49,6 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, UNEVICTABLE_PGCLEARED, /* on COW, page truncate */ UNEVICTABLE_PGSTRANDED, /* unable to isolate on unlock */ UNEVICTABLE_MLOCKFREED, -#endif NR_VM_EVENT_ITEMS }; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 0e51a35a4486..2ccee08f92f1 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1325,7 +1325,6 @@ static struct ctl_table vm_table[] = { .extra2 = &one, }, #endif -#ifdef CONFIG_UNEVICTABLE_LRU { .ctl_name = CTL_UNNUMBERED, .procname = "scan_unevictable_pages", @@ -1334,7 +1333,6 @@ static struct ctl_table vm_table[] = { .mode = 0644, .proc_handler = &scan_unevictable_handler, }, -#endif /* * NOTE: do not add new entries to this table unless you have read * Documentation/sysctl/ctl_unnumbered.txt diff --git a/mm/Kconfig b/mm/Kconfig index 71830ba7b986..97d2c88b745e 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -203,25 +203,13 @@ config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS -config UNEVICTABLE_LRU - bool "Add LRU list to track non-evictable pages" - default y - help - Keeps unevictable pages off of the active and inactive pageout - lists, so kswapd will not waste CPU time or have its balancing - algorithms thrown off by scanning these pages. Selecting this - will use one page flag and increase the code size a little, - say Y unless you know what you are doing. - - See Documentation/vm/unevictable-lru.txt for more information. - config HAVE_MLOCK bool default y if MMU=y config HAVE_MLOCKED_PAGE_BIT bool - default y if HAVE_MLOCK=y && UNEVICTABLE_LRU=y + default y if HAVE_MLOCK=y config MMU_NOTIFIER bool diff --git a/mm/internal.h b/mm/internal.h index b4ac332e8072..f02c7508068d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -73,7 +73,6 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma) } #endif -#ifdef CONFIG_UNEVICTABLE_LRU /* * unevictable_migrate_page() called only from migrate_page_copy() to * migrate unevictable flag to new page. @@ -85,11 +84,6 @@ static inline void unevictable_migrate_page(struct page *new, struct page *old) if (TestClearPageUnevictable(old)) SetPageUnevictable(new); } -#else -static inline void unevictable_migrate_page(struct page *new, struct page *old) -{ -} -#endif #ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT /* diff --git a/mm/mlock.c b/mm/mlock.c index ac130433c7d3..45eb650b9654 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -31,7 +31,6 @@ int can_do_mlock(void) } EXPORT_SYMBOL(can_do_mlock); -#ifdef CONFIG_UNEVICTABLE_LRU /* * Mlocked pages are marked with PageMlocked() flag for efficient testing * in vmscan and, possibly, the fault path; and to support semi-accurate @@ -261,27 +260,6 @@ static int __mlock_posix_error_return(long retval) return retval; } -#else /* CONFIG_UNEVICTABLE_LRU */ - -/* - * Just make pages present if VM_LOCKED. No-op if unlocking. - */ -static long __mlock_vma_pages_range(struct vm_area_struct *vma, - unsigned long start, unsigned long end, - int mlock) -{ - if (mlock && (vma->vm_flags & VM_LOCKED)) - return make_pages_present(start, end); - return 0; -} - -static inline int __mlock_posix_error_return(long retval) -{ - return 0; -} - -#endif /* CONFIG_UNEVICTABLE_LRU */ - /** * mlock_vma_pages_range() - mlock pages in specified vma range. * @vma - the vma containing the specfied address range diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 00e293734fc9..c95a77cd581b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2077,19 +2077,14 @@ void show_free_areas(void) printk("Active_anon:%lu active_file:%lu inactive_anon:%lu\n" " inactive_file:%lu" -//TODO: check/adjust line lengths -#ifdef CONFIG_UNEVICTABLE_LRU " unevictable:%lu" -#endif " dirty:%lu writeback:%lu unstable:%lu\n" " free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n", global_page_state(NR_ACTIVE_ANON), global_page_state(NR_ACTIVE_FILE), global_page_state(NR_INACTIVE_ANON), global_page_state(NR_INACTIVE_FILE), -#ifdef CONFIG_UNEVICTABLE_LRU global_page_state(NR_UNEVICTABLE), -#endif global_page_state(NR_FILE_DIRTY), global_page_state(NR_WRITEBACK), global_page_state(NR_UNSTABLE_NFS), @@ -2113,9 +2108,7 @@ void show_free_areas(void) " inactive_anon:%lukB" " active_file:%lukB" " inactive_file:%lukB" -#ifdef CONFIG_UNEVICTABLE_LRU " unevictable:%lukB" -#endif " present:%lukB" " pages_scanned:%lu" " all_unreclaimable? %s" @@ -2129,9 +2122,7 @@ void show_free_areas(void) K(zone_page_state(zone, NR_INACTIVE_ANON)), K(zone_page_state(zone, NR_ACTIVE_FILE)), K(zone_page_state(zone, NR_INACTIVE_FILE)), -#ifdef CONFIG_UNEVICTABLE_LRU K(zone_page_state(zone, NR_UNEVICTABLE)), -#endif K(zone->present_pages), zone->pages_scanned, (zone_is_all_unreclaimable(zone) ? "yes" : "no") diff --git a/mm/rmap.c b/mm/rmap.c index 23122af32611..316c9d6930ad 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1202,7 +1202,6 @@ int try_to_unmap(struct page *page, int migration) return ret; } -#ifdef CONFIG_UNEVICTABLE_LRU /** * try_to_munlock - try to munlock a page * @page: the page to be munlocked @@ -1226,4 +1225,4 @@ int try_to_munlock(struct page *page) else return try_to_unmap_file(page, 1, 0); } -#endif + diff --git a/mm/vmscan.c b/mm/vmscan.c index 879d034930c4..2c4b945b011f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -514,7 +514,6 @@ int remove_mapping(struct address_space *mapping, struct page *page) * * lru_lock must not be held, interrupts must be enabled. */ -#ifdef CONFIG_UNEVICTABLE_LRU void putback_lru_page(struct page *page) { int lru; @@ -568,20 +567,6 @@ redo: put_page(page); /* drop ref from isolate */ } -#else /* CONFIG_UNEVICTABLE_LRU */ - -void putback_lru_page(struct page *page) -{ - int lru; - VM_BUG_ON(PageLRU(page)); - - lru = !!TestClearPageActive(page) + page_is_file_cache(page); - lru_cache_add_lru(page, lru); - put_page(page); -} -#endif /* CONFIG_UNEVICTABLE_LRU */ - - /* * shrink_page_list() returns the number of reclaimed pages */ @@ -2470,7 +2455,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) } #endif -#ifdef CONFIG_UNEVICTABLE_LRU /* * page_evictable - test whether a page is evictable * @page: the page to test @@ -2717,4 +2701,3 @@ void scan_unevictable_unregister_node(struct node *node) sysdev_remove_file(&node->sysdev, &attr_scan_unevictable_pages); } -#endif diff --git a/mm/vmstat.c b/mm/vmstat.c index 1e151cf6bf86..1e3aa8139f22 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -629,10 +629,8 @@ static const char * const vmstat_text[] = { "nr_active_anon", "nr_inactive_file", "nr_active_file", -#ifdef CONFIG_UNEVICTABLE_LRU "nr_unevictable", "nr_mlock", -#endif "nr_anon_pages", "nr_mapped", "nr_file_pages", @@ -687,7 +685,6 @@ static const char * const vmstat_text[] = { "htlb_buddy_alloc_success", "htlb_buddy_alloc_fail", #endif -#ifdef CONFIG_UNEVICTABLE_LRU "unevictable_pgs_culled", "unevictable_pgs_scanned", "unevictable_pgs_rescued", @@ -697,7 +694,6 @@ static const char * const vmstat_text[] = { "unevictable_pgs_stranded", "unevictable_pgs_mlockfreed", #endif -#endif }; static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, -- cgit v1.2.3-59-g8ed1b From 82553a937f12352c26fe457510ebab3f512cd3fa Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Tue, 16 Jun 2009 15:32:58 -0700 Subject: oom: invoke oom killer for __GFP_NOFAIL The oom killer must be invoked regardless of the order if the allocation is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to free some memory. Cc: Nick Piggin Acked-by: Rik van Riel Acked-by: Mel Gorman Cc: Peter Zijlstra Cc: Christoph Lameter Cc: Dave Hansen Signed-off-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c95a77cd581b..c5fb017c9430 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1561,7 +1561,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, goto out; /* The OOM killer will not help higher order allocs */ - if (order > PAGE_ALLOC_COSTLY_ORDER) + if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL)) goto out; /* Exhausted what can be done so it's blamo time */ @@ -1781,11 +1781,13 @@ rebalance: goto got_pg; /* - * The OOM killer does not trigger for high-order allocations - * but if no progress is being made, there are no other - * options and retrying is unlikely to help + * The OOM killer does not trigger for high-order + * ~__GFP_NOFAIL allocations so if no progress is being + * made, there are no other options and retrying is + * unlikely to help. */ - if (order > PAGE_ALLOC_COSTLY_ORDER) + if (order > PAGE_ALLOC_COSTLY_ORDER && + !(gfp_mask & __GFP_NOFAIL)) goto nopage; goto restart; -- cgit v1.2.3-59-g8ed1b From 73d60b7f747176dbdff826c4127d22e1fd3f9f74 Mon Sep 17 00:00:00 2001 From: Yinghai Lu Date: Tue, 16 Jun 2009 15:33:00 -0700 Subject: page-allocator: clear N_HIGH_MEMORY map before we set it again SRAT tables may contains nodes of very small size. The arch code may decide to not activate such a node. However, currently the early boot code sets N_HIGH_MEMORY for such nodes. These nodes therefore seem to be active although these nodes have no present pages. For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too Signed-off-by: Yinghai Lu Tested-by: Jack Steiner Acked-by: Christoph Lameter Cc: Mel Gorman Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c5fb017c9430..6407cbfccd77 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4204,6 +4204,11 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn) early_node_map[i].start_pfn, early_node_map[i].end_pfn); + /* + * find_zone_movable_pfns_for_nodes/early_calculate_totalpages init + * that node_mask, clear it at first + */ + nodes_clear(node_states[N_HIGH_MEMORY]); /* Initialise every node */ mminit_verify_pageflags_layout(); setup_nr_node_ids(); -- cgit v1.2.3-59-g8ed1b From fa5e084e43eb14c14942027e1e2e894aeed96097 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:33:22 -0700 Subject: vmscan: do not unconditionally treat zones that fail zone_reclaim() as full On NUMA machines, the administrator can configure zone_reclaim_mode that is a more targetted form of direct reclaim. On machines with large NUMA distances for example, a zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be reclaimed if the zone watermarks are not being met. The problem is that zone_reclaim() failing at all means the zone gets marked full. This can cause situations where a zone is usable, but is being skipped because it has been considered full. Take a situation where a large tmpfs mount is occuping a large percentage of memory overall. The pages do not get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full and the zonelist cache considers them not worth trying in the future. This patch makes zone_reclaim() return more fine-grained information about what occured when zone_reclaim() failued. The zone only gets marked full if it really is unreclaimable. If it's a case that the scan did not occur or if enough pages were not reclaimed with the limited reclaim_mode, then the zone is simply skipped. There is a side-effect to this patch. Currently, if zone_reclaim() successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would go ahead. With this patch applied, zone watermarks are rechecked after zone_reclaim() does some work. This bug was introduced by commit 9276b1bc96a132f4068fdee00983c532f43d3a26 ("memory page_alloc zonelist caching speedup") way back in 2.6.19 when the zonelist_cache was introduced. It was not intended that zone_reclaim() aggressively consider the zone to be full when it failed as full direct reclaim can still be an option. Due to the age of the bug, it should be considered a -stable candidate. Signed-off-by: Mel Gorman Reviewed-by: Wu Fengguang Reviewed-by: Rik van Riel Reviewed-by: KOSAKI Motohiro Cc: Christoph Lameter Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/internal.h | 4 ++++ mm/page_alloc.c | 26 ++++++++++++++++++++++---- mm/vmscan.c | 11 ++++++----- 3 files changed, 32 insertions(+), 9 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/internal.h b/mm/internal.h index f02c7508068d..f290c4db528b 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -259,4 +259,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int len, int flags, struct page **pages, struct vm_area_struct **vmas); +#define ZONE_RECLAIM_NOSCAN -2 +#define ZONE_RECLAIM_FULL -1 +#define ZONE_RECLAIM_SOME 0 +#define ZONE_RECLAIM_SUCCESS 1 #endif diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6407cbfccd77..2f457a756d46 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1462,15 +1462,33 @@ zonelist_scan: BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK); if (!(alloc_flags & ALLOC_NO_WATERMARKS)) { unsigned long mark; + int ret; + mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK]; - if (!zone_watermark_ok(zone, order, mark, - classzone_idx, alloc_flags)) { - if (!zone_reclaim_mode || - !zone_reclaim(zone, gfp_mask, order)) + if (zone_watermark_ok(zone, order, mark, + classzone_idx, alloc_flags)) + goto try_this_zone; + + if (zone_reclaim_mode == 0) + goto this_zone_full; + + ret = zone_reclaim(zone, gfp_mask, order); + switch (ret) { + case ZONE_RECLAIM_NOSCAN: + /* did not scan */ + goto try_next_zone; + case ZONE_RECLAIM_FULL: + /* scanned but unreclaimable */ + goto this_zone_full; + default: + /* did we reclaim enough */ + if (!zone_watermark_ok(zone, order, mark, + classzone_idx, alloc_flags)) goto this_zone_full; } } +try_this_zone: page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask, migratetype); if (page) diff --git a/mm/vmscan.c b/mm/vmscan.c index 79a98d98ed33..16c82a868e2b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2492,16 +2492,16 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) */ if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages && zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages) - return 0; + return ZONE_RECLAIM_FULL; if (zone_is_all_unreclaimable(zone)) - return 0; + return ZONE_RECLAIM_FULL; /* * Do not scan if the allocation should not be delayed. */ if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC)) - return 0; + return ZONE_RECLAIM_NOSCAN; /* * Only run zone reclaim on the local zone or on zones that do not @@ -2511,10 +2511,11 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) */ node_id = zone_to_nid(zone); if (node_state(node_id, N_CPU) && node_id != numa_node_id()) - return 0; + return ZONE_RECLAIM_NOSCAN; if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED)) - return 0; + return ZONE_RECLAIM_NOSCAN; + ret = __zone_reclaim(zone, gfp_mask, order); zone_clear_flag(zone, ZONE_RECLAIM_LOCKED); -- cgit v1.2.3-59-g8ed1b