From 88f5004430babb836cfce886d5d54c82166f8ba4 Mon Sep 17 00:00:00 2001
From: Yongseok Koh <yongseok.koh@samsung.com>
Date: Tue, 19 Jan 2010 17:33:49 +0900
Subject: vmalloc: remove BUG_ON due to racy counting of VM_LAZY_FREE

In free_unmap_area_noflush(), va->flags is marked as VM_LAZY_FREE first, and
then vmap_lazy_nr is increased atomically.

But, in __purge_vmap_area_lazy(), while traversing of vmap_are_list, nr
is counted by checking VM_LAZY_FREE is set to va->flags.  After counting
the variable nr, kernel reads vmap_lazy_nr atomically and checks a
BUG_ON condition whether nr is greater than vmap_lazy_nr to prevent
vmap_lazy_nr from being negative.

The problem is that, if interrupted right after marking VM_LAZY_FREE,
increment of vmap_lazy_nr can be delayed.  Consequently, BUG_ON
condition can be met because nr is counted more than vmap_lazy_nr.

It is highly probable when vmalloc/vfree are called frequently.  This
scenario have been verified by adding delay between marking VM_LAZY_FREE
and increasing vmap_lazy_nr in free_unmap_area_noflush().

Even the vmap_lazy_nr is for checking high watermark, it never be the
strict watermark.  Although the BUG_ON condition is to prevent
vmap_lazy_nr from being negative, vmap_lazy_nr is signed variable.  So,
it could go down to negative value temporarily.

Consequently, removing the BUG_ON condition is proper.

A possible BUG_ON message is like the below.

   kernel BUG at mm/vmalloc.c:517!
   invalid opcode: 0000 [#1] SMP
   EIP: 0060:[<c04824a4>] EFLAGS: 00010297 CPU: 3
   EIP is at __purge_vmap_area_lazy+0x144/0x150
   EAX: ee8a8818 EBX: c08e77d4 ECX: e7c7ae40 EDX: c08e77ec
   ESI: 000081fe EDI: e7c7ae60 EBP: e7c7ae64 ESP: e7c7ae3c
   DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
   Call Trace:
   [<c0482ad9>] free_unmap_vmap_area_noflush+0x69/0x70
   [<c0482b02>] remove_vm_area+0x22/0x70
   [<c0482c15>] __vunmap+0x45/0xe0
   [<c04831ec>] vmalloc+0x2c/0x30
   Code: 8d 59 e0 eb 04 66 90 89 cb 89 d0 e8 87 fe ff ff 8b 43 20 89 da 8d 48 e0 8d 43 20 3b 04 24 75 e7 fe 05 a8 a5 a3 c0 e9 78 ff ff ff <0f> 0b eb fe 90 8d b4 26 00 00 00 00 56 89 c6 b8 ac a5 a3 c0 31
   EIP: [<c04824a4>] __purge_vmap_area_lazy+0x144/0x150 SS:ESP 0068:e7c7ae3c

[ See also http://marc.info/?l=linux-kernel&m=126335856228090&w=2 ]

Signed-off-by: Yongseok Koh <yongseok.koh@samsung.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/vmalloc.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

(limited to 'mm')
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 37e69295f250..d55d905463eb 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -555,10 +555,8 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
 	}
 	rcu_read_unlock();
 
-	if (nr) {
-		BUG_ON(nr > atomic_read(&vmap_lazy_nr));
+	if (nr)
 		atomic_sub(nr, &vmap_lazy_nr);
-	}
 
 	if (nr || force_flush)
 		flush_tlb_kernel_range(*start, *end);
-- 
cgit v1.2.3-59-g8ed1b


From 0531b2aac59c2296570ac52bfc032ef2ace7d5e1 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 27 Jan 2010 09:20:03 -0800
Subject: mm: add new 'read_cache_page_gfp()' helper function

It's a simplified 'read_cache_page()' which takes a page allocation
flag, so that different paths can control how aggressive the memory
allocations are that populate a address space.

In particular, the intel GPU object mapping code wants to be able to do
a certain amount of own internal memory management by automatically
shrinking the address space when memory starts getting tight.  This
allows it to dynamically use different memory allocation policies on a
per-allocation basis, rather than depend on the (static) address space
gfp policy.

The actual new function is a one-liner, but re-organizing the helper
functions to the point where you can do this with a single line of code
is what most of the patch is all about.

Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/pagemap.h |   2 +
 mm/filemap.c            | 100 ++++++++++++++++++++++++++++++++----------------
 2 files changed, 70 insertions(+), 32 deletions(-)

(limited to 'mm')

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ed5d7501e181..3c62ed408492 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -253,6 +253,8 @@ extern struct page * read_cache_page_async(struct address_space *mapping,
 extern struct page * read_cache_page(struct address_space *mapping,
 				pgoff_t index, filler_t *filler,
 				void *data);
+extern struct page * read_cache_page_gfp(struct address_space *mapping,
+				pgoff_t index, gfp_t gfp_mask);
 extern int read_cache_pages(struct address_space *mapping,
 		struct list_head *pages, filler_t *filler, void *data);
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 96ac6b0eb6cb..e3736923220e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1634,14 +1634,15 @@ EXPORT_SYMBOL(generic_file_readonly_mmap);
 static struct page *__read_cache_page(struct address_space *mapping,
 				pgoff_t index,
 				int (*filler)(void *,struct page*),
-				void *data)
+				void *data,
+				gfp_t gfp)
 {
 	struct page *page;
 	int err;
 repeat:
 	page = find_get_page(mapping, index);
 	if (!page) {
-		page = page_cache_alloc_cold(mapping);
+		page = __page_cache_alloc(gfp | __GFP_COLD);
 		if (!page)
 			return ERR_PTR(-ENOMEM);
 		err = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
@@ -1661,31 +1662,18 @@ repeat:
 	return page;
 }
 
-/**
- * read_cache_page_async - read into page cache, fill it if needed
- * @mapping:	the page's address_space
- * @index:	the page index
- * @filler:	function to perform the read
- * @data:	destination for read data
- *
- * Same as read_cache_page, but don't wait for page to become unlocked
- * after submitting it to the filler.
- *
- * Read into the page cache. If a page already exists, and PageUptodate() is
- * not set, try to fill the page but don't wait for it to become unlocked.
- *
- * If the page does not get brought uptodate, return -EIO.
- */
-struct page *read_cache_page_async(struct address_space *mapping,
+static struct page *do_read_cache_page(struct address_space *mapping,
 				pgoff_t index,
 				int (*filler)(void *,struct page*),
-				void *data)
+				void *data,
+				gfp_t gfp)
+
 {
 	struct page *page;
 	int err;
 
 retry:
-	page = __read_cache_page(mapping, index, filler, data);
+	page = __read_cache_page(mapping, index, filler, data, gfp);
 	if (IS_ERR(page))
 		return page;
 	if (PageUptodate(page))
@@ -1710,8 +1698,67 @@ out:
 	mark_page_accessed(page);
 	return page;
 }
+
+/**
+ * read_cache_page_async - read into page cache, fill it if needed
+ * @mapping:	the page's address_space
+ * @index:	the page index
+ * @filler:	function to perform the read
+ * @data:	destination for read data
+ *
+ * Same as read_cache_page, but don't wait for page to become unlocked
+ * after submitting it to the filler.
+ *
+ * Read into the page cache. If a page already exists, and PageUptodate() is
+ * not set, try to fill the page but don't wait for it to become unlocked.
+ *
+ * If the page does not get brought uptodate, return -EIO.
+ */
+struct page *read_cache_page_async(struct address_space *mapping,
+				pgoff_t index,
+				int (*filler)(void *,struct page*),
+				void *data)
+{
+	return do_read_cache_page(mapping, index, filler, data, mapping_gfp_mask(mapping));
+}
 EXPORT_SYMBOL(read_cache_page_async);
 
+static struct page *wait_on_page_read(struct page *page)
+{
+	if (!IS_ERR(page)) {
+		wait_on_page_locked(page);
+		if (!PageUptodate(page)) {
+			page_cache_release(page);
+			page = ERR_PTR(-EIO);
+		}
+	}
+	return page;
+}
+
+/**
+ * read_cache_page_gfp - read into page cache, using specified page allocation flags.
+ * @mapping:	the page's address_space
+ * @index:	the page index
+ * @gfp:	the page allocator flags to use if allocating
+ *
+ * This is the same as "read_mapping_page(mapping, index, NULL)", but with
+ * any new page allocations done using the specified allocation flags. Note
+ * that the Radix tree operations will still use GFP_KERNEL, so you can't
+ * expect to do this atomically or anything like that - but you can pass in
+ * other page requirements.
+ *
+ * If the page does not get brought uptodate, return -EIO.
+ */
+struct page *read_cache_page_gfp(struct address_space *mapping,
+				pgoff_t index,
+				gfp_t gfp)
+{
+	filler_t *filler = (filler_t *)mapping->a_ops->readpage;
+
+	return wait_on_page_read(do_read_cache_page(mapping, index, filler, NULL, gfp));
+}
+EXPORT_SYMBOL(read_cache_page_gfp);
+
 /**
  * read_cache_page - read into page cache, fill it if needed
  * @mapping:	the page's address_space
@@ -1729,18 +1776,7 @@ struct page *read_cache_page(struct address_space *mapping,
 				int (*filler)(void *,struct page*),
 				void *data)
 {
-	struct page *page;
-
-	page = read_cache_page_async(mapping, index, filler, data);
-	if (IS_ERR(page))
-		goto out;
-	wait_on_page_locked(page);
-	if (!PageUptodate(page)) {
-		page_cache_release(page);
-		page = ERR_PTR(-EIO);
-	}
- out:
-	return page;
+	return wait_on_page_read(read_cache_page_async(mapping, index, filler, data));
 }
 EXPORT_SYMBOL(read_cache_page);
 
-- 
cgit v1.2.3-59-g8ed1b


From a7016235a61d520e6806f38129001d935c4b6661 Mon Sep 17 00:00:00 2001
From: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Date: Fri, 29 Jan 2010 17:46:34 +0000
Subject: mm: fix migratetype bug which slowed swapping

After memory pressure has forced it to dip into the reserves, 2.6.32's
5f8dcc21211a3d4e3a7a5ca366b469fb88117f61 "page-allocator: split per-cpu
list into one-list-per-migrate-type" has been returning MIGRATE_RESERVE
pages to the MIGRATE_MOVABLE free_list: in some sense depleting reserves.

Fix that in the most straightforward way (which, considering the overheads
of alternative approaches, is Mel's preference): the right migratetype is
already in page_private(page), but free_pcppages_bulk() wasn't using it.

How did this bug show up?  As a 20% slowdown in my tmpfs loop kbuild
swapping tests, on PowerMac G5 with SLUB allocator.  Bisecting to that
commit was easy, but explaining the magnitude of the slowdown not easy.

The same effect appears, but much less markedly, with SLAB, and even
less markedly on other machines (the PowerMac divides into fewer zones
than x86, I think that may be a factor).  We guess that lumpy reclaim
of short-lived high-order pages is implicated in some way, and probably
this bug has been tickling a poor decision somewhere in page reclaim.

But instrumentation hasn't told me much, I've run out of time and
imagination to determine exactly what's going on, and shouldn't hold up
the fix any longer: it's valid, and might even fix other misbehaviours.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/page_alloc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

(limited to 'mm')

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d2a8889b4c58..8deb9d0fd5b1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -556,8 +556,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			page = list_entry(list->prev, struct page, lru);
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
-			__free_one_page(page, zone, 0, migratetype);
-			trace_mm_page_pcpu_drain(page, 0, migratetype);
+			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
+			__free_one_page(page, zone, 0, page_private(page));
+			trace_mm_page_pcpu_drain(page, 0, page_private(page));
 		} while (--count && --batch_free && !list_empty(list));
 	}
 	spin_unlock(&zone->lock);
-- 
cgit v1.2.3-59-g8ed1b