aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2012-10-09rbtree: add __rb_change_child() helper functionMichel Lespinasse1-29/+17
Add __rb_change_child() as an inline helper function to replace code that would otherwise be duplicated 4 times in the source. No changes to binary size or speed. Signed-off-by: Michel Lespinasse <walken@google.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09rbtree test: fix sparse warning about 64-bit constantMichel Lespinasse1-1/+1
Just a small fix to make sparse happy. Signed-off-by: Michel Lespinasse <walken@google.com> Reported-by: Fengguang Wu <wfg@linux.intel.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09rbtree: optimize fetching of sibling nodeMichel Lespinasse1-8/+13
When looking to fetch a node's sibling, we went through a sequence of: - check if node is the parent's left child - if it is, then fetch the parent's right child This can be replaced with: - fetch the parent's right child as an assumed sibling - check that node is NOT the fetched child This avoids fetching the parent's left child when node is actually that child. Saves a bit on code size, though it doesn't seem to make a large difference in speed. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <David.Woodhouse@intel.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09rbtree: coding style adjustmentsMichel Lespinasse1-19/+23
Set comment and indentation style to be consistent with linux coding style and the rest of the file, as suggested by Peter Zijlstra Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09rbtree: low level optimizations in __rb_erase_color()Michel Lespinasse1-93/+115
In __rb_erase_color(), we often already have pointers to the nodes being rotated and/or know what their colors must be, so we can generate more efficient code than the generic __rb_rotate_left() and __rb_rotate_right() functions. Also when the current node is red or when flipping the sibling's color, the parent is already known so we can use the more efficient rb_set_parent_color() function to set the desired color. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09rbtree: optimize case selection logic in __rb_erase_color()Michel Lespinasse1-38/+30
In __rb_erase_color(), we have to select one of 3 cases depending on the color on the 'other' node children. If both children are black, we flip a few node colors and iterate. Otherwise, we do either one or two tree rotations, depending on the color of the 'other' child opposite to 'node', and then we are done. The corresponding logic had duplicate checks for the color of the 'other' child opposite to 'node'. It was checking it first to determine if both children are black, and then to determine how many tree rotations are required. Rearrange the logic to avoid that extra check. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09rbtree: adjust node color in __rb_erase_color() only when necessaryMichel Lespinasse2-11/+18
In __rb_erase_color(), we were always setting a node to black after exiting the main loop. And in one case, after fixing up the tree to satisfy all rbtree invariants, we were setting the current node to root just to guarantee a loop exit, at which point the root would be set to black. However this is not necessary, as the root of an rbtree is already known to be black. The only case where the color flip is required is when we exit the loop due to the current node being red, and it's easiest to just do the flip at that point instead of doing it after the loop. [adrian.hunter@intel.com: perf tools: fix build for another rbtree.c change] Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09rbtree: low level optimizations in rb_insert_color()Michel Lespinasse1-35/+131
- Use the newly introduced rb_set_parent_color() function to flip the color of nodes whose parent is already known. - Optimize rb_parent() when the node is known to be red - there is no need to mask out the color in that case. - Flipping gparent's color to red requires us to fetch its rb_parent_color field, so we can reuse it as the parent value for the next loop iteration. - Do not use __rb_rotate_left() and __rb_rotate_right() to handle tree rotations: we already have pointers to all relevant nodes, and know their colors (either because we want to adjust it, or because we've tested it, or we can deduce it as black due to the node proximity to a known red node). So we can generate more efficient code by making use of the node pointers we already have, and setting both the parent and color attributes for nodes all at once. Also in Case 2, some node attributes don't have to be set because we know another tree rotation (Case 3) will always follow and override them. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09rbtree: adjust root color in rb_insert_color() only when necessaryMichel Lespinasse1-4/+15
The root node of an rbtree must always be black. However, rb_insert_color() only needs to maintain this invariant when it has been broken - that is, when it exits the loop due to the current (red) node being the root. In all other cases (exiting after tree rotations, or exiting due to an existing black parent) the invariant is already satisfied, so there is no need to adjust the root node color. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09rbtree: break out of rb_insert_color loop after tree rotationMichel Lespinasse1-10/+4
It is a well known property of rbtrees that insertion never requires more than two tree rotations. In our implementation, after one loop iteration identified one or two necessary tree rotations, we would iterate and look for more. However at that point the node's parent would always be black, which would cause us to exit the loop. We can make the code flow more obvious by just adding a break statement after the tree rotations, where we know we are done. Additionally, in the cases where two tree rotations are necessary, we don't have to update the 'node' pointer as it wouldn't be used until the next loop iteration, which we now avoid due to this break statement. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09rbtree: performance and correctness testMichel Lespinasse3-0/+144
This small module helps measure the performance of rbtree insert and erase. Additionally, we run a few correctness tests to check that the rbtrees have all desired properties: - contains the right number of nodes in the order desired, - never two consecutive red nodes on any path, - all paths to leaf nodes have the same number of black nodes, - root node is black [akpm@linux-foundation.org: fix printk warning: sparc64 cycles_t is unsigned long] Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09rbtree: move some implementation details from rbtree.h to rbtree.cMichel Lespinasse3-31/+36
rbtree users must use the documented APIs to manipulate the tree structure. Low-level helpers to manipulate node colors and parenthood are not part of that API, so move them to lib/rbtree.c [dwmw2@infradead.org: fix jffs2 build issue due to renamed __rb_parent_color field] Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09rbtree: fix incorrect rbtree node insertion in fs/proc/proc_sysctl.cMichel Lespinasse1-0/+1
The recently added code to use rbtrees in sysctl did not follow the proper rbtree interface on insertion - it was calling rb_link_node() which inserts a new node into the binary tree, but missed the call to rb_insert_color() which properly balances the rbtree and establishes all expected rbtree invariants. I found out about this only because faulty commit also used rb_init_node(), which I am removing within this patchset. But I think it's an easy mistake to make, and it makes me wonder if we should change the rbtree API so that insertions would be done with a single rb_insert() call (even if its implementation could still inline the rb_link_node() part and call a private __rb_insert_color function to do the rebalancing). Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09rbtree: empty nodes have no colorMichel Lespinasse5-17/+9
Empty nodes have no color. We can make use of this property to simplify the code emitted by the RB_EMPTY_NODE and RB_CLEAR_NODE macros. Also, we can get rid of the rb_init_node function which had been introduced by commit 88d19cf37952 ("timers: Add rb_init_node() to allow for stack allocated rb nodes") to avoid some issue with the empty node's color not being initialized. I'm not sure what the RB_EMPTY_NODE checks in rb_prev() / rb_next() are doing there, though. axboe introduced them in commit 10fd48f2376d ("rbtree: fixed reversed RB_EMPTY_NODE and rb_next/prev"). The way I see it, the 'empty node' abstraction is only used by rbtree users to flag nodes that they haven't inserted in any rbtree, so asking the predecessor or successor of such nodes doesn't make any sense. One final rb_init_node() caller was recently added in sysctl code to implement faster sysctl name lookups. This code doesn't make use of RB_EMPTY_NODE at all, and from what I could see it only called rb_init_node() under the mistaken assumption that such initialization was required before node insertion. [sfr@canb.auug.org.au: fix net/ceph/osd_client.c build] Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09rbtree: reference Documentation/rbtree.txt for usage instructionsMichel Lespinasse1-66/+1
I recently started looking at the rbtree code (with an eye towards improving the augmented rbtree support, but I haven't gotten there yet). I noticed a lot of possible speed improvements, which I am now proposing in this patch set. Patches 1-4 are preparatory: remove internal functions from rbtree.h so that users won't be tempted to use them instead of the documented APIs, clean up some incorrect usages I've noticed (in particular, with the recently added fs/proc/proc_sysctl.c rbtree usage), reference the documentation so that people have one less excuse to miss it, etc. Patch 5 is a small module I wrote to check the rbtree performance. It creates 100 nodes with random keys and repeatedly inserts and erases them from an rbtree. Additionally, it has code to check for rbtree invariants after each insert or erase operation. Patches 6-12 is where the rbtree optimizations are done, and they touch only that one file, lib/rbtree.c . I am getting good results out of these - in my small benchmark doing rbtree insertion (including search) and erase, I'm seeing a 30% runtime reduction on Sandybridge E5, which is more than I initially thought would be possible. (the results aren't as impressive on my two other test hosts though, AMD barcelona and Intel Westmere, where I am seeing 14% runtime reduction only). The code size - both source (ommiting comments) and compiled - is also shorter after these changes. However, I do admit that the updated code is more arduous to read - one big reason for that is the removal of the tree rotation helpers, which added some overhead but also made it easier to reason about things locally. Overall, I believe this is an acceptable compromise, given that this code doesn't get modified very often, and that I have good tests for it. Upon Peter's suggestion, I added comments showing the rtree configuration before every rotation. I think they help; however it's still best to have a copy of the cormen/leiserson/rivest book when digging into this code. This patch: reference Documentation/rbtree.txt for usage instructions include/linux/rbtree.h included some basic usage instructions, while Documentation/rbtree.txt had some more complete and easier to follow instructions. Replacing the former with a reference to the latter. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09ipc/mqueue: remove unnecessary rb_init_node() callsMichel Lespinasse1-3/+0
Commit d6629859b36d ("ipc/mqueue: improve performance of send/recv") and ce2d52cc ("ipc/mqueue: add rbtree node caching support") introduced an rbtree of message priorities, and usage of rb_init_node() to initialize the corresponding nodes. As it turns out, rb_init_node() is unnecessary here, as the nodes are fully initialized on insertion by rb_link_node() and the code doesn't access nodes that aren't inserted on the rbtree. Removing the rb_init_node() calls as I removed that function during rbtree API cleanups (the only other use of it was in a place that similarly didn't require it). Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp, s390: architecture backend for thp on s390Gerald Schaefer6-18/+220
This implements the architecture backend for transparent hugepages on s390. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp, s390: disable thp for kvm host on s390Gerald Schaefer1-0/+30
This patch is part of the architecture backend for thp on s390. It disables thp for kvm hosts, because there is no kvm host hugepage support so far. Existing thp mappings are split by follow_page() with FOLL_SPLIT, and future thp mappings are prevented by setting VM_NOHUGEPAGE in mm->def_flags. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp, s390: thp pagetable pre-allocation for s390Gerald Schaefer2-0/+44
This patch is part of the architecture backend for thp on s390. It provides the pagetable pre-allocation functions pgtable_trans_huge_deposit() and pgtable_trans_huge_withdraw(). Unlike other archs, s390 has no struct page * as pgtable_t, but rather a pointer to the page table. So instead of saving the pagetable pre- allocation list info inside the struct page, it is being saved within the pagetable itself. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp, s390: thp splitting backend for s390Gerald Schaefer3-1/+41
This patch is part of the architecture backend for thp on s390. It provides the functions related to thp splitting, including serialization against gup. Unlike other archs, pmdp_splitting_flush() cannot use a tlb flushing operation to serialize against gup on s390, because that wouldn't be stopped by the disabled IRQs. So instead, smp_call_function() is called with an empty function, which will have the expected effect. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp: make MADV_HUGEPAGE check for mm->def_flagsGerald Schaefer1-0/+4
This adds a check to hugepage_madvise(), to refuse MADV_HUGEPAGE if VM_NOHUGEPAGE is set in mm->def_flags. On s390, the VM_NOHUGEPAGE flag will be set in mm->def_flags for kvm processes, to prevent any future thp mappings. In order to also prevent MADV_HUGEPAGE on such an mm, hugepage_madvise() should check mm->def_flags. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp: introduce pmdp_invalidate()Gerald Schaefer3-2/+17
On s390, a valid page table entry must not be changed while it is attached to any CPU. So instead of pmd_mknotpresent() and set_pmd_at(), an IDTE operation would be necessary there. This patch introduces the pmdp_invalidate() function, to allow architecture-specific implementations. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp: remove assumptions on pgtable_t typeGerald Schaefer4-43/+55
The thp page table pre-allocation code currently assumes that pgtable_t is of type "struct page *". This may not be true for all architectures, so this patch removes that assumption by replacing the functions prepare_pmd_huge_pte() and get_pmd_huge_pte() with two new functions that can be defined architecture-specific. It also removes two VM_BUG_ON checks for page_count() and page_mapcount() operating on a pgtable_t. Apart from the VM_BUG_ON removal, there will be no functional change introduced by this patch. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp, x86: introduce HAVE_ARCH_TRANSPARENT_HUGEPAGEGerald Schaefer3-1/+5
Cleanup patch in preparation for transparent hugepage support on s390. Adding new architectures to the TRANSPARENT_HUGEPAGE config option can make the "depends" line rather ugly, like "depends on (X86 || (S390 && 64BIT)) && MMU". This patch adds a HAVE_ARCH_TRANSPARENT_HUGEPAGE instead. x86 already has MMU "def_bool y", so the MMU check is superfluous there and HAVE_ARCH_TRANSPARENT_HUGEPAGE can be selected in arch/x86/Kconfig. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: fix potential anon_vma locking issue in mprotect()Michel Lespinasse1-1/+5
Fix an anon_vma locking issue in the following situation: - vma has no anon_vma - next has an anon_vma - vma is being shrunk / next is being expanded, due to an mprotect call We need to take next's anon_vma lock to avoid races with rmap users (such as page migration) while next is being expanded. Signed-off-by: Michel Lespinasse <walken@google.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp: remove unnecessary set_recommended_min_free_kbytesXiao Guangrong1-5/+0
Since it is called in start_khugepaged Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp: use khugepaged_enabled to remove duplicate codeXiao Guangrong1-9/+2
Use khugepaged_enabled to see whether thp is enabled Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp: remove khugepaged_loopXiao Guangrong1-10/+4
Merge khugepaged_loop into khugepaged Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp: introduce khugepaged_prealloc_page and khugepaged_alloc_pageXiao Guangrong1-68/+98
They are used to abstract the difference between NUMA enabled and NUMA disabled to make the code more readable Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp: release page in page pre-alloc pathXiao Guangrong1-12/+7
If NUMA is enabled, we can release the page in the page pre-alloc operation, then the CONFIG_NUMA dependent code can be reduced Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp: merge page pre-alloc in khugepaged_loop into khugepaged_do_scanXiao Guangrong1-52/+45
There are two pre-alloc operations in these two function, the different is: - it allows to sleep if page alloc fail in khugepaged_loop - it exits immediately if page alloc fail in khugepaged_do_scan Actually, in khugepaged_do_scan, we can allow the pre-alloc to sleep on the first failure, then the operation in khugepaged_loop can be removed Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp: remove some code depend on CONFIG_NUMAXiao Guangrong1-7/+3
If NUMA is disabled, hpage is used as page pre-alloc, so there are two cases for hpage: - it is !NULL, means the page is not consumed otherwise, - the page has been consumed If NUMA is enabled, hpage is just used as alloc-fail indicator which is not a real page, NULL means not fail triggered. So, we can release the page only if !IS_ERR_OR_NULL Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp: remove wake_up_interruptible in the exit pathXiao Guangrong1-14/+21
Add the check of kthread_should_stop() to the conditions which are used to wakeup on khugepaged_wait, then kthread_stop is enough to let the thread exit Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp: remove unnecessary khugepaged_thread checkXiao Guangrong1-4/+1
Now, khugepaged creation and cancel are completely serial under the protection of khugepaged_mutex, it is impossible that many khugepaged entities are running Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp: move khugepaged_mutex out of khugepagedXiao Guangrong1-23/+13
Currently, hugepaged_mutex is used really complexly and hard to understand, actually, it is just used to serialize start_khugepaged and khugepaged for these reasons: - khugepaged_thread is shared between them - the thp disable path (echo never > transparent_hugepage/enabled) is nonblocking, so we need to protect khugepaged_thread to get a stable running state These can be avoided by: - use the lock to serialize the thread creation and cancel - thp disable path can not finised until the thread exits Then khugepaged_thread is fully controlled by start_khugepaged, khugepaged will be happy without the lock Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp: remove unnecessary check in start_khugepagedXiao Guangrong1-5/+2
The check is unnecessary since if mm_slot_cache or mm_slots_hash initialize failed, no sysfs interface will be created Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp: fix the count of THP_COLLAPSE_ALLOCXiao Guangrong1-1/+1
THP_COLLAPSE_ALLOC is double counted if NUMA is disabled since it has already been calculated in khugepaged_alloc_hugepage Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: adjust final #endif position in mm/internal.hMichel Lespinasse1-1/+2
Make sure the #endif that terminates the standard #ifndef / #define / #endif construct gets labeled, and gets positioned at the end of the file as is normally the case. Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: hugetlb: add arch hook for clearing page flags before entering poolWill Deacon9-0/+32
The core page allocator ensures that page flags are zeroed when freeing pages via free_pages_check. A number of architectures (ARM, PPC, MIPS) rely on this property to treat new pages as dirty with respect to the data cache and perform the appropriate flushing before mapping the pages into userspace. This can lead to cache synchronisation problems when using hugepages, since the allocator keeps its own pool of pages above the usual page allocator and does not reset the page flags when freeing a page into the pool. This patch adds a new architecture hook, arch_clear_hugepage_flags, so that architectures which rely on the page flags being in a particular state for fresh allocations can adjust the flags accordingly when a page is freed into the pool. Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Michal Hocko <mhocko@suse.cz> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09oom: remove deprecated oom_adjDavidlohr Bueso7-171/+7
The deprecated /proc/<pid>/oom_adj is scheduled for removal this month. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Acked-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm/vmscan: fix error number for failed kthreadGavin Shan1-2/+2
Fix the return value while failing to create the kswapd kernel thread. Also, the error message is prioritized as KERN_ERR. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm/mmu_notifier: init notifier if necessaryGavin Shan1-11/+11
While registering MMU notifier, new instance of MMU notifier_mm will be allocated and later free'd if currrent mm_struct's MMU notifier_mm has been initialized. That causes some overhead. The patch tries to elominate that. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Avi Kivity <avi@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Sagi Grimberg <sagig@mellanox.co.il> Cc: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: mmu_notifier: have mmu_notifiers use a global SRCU so they may safely scheduleSagi Grimberg2-25/+49
With an RCU based mmu_notifier implementation, any callout to mmu_notifier_invalidate_range_{start,end}() or mmu_notifier_invalidate_page() would not be allowed to call schedule() as that could potentially allow a modification to the mmu_notifier structure while it is currently being used. Since srcu allocs 4 machine words per instance per cpu, we may end up with memory exhaustion if we use srcu per mm. So all mms share a global srcu. Note that during large mmu_notifier activity exit & unregister paths might hang for longer periods, but it is tolerable for current mmu_notifier clients. Signed-off-by: Sagi Grimberg <sagig@mellanox.co.il> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Haggai Eran <haggaie@mellanox.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: mmu_notifier: fix inconsistent memory between secondary MMU and hostXiao Guangrong1-1/+11
There is a bug in set_pte_at_notify() which always sets the pte to the new page before releasing the old page in the secondary MMU. At this time, the process will access on the new page, but the secondary MMU still access on the old page, the memory is inconsistent between them The below scenario shows the bug more clearly: at the beginning: *p = 0, and p is write-protected by KSM or shared with parent process CPU 0 CPU 1 write 1 to p to trigger COW, set_pte_at_notify will be called: *pte = new_page + W; /* The W bit of pte is set */ *p = 1; /* pte is valid, so no #PF */ return back to secondary MMU, then the secondary MMU read p, but get: *p == 0; /* * !!!!!! * the host has already set p to 1, but the secondary * MMU still get the old value 0 */ call mmu_notifier_change_pte to release old page in secondary MMU We can fix it by release old page first, then set the pte to the new page. Note, the new page will be firstly used in secondary MMU before it is mapped into the page table of the process, but this is safe because it is protected by the page table lock, there is no race to change the pte [akpm@linux-foundation.org: add comment from Andrea] Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Avi Kivity <avi@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mempolicy: fix a memory corruption by refcount imbalance in alloc_pages_vma()Mel Gorman1-1/+11
Commit cc9a6c877661 ("cpuset: mm: reduce large amounts of memory barrier related damage v3") introduced a potential memory corruption. shmem_alloc_page() uses a pseudo vma and it has one significant unique combination, vma->vm_ops=NULL and vma->policy->flags & MPOL_F_SHARED. get_vma_policy() does NOT increase a policy ref when vma->vm_ops=NULL and mpol_cond_put() DOES decrease a policy ref when a policy has MPOL_F_SHARED. Therefore, when a cpuset update race occurs, alloc_pages_vma() falls in 'goto retry_cpuset' path, decrements the reference count and frees the policy prematurely. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Josh Boyer <jwboyer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mempolicy: fix refcount leak in mpol_set_shared_policy()KOSAKI Motohiro1-6/+9
When shared_policy_replace() fails to allocate new->policy is not freed correctly by mpol_set_shared_policy(). The problem is that shared mempolicy code directly call kmem_cache_free() in multiple places where it is easy to make a mistake. This patch creates an sp_free wrapper function and uses it. The bug was introduced pre-git age (IOW, before 2.6.12-rc2). [mgorman@suse.de: Editted changelog] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Josh Boyer <jwboyer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mempolicy: fix a race in shared_policy_replace()Mel Gorman2-22/+17
shared_policy_replace() use of sp_alloc() is unsafe. 1) sp_node cannot be dereferenced if sp->lock is not held and 2) another thread can modify sp_node between spin_unlock for allocating a new sp node and next spin_lock. The bug was introduced before 2.6.12-rc2. Kosaki's original patch for this problem was to allocate an sp node and policy within shared_policy_replace and initialise it when the lock is reacquired. I was not keen on this approach because it partially duplicates sp_alloc(). As the paths were sp->lock is taken are not that performance critical this patch converts sp->lock to sp->mutex so it can sleep when calling sp_alloc(). [kosaki.motohiro@jp.fujitsu.com: Original patch] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Josh Boyer <jwboyer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mempolicy: remove mempolicy sharingKOSAKI Motohiro1-14/+38
Dave Jones' system call fuzz testing tool "trinity" triggered the following bug error with slab debugging enabled ============================================================================= BUG numa_policy (Not tainted): Poison overwritten ----------------------------------------------------------------------------- INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154 __slab_alloc+0x3d3/0x445 kmem_cache_alloc+0x29d/0x2b0 mpol_new+0xa3/0x140 sys_mbind+0x142/0x620 system_call_fastpath+0x16/0x1b INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154 __slab_free+0x2e/0x1de kmem_cache_free+0x25a/0x260 __mpol_put+0x27/0x30 remove_vma+0x68/0x90 exit_mmap+0x118/0x140 mmput+0x73/0x110 exit_mm+0x108/0x130 do_exit+0x162/0xb90 do_group_exit+0x4f/0xc0 sys_exit_group+0x17/0x20 system_call_fastpath+0x16/0x1b INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x (null) flags=0x20000000004080 INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0 The problem is that the structure is being prematurely freed due to a reference count imbalance. In the following case mbind(addr, len) should replace the memory policies of both vma1 and vma2 and thus they will become to share the same mempolicy and the new mempolicy will have the MPOL_F_SHARED flag. +-------------------+-------------------+ | vma1 | vma2(shmem) | +-------------------+-------------------+ | | addr addr+len alloc_pages_vma() uses get_vma_policy() and mpol_cond_put() pair for maintaining the mempolicy reference count. The current rule is that get_vma_policy() only increments refcount for shmem VMA and mpol_conf_put() only decrements refcount if the policy has MPOL_F_SHARED. In above case, vma1 is not shmem vma and vma->policy has MPOL_F_SHARED! The reference count will be decreased even though was not increased whenever alloc_page_vma() is called. This has been broken since commit [52cd3b07: mempolicy: rework mempolicy Reference Counting] in 2008. There is another serious bug with the sharing of memory policies. Currently, mempolicy rebind logic (it is called from cpuset rebinding) ignores a refcount of mempolicy and override it forcibly. Thus, any mempolicy sharing may cause mempolicy corruption. The bug was introduced by commit [68860ec1: cpusets: automatic numa mempolicy rebinding]. Ideally, the shared policy handling would be rewritten to either properly handle COW of the policy structures or at least reference count MPOL_F_SHARED based exclusively on information within the policy. However, this patch takes the easier approach of disabling any policy sharing between VMAs. Each new range allocated with sp_alloc will allocate a new policy, set the reference count to 1 and drop the reference count of the old policy. This increases the memory footprint but is not expected to be a major problem as mbind() is unlikely to be used for fine-grained ranges. It is also inefficient because it means we allocate a new policy even in cases where mbind_range() could use the new_policy passed to it. However, it is more straight-forward and the change should be invisible to the user. [mgorman@suse.de: Edited changelog] Reported-by: Dave Jones <davej@redhat.com>, Cc: Christoph Lameter <cl@linux.com>, Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Josh Boyer <jwboyer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09revert "mm: mempolicy: Let vma_merge and vma_split handle vma->vm_policy linkages"KOSAKI Motohiro1-17/+24
Commit 05f144a0d5c2 ("mm: mempolicy: Let vma_merge and vma_split handle vma->vm_policy linkages") removed vma->vm_policy updates code but it is the purpose of mbind_range(). Now, mbind_range() is virtually a no-op and while it does not allow memory corruption it is not the right fix. This patch is a revert. [mgorman@suse.de: Edited changelog] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Cc: Josh Boyer <jwboyer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: compaction: capture a suitable high-order page immediately when it is made availableMel Gorman5-29/+130
While compaction is migrating pages to free up large contiguous blocks for allocation it races with other allocation requests that may steal these blocks or break them up. This patch alters direct compaction to capture a suitable free page as soon as it becomes available to reduce this race. It uses similar logic to split_free_page() to ensure that watermarks are still obeyed. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>