aboutsummaryrefslogtreecommitdiffstats
path: root/fs (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2009-09-22ksm: fix deadlock with munlock in exit_mmapAndrea Arcangeli5-20/+8
Rawhide users have reported hang at startup when cryptsetup is run: the same problem can be simply reproduced by running a program int main() { mlockall(MCL_CURRENT | MCL_FUTURE); return 0; } The problem is that exit_mmap() applies munlock_vma_pages_all() to clean up VM_LOCKED areas, and its current implementation (stupidly) tries to fault in absent pages, for example where PROT_NONE prevented them being faulted in when mlocking. Whereas the "ksm: fix oom deadlock" patch, knowing there's a race by which KSM might try to fault in pages after exit_mmap() had finally zapped the range, backs out of such faults doing nothing when its ksm_test_exit() notices mm_users 0. So revert that part of "ksm: fix oom deadlock" which moved the ksm_exit() call from before exit_mmap() to the middle of exit_mmap(); and remove those ksm_test_exit() checks from the page fault paths, so allowing the munlocking to proceed without interference. ksm_exit, if there are rmap_items still chained on this mm slot, takes mmap_sem write side: so preventing KSM from working on an mm while exit_mmap runs. And KSM will bail out as soon as it notices that mm_users is already zero, thanks to its internal ksm_test_exit checks. So that when a task is killed by OOM killer or the user, KSM will not indefinitely prevent it from running exit_mmap to release its memory. This does break a part of what "ksm: fix oom deadlock" was trying to achieve. When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even when ksmd itself has to cancel a KSM page, it is possible that the first OOM-kill victim would be the KSM process being faulted: then its memory won't be freed until a second victim has been selected (freeing memory for the unmerging fault to complete). But the OOM killer is already liable to kill a second victim once the intended victim's p->mm goes to NULL: so there's not much point in rejecting this KSM patch before fixing that OOM behaviour. It is very much more important to allow KSM users to boot up, than to haggle over an unlikely and poorly supported OOM case. We also intend to fix munlocking to not fault pages: at which point this patch _could_ be reverted; though that would be controversial, so we hope to find a better solution. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Justin M. Forbes <jforbes@redhat.com> Acked-for-now-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: fix oom deadlockHugh Dickins5-53/+137
There's a now-obvious deadlock in KSM's out-of-memory handling: imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex, trying to allocate a page to break KSM in an mm which becomes the OOM victim (quite likely in the unmerge case): it's killed and goes to exit, and hangs there waiting to acquire ksm_thread_mutex. Clearly we must not require ksm_thread_mutex in __ksm_exit, simple though that made everything else: perhaps use mmap_sem somehow? And part of the answer lies in the comments on unmerge_ksm_pages: __ksm_exit should also leave all the rmap_item removal to ksmd. But there's a fundamental problem, that KSM relies upon mmap_sem to guarantee the consistency of the mm it's dealing with, yet exit_mmap tears down an mm without taking mmap_sem. And bumping mm_users won't help at all, that just ensures that the pages the OOM killer assumes are on their way to being freed will not be freed. The best answer seems to be, to move the ksm_exit callout from just before exit_mmap, to the middle of exit_mmap: after the mm's pages have been freed (if the mmu_gather is flushed), but before its page tables and vma structures have been freed; and down_write,up_write mmap_sem there to serialize with KSM's own reliance on mmap_sem. But KSM then needs to be careful, whenever it downs mmap_sem, to check that the mm is not already exiting: there's a danger of using find_vma on a layout that's being torn apart, or writing into page tables which have been freed for reuse; and even do_anonymous_page and __do_fault need to check they're not being called by break_ksm to reinstate a pte after zap_pte_range has zapped that page table. Though it might be clearer to add an exiting flag, set while holding mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating a zapped pte. All we need is to check whether mm_users is 0 - but must remember that ksmd may detect that before __ksm_exit is reached. So, ksm_test_exit(mm) added to comment such checks on mm->mm_users. __ksm_exit now has to leave clearing up the rmap_items to ksmd, that needs ksm_thread_mutex; but shift the exiting mm just after the ksm_scan cursor so that it will soon be dealt with. __ksm_enter raise mm_count to hold the mm_struct, ksmd's exit processing (exactly like its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it, similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd). But also give __ksm_exit a fast path: when there's no complication (no rmap_items attached to mm and it's not at the ksm_scan cursor), it can safely do all the exiting work itself. This is not just an optimization: when ksmd is not running, the raised mm_count would otherwise leak mm_structs. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: distribute remove_mm_from_listsHugh Dickins1-55/+42
Do some housekeeping in ksm.c, to help make the next patch easier to understand: remove the function remove_mm_from_lists, distributing its code to its callsites scan_get_next_rmap_item and __ksm_exit. That turns out to be a win in scan_get_next_rmap_item: move its remove_trailing_rmap_items and cursor advancement up, and it becomes simpler than before. __ksm_exit becomes messier, but will change again; and moving its remove_trailing_rmap_items up lets us strengthen the unstable tree item's age condition in remove_rmap_item_from_tree. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: fix endless loop on oomHugh Dickins1-23/+85
break_ksm has been looping endlessly ignoring VM_FAULT_OOM: that should only be a problem for ksmd when a memory control group imposes limits (normally the OOM killer will kill others with an mm until it succeeds); but in general (especially for MADV_UNMERGEABLE and KSM_RUN_UNMERGE) we do need to route the error (or kill) back to the caller (or sighandling). Test signal_pending in unmerge_ksm_pages, which could be a lengthy procedure if it has to spill into swap: returning -ERESTARTSYS so that trivial signals will restart but fatals will terminate (is that right? we do different things in different places in mm, none exactly this). unmerge_and_remove_all_rmap_items was forgetting to lock when going down the mm_list: fix that. Whether it's successful or not, reset ksm_scan cursor to head; but only if it's successful, reset seqnr (shown in full_scans) - page counts will have gone down to zero. This patch leaves a significant OOM deadlock, but it's a good step on the way, and that deadlock is fixed in a subsequent patch. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: five little cleanupsHugh Dickins1-68/+44
1. We don't use __break_cow entry point now: merge it into break_cow. 2. remove_all_slot_rmap_items is just a special case of remove_trailing_rmap_items: use the latter instead. 3. Extend comment on unmerge_ksm_pages and rmap_items. 4. try_to_merge_two_pages should use try_to_merge_with_ksm_page instead of duplicating its code; and so swap them around. 5. Comment on cmp_and_merge_page described last year's: update it. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: keep quiet while list emptyHugh Dickins1-6/+22
ksm_scan_thread already sleeps in wait_event_interruptible until setting ksm_run activates it; but if there's nothing on its list to look at, i.e. nobody has yet said madvise MADV_MERGEABLE, it's a shame to be clocking up system time and full_scans: ksmd_should_run added to check that too. And move the mutex_lock out around it: the new counts showed that when ksm_run is stopped, a little work often got done afterwards, because it had been read before taking the mutex. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: break cow once unsharedHugh Dickins1-0/+8
We kept agreeing not to bother about the unswappable shared KSM pages which later become unshared by others: observation suggests they're not a significant proportion. But they are disadvantageous, and it is easier to break COW to replace them by swappable pages, than offer statistics to show that they don't matter; then we can stop worrying about them. Doing this in ksm_do_scan, they don't go through cmp_and_merge_page on this pass: give them a good chance of getting into the unstable tree on the next pass, or back into the stable, by computing checksum now. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: pages_unshared and pages_volatileHugh Dickins1-1/+51
The pages_shared and pages_sharing counts give a good picture of how successful KSM is at sharing; but no clue to how much wasted work it's doing to get there. Add pages_unshared (count of unique pages waiting in the unstable tree, hoping to find a mate) and pages_volatile. pages_volatile is harder to define. It includes those pages changing too fast to get into the unstable tree, but also whatever other edge conditions prevent a page getting into the trees: a high value may deserve investigation. Don't try to calculate it from the various conditions: it's the total of rmap_items less those accounted for. Also show full_scans: the number of completed scans of everything registered in the mm list. The locking for all these counts is simply ksm_thread_mutex. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: move pages_sharing updatesHugh Dickins1-15/+9
The pages_shared count is incremented and decremented when adding a node to and removing a node from the stable tree: easy to understand. But the pages_sharing count was hard to follow, being adjusted in various places: increment and decrement it when adding to and removing from the stable tree. And the pages_sharing variable used to include the pages_shared, then those were subtracted when shown in the pages_sharing sysfs file: now keep it as an exclusive count of leaves hanging off the stable tree nodes, throughout. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: rename kernel_pages_allocatedHugh Dickins1-29/+28
We're not implementing swapping of KSM pages in its first release; but when that follows, "kernel_pages_allocated" will be a very poor name for the sysfs file showing number of nodes in the stable tree: rename that to "pages_shared" throughout. But we already have a "pages_shared", counting those page slots sharing the shared pages: first rename that to... "pages_sharing". What will become of "max_kernel_pages" when the pages shared can be swapped? I guess it will just be removed, so keep that name. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: change ksm nice level to be 5Izik Eidus1-1/+1
ksm should try not to disturb other tasks as much as possible. Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Chris Wright <chrisw@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: change copyright messageIzik Eidus1-1/+2
Adding Hugh Dickins into the authors list. Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Chris Wright <chrisw@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: prevent mremap move poisoningHugh Dickins1-0/+12
KSM's scan allows for user pages to be COWed or unmapped at any time, without requiring any notification. But its stable tree does assume that when it finds a KSM page where it placed a KSM page, then it is the same KSM page that it placed there. mremap move could break that assumption: if an area containing a KSM page was unmapped, then an area containing a different KSM page was moved with mremap into the place of the original, before KSM's scan came around to notice. That could then poison a node of the stable tree, so that memcmps would "lie" and upset the ordering of the tree. Probably noone will ever need mremap move on a VM_MERGEABLE area; except that prohibiting it would make trouble for schemes in which we try making everything VM_MERGEABLE e.g. for testing: an mremap which normally works would then fail mysteriously. There's no need to go to any trouble, such as re-sorting KSM's list of rmap_items to match the new layout: simply unmerge the area to COW all its KSM pages before moving, but leave VM_MERGEABLE on so that they're remerged later. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: Kernel SamePage MergingIzik Eidus1-5/+1484
Ksm is code that allows merging of identical pages between one or more applications, in a way invisible to the applications that use it. Pages that are merged are marked as read-only, then COWed when any application tries to change them. Whereas fork() allows sharing anonymous pages between parent and child, ksm can share anonymous pages between unrelated processes. Ksm works by walking over the memory pages of the applications it scans, in order to find identical pages. It uses two sorted data structures, called the stable and unstable trees, to locate identical pages in an effective way. When ksm finds two identical pages, it marks them as readonly and merges them into a single page. After the pages have been marked as readonly and merged into one, Linux treats them as normal copy-on-write pages, copying to a fresh anonymous page if write access is required later. Ksm scans and merges anonymous pages only in those memory areas that have been registered with it by madvise(addr, length, MADV_MERGEABLE). The ksm scanner is controlled by sysfs files in /sys/kernel/mm/ksm/: max_kernel_pages - the maximum number of unswappable kernel pages which may be allocated by ksm (0 for unlimited). kernel_pages_allocated - how many ksm pages are currently allocated, sharing identical content between different processes (pages unswappable in this release). pages_shared - how many pages have been saved by sharing with ksm pages (kernel_pages_allocated being excluded from this count). pages_to_scan - how many pages ksm should scan before sleeping. sleep_millisecs - how many milliseconds ksm should sleep between scans. run - write 0 to disable ksm, read 0 while ksm is disabled (default), write 1 to run ksm, read 1 while ksm is running, write 2 to disable ksm and unmerge all its pages. Includes contributions by Andrea Arcangeli Chris Wright and Hugh Dickins. [hugh.dickins@tiscali.co.uk: fix rare page leak] Signed-off-by: Izik Eidus <ieidus@redhat.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: identify PageKsm pagesHugh Dickins3-1/+36
KSM will need to identify its kernel merged pages unambiguously, and /proc/kpageflags will probably like to do so too. Since KSM will only be substituting anonymous pages, statistics are best preserved by making a PageKsm page a special PageAnon page: one with no anon_vma. But KSM then needs its own page_add_ksm_rmap() - keep it in ksm.h near PageKsm; and do_wp_page() must COW them, unlike singly mapped PageAnons. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: no debug in page_dup_rmap()Hugh Dickins3-27/+2
page_dup_rmap(), used on each mapped page when forking, was originally just an inline atomic_inc of mapcount. 2.6.22 added CONFIG_DEBUG_VM out-of-line checks to it, which would need to be ever-so-slightly complicated to allow for the PageKsm() we're about to define. But I think these checks never caught anything. And if it's coding errors we're worried about, such checks should be in page_remove_rmap() too, not just when forking; whereas if it's pagetable corruption we're worried about, then they shouldn't be limited to CONFIG_DEBUG_VM. Oh, just revert page_dup_rmap() to an inline atomic_inc of mapcount. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: the mm interface to ksmHugh Dickins8-1/+147
This patch presents the mm interface to a dummy version of ksm.c, for better scrutiny of that interface: the real ksm.c follows later. When CONFIG_KSM is not set, madvise(2) reject MADV_MERGEABLE and MADV_UNMERGEABLE with EINVAL, since that seems more helpful than pretending that they can be serviced. But when CONFIG_KSM=y, accept them even if KSM is not currently running, and even on areas which KSM will not touch (e.g. hugetlb or shared file or special driver mappings). Like other madvices, report ENOMEM despite success if any area in the range is unmapped, and use EAGAIN to report out of memory. Define vma flag VM_MERGEABLE to identify an area on which KSM may try merging pages: leave it to ksm_madvise() to decide whether to set it. Define mm flag MMF_VM_MERGEABLE to identify an mm which might contain VM_MERGEABLE areas, to minimize callouts when forking or exiting. Based upon earlier patches by Chris Wright and Izik Eidus. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: define MADV_MERGEABLE and MADV_UNMERGEABLEHugh Dickins5-0/+15
The out-of-tree KSM used ioctls on fds cloned from /dev/ksm to register a memory area for merging: we prefer now to use an madvise(2) interface. This patch just defines MADV_MERGEABLE (to tell KSM it may merge pages in this area found identical to pages in other mergeable areas) and MADV_UNMERGEABLE (to undo that). Most architectures use asm-generic, but alpha, mips, parisc, xtensa need their own definitions: included here for mmotm convenience, but we'll probably want to split this and feed pieces to arch maintainers. Based upon earlier patches by Chris Wright and Izik Eidus. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Helge Deller <deller@gmx.de> Cc: Chris Zankel <chris@zankel.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: first tidy up madvise_vma()Hugh Dickins1-26/+13
madvise.c has several levels of switch statements, what to do in which? Move MADV_DOFORK code down from madvise_vma() to madvise_behavior(), so madvise_vma() can be a simple router, to madvise_behavior() by default. vma->vm_flags is an unsigned long so use the same type for new_flags. Add missing comment lines to describe MADV_DONTFORK and MADV_DOFORK. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22ksm: add mmu_notifier set_pte_at_notify()Izik Eidus3-2/+61
KSM is a linux driver that allows dynamicly sharing identical memory pages between one or more processes. Unlike tradtional page sharing that is made at the allocation of the memory, ksm do it dynamicly after the memory was created. Memory is periodically scanned; identical pages are identified and merged. The sharing is made in a transparent way to the processes that use it. Ksm is highly important for hypervisors (kvm), where in production enviorments there might be many copys of the same data data among the host memory. This kind of data can be: similar kernels, librarys, cache, and so on. Even that ksm was wrote for kvm, any userspace application that want to use it to share its data can try it. Ksm may be useful for any application that might have similar (page aligment) data strctures among the memory, ksm will find this data merge it to one copy, and even if it will be changed and thereforew copy on writed, ksm will merge it again as soon as it will be identical again. Another reason to consider using ksm is the fact that it might simplify alot the userspace code of application that want to use shared private data, instead that the application will mange shared area, ksm will do this for the application, and even write to this data will be allowed without any synchinization acts from the application. Ksm was designed to be a loadable module that doesn't change the VM code of linux. This patch: The set_pte_at_notify() macro allows setting a pte in the shadow page table directly, instead of flushing the shadow page table entry and then getting vmexit to set it. It uses a new change_pte() callback to do so. set_pte_at_notify() is an optimization for kvm, and other users of mmu_notifiers, for COW pages. It is useful for kvm when ksm is used, because it allows kvm not to have to receive vmexit and only then map the ksm page into the shadow page table, but instead map it directly at the same time as Linux maps the page into the host page table. Users of mmu_notifiers who don't implement new mmu_notifier_change_pte() callback will just receive the mmu_notifier_invalidate_page() callback. Signed-off-by: Izik Eidus <ieidus@redhat.com> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: perform non-atomic test-clear of PG_mlocked on freeJohannes Weiner2-5/+11
By the time PG_mlocked is cleared in the page freeing path, nobody else is looking at our page->flags anymore. It is thus safe to make the test-and-clear non-atomic and thereby removing an unnecessary and expensive operation from a hotpath. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22vmalloc.c: fix double error checkingFigo.zhang1-3/+1
There is no need for double error checking. Signed-off-by: Figo.zhang <figo1802@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: add gfp mask checking for __get_free_pages()Akinobu Mita1-15/+9
__get_free_pages() with __GFP_HIGHMEM is not safe because the return address cannot represent a highmem page. get_zeroed_page() already has such a debug checking. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22vmscan: kill unnecessary prefetchKOSAKI Motohiro1-1/+0
The pages in the list passed move_active_pages_to_lru() are already touched by shrink_active_list(). IOW the prefetch in move_active_pages_to_lru() don't populate any cache. it's pointless. This patch remove it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22vmscan: kill unnecessary page flag testKOSAKI Motohiro1-2/+2
The page_lru() already evaluate PageActive() and PageSwapBacked(). We don't need to re-evaluate it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22vmscan: move ClearPageActive from move_active_pages() to shrink_active_list()KOSAKI Motohiro1-4/+1
The move_active_pages_to_lru() function is called under irq disabled and ClearPageActive() doesn't need irq disabling. Then, this patch move it into shrink_active_list(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22vmscan: don't attempt to reclaim anon page in lumpy reclaim when no swap space is availableMinchan Kim1-0/+10
The VM already avoids attempting to reclaim anon pages in various places, But it doesn't avoid it for lumpy reclaim. It shuffles lru list unnecessary so that it is pointless. [akpm@linux-foundation.org: cleanup] Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: count only reclaimable lru pagesWu Fengguang3-22/+44
global_lru_pages() / zone_lru_pages() can be used in two ways: - to estimate max reclaimable pages in determine_dirtyable_memory() - to calculate the slab scan ratio When swap is full or not present, the anon lru lists are not reclaimable and also won't be scanned. So the anon pages shall not be counted in both usage scenarios. Also rename to _reclaimable_pages: now they are counting the possibly reclaimable lru pages. It can greatly (and correctly) increase the slab scan rate under high memory pressure (when most file pages have been reclaimed and swap is full/absent), thus reduce false OOM kills. Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: David Howells <dhowells@redhat.com> Cc: "Li, Ming Chun" <macli@brc.ubc.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22vm: document that setting vfs_cache_pressure to 0 isn't a good ideaJan Kara1-1/+3
Reported-by: Christian Thaeter <ct@pipapo.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: remove __{add,sub}_zone_page_state()KOSAKI Motohiro1-5/+0
__add_zone_page_state() and __sub_zone_page_state() are unused. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22vmscan: throttle direct reclaim when too many pages are isolated alreadyRik van Riel1-0/+33
When way too many processes go into direct reclaim, it is possible for all of the pages to be taken off the LRU. One result of this is that the next process in the page reclaim code thinks there are no reclaimable pages left and triggers an out of memory kill. One solution to this problem is to never let so many processes into the page reclaim path that the entire LRU is emptied. Limiting the system to only having half of each inactive list isolated for reclaim should be safe. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: vmstat: add isolate pagesKOSAKI Motohiro5-4/+35
If the system is running a heavy load of processes then concurrent reclaim can isolate a large number of pages from the LRU. /proc/vmstat and the output generated for an OOM do not show how many pages were isolated. This has been observed during process fork bomb testing (mstctl11 in LTP). This patch shows the information about isolated pages. Reproduced via: ----------------------- % ./hackbench 140 process 1000 => OOM occur active_anon:146 inactive_anon:0 isolated_anon:49245 active_file:79 inactive_file:18 isolated_file:113 unevictable:0 dirty:0 writeback:0 unstable:0 buffer:39 free:370 slab_reclaimable:309 slab_unreclaimable:5492 mapped:53 shmem:15 pagetables:28140 bounce:0 Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: shrink_inactive_list() nr_scan accounting fix fixKOSAKI Motohiro1-12/+18
If sc->isolate_pages() return 0, we don't need to call shrink_page_list(). In past days, shrink_inactive_list() handled it properly. But commit fb8d14e1 (three years ago commit!) breaked it. current shrink_inactive_list() always call shrink_page_list() although isolate_pages() return 0. This patch restore proper return value check. Requirements: o "nr_taken == 0" condition should stay before calling shrink_page_list(). o "nr_taken == 0" condition should stay after nr_scan related statistics modification. Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: rename pgmoved variable in shrink_active_list()KOSAKI Motohiro1-8/+8
Currently the pgmoved variable has two meanings. It causes harder reviewing. This patch separates it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: update alloc_flags after oom killer has been calledDavid Rientjes1-1/+1
It is possible for the oom killer to select current as the task to kill. When this happens, alloc_flags needs to be updated accordingly to set ALLOC_NO_WATERMARKS so the subsequent allocation attempt may use memory reserves as the result of its thread having TIF_MEMDIE set if the allocation is not __GFP_NOMEMALLOC. Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: oom analysis: add shmem vmstatKOSAKI Motohiro7-3/+18
Recently we encountered OOM problems due to memory use of the GEM cache. Generally a large amuont of Shmem/Tmpfs pages tend to create a memory shortage problem. We often use the following calculation to determine the amount of shmem pages: shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES however the expression does not consider isolated and mlocked pages. This patch adds explicit accounting for pages used by shmem and tmpfs. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: oom analysis: Show kernel stack usage in /proc/meminfo and OOM log outputKOSAKI Motohiro6-1/+22
The amount of memory allocated to kernel stacks can become significant and cause OOM conditions. However, we do not display the amount of memory consumed by stacks. Add code to display the amount of memory used for stacks in /proc/meminfo. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: oom analysis: add buffer cache information to show_free_areas()KOSAKI Motohiro1-1/+2
It is often useful to know the statistics for all pages that are handled like page cache pages when looking at OOM log output. Therefore show_free_areas() should also display buffer cache statistics. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: oom analysis: add per-zone statistics to show_free_areas()KOSAKI Motohiro1-0/+20
show_free_areas() displays only a limited amount of zone counters. This patch includes additional counters in the display to allow easier debugging. This may be especially useful if an OOM is due to running out of DMA memory. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22Documentation/memory.txt: remove some very outdated recommendationsAndi Kleen1-29/+2
Remove some very outdated recommendations in Documentation/memory.txt Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: show_free_areas(): display slab pages in two separate fieldsKOSAKI Motohiro1-3/+4
If an OOM happens, we really want to know the number of remaining reclaimable pages. So the reclaimable slab and unreclaimable slab fields should not be combined for display. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: clean up page_remove_rmap()KOSAKI Motohiro1-27/+30
page_remove_rmap() has multiple PageAnon() tests and it has deep nesting. Clean this up. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22hugetlb: restore interleaving of bootmem huge pagesLee Schermerhorn1-1/+1
I noticed that alloc_bootmem_huge_page() will only advance to the next node on failure to allocate a huge page, potentially filling nodes with huge-pages. I asked about this on linux-mm and linux-numa, cc'ing the usual huge page suspects. Mel Gorman responded: I strongly suspect that the same node being used until allocation failure instead of round-robin is an oversight and not deliberate at all. It appears to be a side-effect of a fix made way back in commit 63b4613c3f0d4b724ba259dc6c201bb68b884e1a ["hugetlb: fix hugepage allocation with memoryless nodes"]. Prior to that patch it looked like allocations would always round-robin even when allocation was successful. This patch--factored out of my "hugetlb mempolicy" series--moves the advance of the hstate next node from which to allocate up before the test for success of the attempted allocation. Note that alloc_bootmem_huge_page() is only used for order > MAX_ORDER huge pages. I'll post a separate patch for mainline/stable, as the above mentioned "balance freeing" series renamed the next node to alloc function. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Andy Whitcroft <apw@canonical.com> Reviewed-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22hugetlb: clean up and update huge pages documentationLee Schermerhorn1-46/+87
Attempt to clarify huge page administration and usage, and updates the doucmentation to mention the balancing of huge pages across nodes when allocating and freeing. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22hugetlb: use free_pool_huge_page() to return unused surplus pagesLee Schermerhorn1-33/+24
Use the [modified] free_pool_huge_page() function to return unused surplus pages. This will help keep huge pages balanced across nodes between freeing of unused surplus pages and freeing of persistent huge pages [from set_max_huge_pages] by using the same node id "cursor". It also eliminates some code duplication. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22hugetlb: balance freeing of huge pages across nodesLee Schermerhorn2-47/+88
Free huges pages from nodes in round robin fashion in an attempt to keep [persistent a.k.a static] hugepages balanced across nodes New function free_pool_huge_page() is modeled on and performs roughly the inverse of alloc_fresh_huge_page(). Replaces dequeue_huge_page() which now has no callers, so this patch removes it. Helper function hstate_next_node_to_free() uses new hstate member next_to_free_nid to distribute "frees" across all nodes with huge pages. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22page_alloc: fix kernel-doc warningRandy Dunlap1-1/+1
Ummark function as having kernel-doc notation, fixing the kernel-doc warning. Warning(mm/page_alloc.c:4519): No description found for parameter 'zone' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22memory hotplug: migrate swap cache pageShaohua Li1-2/+4
In test, some pages in swap-cache can't be migrated, as they aren't rmap. unmap_and_move() ignores swap-cache page which is just read in and hasn't rmap (see the comments in the code), but swap_aops provides .migratepage. Better to migrate such pages instead of ignore them. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Yakui Zhao <yakui.zhao@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22memory hotplug: alloc page from other node in memory onlineShaohua Li3-7/+22
To initialize hotadded node, some pages are allocated. At that time, the node hasn't memory, this makes the allocation always fail. In such case, let's allocate pages from other nodes. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Yakui Zhao <yakui.zhao@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22memory hotplug: make pages from movable zone always isolatableShaohua Li1-1/+4
Pages on movable zone have two types, MIGRATE_MOVABLE and MIGRATE_RESERVE, both them can be movable, because only movable memory allocation can get pages from movable zone. This makes pages in movable zone always be able to migrate. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Yakui Zhao <yakui.zhao@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>