aboutsummaryrefslogtreecommitdiffstats
path: root/mm (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2009-04-02nommu: fix a number of issues with the per-MM VMA patchDavid Howells2-30/+25
Fix a number of issues with the per-MM VMA patch: (1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on a NOMMU system with more than 2G pages. Makes no difference on a 32-bit system. (2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value, lest it overflow. (3) Move the allocation of the vm_area_struct slab back for fork.c. (4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs. (5) Use BUG_ON() rather than if () BUG(). (6) Make the default validate_nommu_regions() a static inline rather than a #define. (7) Make free_page_series()'s objection to pages with a refcount != 1 more informative. (8) Adjust the __put_nommu_region() banner comment to indicate that the semaphore must be held for writing. (9) Limit the number of warnings about munmaps of non-mmapped regions. Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02generic debug pagealloc: build fixAkinobu Mita1-0/+9
This fixes a build failure with generic debug pagealloc: mm/debug-pagealloc.c: In function 'set_page_poison': mm/debug-pagealloc.c:8: error: 'struct page' has no member named 'debug_flags' mm/debug-pagealloc.c: In function 'clear_page_poison': mm/debug-pagealloc.c:13: error: 'struct page' has no member named 'debug_flags' mm/debug-pagealloc.c: In function 'page_poison': mm/debug-pagealloc.c:18: error: 'struct page' has no member named 'debug_flags' mm/debug-pagealloc.c: At top level: mm/debug-pagealloc.c:120: error: redefinition of 'kernel_map_pages' include/linux/mm.h:1278: error: previous definition of 'kernel_map_pages' was here mm/debug-pagealloc.c: In function 'kernel_map_pages': mm/debug-pagealloc.c:122: error: 'debug_pagealloc_enabled' undeclared (first use in this function) by fixing - debug_flags should be in struct page - define DEBUG_PAGEALLOC config option for all architectures Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01shmem: writepage directly to swapHugh Dickins1-2/+1
Synopsis: if shmem_writepage calls swap_writepage directly, most shmem swap loads benefit, and a catastrophic interaction between SLUB and some flash storage is avoided. shmem_writepage() has always been peculiar in making no attempt to write: it has just transferred a shmem page from file cache to swap cache, then let that page make its way around the LRU again before being written and freed. The idea was that people use tmpfs because they want those pages to stay in RAM; so although we give it an overflow to swap, we should resist writing too soon, giving those pages a second chance before they can be reclaimed. That was always questionable, and I've toyed with this patch for years; but never had a clear justification to depart from the original design. It became more questionable in 2.6.28, when the split LRU patches classed shmem and tmpfs pages as SwapBacked rather than as file_cache: that in itself gives them more resistance to reclaim than normal file pages. I prepared this patch for 2.6.29, but the merge window arrived before I'd completed gathering statistics to justify sending it in. Then while comparing SLQB against SLUB, running SLUB on a laptop I'd habitually used with SLAB, I found SLUB to run my tmpfs kbuild swapping tests five times slower than SLAB or SLQB - other machines slower too, but nowhere near so bad. Simpler "cp -a" swapping tests showed the same. slub_max_order=0 brings sanity to all, but heavy swapping is too far from normal to justify such a tuning. The crucial factor on that laptop turns out to be that I'm using an SD card for swap. What happens is this: By default, SLUB uses order-2 pages for shmem_inode_cache (and many other fs inodes), so creating tmpfs files under memory pressure brings lumpy reclaim into play. One subpage of the order is chosen from the bottom of the LRU as usual, then the other three picked out from their random positions on the LRUs. In a tmpfs load, many of these pages will be ones which already passed through shmem_writepage, so already have swap allocated. And though their offsets on swap were probably allocated sequentially, now that the pages are picked off at random, their swap offsets are scattered. But the flash storage on the SD card is very sensitive to having its writes merged: once swap is written at scattered offsets, performance falls apart. Rotating disk seeks increase too, but less disastrously. So: stop giving shmem/tmpfs pages a second pass around the LRU, write them out to swap as soon as their swap has been allocated. It's surely possible to devise an artificial load which runs faster the old way, one whose sizing is such that the tmpfs pages on their second pass are the ones that are wanted again, and other pages not. But I've not yet found such a load: on all machines, under the loads I've tried, immediate swap_writepage speeds up shmem swapping: especially when using the SLUB allocator (and more effectively than slub_max_order=0), but also with the others; and it also reduces the variance between runs. How much faster varies widely: a factor of five is rare, 5% is common. One load which might have suffered: imagine a swapping shmem load in a limited mem_cgroup on a machine with plenty of memory. Before 2.6.29 the swapcache was not charged, and such a load would have run quickest with the shmem swapcache never written to swap. But now swapcache is charged, so even this load benefits from shmem_writepage directly to swap. Apologies for the #ifndef CONFIG_SWAP swap_writepage() stub in swap.h: it's silly because that will never get called; but refactoring shmem.c sensibly according to CONFIG_SWAP will be a separate task. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01vmscan: fix it to take care of nodemaskKAMEZAWA Hiroyuki2-3/+13
try_to_free_pages() is used for the direct reclaim of up to SWAP_CLUSTER_MAX pages when watermarks are low. The caller to alloc_pages_nodemask() can specify a nodemask of nodes that are allowed to be used but this is not passed to try_to_free_pages(). This can lead to unnecessary reclaim of pages that are unusable by the caller and int the worst case lead to allocation failure as progress was not been make where it is needed. This patch passes the nodemask used for alloc_pages_nodemask() to try_to_free_pages(). Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01vmscan: print shrink_slab symbol name on negative shrinker objectsDavid Rientjes1-2/+3
When a shrinker has a negative number of objects to delete, the symbol name of the shrinker should be printed, not shrink_slab. This also makes the error message slightly more informative. Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01nommu: make CONFIG_UNEVICTABLE_LRU available when CONFIG_MMU=nDavid Howells1-1/+0
Make CONFIG_UNEVICTABLE_LRU available when CONFIG_MMU=n. There's no logical reason it shouldn't be available, and it can be used for ramfs. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Greg Ungerer <gerg@snapgear.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Enrik Berkhan <Enrik.Berkhan@ge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01nommu: there is no mlock() for NOMMU, so don't provide the bitsDavid Howells2-3/+13
The mlock() facility does not exist for NOMMU since all mappings are effectively locked anyway, so we don't make the bits available when they're not useful. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Greg Ungerer <gerg@snapgear.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Enrik Berkhan <Enrik.Berkhan@ge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01mm: introduce debug_kmap_atomicAkinobu Mita1-0/+45
x86 has debug_kmap_atomic_prot() which is error checking function for kmap_atomic. It is usefull for the other architectures, although it needs CONFIG_TRACE_IRQFLAGS_SUPPORT. This patch exposes it to the other architectures. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01mm: page_mkwrite change prototype to match faultNick Piggin1-4/+22
Change the page_mkwrite prototype to take a struct vm_fault, and return VM_FAULT_xxx flags. There should be no functional change. This makes it possible to return much more detailed error information to the VM (and also can provide more information eg. virtual_address to the driver, which might be important in some special cases). This is required for a subsequent fix. And will also make it easier to merge page_mkwrite() with fault() in future. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Cc: Artem Bityutskiy <dedekind@infradead.org> Cc: Felix Blyakher <felixb@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01mm: fix proc_dointvec_userhz_jiffies "breakage"Alexey Dobriyan1-9/+11
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=9838 On i386, HZ=1000, jiffies_to_clock_t() converts time in a somewhat strange way from the user's point of view: # echo 500 >/proc/sys/vm/dirty_writeback_centisecs # cat /proc/sys/vm/dirty_writeback_centisecs 499 So, we have 5000 jiffies converted to only 499 clock ticks and reported back. TICK_NSEC = 999848 ACTHZ = 256039 Keeping in-kernel variable in units passed from userspace will fix issue of course, but this probably won't be right for every sysctl. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01generic debug pageallocAkinobu Mita3-0/+147
CONFIG_DEBUG_PAGEALLOC is now supported by x86, powerpc, sparc64, and s390. This patch implements it for the rest of the architectures by filling the pages with poison byte patterns after free_pages() and verifying the poison patterns before alloc_pages(). This generic one cannot detect invalid page accesses immediately but invalid read access may cause invalid dereference by poisoned memory and invalid write access can be detected after a long delay. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01memdup_user(): introduceLi Zefan1-0/+30
I notice there are many places doing copy_from_user() which follows kmalloc(): dst = kmalloc(len, GFP_KERNEL); if (!dst) return -ENOMEM; if (copy_from_user(dst, src, len)) { kfree(dst); return -EFAULT } memdup_user() is a wrapper of the above code. With this new function, we don't have to write 'len' twice, which can lead to typos/mistakes. It also produces smaller code and kernel text. A quick grep shows 250+ places where memdup_user() *may* be used. I'll prepare a patchset to do this conversion. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01hugetlb: chg cannot become less than 0Roel Kluin1-3/+3
chg is unsigned, so it cannot be less than 0. Also, since region_chg returns long, let vma_needs_reservation() forward this to alloc_huge_page(). Store it as long as well. all callers cast it to long anyway. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Adam Litke <agl@us.ibm.com> Cc: Johannes Weiner <hannes@saeurebad.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01mm: remove pagevec_swap_free()KOSAKI Motohiro1-23/+0
pagevec_swap_free() is now unused. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01mm: don't free swap slots on page deactivationJohannes Weiner1-3/+0
The pagevec_swap_free() at the end of shrink_active_list() was introduced in 68a22394 "vmscan: free swap space on swap-in/activation" when shrink_active_list() was still rotating referenced active pages. In 7e9cd48 "vmscan: fix pagecache reclaim referenced bit check" this was changed, the rotating removed but the pagevec_swap_free() after the rotation loop was forgotten, applying now to the pagevec of the deactivation loop instead. Now swap space is freed for deactivated pages. And only for those that happen to be on the pagevec after the deactivation loop. Complete 7e9cd48 and remove the rest of the swap freeing. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01mm: move pagevec stripping to save unlock-relockJohannes Weiner1-5/+2
In shrink_active_list() after the deactivation loop, we strip buffer heads from the potentially remaining pages in the pagevec. Currently, this drops the zone's lru lock for stripping, only to reacquire it again afterwards to update statistics. It is not necessary to strip the pages before updating the stats, so move the whole thing out of the protected region and save the extra locking. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: MinChan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01vfs: add/use account_page_dirtied()Edward Shishkin1-7/+15
Add a helper function account_page_dirtied(). Use that from two callsites. reiser4 adds a function which adds a third callsite. Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01vmscan: respect higher order in zone_reclaim()Johannes Weiner1-0/+1
During page allocation, there are two stages of direct reclaim that are applied to each zone in the preferred list. The first stage using zone_reclaim() reclaims unmapped file backed pages and slab pages if over defined limits as these are cheaper to reclaim. The caller specifies the order of the target allocation but the scan control is not being correctly initialised. The impact is that the correct number of pages are being reclaimed but that lumpy reclaim is not being applied. This increases the chances of a full direct reclaim via try_to_free_pages() is required. This patch initialises the order field of the scan control as requested by the caller. [mel@csn.ul.ie: rewrote changelog] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01mm: add comment why mark_page_accessed() would be better than pte_mkyoung() in follow_page()KOSAKI Motohiro1-0/+5
At first look, mark_page_accessed() in follow_page() seems a bit strange. It seems pte_mkyoung() would be better consistent with other kernel code. However, it is intentional. The commit log said: ------------------------------------------------ commit 9e45f61d69be9024a2e6bef3831fb04d90fac7a8 Author: akpm <akpm> Date: Fri Aug 15 07:24:59 2003 +0000 [PATCH] Use mark_page_accessed() in follow_page() Touching a page via follow_page() counts as a reference so we should be either setting the referenced bit in the pte or running mark_page_accessed(). Altering the pte is tricky because we haven't implemented an atomic pte_mkyoung(). And mark_page_accessed() is better anyway because it has more aging state: it can move the page onto the active list. BKrev: 3f3c8acbplT8FbwBVGtth7QmnqWkIw ------------------------------------------------ The atomic issue is still true nowadays. adding comment help to understand code intention and it would be better. [akpm@linux-foundation.org: clarify text] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01vmscan: clip swap_cluster_max in shrink_all_memory()Johannes Weiner1-1/+1
shrink_inactive_list() scans in sc->swap_cluster_max chunks until it hits the scan limit it was passed. shrink_inactive_list() { do { isolate_pages(swap_cluster_max) shrink_page_list() } while (nr_scanned < max_scan); } This assumes that swap_cluster_max is not bigger than the scan limit because the latter is checked only after at least one iteration. In shrink_all_memory() sc->swap_cluster_max is initialized to the overall reclaim goal in the beginning but not decreased while reclaim is making progress which leads to subsequent calls to shrink_inactive_list() reclaiming way too much in the one iteration that is done unconditionally. Set sc->swap_cluster_max always to the proper goal before doing shrink_all_zones() shrink_list() shrink_inactive_list(). While the current shrink_all_memory() happily reclaims more than actually requested, this patch fixes it to never exceed the goal: unpatched wanted=10000 reclaimed=13356 wanted=10000 reclaimed=19711 wanted=10000 reclaimed=10289 wanted=10000 reclaimed=17306 wanted=10000 reclaimed=10700 wanted=10000 reclaimed=10004 wanted=10000 reclaimed=13301 wanted=10000 reclaimed=10976 wanted=10000 reclaimed=10605 wanted=10000 reclaimed=10088 wanted=10000 reclaimed=15000 patched wanted=10000 reclaimed=10000 wanted=10000 reclaimed=9599 wanted=10000 reclaimed=8476 wanted=10000 reclaimed=8326 wanted=10000 reclaimed=10000 wanted=10000 reclaimed=10000 wanted=10000 reclaimed=9919 wanted=10000 reclaimed=10000 wanted=10000 reclaimed=10000 wanted=10000 reclaimed=10000 wanted=10000 reclaimed=10000 wanted=10000 reclaimed=9624 wanted=10000 reclaimed=10000 wanted=10000 reclaimed=10000 wanted=8500 reclaimed=8092 wanted=316 reclaimed=316 Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: MinChan Kim <minchan.kim@gmail.com> Acked-by: Nigel Cunningham <ncunningham@crca.org.au> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01mm: shrink_all_memory(): use sc.nr_reclaimedMinChan Kim1-22/+24
Commit a79311c14eae4bb946a97af25f3e1b17d625985d "vmscan: bail out of direct reclaim after swap_cluster_max pages" moved the nr_reclaimed counter into the scan control to accumulate the number of all reclaimed pages in a reclaim invocation. shrink_all_memory() can use the same mechanism. it increase code consistency and redability. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: MinChan Kim <minchan.kim@gmail.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01mm: don't call mark_page_accessed() in do_swap_page()KOSAKI Motohiro1-2/+0
commit bf3f3bc5e734706730c12a323f9b2068052aa1f0 (mm: don't mark_page_accessed in fault path) only remove the mark_page_accessed() in filemap_fault(). Therefore, swap-backed pages and file-backed pages have inconsistent behavior. mark_page_accessed() should be removed from do_swap_page(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01mm: introduce for_each_populated_zone() macroKOSAKI Motohiro3-33/+8
Impact: cleanup In almost cases, for_each_zone() is used with populated_zone(). It's because almost function doesn't need memoryless node information. Therefore, for_each_populated_zone() can help to make code simplify. This patch has no functional change. [akpm@linux-foundation.org: small cleanup] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01vmscan: rename sc.may_swap to may_unmapJohannes Weiner1-10/+10
sc.may_swap does not only influence reclaiming of anon pages but pages mapped into pagetables in general, which also includes mapped file pages. In shrink_page_list(): if (!sc->may_swap && page_mapped(page)) goto keep_locked; For anon pages, this makes sense as they are always mapped and reclaiming them always requires swapping. But mapped file pages are skipped here as well and it has nothing to do with swapping. The real effect of the knob is whether mapped pages are unmapped and reclaimed or not. Rename it to `may_unmap' to have its name match its actual meaning more precisely. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: MinChan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01oom_kill: don't call for int_sqrt(0)Cyrill Gorcunov1-7/+5
There is no need to call for int_sqrt if argument is 0. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01vmap: remove needless lock and list in vmapMinChan Kim1-16/+3
vmap's dirty_list is unused. It's for optimizing flushing. but Nick didn't write the code yet. so, we don't need it until time as it is needed. This patch removes vmap_block's dirty_list and codes related to it. Signed-off-by: MinChan Kim <minchan.kim@gmail.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01mm: mminit_validate_memmodel_limits(): remove redundant testCyrill Gorcunov1-3/+1
In case if start_pfn overlap the upper bound no need to test end_pfn again since we have it already trimmed. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-31tracing, Text Edit Lock: cleanupIngo Molnar1-2/+0
Remove incorrectly introduced headers. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-30lockdep: annotate reclaim context (__GFP_NOFS), fix SLOBIngo Molnar1-1/+1
Impact: build fix fix typo in mm/slob.c: mm/slob.c:469: error: ‘flags’ undeclared (first use in this function) mm/slob.c:469: error: (Each undeclared identifier is reported only once mm/slob.c:469: error: for each function it appears in.) Cc: Nick Piggin <npiggin@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20090128135457.350751756@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-30trivial: Fix dubious bitwise 'or' usage spotted by sparse.Alexey Zaytsev1-1/+1
It doesn't change the semantics, but it looks like the logical 'or' was meant to be used here. Signed-off-by: Alexey Zaytsev <alexey.zaytsev@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-03-30cpumask: use new cpumask_ functions in core code.Rusty Russell2-2/+2
Impact: cleanup Time to clean up remaining laggards using the old cpu_ functions. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Trond.Myklebust@netapp.com
2009-03-30cpumask: remove dangerous CPU_MASK_ALL_PTR, &CPU_MASK_ALLRusty Russell1-1/+1
Impact: cleanup (Thanks to Al Viro for reminding me of this, via Ingo) CPU_MASK_ALL is the (deprecated) "all bits set" cpumask, defined as so: #define CPU_MASK_ALL (cpumask_t) { { ... } } Taking the address of such a temporary is questionable at best, unfortunately 321a8e9d (cpumask: add CPU_MASK_ALL_PTR macro) added CPU_MASK_ALL_PTR: #define CPU_MASK_ALL_PTR (&CPU_MASK_ALL) Which formalizes this practice. One day gcc could bite us over this usage (though we seem to have gotten away with it so far). So replace everywhere which used &CPU_MASK_ALL or CPU_MASK_ALL_PTR with the modern "cpu_all_mask" (a real const struct cpumask *). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Reported-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Mike Travis <travis@sgi.com>
2009-03-26writeback: double the dirty thresholdsWu Fengguang1-2/+2
Enlarge default dirty ratios from 5/10 to 10/20. This fixes [Bug #12809] iozone regression with 2.6.29-rc6. The iozone benchmarks are performed on a 1200M file, with 8GB ram. iozone -i 0 -i 1 -i 2 -i 3 -i 4 -r 4k -s 64k -s 512m -s 1200m -b tmp.xls iozone -B -r 4k -s 64k -s 512m -s 1200m -b tmp.xls The performance regression is triggered by commit 1cf6e7d83bf3(mm: task dirty accounting fix), which makes more correct/thorough dirty accounting. The default 5/10 dirty ratios were picked (a) with the old dirty logic and (b) largely at random and (c) designed to be aggressive. In particular, that (a) means that having fixed some of the dirty accounting, maybe the real bug is now that it was always too aggressive, just hidden by an accounting issue. The enlarged 10/20 dirty ratios are just about enough to fix the regression. [ We will have to look at how this affects the old fsync() latency issue, but that probably will need independent work. - Linus ] Cc: Nick Piggin <npiggin@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Reported-by: "Lin, Ming M" <ming.m.lin@intel.com> Tested-by: "Lin, Ming M" <ming.m.lin@intel.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-26Move the default_backing_dev_info out of readahead.c and into backing-dev.cJens Axboe2-26/+25
It really makes no sense to have it in readahead.c, so move it where it belongs. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-03-23slob: fix lockup in slob_free()Nick Piggin1-1/+2
Don't hold SLOB lock when freeing the page. Reduces lock hold width. See the following thread for discussion of the bug: http://marc.info/?l=linux-kernel&m=123709983214143&w=2 Reported-by: Ingo Molnar <mingo@elte.hu> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-03-23slub: use get_track()Akinobu Mita1-7/+1
Use get_track() in set_track() Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-03-20tracing, Text Edit Lock - kprobes architecture independent support, nommu fixIngo Molnar1-8/+0
Impact: build fix on SH !CONFIG_MMU Stephen Rothwell reported this linux-next build failure on the SH architecture: kernel/built-in.o: In function `disable_all_kprobes': kernel/kprobes.c:1382: undefined reference to `text_mutex' [...] And observed: | Introduced by commit 4460fdad85becd569f11501ad5b91814814335ff ("tracing, | Text Edit Lock - kprobes architecture independent support") from the | tracing tree. text_mutex is defined in mm/memory.c which is only built | if CONFIG_MMU is defined, which is not true for sh allmodconfig. Move this lock to kernel/extable.c (which is already home to various kernel text related routines), which file is always built-in. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> LKML-Reference: <20090320110602.86351a91.sfr@canb.auug.org.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-15highmem: atomic highmem kmap page pinningNicolas Pitre1-8/+57
Most ARM machines have a non IO coherent cache, meaning that the dma_map_*() set of functions must clean and/or invalidate the affected memory manually before DMA occurs. And because the majority of those machines have a VIVT cache, the cache maintenance operations must be performed using virtual addresses. When a highmem page is kunmap'd, its mapping (and cache) remains in place in case it is kmap'd again. However if dma_map_page() is then called with such a page, some cache maintenance on the remaining mapping must be performed. In that case, page_address(page) is non null and we can use that to synchronize the cache. It is unlikely but still possible for kmap() to race and recycle the virtual address obtained above, and use it for another page before some on-going cache invalidation loop in dma_map_page() is done. In that case, the new mapping could end up with dirty cache lines for another page, and the unsuspecting cache invalidation loop in dma_map_page() might simply discard those dirty cache lines resulting in data loss. For example, let's consider this sequence of events: - dma_map_page(..., DMA_FROM_DEVICE) is called on a highmem page. --> - vaddr = page_address(page) is non null. In this case it is likely that the page has valid cache lines associated with vaddr. Remember that the cache is VIVT. --> for (i = vaddr; i < vaddr + PAGE_SIZE; i += 32) invalidate_cache_line(i); *** preemption occurs in the middle of the loop above *** - kmap_high() is called for a different page. --> - last_pkmap_nr wraps to zero and flush_all_zero_pkmaps() is called. The pkmap_count value for the page passed to dma_map_page() above happens to be 1, so the page is unmapped. But prior to that, flush_cache_kmaps() cleared the cache for it. So far so good. - A fresh pkmap entry is assigned for this kmap request. The Murphy law says this pkmap entry will eventually happen to use the same vaddr as the one which used to belong to the other page being processed by dma_map_page() in the preempted thread above. - The kmap_high() caller start dirtying the cache using the just assigned virtual mapping for its page. *** the first thread is rescheduled *** - The for(...) loop is resumed, but now cached data belonging to a different physical page is being discarded ! And this is not only a preemption issue as ARM can be SMP as well, making the above scenario just as likely. Hence the need for some kind of pkmap page pinning which can be used in any context, primarily for the benefit of dma_map_page() on ARM. This provides the necessary interface to cope with the above issue if ARCH_NEEDS_KMAP_HIGH_GET is defined, otherwise the resulting code is unchanged. Signed-off-by: Nicolas Pitre <nico@marvell.com> Reviewed-by: MinChan Kim <minchan.kim@gmail.com> Acked-by: Andrew Morton <akpm@linux-foundation.org>
2009-03-14vmscan: pgmoved should be cleared after updating recent_rotatedDaisuke Nishimura1-1/+1
pgmoved should be cleared after updating recent_rotated. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-14VM, x86, PAT: add a new vm flag to track full pfnmap at mmapPallipadi, Venkatesh1-2/+2
Impact: cleanup Add a new vm flag VM_PFN_AT_MMAP to identify a PFNMAP that is fully mapped with remap_pfn_range. Patch removes the overloading of VM_INSERTPAGE from the earlier patch. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Nick Piggin <npiggin@suse.de> LKML-Reference: <20090313233543.GA19909@linux-os.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-13cpumask: replace node_to_cpumask with cpumask_of_node.Rusty Russell4-7/+9
Impact: cleanup node_to_cpumask (and the blecherous node_to_cpumask_ptr which contained a declaration) are replaced now everyone implements cpumask_of_node. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-13VM, x86, PAT: Change is_linear_pfn_mapping to not use vm_pgoffPallipadi, Venkatesh1-2/+4
Impact: fix false positive PAT warnings - also fix VirtalBox hang Use of vma->vm_pgoff to identify the pfnmaps that are fully mapped at mmap time is broken. vm_pgoff is set by generic mmap code even for cases where drivers are setting up the mappings at the fault time. The problem was originally reported here: http://marc.info/?l=linux-kernel&m=123383810628583&w=2 Change is_linear_pfn_mapping logic to overload VM_INSERTPAGE flag along with VM_PFNMAP to mean full PFNMAP setup at mmap time. Problem also tracked at: http://bugzilla.kernel.org/show_bug.cgi?id=12800 Reported-by: Thomas Hellstrom <thellstrom@vmware.com> Tested-by: Frans Pop <elendil@planet.nl> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Suresh Siddha <suresh.b.siddha>@intel.com> Cc: Nick Piggin <npiggin@suse.de> Cc: "ebiederm@xmission.com" <ebiederm@xmission.com> Cc: <stable@kernel.org> # only for 2.6.29.1, not .28 LKML-Reference: <20090313004527.GA7176@linux-os.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-12memcg: use correct scan number at reclaimKOSAKI Motohiro1-1/+1
Even when page reclaim is under mem_cgroup, # of scan page is determined by status of global LRU. Fix that. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-11percpu: fix spurious alignment WARN in legacy SMP percpu allocatorTejun Heo1-1/+1
Impact: remove spurious WARN on legacy SMP percpu allocator Commit f2a8205c4ef1af917d175c36a4097ae5587791c8 incorrectly added too tight WARN_ON_ONCE() on alignments for UP and legacy SMP percpu allocator. Commit e317603694bfd17b28a40de9d65e1a4ec12f816e fixed it for UP but legacy SMP allocator was forgotten. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Sachin P. Sant <sachinp@in.ibm.com>
2009-03-10percpu: generalize embedding first chunk setup helperTejun Heo1-0/+86
Impact: code reorganization Separate out embedding first chunk setup helper from x86 embedding first chunk allocator and put it in mm/percpu.c. This will be used by the default percpu first chunk allocator and possibly by other archs. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-10percpu: more flexibility for @dyn_size of pcpu_setup_first_chunk()Tejun Heo1-14/+14
Impact: cleanup, more flexibility for first chunk init Non-negative @dyn_size used to be allowed iff @unit_size wasn't auto. This restriction stemmed from implementation detail and made things a bit less intuitive. This patch allows @dyn_size to be specified regardless of @unit_size and swaps the positions of @dyn_size and @unit_size so that the parameter order makes more sense (static, reserved and dyn sizes followed by enclosing unit_size). While at it, add @unit_size >= PCPU_MIN_UNIT_SIZE sanity check. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-10percpu: make x86 addr <-> pcpu ptr conversion macros genericTejun Heo1-1/+15
Impact: generic addr <-> pcpu ptr conversion macros There's nothing arch specific about x86 __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr(). With proper __per_cpu_load and __per_cpu_start defined, they'll do the right thing regardless of actual layout. Move these macros from arch/x86/include/asm/percpu.h to mm/percpu.c and allow archs to override it as necessary. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-07percpu: finer grained locking to break deadlock and allow atomic freeTejun Heo1-37/+124
Impact: fix deadlock and allow atomic free Percpu allocation always uses GFP_KERNEL and whole alloc/free paths were protected by single mutex. All percpu allocations have been from GFP_KERNEL-safe context and the original allocator had this assumption too. However, by protecting both alloc and free paths with the same mutex, the new allocator creates free -> alloc -> GFP_KERNEL dependency which the original allocator didn't have. This can lead to deadlock if free is called from FS or IO paths. Also, in general, allocators are expected to allow free to be called from atomic context. This patch implements finer grained locking to break the deadlock and allow atomic free. For details, please read the "Synchronization rules" comment. While at it, also add CONTEXT: to function comments to describe which context they expect to be called from and what they do to it. This problem was reported by Thomas Gleixner and Peter Zijlstra. http://thread.gmane.org/gmane.linux.kernel/802384 Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Peter Zijlstra <peterz@infradead.org>
2009-03-06tracing, Text Edit Lock - Architecture Independent CodeMathieu Desnoyers1-0/+10
This is an architecture independant synchronization around kernel text modifications through use of a global mutex. A mutex has been chosen so that kprobes, the main user of this, can sleep during memory allocation between the memory read of the instructions it must replace and the memory write of the breakpoint. Other user of this interface: immediate values. Paravirt and alternatives are always done when SMP is inactive, so there is no need to use locks. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> LKML-Reference: <49B142D8.7020601@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-07percpu: move fully free chunk reclamation into a workTejun Heo1-10/+38
Impact: code reorganization for later changes Do fully free chunk reclamation using a work. This change is to prepare for locking changes. Signed-off-by: Tejun Heo <tj@kernel.org>