aboutsummaryrefslogtreecommitdiffstats
path: root/mm/huge_memory.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2015-11-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-9/+6
Merge second patch-bomb from Andrew Morton: - most of the rest of MM - procfs - lib/ updates - printk updates - bitops infrastructure tweaks - checkpatch updates - nilfs2 update - signals - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc, dma-debug, dma-mapping, ... * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits) ipc,msg: drop dst nil validation in copy_msg include/linux/zutil.h: fix usage example of zlib_adler32() panic: release stale console lock to always get the logbuf printed out dma-debug: check nents in dma_sync_sg* dma-mapping: tidy up dma_parms default handling pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode kexec: use file name as the output message prefix fs, seqfile: always allow oom killer seq_file: reuse string_escape_str() fs/seq_file: use seq_* helpers in seq_hex_dump() coredump: change zap_threads() and zap_process() to use for_each_thread() coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT) signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread() signal: turn dequeue_signal_lock() into kernel_dequeue_signal() signals: kill block_all_signals() and unblock_all_signals() nilfs2: fix gcc uninitialized-variable warnings in powerpc build nilfs2: fix gcc unused-but-set-variable warnings MAINTAINERS: nilfs2: add header file for tracing nilfs2: add tracepoints for analyzing reading and writing metadata files ...
2015-11-07Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivialLinus Torvalds1-1/+1
Pull trivial updates from Jiri Kosina: "Trivial stuff from trivial tree that can be trivially summed up as: - treewide drop of spurious unlikely() before IS_ERR() from Viresh Kumar - cosmetic fixes (that don't really affect basic functionality of the driver) for pktcdvd and bcache, from Julia Lawall and Petr Mladek - various comment / printk fixes and updates all over the place" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: bcache: Really show state of work pending bit hwmon: applesmc: fix comment typos Kconfig: remove comment about scsi_wait_scan module class_find_device: fix reference to argument "match" debugfs: document that debugfs_remove*() accepts NULL and error values net: Drop unlikely before IS_ERR(_OR_NULL) mm: Drop unlikely before IS_ERR(_OR_NULL) fs: Drop unlikely before IS_ERR(_OR_NULL) drivers: net: Drop unlikely before IS_ERR(_OR_NULL) drivers: misc: Drop unlikely before IS_ERR(_OR_NULL) UBI: Update comments to reflect UBI_METAONLY flag pktcdvd: drop null test before destroy functions
2015-11-06mm: make compound_head() robustKirill A. Shutemov1-2/+1
Hugh has pointed that compound_head() call can be unsafe in some context. There's one example: CPU0 CPU1 isolate_migratepages_block() page_count() compound_head() !!PageTail() == true put_page() tail->first_page = NULL head = tail->first_page alloc_pages(__GFP_COMP) prep_compound_page() tail->first_page = head __SetPageTail(p); !!PageTail() == true <head == NULL dereferencing> The race is pure theoretical. I don't it's possible to trigger it in practice. But who knows. We can fix the race by changing how encode PageTail() and compound_head() within struct page to be able to update them in one shot. The patch introduces page->compound_head into third double word block in front of compound_dtor and compound_order. Bit 0 encodes PageTail() and the rest bits are pointer to head page if bit zero is set. The patch moves page->pmd_huge_pte out of word, just in case if an architecture defines pgtable_t into something what can have the bit 0 set. hugetlb_cgroup uses page->lru.next in the second tail page to store pointer struct hugetlb_cgroup. The patch switch it to use page->private in the second tail page instead. The space is free since ->first_page is removed from the union. The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER limitation, since there's now space in first tail page to store struct hugetlb_cgroup pointer. But that's out of scope of the patch. That means page->compound_head shares storage space with: - page->lru.next; - page->next; - page->rcu_head.next; That's too long list to be absolutely sure, but looks like nobody uses bit 0 of the word. page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future call_rcu_lazy() is not allowed as it makes use of the bit and we can get false positive PageTail(). [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06thp: remove unused vma parameter from khugepaged_alloc_pageAaron Tomlin1-5/+3
The "vma" parameter to khugepaged_alloc_page() is unused. It has to remain unused or the drop read lock 'map_sem' optimisation introduce by commit 8b1645685acf ("mm, THP: don't hold mmap_sem in khugepaged when allocating THP") wouldn't be safe. So let's remove it. Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm, page_alloc: remove MIGRATE_RESERVEMel Gorman1-1/+1
MIGRATE_RESERVE preserves an old property of the buddy allocator that existed prior to fragmentation avoidance -- min_free_kbytes worth of pages tended to remain contiguous until the only alternative was to fail the allocation. At the time it was discovered that high-order atomic allocations relied on this property so MIGRATE_RESERVE was introduced. A later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch deletes MIGRATE_RESERVE and supporting code so it'll be easier to review. Note that this patch in isolation may look like a false regression if someone was bisecting high-order atomic allocation failures. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIMMel Gorman1-1/+1
__GFP_WAIT was used to signal that the caller was in atomic context and could not sleep. Now it is possible to distinguish between true atomic context and callers that are not willing to sleep. The latter should clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT behaves differently, there is a risk that people will clear the wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate what it does -- setting it allows all reclaim activity, clearing them prevents it. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05mm: introduce VM_LOCKONFAULTEric B Munson1-1/+1
The cost of faulting in all memory to be locked can be very high when working with large mappings. If only portions of the mapping will be used this can incur a high penalty for locking. For the example of a large file, this is the usage pattern for a large statical language model (probably applies to other statical or graphical models as well). For the security example, any application transacting in data that cannot be swapped out (credit card data, medical records, etc). This patch introduces the ability to request that pages are not pre-faulted, but are placed on the unevictable LRU when they are finally faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT flag for a VMA will cause pages faulted into that VMA to be added to the unevictable LRU when they are faulted or if they are already present, but will not cause any missing pages to be faulted in. Exposing this new lock state means that we cannot overload the meaning of the FOLL_POPULATE flag any longer. Prior to this patch it was used to mean that the VMA for a fault was locked. This means we need the new FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE will now only control if the VMA should be populated and in the case of VM_LOCKONFAULT, it will not be set. Signed-off-by: Eric B Munson <emunson@akamai.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-03Merge tag 'arc-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arcLinus Torvalds1-1/+1
Pull ARC updates from Vineet Gupta: - Support for new MM features in ARCv2 cores (THP, PAE40) Some generic THP bits are touched - all ACKed by Kirill - Platform framework updates to prepare for EZChip arrival (still in works) - ARC Public Mailing list setup finally (linux-snps-arc@lists.infraded.org) * tag 'arc-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: (42 commits) ARC: mm: PAE40 support ARC: mm: PAE40: tlbex.S: Explicitify the size of pte_t ARC: mm: PAE40: switch to using phys_addr_t for physical addresses ARC: mm: HIGHMEM: populate high memory from DT ARC: mm: HIGHMEM: kmap API implementation ARC: mm: preps ahead of HIGHMEM support #2 ARC: mm: preps ahead of HIGHMEM support ARC: mm: use generic macros _BITUL()/_AC() ARC: mm: Improve Duplicate PD Fault handler MAINTAINERS: Add public mailing list for ARC ARC: Ensure DT mem base is same as what kernel is built with ARC: boot: Non Master cpus only need to call EARLY_CPU_SETUP once ARCv2: smp: [plat-*]: No need to explicitly call mcip_init_smp() ARC: smp: Introduce smp hook @init_irq_cpu called for all cores ARC: smp: Rename platform hook @init_smp -> @init_cpu_smp ARCv2: smp: [plat-*]: No need to explicitly call mcip_init_early_smp() ARC: smp: Introduce smp hook @init_early_smp for Master core ARC: remove @init_time, @init_irq platform callbacks ARC: smp: irqchip: handle IPI as percpu irq like timer ARC: boot: Support Halt-on-reset and Run-on-reset SMP booting modes ...
2015-10-23thp: use is_zero_pfn() only after pte_present() checkMinchan Kim1-1/+2
Use is_zero_pfn() on pteval only after pte_present() check on pteval (It might be better idea to introduce is_zero_pte() which checks pte_present() first). Otherwise when working on a swap or migration entry and if pte_pfn's result is equal to zero_pfn by chance, we lose user's data in __collapse_huge_page_copy(). So if you're unlucky, the application segfaults and finally you could see below message on exit: BUG: Bad rss-counter state mm:ffff88007f099300 idx:2 val:3 Fixes: ca0984caa823 ("mm: incorporate zero pages into transparent huge pages") Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [4.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-17mm,thp: introduce flush_pmd_tlb_rangeVineet Gupta1-1/+1
ARCHes with special requirements for evicting THP backing TLB entries can implement this. Otherwise also, it can help optimize TLB flush in THP regime. stock flush_tlb_range() typically has optimization to nuke the entire TLB if flush span is greater than a certain threshhold, which will likely be true for a single huge page. Thus a single thp flush will invalidate the entrire TLB which is not desirable. e.g. see arch/arc: flush_pmd_tlb_range Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20151009100816.GC7873@node Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2015-09-29mm: Drop unlikely before IS_ERR(_OR_NULL)Viresh Kumar1-1/+1
IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag and there is no need to do that again from its callers. Drop it. Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-09-10mm: introduce idle page trackingVladimir Davydov1-2/+10
Knowing the portion of memory that is not used by a certain application or memory cgroup (idle memory) can be useful for partitioning the system efficiently, e.g. by setting memory cgroup limits appropriately. Currently, the only means to estimate the amount of idle memory provided by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the access bit for all pages mapped to a particular process by writing 1 to clear_refs, wait for some time, and then count smaps:Referenced. However, this method has two serious shortcomings: - it does not count unmapped file pages - it affects the reclaimer logic To overcome these drawbacks, this patch introduces two new page flags, Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap. A page's Idle flag can only be set from userspace by setting bit in /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page, and it is cleared whenever the page is accessed either through page tables (it is cleared in page_referenced() in this case) or using the read(2) system call (mark_page_accessed()). Thus by setting the Idle flag for pages of a particular workload, which can be found e.g. by reading /proc/PID/pagemap, waiting for some time to let the workload access its working set, and then reading the bitmap file, one can estimate the amount of pages that are not used by the workload. The Young page flag is used to avoid interference with the memory reclaimer. A page's Young flag is set whenever the Access bit of a page table entry pointing to the page is cleared by writing to the bitmap file. If page_referenced() is called on a Young page, it will add 1 to its return value, therefore concealing the fact that the Access bit was cleared. Note, since there is no room for extra page flags on 32 bit, this feature uses extended page flags when compiled on 32 bit. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: kpageidle requires an MMU] [akpm@linux-foundation.org: decouple from page-flags rework] Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm/khugepaged: allow interruption of allocation sleep againPetr Mladek1-2/+6
Commit 1dfb059b9438 ("thp: reduce khugepaged freezing latency") fixed khugepaged to do not block a system suspend. But the result is that it could not get interrupted before the given timeout because the condition for the wait event is "false". This patch puts back the original approach but it uses freezable_schedule_timeout_interruptible() instead of schedule_timeout_interruptible(). It does the right thing. I am pretty sure that the freezable variant was not used in the original fix only because it was not available at that time. The regression has been there for ages. It was not critical. It just did the allocation throttling a little bit more aggressively. I found this problem when converting the kthread to kthread worker API and trying to understand the code. This bug is thought to have minimal userspace-visible impact. Somebody could set a high alloc_sleep value by mistake, and then try to fix it back, but khugepaged would keep sleeping until the high value expires. Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm: rename alloc_pages_exact_node() to __alloc_pages_node()Vlastimil Babka1-1/+1
alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page allocator: do not check NUMA node ID when the caller knows the node is valid") as an optimized variant of alloc_pages_node(), that doesn't fallback to current node for nid == NUMA_NO_NODE. Unfortunately the name of the function can easily suggest that the allocation is restricted to the given node and fails otherwise. In truth, the node is only preferred, unless __GFP_THISNODE is passed among the gfp flags. The misleading name has lead to mistakes in the past, see for example commits 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") and b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node"). Another issue with the name is that there's a family of alloc_pages_exact*() functions where 'exact' means exact size (instead of page order), which leads to more confusion. To prevent further mistakes, this patch effectively renames alloc_pages_exact_node() to __alloc_pages_node() to better convey that it's an optimized variant of alloc_pages_node() not intended for general usage. Both functions get described in comments. It has been also considered to really provide a convenience function for allocations restricted to a node, but the major opinion seems to be that __GFP_THISNODE already provides that functionality and we shouldn't duplicate the API needlessly. The number of users would be small anyway. Existing callers of alloc_pages_exact_node() are simply converted to call __alloc_pages_node(), with the exception of sba_alloc_coherent() which open-codes the check for NUMA_NO_NODE, so it is converted to use alloc_pages_node() instead. This means it no longer performs some VM_BUG_ON checks, and since the current check for nid in alloc_pages_node() uses a 'nid < 0' comparison (which includes NUMA_NO_NODE), it may hide wrong values which would be previously exposed. Both differences will be rectified by the next patch. To sum up, this patch makes no functional changes, except temporarily hiding potentially buggy callers. Restricting the checks in alloc_pages_node() is left for the next patch which can in turn expose more existing buggy callers. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Robin Holt <robinmholt@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Cliff Whickman <cpw@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm: make set_recommended_min_free_kbytes() return voidNicholas Krause1-2/+1
This makes set_recommended_min_free_kbytes() have a return type of void as it cannot fail. Signed-off-by: Nicholas Krause <xerofoify@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08dax: don't use set_huge_zero_page()Kirill A. Shutemov1-1/+1
This is another place where DAX assumed that pgtable_t was a pointer. Open code the important parts of set_huge_zero_page() in DAX and make set_huge_zero_page() static again. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08thp: fix zap_huge_pmd() for DAXKirill A. Shutemov1-40/+31
The original DAX code assumed that pgtable_t was a pointer, which isn't true on all architectures. Restructure the code to not rely on that assumption. [willy@linux.intel.com: further fixes integrated into this patch] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08thp: decrement refcount on huge zero page if it is splitKirill A. Shutemov1-1/+3
The DAX code neglected to put the refcount on the huge zero page. Also we must notify on splits. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08thp: change insert_pfn's return type to voidMatthew Wilcox1-3/+3
It would make more sense to have all the return values from vmf_insert_pfn_pmd() encoded in one place instead of having to follow the convention into insert_pfn(). Suggested by Jeff Moyer. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm: add vmf_insert_pfn_pmd()Matthew Wilcox1-0/+43
Similar to vm_insert_pfn(), but for PMDs rather than PTEs. The 'vmf_' prefix instead of 'vm_' prefix is intended to indicate that it returns a VMF_ value rather than an errno (which would only have to be converted into a VMF_ value anyway). Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm: export various functions for the benefit of DAXMatthew Wilcox1-6/+1
To use the huge zero page in DAX, we need these functions exported. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08thp: prepare for DAX huge pagesMatthew Wilcox1-19/+29
Add a vma_is_dax() helper macro to test whether the VMA is DAX, and use it in zap_huge_pmd() and __split_huge_page_pmd(). [akpm@linux-foundation.org: fix build] Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08dax: revert userfaultfd changeAndrew Morton1-1/+4
Undo the change which "userfaultfd: call handle_userfault() for userfaultfd_missing() faults" made to set_huge_zero_page(). DAX will need that return value. Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08thp: vma_adjust_trans_huge(): adjust file-backed VMA tooKirill A. Shutemov1-1/+1
This series of patches adds support for using PMD page table entries to map DAX files. We expect NV-DIMMs to start showing up that are many gigabytes in size and the memory consumption of 4kB PTEs will be astronomical. The patch series leverages much of the Transparant Huge Pages infrastructure, going so far as to borrow one of Kirill's patches from his THP page cache series. This patch (of 10): Since we're going to have huge pages in page cache, we need to call adjust file-backed VMA, which potentially can contain huge pages. For now we call it for all VMAs. Probably later we will need to introduce a flag to indicate that the VMA has huge pages. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: propagate the full address in THP faultsAndrea Arcangeli1-4/+6
The THP faults were not propagating the original fault address. The latest version of the API with uffd.arg.pagefault.address is supposed to propagate the full address through THP faults. This was not a kernel crashing bug and it wouldn't risk to corrupt user memory, but it would cause a SIGBUS failure because the wrong page was being copied. For various reasons this wasn't easily reproducible in the qemu workload, but the strestest exposed the problem immediately. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: prevent khugepaged to merge if userfaultfd is armedAndrea Arcangeli1-2/+4
If userfaultfd is armed on a certain vma we can't "fill" the holes with zeroes or we'll break the userland on demand paging. The holes if the userfault is armed, are really missing information (not zeroes) that the userland has to load from network or elsewhere. The same issue happens for wrprotected ptes that we can't just convert into a single writable pmd_trans_huge. We could however in theory still merge across zeropages if only VM_UFFD_MISSING is set (so if VM_UFFD_WP is not set)... that could be slightly improved but it'd be much more complex code for a tiny corner case. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: call handle_userfault() for userfaultfd_missing() faultsAndrea Arcangeli1-22/+47
This is where the page faults must be modified to call handle_userfault() if userfaultfd_missing() is true (so if the vma->vm_flags had VM_UFFD_MISSING set). handle_userfault() then takes care of blocking the page fault and delivering it to userland. The fault flags must also be passed as parameter so the "read|write" kind of fault can be passed to userland. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07mm: check __PG_HWPOISON separately from PAGE_FLAGS_CHECK_AT_*Naoya Horiguchi1-6/+1
The race condition addressed in commit add05cecef80 ("mm: soft-offline: don't free target page in successful page migration") was not closed completely, because that can happen not only for soft-offline, but also for hard-offline. Consider that a slab page is about to be freed into buddy pool, and then an uncorrected memory error hits the page just after entering __free_one_page(), then VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP) is triggered, despite the fact that it's not necessary because the data on the affected page is not consumed. To solve it, this patch drops __PG_HWPOISON from page flag checks at allocation/free time. I think it's justified because __PG_HWPOISON flags is defined to prevent the page from being reused, and setting it outside the page's alloc-free cycle is a designed behavior (not a bug.) For recent months, I was annoyed about BUG_ON when soft-offlined page remains on lru cache list for a while, which is avoided by calling put_page() instead of putback_lru_page() in page migration's success path. This means that this patch reverts a major change from commit add05cecef80 about the new refcounting rule of soft-offlined pages, so "reuse window" revives. This will be closed by a subsequent patch. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dean Nelson <dnelson@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24mm: clarify that the function operates on hugepage pteAneesh Kumar K.V1-8/+8
We have confusing functions to clear pmd, pmd_clear_* and pmd_clear. Add _huge_ to pmdp_clear functions so that we are clear that they operate on hugepage pte. We don't bother about other functions like pmdp_set_wrprotect, pmdp_clear_flush_young, because they operate on PTE bits and hence indicate they are operating on hugepage ptes Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24mm/thp: split out pmd collapse flush into separate functionsAneesh Kumar K.V1-1/+1
Architectures like ppc64 [1] need to do special things while clearing pmd before a collapse. For them this operation is largely different from a normal hugepage pte clear. Hence add a separate function to clear pmd before collapse. After this patch pmdp_* functions operate only on hugepage pte, and not on regular pmd_t values pointing to page table. [1] ppc64 needs to invalidate all the normal page pte mappings we already have inserted in the hardware hash page table. But before doing that we need to make sure there are no parallel hash page table insert going on. So we need to do a kick_all_cpus_sync() before flushing the older hash table entries. By moving this to a separate function we capture these details and mention how it is different from a hugepage pte clear. This patch is a cleanup and only does code movement for clarity. There should not be any change in functionality. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24thp: cleanup how khugepaged enters freezerJiri Kosina1-3/+1
khugepaged_do_scan() checks in every iteration whether freezing(current) is true, and in such case breaks out of the loop, which causes try_to_freeze() to be called immediately afterwards in khugepaged_wait_work(). If nothing else, this causes unnecessary freezing(current) test, and also makes the way khugepaged enters freezer a bit less obvious than necessary. Let's just try to freeze directly, instead of splitting it into two (directly adjacent) phases. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15thp: cleanup khugepaged startupKirill A. Shutemov1-10/+8
Few trivial cleanups: - no need to call set_recommended_min_free_kbytes() from late_initcall() -- start_khugepaged() calls it; - no need to call set_recommended_min_free_kbytes() from start_khugepaged() if khugepaged is not started; - there isn't much point in running start_khugepaged() if we've just set transparent_hugepage_flags to zero; - start_khugepaged() is misnamed -- it also used to stop the thread; Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15thp: do not adjust zone water marks if khugepaged is not startedKirill A. Shutemov1-1/+2
set_recommended_min_free_kbytes() adjusts zone water marks to be suitable for khugepaged. We avoid doing this if khugepaged is disabled, but don't catch the case when khugepaged is failed to start. Let's address this by checking khugepaged_thread instead of khugepaged_enabled() in set_recommended_min_free_kbytes(). It's NULL if the kernel thread is stopped or failed to start. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15thp: handle errors in hugepage_init() properlyKirill A. Shutemov1-5/+20
We miss error-handling in few cases hugepage_init(). Let's fix that. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15mm: remove rest of ACCESS_ONCE() usagesJason Low1-2/+2
We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/ tree since it doesn't work reliably on non-scalar types. This patch removes the rest of the usages of ACCESS_ONCE, and use the new READ_ONCE API for the read accesses. This makes things cleaner, instead of using separate/multiple sets of APIs. Signed-off-by: Jason Low <jason.low2@hp.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15mm, memcg: sync allocation and memcg charge gfp flags for THPMichal Hocko1-22/+20
memcg currently uses hardcoded GFP_TRANSHUGE gfp flags for all THP charges. THP allocations, however, might be using different flags depending on /sys/kernel/mm/transparent_hugepage/{,khugepaged/}defrag and the current allocation context. The primary difference is that defrag configured to "madvise" value will clear __GFP_WAIT flag from the core gfp mask to make the allocation lighter for all mappings which are not backed by VM_HUGEPAGE vmas. If memcg charge path ignores this fact we will get light allocation but the a potential memcg reclaim would kill the whole point of the configuration. Fix the mismatch by providing the same gfp mask used for the allocation to the charge functions. This is quite easy for all paths except for hugepaged kernel thread with !CONFIG_NUMA which is doing a pre-allocation long before the allocated page is used in collapse_huge_page via khugepaged_alloc_page. To prevent from cluttering the whole code path from khugepaged_do_scan we simply return the current flags as per khugepaged_defrag() value which might have changed since the preallocation. If somebody changed the value of the knob we would charge differently but this shouldn't happen often and it is definitely not critical because it would only lead to a reduced success rate of one-off THP promotion. [akpm@linux-foundation.org: fix weird code layout while we're there] [rientjes@google.com: clean up around alloc_hugepage_gfpmask()] Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm, thp: really limit transparent hugepage allocation to local nodeDavid Rientjes1-2/+7
Commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node") restructured alloc_hugepage_vma() with the intent of only allocating transparent hugepages locally when there was not an effective interleave mempolicy. alloc_pages_exact_node() does not limit the allocation to the single node, however, but rather prefers it. This is because __GFP_THISNODE is not set which would cause the node-local nodemask to be passed. Without it, only a nodemask that prefers the local node is passed. Fix this by passing __GFP_THISNODE and falling back to small pages when the allocation fails. Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") suffers from a similar problem for khugepaged, which is also fixed. Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node") Fixes: 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Jarno Rajahalme <jrajahalme@nicira.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Greg Thelen <gthelen@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: incorporate zero pages into transparent huge pagesEbru Akagunduz1-8/+20
This patch improves THP collapse rates, by allowing zero pages. Currently THP can collapse 4kB pages into a THP when there are up to khugepaged_max_ptes_none pte_none ptes in a 2MB range. This patch counts pte none and mapped zero pages with the same variable. The patch was tested with a program that allocates 800MB of memory, and performs interleaved reads and writes, in a pattern that causes some 2MB areas to first see read accesses, resulting in the zero pfn being mapped there. To simulate memory fragmentation at allocation time, I modified do_huge_pmd_anonymous_page to return VM_FAULT_FALLBACK for read faults. Without the patch, only %50 of the program was collapsed into THP and the percentage did not increase over time. With this patch after 10 minutes of waiting khugepaged had collapsed %99 of the program's memory. [aarcange@redhat.com: fix bogus BUG()] Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: rename FOLL_MLOCK to FOLL_POPULATEKirill A. Shutemov1-1/+1
After commit a1fde08c74e9 ("VM: skip the stack guard page lookup in get_user_pages only for mlock") FOLL_MLOCK has lost its original meaning: we don't necessarily mlock the page if the flags is set -- we also take VM_LOCKED into consideration. Since we use the same codepath for __mm_populate(), let's rename FOLL_MLOCK to FOLL_POPULATE. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25mm: numa: mark huge PTEs young when clearing NUMA hinting faultsMel Gorman1-0/+1
Base PTEs are marked young when the NUMA hinting information is cleared but the same does not happen for huge pages which this patch addresses. Note that migrated pages are not marked young as the base page migration code does not assume that migrated pages have been referenced. This could be addressed but beyond the scope of this series which is aimed at Dave Chinners shrink workload that is unlikely to be affected by this issue. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25mm: numa: slow PTE scan rate if migration failures occurMel Gorman1-1/+2
Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226 Across the board the 4.0-rc1 numbers are much slower, and the degradation is far worse when using the large memory footprint configs. Perf points straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config: - 56.07% 56.07% [kernel] [k] default_send_IPI_mask_sequence_phys - default_send_IPI_mask_sequence_phys - 99.99% physflat_send_IPI_mask - 99.37% native_send_call_func_ipi smp_call_function_many - native_flush_tlb_others - 99.85% flush_tlb_page ptep_clear_flush try_to_unmap_one rmap_walk try_to_unmap migrate_pages migrate_misplaced_page - handle_mm_fault - 99.73% __do_page_fault trace_do_page_fault do_async_page_fault + async_page_fault 0.63% native_send_call_func_single_ipi generic_exec_single smp_call_function_single This is showing excessive migration activity even though excessive migrations are meant to get throttled. Normally, the scan rate is tuned on a per-task basis depending on the locality of faults. However, if migrations fail for any reason then the PTE scanner may scan faster if the faults continue to be remote. This means there is higher system CPU overhead and fault trapping at exactly the time we know that migrations cannot happen. This patch tracks when migration failures occur and slows the PTE scanner. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Dave Chinner <david@fromorbit.com> Tested-by: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25mm: numa: preserve PTE write permissions across a NUMA hinting faultMel Gorman1-1/+8
Protecting a PTE to trap a NUMA hinting fault clears the writable bit and further faults are needed after trapping a NUMA hinting fault to set the writable bit again. This patch preserves the writable bit when trapping NUMA hinting faults. The impact is obvious from the number of minor faults trapped during the basis balancing benchmark and the system CPU usage; autonumabench 4.0.0-rc4 4.0.0-rc4 baseline preserve Time System-NUMA01 107.13 ( 0.00%) 103.13 ( 3.73%) Time System-NUMA01_THEADLOCAL 131.87 ( 0.00%) 83.30 ( 36.83%) Time System-NUMA02 8.95 ( 0.00%) 10.72 (-19.78%) Time System-NUMA02_SMT 4.57 ( 0.00%) 3.99 ( 12.69%) Time Elapsed-NUMA01 515.78 ( 0.00%) 517.26 ( -0.29%) Time Elapsed-NUMA01_THEADLOCAL 384.10 ( 0.00%) 384.31 ( -0.05%) Time Elapsed-NUMA02 48.86 ( 0.00%) 48.78 ( 0.16%) Time Elapsed-NUMA02_SMT 47.98 ( 0.00%) 48.12 ( -0.29%) 4.0.0-rc4 4.0.0-rc4 baseline preserve User 44383.95 43971.89 System 252.61 201.24 Elapsed 998.68 1000.94 Minor Faults 2597249 1981230 Major Faults 365 364 There is a similar drop in system CPU usage using Dave Chinner's xfsrepair workload 4.0.0-rc4 4.0.0-rc4 baseline preserve Amean real-xfsrepair 454.14 ( 0.00%) 442.36 ( 2.60%) Amean syst-xfsrepair 277.20 ( 0.00%) 204.68 ( 26.16%) The patch looks hacky but the alternatives looked worse. The tidest was to rewalk the page tables after a hinting fault but it was more complex than this approach and the performance was worse. It's not generally safe to just mark the page writable during the fault if it's a write fault as it may have been read-only for COW so that approach was discarded. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Dave Chinner <david@fromorbit.com> Tested-by: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25mm: numa: group related processes based on VMA flags instead of page table flagsMel Gorman1-11/+2
These are three follow-on patches based on the xfsrepair workload Dave Chinner reported was problematic in 4.0-rc1 due to changes in page table management -- https://lkml.org/lkml/2015/3/1/226. Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp: Return the correct value for change_huge_pmd"). It was known that the performance in 3.19 was still better even if is far less safe. This series aims to restore the performance without compromising on safety. For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the three patches applied on top autonumabench 3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 vanilla vanilla vmwrite-v5r8 preserve-v5r8 slowscan-v5r8 Time System-NUMA01 124.00 ( 0.00%) 161.86 (-30.53%) 107.13 ( 13.60%) 103.13 ( 16.83%) 145.01 (-16.94%) Time System-NUMA01_THEADLOCAL 115.54 ( 0.00%) 107.64 ( 6.84%) 131.87 (-14.13%) 83.30 ( 27.90%) 92.35 ( 20.07%) Time System-NUMA02 9.35 ( 0.00%) 10.44 (-11.66%) 8.95 ( 4.28%) 10.72 (-14.65%) 8.16 ( 12.73%) Time System-NUMA02_SMT 3.87 ( 0.00%) 4.63 (-19.64%) 4.57 (-18.09%) 3.99 ( -3.10%) 3.36 ( 13.18%) Time Elapsed-NUMA01 570.06 ( 0.00%) 567.82 ( 0.39%) 515.78 ( 9.52%) 517.26 ( 9.26%) 543.80 ( 4.61%) Time Elapsed-NUMA01_THEADLOCAL 393.69 ( 0.00%) 384.83 ( 2.25%) 384.10 ( 2.44%) 384.31 ( 2.38%) 380.73 ( 3.29%) Time Elapsed-NUMA02 49.09 ( 0.00%) 49.33 ( -0.49%) 48.86 ( 0.47%) 48.78 ( 0.63%) 50.94 ( -3.77%) Time Elapsed-NUMA02_SMT 47.51 ( 0.00%) 47.15 ( 0.76%) 47.98 ( -0.99%) 48.12 ( -1.28%) 49.56 ( -4.31%) 3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8 User 46334.60 46391.94 44383.95 43971.89 44372.12 System 252.84 284.66 252.61 201.24 249.00 Elapsed 1062.14 1050.96 998.68 1000.94 1026.78 Overall the system CPU usage is comparable and the test is naturally a bit variable. The slowing of the scanner hurts numa01 but on this machine it is an adverse workload and patches that dramatically help it often hurt absolutely everything else. Due to patch 2, the fault activity is interesting 3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8 Minor Faults 2097811 2656646 2597249 1981230 1636841 Major Faults 362 450 365 364 365 Note the impact preserving the write bit across protection updates and fault reduces faults. NUMA alloc hit 1229008 1217015 1191660 1178322 1199681 NUMA alloc miss 0 0 0 0 0 NUMA interleave hit 0 0 0 0 0 NUMA alloc local 1228514 1216317 1190871 1177448 1199021 NUMA base PTE updates 245706197 240041607 238195516 244704842 115012800 NUMA huge PMD updates 479530 468448 464868 477573 224487 NUMA page range updates 491225557 479886983 476207932 489222218 229950144 NUMA hint faults 659753 656503 641678 656926 294842 NUMA hint local faults 381604 373963 360478 337585 186249 NUMA hint local percent 57 56 56 51 63 NUMA pages migrated 5412140 6374899 6266530 5277468 5755096 AutoNUMA cost 5121% 5083% 4994% 5097% 2388% Here the impact of slowing the PTE scanner on migratrion failures is obvious as "NUMA base PTE updates" and "NUMA huge PMD updates" are massively reduced even though the headline performance is very similar. As xfsrepair was the reported workload here is the impact of the series on it. xfsrepair 3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 vanilla vanilla vmwrite-v5r8 preserve-v5r8 slowscan-v5r8 Min real-fsmark 1183.29 ( 0.00%) 1165.73 ( 1.48%) 1152.78 ( 2.58%) 1153.64 ( 2.51%) 1177.62 ( 0.48%) Min syst-fsmark 4107.85 ( 0.00%) 4027.75 ( 1.95%) 3986.74 ( 2.95%) 3979.16 ( 3.13%) 4048.76 ( 1.44%) Min real-xfsrepair 441.51 ( 0.00%) 463.96 ( -5.08%) 449.50 ( -1.81%) 440.08 ( 0.32%) 439.87 ( 0.37%) Min syst-xfsrepair 195.76 ( 0.00%) 278.47 (-42.25%) 262.34 (-34.01%) 203.70 ( -4.06%) 143.64 ( 26.62%) Amean real-fsmark 1188.30 ( 0.00%) 1177.34 ( 0.92%) 1157.97 ( 2.55%) 1158.21 ( 2.53%) 1182.22 ( 0.51%) Amean syst-fsmark 4111.37 ( 0.00%) 4055.70 ( 1.35%) 3987.19 ( 3.02%) 3998.72 ( 2.74%) 4061.69 ( 1.21%) Amean real-xfsrepair 450.88 ( 0.00%) 468.32 ( -3.87%) 454.14 ( -0.72%) 442.36 ( 1.89%) 440.59 ( 2.28%) Amean syst-xfsrepair 199.66 ( 0.00%) 290.60 (-45.55%) 277.20 (-38.84%) 204.68 ( -2.51%) 150.55 ( 24.60%) Stddev real-fsmark 4.12 ( 0.00%) 10.82 (-162.29%) 4.14 ( -0.28%) 5.98 (-45.05%) 4.60 (-11.53%) Stddev syst-fsmark 2.63 ( 0.00%) 20.32 (-671.82%) 0.37 ( 85.89%) 16.47 (-525.59%) 15.05 (-471.79%) Stddev real-xfsrepair 6.87 ( 0.00%) 4.55 ( 33.75%) 3.46 ( 49.58%) 1.78 ( 74.12%) 0.52 ( 92.50%) Stddev syst-xfsrepair 3.02 ( 0.00%) 10.30 (-241.37%) 13.17 (-336.37%) 0.71 ( 76.63%) 5.00 (-65.61%) CoeffVar real-fsmark 0.35 ( 0.00%) 0.92 (-164.73%) 0.36 ( -2.91%) 0.52 (-48.82%) 0.39 (-12.10%) CoeffVar syst-fsmark 0.06 ( 0.00%) 0.50 (-682.41%) 0.01 ( 85.45%) 0.41 (-543.22%) 0.37 (-478.78%) CoeffVar real-xfsrepair 1.52 ( 0.00%) 0.97 ( 36.21%) 0.76 ( 49.94%) 0.40 ( 73.62%) 0.12 ( 92.33%) CoeffVar syst-xfsrepair 1.51 ( 0.00%) 3.54 (-134.54%) 4.75 (-214.31%) 0.34 ( 77.20%) 3.32 (-119.63%) Max real-fsmark 1193.39 ( 0.00%) 1191.77 ( 0.14%) 1162.90 ( 2.55%) 1166.66 ( 2.24%) 1188.50 ( 0.41%) Max syst-fsmark 4114.18 ( 0.00%) 4075.45 ( 0.94%) 3987.65 ( 3.08%) 4019.45 ( 2.30%) 4082.80 ( 0.76%) Max real-xfsrepair 457.80 ( 0.00%) 474.60 ( -3.67%) 457.82 ( -0.00%) 444.42 ( 2.92%) 441.03 ( 3.66%) Max syst-xfsrepair 203.11 ( 0.00%) 303.65 (-49.50%) 294.35 (-44.92%) 205.33 ( -1.09%) 155.28 ( 23.55%) The really relevant lines as syst-xfsrepair which is the system CPU usage when running xfsrepair. Note that on my machine the overhead was 45% higher on 4.0-rc4 which may be part of what Dave is seeing. Once we preserve the write bit across faults, it's only 2.51% higher on average. With the full series applied, system CPU usage is 24.6% lower on average. Again, the impact of preserving the write bit on minor faults is obvious and the impact of slowing scanning after migration failures is obvious on the PTE updates. Note also that the number of pages migrated is much reduced even though the headline performance is comparable. 3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8 Minor Faults 153466827 254507978 249163829 153501373 105737890 Major Faults 610 702 690 649 724 NUMA base PTE updates 217735049 210756527 217729596 216937111 144344993 NUMA huge PMD updates 129294 85044 106921 127246 79887 NUMA pages migrated 21938995 29705270 28594162 22687324 16258075 3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8 Mean sdb-avgqusz 13.47 2.54 2.55 2.47 2.49 Mean sdb-avgrqsz 202.32 140.22 139.50 139.02 138.12 Mean sdb-await 25.92 5.09 5.33 5.02 5.22 Mean sdb-r_await 4.71 0.19 0.83 0.51 0.11 Mean sdb-w_await 104.13 5.21 5.38 5.05 5.32 Mean sdb-svctm 0.59 0.13 0.14 0.13 0.14 Mean sdb-rrqm 0.16 0.00 0.00 0.00 0.00 Mean sdb-wrqm 3.59 1799.43 1826.84 1812.21 1785.67 Max sdb-avgqusz 111.06 12.13 14.05 11.66 15.60 Max sdb-avgrqsz 255.60 190.34 190.01 187.33 191.78 Max sdb-await 168.24 39.28 49.22 44.64 65.62 Max sdb-r_await 660.00 52.00 280.00 76.00 12.00 Max sdb-w_await 7804.00 39.28 49.22 44.64 65.62 Max sdb-svctm 4.00 2.82 2.86 1.98 2.84 Max sdb-rrqm 8.30 0.00 0.00 0.00 0.00 Max sdb-wrqm 34.20 5372.80 5278.60 5386.60 5546.15 FWIW, I also checked SPECjbb in different configurations but it's similar observations -- minor faults lower, PTE update activity lower and performance is roughly comparable against 3.19. This patch (of 3): Threads that share writable data within pages are grouped together as related tasks. This decision is based on whether the PTE is marked dirty which is subject to timing races between the PTE scanner update and when the application writes the page. If the page is file-backed, then background flushes and sync also affect placement. This is unpredictable behaviour which is impossible to reason about so this patch makes grouping decisions based on the VMA flags. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Dave Chinner <david@fromorbit.com> Tested-by: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12mm: thp: Return the correct value for change_huge_pmdMel Gorman1-2/+2
The wrong value is being returned by change_huge_pmd since commit 10c1045f28e8 ("mm: numa: avoid unnecessary TLB flushes when setting NUMA hinting entries") which allows a fallthrough that tries to adjust non-existent PTEs. This patch corrects it. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12mm: fix up numa read-only thread grouping logicLinus Torvalds1-1/+6
Dave Chinner reported that commit 4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations") slowed down his xfsrepair test enormously. In particular, it was using more system time due to extra TLB flushing. The ultimate reason turns out to be how the change to use the regular page table accessor functions broke the NUMA grouping logic. The old special mknuma/mknonnuma code accessed the page table present bit and the magic NUMA bit directly, while the new code just changes the page protections using PROT_NONE and the regular vma protections. That sounds equivalent, and from a fault standpoint it really is, but a subtle side effect is that the *other* protection bits of the page table entries also change. And the code to decide how to group the NUMA entries together used the writable bit to decide whether a particular page was likely to be shared read-only or not. And with the change to make the NUMA handling use the regular permission setting functions, that writable bit was basically always cleared for private mappings due to COW. So even if the page actually ends up being written to in the end, the NUMA balancing would act as if it was always shared RO. This code is a heuristic anyway, so the fix - at least for now - is to instead check whether the page is dirty rather than writable. The bit doesn't change with protection changes. NOTE! This also adds a FIXME comment to revisit this issue, Not only should we probably re-visit the whole "is this a shared read-only page" heuristic (we might want to take the vma permissions into account and base this more on those than the per-page ones, and also look at whether the particular access that triggers it is a write or not), but the whole COW issue shows that we should think about the NUMA fault handling some more. For example, maybe we should do the early-COW thing that a regular fault does. Or maybe we should accept that while using the same bits as PROTNONE was a good thing (and got rid of the specual NUMA bit), we might still want to just preseve the other protection bits across NUMA faulting. Those are bigger questions, left for later. This just fixes up the heuristic so that it at least approximates working again. More analysis and work needed. Reported-by: Dave Chinner <david@fromorbit.com> Tested-by: Mel Gorman <mgorman@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org>, Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12mm: numa: avoid unnecessary TLB flushes when setting NUMA hinting entriesMel Gorman1-6/+8
If a PTE or PMD is already marked NUMA when scanning to mark entries for NUMA hinting then it is not necessary to update the entry and incur a TLB flush penalty. Avoid the avoidhead where possible. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Jones <davej@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12mm: numa: add paranoid check around pte_protnone_numaMel Gorman1-0/+3
pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are complete. Treating a real PROT_NONE PTE as a NUMA hinting fault is going to result in strangeness so add a check for it. BUG_ON looks like overkill but if this is hit then it's a serious bug that could result in corruption so do not even try recovering. It would have been more comprehensive to check VMA flags in pte_protnone_numa but it would have made the API ugly just for a debugging check. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Jones <davej@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12mm: numa: do not trap faults on the huge zero pageMel Gorman1-1/+12
Faults on the huge zero page are pointless and there is a BUG_ON to catch them during fault time. This patch reintroduces a check that avoids marking the zero page PAGE_NONE. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Jones <davej@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12mm: convert p[te|md]_mknonnuma and remaining page table manipulationsMel Gorman1-26/+7
With PROT_NONE, the traditional page table manipulation functions are sufficient. [andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()] [akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Jones <davej@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12mm: convert p[te|md]_numa users to p[te|md]_protnone_numaMel Gorman1-8/+8
Convert existing users of pte_numa and friends to the new helper. Note that the kernel is broken after this patch is applied until the other page table modifiers are also altered. This patch layout is to make review easier. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Tested-by: Sasha Levin <sasha.levin@oracle.com> Cc: Dave Jones <davej@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>