aboutsummaryrefslogtreecommitdiffstats
path: root/mm/memblock.c (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2017-02-22mm/memblock.c: trivial code refine in memblock_is_region_memory()Wei Yang1-2/+1
memblock_is_region_memory() invoke memblock_search() to see whether the base address is in the memory region. If it fails, idx would be -1. Then, it returns 0. If the memblock_search() returns a valid index, it means the base address is guaranteed to be in the range memblock.memory.regions[idx]. Because of this, it is not necessary to check the base again. This patch removes the check on "base". Link: http://lkml.kernel.org/r/1482363033-24754-2-git-send-email-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm: fix some typos in mm/zsmalloc.cXishi Qiu1-3/+3
Delete extra semicolon, and fix some typos. Link: http://lkml.kernel.org/r/586F1823.4050107@huawei.com Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm/bootmem.c: cosmetic improvement of code readabilityAdygzhy Ondar1-1/+1
The obvious number of bits in a byte is replaced by BITS_PER_BYTE macro in bootmap_bytes() Link: http://lkml.kernel.org/r/1483781600-5136-1-git-send-email-ondar07@gmail.com Signed-off-by: Adygzhy Ondar <ondar07@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm,compaction: serialize waitqueue_active() checksDavidlohr Bueso1-0/+7
Without a memory barrier, the following race can occur with a high-order allocation: wakeup_kcompactd(order == 1) kcompactd() [L] waitqueue_active(kcompactd_wait) [S] prepare_to_wait_event(kcompactd_wait) [L] (kcompactd_max_order == 0) [S] kcompactd_max_order = order; schedule() Where the waitqueue_active() check is speculatively re-ordered to before setting the actual condition (max_order), not seeing the threads that's going to block; making us miss a wakeup. There are a couple of options to fix this, including calling wq_has_sleepers() which adds a full barrier, or unconditionally doing the wake_up_interruptible() and serialize on the q->lock. However, to make use of the control dependency, we just need to add L->L guarantees. While this bug is theoretical, there have been other offenders of the lockless waitqueue_active() in the past -- this is also documented in the call itself. Link: http://lkml.kernel.org/r/1483975528-24342-1-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm: page_alloc: skip over regions of invalid pfns where possiblePaul Burton3-1/+36
When using a sparse memory model memmap_init_zone() when invoked with the MEMMAP_EARLY context will skip over pages which aren't valid - ie. which aren't in a populated region of the sparse memory map. However if the memory map is extremely sparse then it can spend a long time linearly checking each PFN in a large non-populated region of the memory map & skipping it in turn. When CONFIG_HAVE_MEMBLOCK_NODE_MAP is enabled, we have sufficient information to quickly discover the next valid PFN given an invalid one by searching through the list of memory regions & skipping forwards to the first PFN covered by the memory region to the right of the non-populated region. Implement this in order to speed up memmap_init_zone() for systems with extremely sparse memory maps. James said "I have tested this patch on a virtual model of a Samurai CPU with a sparse memory map. The kernel boot time drops from 109 to 62 seconds. " Link: http://lkml.kernel.org/r/20161125185518.29885-1-paul.burton@imgtec.com Signed-off-by: Paul Burton <paul.burton@imgtec.com> Tested-by: James Hartley <james.hartley@imgtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, compaction: add vmstats for kcompactd workDavid Rientjes4-3/+24
A "compact_daemon_wake" vmstat exists that represents the number of times kcompactd has woken up. This doesn't represent how much work it actually did, though. It's useful to understand how much compaction work is being done by kcompactd versus other methods such as direct compaction and explicitly triggered per-node (or system) compaction. This adds two new vmstats: "compact_daemon_migrate_scanned" and "compact_daemon_free_scanned" to represent the number of pages kcompactd has scanned as part of its migration scanner and freeing scanner, respectively. These values are still accounted for in the general "compact_migrate_scanned" and "compact_free_scanned" for compatibility. It could be argued that explicitly triggered compaction could also be tracked separately, and that could be added if others find it useful. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612071749390.69852@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm/mmzone.c: swap likely to unlikely as code logic is different for next_zones_zonelist()Steven Rostedt1-1/+1
Commit 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator") changed how next_zones_zonelist() is called, by adding a static inline function to do the fast path. This function adds: if (likely(!nodes && zonelist_zone_idx(z) <= highest_zoneidx)) return z; return __next_zones_zonelist(z, highest_zoneidx, nodes); Where __next_zones_zonelist() is only called when nodes is not NULL or zonelist_zone_idx(z) is less than highest_zoneidx. The original next_zone_zonelist() was converted to __next_zones_zonelist() but it still maintained: if (likely(nodes == NULL)) Which is now actually a very unlikely, as it is only called with nodes equal to NULL when zonelist_zone_idx(z) is greater than highest_zoneidx. Before this commit, this if had this statistic: correct incorrect % Function File Line ------- --------- - -------- ---- ---- 837895 446078 34 next_zones_zonelist mmzone.c 63 After this commit, it has: correct incorrect % Function File Line ------- --------- - -------- ---- ---- 10 173840 99 __next_zones_zonelist mmzone.c 63 Thus, the if statement is now much more unlikely than it ever was as a likely. Link: http://lkml.kernel.org/r/20170105200102.77989567@gandalf.local.home Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm: fix filemap.c kernel-doc warningsRandy Dunlap1-1/+1
Fix kernel-doc warnings in mm/filemap.c: mm/filemap.c:993: warning: No description found for parameter '__page' mm/filemap.c:993: warning: Excess function parameter 'page' description in '__lock_page' Link: http://lkml.kernel.org/r/a66fe492-518c-ad6c-5f03-5e8b721fb451@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm: un-export wake_up_page functionsNicholas Piggin2-12/+10
These are no longer used outside mm/filemap.c, so un-export them and make them static where possible. These were exported specifically for NFS use in commit a4796e37c12e ("MM: export page_wakeup functions"). Link: http://lkml.kernel.org/r/20170103182234.30141-3-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22nfs: no PG_private waiters remain, remove wakerNicholas Piggin1-2/+0
Since commit 4f52b6bb8c57 ("NFS: Don't call COMMIT in ->releasepage()"), no tasks wait on PagePrivate. Thus the wake introduced in commit 9590544694be ("NFS: avoid deadlocks with loop-back mounted NFS filesystems.") can be removed. Link: http://lkml.kernel.org/r/20170103182234.30141-2-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22trace-vmscan-postprocess: sync with tracepoints updatesMichal Hocko1-13/+13
Both mm_vmscan_lru_shrink_active and mm_vmscan_lru_isolate have changed so the script needs to be update to reflect those changes Link: http://lkml.kernel.org/r/20170105151737.GU21618@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, vmscan: add mm_vmscan_inactive_list_is_low tracepointMichal Hocko2-9/+54
Currently we have tracepoints for both active and inactive LRU lists reclaim but we do not have any which would tell us why we we decided to age the active list. Without that it is quite hard to diagnose active/inactive lists balancing. Add mm_vmscan_inactive_list_is_low tracepoint to tell us this information. Link: http://lkml.kernel.org/r/20170104101942.4860-8-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, vmscan: enhance mm_vmscan_lru_shrink_inactive tracepointMichal Hocko2-3/+40
mm_vmscan_lru_shrink_inactive will currently report the number of scanned and reclaimed pages. This doesn't give us an idea how the reclaim went except for the overall effectiveness though. Export and show other counters which will tell us why we couldn't reclaim some pages. - nr_dirty, nr_writeback, nr_congested and nr_immediate tells us how many pages are blocked due to IO - nr_activate tells us how many pages were moved to the active list - nr_ref_keep reports how many pages are kept on the LRU due to references (mostly for the file pages which are about to go for another round through the inactive list) - nr_unmap_fail - how many pages failed to unmap All these are rather low level so they might change in future but the tracepoint is already implementation specific so no tools should be depending on its stability. Link: http://lkml.kernel.org/r/20170104101942.4860-7-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, vmscan: extract shrink_page_list reclaim counters into a structMichal Hocko1-31/+30
shrink_page_list returns quite some counters back to its caller. Extract the existing 5 into struct reclaim_stat because this makes the code easier to follow and also allows further counters to be returned. While we are at it, make all of them unsigned rather than unsigned long as we do not really need full 64b for them (we never scan more than SWAP_CLUSTER_MAX pages at once). This should reduce some stack space. This patch shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20170104101942.4860-6-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, vmscan: show LRU name in mm_vmscan_lru_isolate tracepointMichal Hocko3-8/+15
mm_vmscan_lru_isolate currently prints only whether the LRU we isolate from is file or anonymous but we do not know which LRU this is. It is useful to know whether the list is active or inactive, since we are using the same function to isolate pages from both of them and it's hard to distinguish otherwise. Link: http://lkml.kernel.org/r/20170104101942.4860-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, vmscan: show the number of skipped pages in mm_vmscan_lru_isolateMichal Hocko2-8/+13
mm_vmscan_lru_isolate shows the number of requested, scanned and taken pages. This is mostly OK but on 32b systems the number of scanned pages is quite misleading because it includes both the scanned and skipped pages. Moreover the skipped part is scaled based on the number of taken pages. Let's report the exact numbers without any additional logic and add the number of skipped pages. This should make the reported data much more easier to interpret. Link: http://lkml.kernel.org/r/20170104101942.4860-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, vmscan: add active list aging tracepointMichal Hocko2-4/+50
Our reclaim process has several tracepoints to tell us more about how things are progressing. We are, however, missing a tracepoint to track active list aging. Introduce mm_vmscan_lru_shrink_active which reports the number of - nr_taken is number of isolated pages from the active list - nr_referenced pages which tells us that we are hitting referenced pages which are deactivated. If this is a large part of the reported nr_deactivated pages then we might be hitting into the active list too early because they might be still part of the working set. This might help to debug performance issues. - nr_active pages which tells us how many pages are kept on the active list - mostly exec file backed pages. A high number can indicate that we might be trashing on executables. [mhocko@suse.com: update] Link: http://lkml.kernel.org/r/20170104135244.GJ25453@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170104101942.4860-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, vmscan: remove unused mm_vmscan_memcg_isolateMichal Hocko1-30/+1
Patch series "vm, vmscan: enahance vmscan tracepoints", v2. While debugging [2] I've realized that there is some room for improvements in the tracepoints set we offer currently. I had hard times to make any conclusion from the existing ones. The resulting problem turned out to be active list aging [3] and we are missing at least two tracepoints to debug such a problem. Some existing tracepoints could export more information to see _why_ the reclaim progress cannot be made not only _how much_ we could reclaim. The later could be seen quite reasonably from the vmstat counters already. It can be argued that we are showing too many implementation details in those tracepoints but I consider them way too lowlevel already to be usable by any kernel independent userspace. I would be _really_ surprised if anything but debugging tools have used them. Any feedback is highly appreciated. [1] http://lkml.kernel.org/r/20161228153032.10821-1-mhocko@kernel.org [2] http://lkml.kernel.org/r/20161215225702.GA27944@boerne.fritz.box [3] http://lkml.kernel.org/r/20161223105157.GB23109@dhcp22.suse.cz This patch (of 8): The trace point is not used since 925b7673cce3 ("mm: make per-memcg LRU lists exclusive") so it can be removed. Link: http://lkml.kernel.org/r/20170104101942.4860-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lockAndrea Arcangeli1-31/+15
pmd_trans_unstable does an atomic read on the pmd so it doesn't require the pmd_lock for the same check. This also removes the special assumption that the mmap_sem is hold for writing if prot_numa is not set. userfaultfd will hold the mmap_sem only for reading in change_pte_range like prot_numa, but it will not set prot_numa. This is always a valid micro-optimization regardless of userfaultfd. [kirill@shutemov.name: drop unneeded pmd_trans_unstable(pmd) check after __split_huge_pmd()] Link: http://lkml.kernel.org/r/20170208120421.GE5578@node.shutemov.name Link: http://lkml.kernel.org/r/20161216144821.5183-43-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: selftest: test UFFDIO_ZEROPAGE on all memory typesAndrea Arcangeli1-1/+81
This will verify -EINVAL is returned with hugetlbfs/shmem and it'll do a functional test of UFFDIO_ZEROPAGE on anonymous memory. Link: http://lkml.kernel.org/r/20161216144821.5183-42-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: selftest: add test for FORK, MADVDONTNEED and REMAP eventsMike Rapoport1-12/+163
Add test for userfaultfd events used in non-cooperative scenario when the process that monitors the userfaultfd and handles user faults is not the same process that causes the page faults. Link: http://lkml.kernel.org/r/20161216144821.5183-41-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: selftest: add ufd parameter to copy_pageMike Rapoport1-5/+5
With future addition of event tests, copy_page will be called with different userfault file descriptors Link: http://lkml.kernel.org/r/20161216144821.5183-40-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: selftest: introduce userfaultfd_openMike Rapoport1-16/+25
userfaultfd_open will be needed by the non cooperative selftest. Link: http://lkml.kernel.org/r/20161216144821.5183-39-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: hugetlbfs: UFFD_FEATURE_MISSING_SHMEMAndrea Arcangeli1-1/+7
Userland developers asked to be notified immediately by the UFFDIO_API ioctl if shmem missing mode is supported by userfaultfd in the running kernel. This avoids the need to run UFFDIO_REGISTER on a shmem virtual memory range to find out. Link: http://lkml.kernel.org/r/20161216144821.5183-38-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: shmem: avoid leaking blocks and used blocks in UFFDIO_COPYAndrea Arcangeli1-10/+13
If the atomic copy_user fails because of a real dangling userland pointer, we won't go back into the shmem method, so when the method returns it must not leave anything charged up, except the page itself. Link: http://lkml.kernel.org/r/20161216144821.5183-37-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: shmem: avoid a lockup resulting from corrupted page->flagsAndrea Arcangeli1-2/+1
Use the non atomic version of __SetPageUptodate while the page is still private and not visible to lookup operations. Using the non atomic version after the page is already visible to lookups is unsafe as there would be concurrent lock_page operation modifying the page->flags while it runs. This solves a lockup in find_lock_entry with the userfaultfd_shmem selftest. userfaultfd_shm D14296 691 1 0x00000004 Call Trace: schedule+0x3d/0x90 schedule_timeout+0x228/0x420 io_schedule_timeout+0xa4/0x110 __lock_page+0x12d/0x170 find_lock_entry+0xa4/0x190 shmem_getpage_gfp+0xb9/0xc30 shmem_fault+0x70/0x1c0 __do_fault+0x21/0x150 handle_mm_fault+0xec9/0x1490 __do_page_fault+0x20d/0x520 trace_do_page_fault+0x61/0x270 do_async_page_fault+0x19/0x80 async_page_fault+0x25/0x30 Link: http://lkml.kernel.org/r/20170116180408.12184-2-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: shmem: lock the page before adding it to pagecacheAndrea Arcangeli1-0/+5
A VM_BUG_ON triggered on the shmem selftest. Link: http://lkml.kernel.org/r/20161216144821.5183-36-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: shmem: add userfaultfd_shmem testMike Rapoport3-2/+50
The test verifies that anonymous shared mapping can be used with userfault using the existing testing method. The shared memory area is allocated using mmap(..., MAP_SHARED | MAP_ANONYMOUS, ...) and released using madvise(MADV_REMOVE) Link: http://lkml.kernel.org/r/20161216144821.5183-35-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappingsMike Kravetz2-18/+82
When userfaultfd hugetlbfs support was originally added, it followed the pattern of anon mappings and did not support any vmas marked VM_SHARED. As such, support was only added for private mappings. Remove this limitation and support shared mappings. The primary functional change required is adding pages to the page cache. More subtle changes are required for huge page reservation handling in error paths. A lengthy comment in the code describes the reservation handling. [mike.kravetz@oracle.com: update] Link: http://lkml.kernel.org/r/c9c8cafe-baa7-05b4-34ea-1dfa5523a85f@oracle.com Link: http://lkml.kernel.org/r/1487195210-12839-1-git-send-email-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: shmem: allow registration of shared memory rangesMike Rapoport3-16/+9
Expand the userfaultfd_register/unregister routines to allow shared memory VMAs. Currently, there is no UFFDIO_ZEROPAGE and write-protection support for shared memory VMAs, which is reflected in ioctl methods supported by uffdio_register. Link: http://lkml.kernel.org/r/20161216144821.5183-34-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: shmem: add userfaultfd hook for shared memory faultsMike Rapoport1-7/+15
When processing a page fault in shared memory area for not present page, check the VMA determine if faults are to be handled by userfaultfd. If so, delegate the page fault to handle_userfault. Link: http://lkml.kernel.org/r/20161216144821.5183-33-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: shmem: use shmem_mcopy_atomic_pte for shared memoryMike Rapoport1-13/+21
The shmem_mcopy_atomic_pte implements low lever part of UFFDIO_COPY operation for shared memory VMAs. It's based on mcopy_atomic_pte with adjustments necessary for shared memory pages. Link: http://lkml.kernel.org/r/20161216144821.5183-32-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: shmem: add tlbflush.h header for microblazeAndrea Arcangeli1-0/+2
It resolves this build error: All errors (new ones prefixed by >>): mm/shmem.c: In function 'shmem_mcopy_atomic_pte': >> mm/shmem.c:2228:2: error: implicit declaration of function 'update_mmu_cache' [-Werror=implicit-function-declaration] update_mmu_cache(dst_vma, dst_addr, dst_pte); microblaze may have to be also updated to define it in asm/pgtable.h like the other archs, then this header inclusion can be removed. Link: http://lkml.kernel.org/r/20161216144821.5183-31-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: shmem: introduce vma_is_shmemMike Rapoport2-0/+15
Currently userfault relies on vma_is_anonymous and vma_is_hugetlb to ensure compatibility of a VMA with userfault. Introduction of vma_is_shmem allows detection if tmpfs backed VMAs, so that they may be used with userfaultfd. Current implementation presumes usage of vma_is_shmem only by slow path routines in userfaultfd, therefore the vma_is_shmem is not made inline to leave the few remaining free bits in vm_flags. Link: http://lkml.kernel.org/r/20161216144821.5183-30-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd supportMike Rapoport2-0/+121
shmem_mcopy_atomic_pte is the low level routine that implements the userfaultfd UFFDIO_COPY command. It is based on the existing mcopy_atomic_pte routine with modifications for shared memory pages. Link: http://lkml.kernel.org/r/20161216144821.5183-29-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: introduce vma_can_userfaultMike Rapoport1-4/+9
Check whether a VMA can be used with userfault in more compact way Link: http://lkml.kernel.org/r/20161216144821.5183-28-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: hugetlbfs: UFFD_FEATURE_MISSING_HUGETLBFSAndrea Arcangeli1-3/+25
Userland developers asked to be notified immediately by the UFFDIO_API ioctl if hugetlbfs missing mode is supported by userfaultfd in the running kernel. This avoids the need to run UFFDIO_REGISTER on a hugetlbfs virtual memory range to find out. Link: http://lkml.kernel.org/r/20161216144821.5183-27-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: hugetlbfs: reserve count on error in __mcopy_atomic_hugetlbMike Kravetz1-1/+16
If __mcopy_atomic_hugetlb exits with an error, put_page will be called if a huge page was allocated and needs to be freed. If a reservation was associated with the huge page, the PagePrivate flag will be set. Clear PagePrivate before calling put_page/free_huge_page so that the global reservation count is not incremented. Link: http://lkml.kernel.org/r/20161216144821.5183-26-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: hugetlbfs: gup: support VM_FAULT_RETRYAndrea Arcangeli3-11/+44
Add support for VM_FAULT_RETRY to follow_hugetlb_page() so that get_user_pages_unlocked/locked and "nonblocking/FOLL_NOWAIT" features will work on hugetlbfs. This is required for fully functional userfaultfd non-present support on hugetlbfs. Link: http://lkml.kernel.org/r/20161216144821.5183-25-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd rangesMike Kravetz1-2/+49
Add routine userfaultfd_huge_must_wait which has the same functionality as the existing userfaultfd_must_wait routine. Only difference is that new routine must handle page table structure for hugepmd vmas. Link: http://lkml.kernel.org/r/20161216144821.5183-24-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: hugetlbfs: add userfaultfd_hugetlb testMike Kravetz3-17/+161
Test userfaultfd hugetlb functionality by using the existing testing method (in userfaultfd.c). Instead of an anonymous memeory, a hugetlbfs file is mmap'ed private. In this way fallocate hole punch can be used to release pages. This is because madvise(MADV_DONTNEED) is not supported for huge pages. Use the same file, but create wrappers for allocating ranges and releasing pages. Compile userfaultfd.c with HUGETLB_TEST defined to produce an executable to test userfaultfd hugetlb functionality. Link: http://lkml.kernel.org/r/20161216144821.5183-23-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: hugetlbfs: allow registration of ranges containing huge pagesMike Kravetz2-5/+53
Expand the userfaultfd_register/unregister routines to allow VM_HUGETLB vmas. huge page alignment checking is performed after a VM_HUGETLB vma is encountered. Also, since there is no UFFDIO_ZEROPAGE support for huge pages do not return that as a valid ioctl method for huge page ranges. Link: http://lkml.kernel.org/r/20161216144821.5183-22-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: hugetlbfs: add userfaultfd hugetlb hookMike Kravetz1-0/+33
When processing a hugetlb fault for no page present, check the vma to determine if faults are to be handled via userfaultfd. If so, drop the hugetlb_fault_mutex and call handle_userfault(). Link: http://lkml.kernel.org/r/20161216144821.5183-21-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: hugetlbfs: fix __mcopy_atomic_hugetlb retry/error processingMike Kravetz4-6/+14
The new routine copy_huge_page_from_user() uses kmap_atomic() to map PAGE_SIZE pages. However, this prevents page faults in the subsequent call to copy_from_user(). This is OK in the case where the routine is copied with mmap_sema held. However, in another case we want to allow page faults. So, add a new argument allow_pagefault to indicate if the routine should allow page faults. [dan.carpenter@oracle.com: unmap the correct pointer] Link: http://lkml.kernel.org/r/20170113082608.GA3548@mwanda [akpm@linux-foundation.org: kunmap() takes a page*, per Hugh] Link: http://lkml.kernel.org/r/20161216144821.5183-20-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPYMike Kravetz1-0/+186
__mcopy_atomic_hugetlb performs the UFFDIO_COPY operation for huge pages. It is based on the existing __mcopy_atomic routine for normal pages. Unlike normal pages, there is no huge page support for the UFFDIO_ZEROPAGE operation. Link: http://lkml.kernel.org/r/20161216144821.5183-19-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for userfaultfd supportMike Kravetz2-0/+88
hugetlb_mcopy_atomic_pte is the low level routine that implements the userfaultfd UFFDIO_COPY command. It is based on the existing mcopy_atomic_pte routine with modifications for huge pages. Link: http://lkml.kernel.org/r/20161216144821.5183-18-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd supportMike Kravetz2-0/+28
userfaultfd UFFDIO_COPY allows user level code to copy data to a page at fault time. The data is copied from user space to a newly allocated huge page. The new routine copy_huge_page_from_user performs this copy. Link: http://lkml.kernel.org/r/20161216144821.5183-17-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: wake userfaults after UFFDIO_UNREGISTERAndrea Arcangeli1-0/+13
Userfaults may still happen after the userfaultfd monitor thread received a UFFD_EVENT_MADVDONTNEED until UFFDIO_UNREGISTER is run. Wake any pending userfault within UFFDIO_UNREGISTER protected by the mmap_sem for writing, so they will not be reported to userland leading to UFFDIO_COPY returning -EINVAL (as the range was already unregistered) and they will not hang permanently either. Link: http://lkml.kernel.org/r/20161216144821.5183-16-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: avoid MADV_DONTNEED race conditionAndrea Arcangeli1-1/+1
MADV_DONTNEED must be notified to userland before the pages are zapped. This allows userland to immediately stop adding pages to the userfaultfd ranges before the pages are actually zapped or there could be non-zeropage leftovers as result of concurrent UFFDIO_COPY run in between zap_page_range and madvise_userfault_dontneed (both MADV_DONTNEED and UFFDIO_COPY runs under the mmap_sem for reading, so they can run concurrently). Link: http://lkml.kernel.org/r/20161216144821.5183-15-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: add madvise() event for MADV_DONTNEED requestPavel Emelyanov4-1/+51
If the page is punched out of the address space the uffd reader should know this and zeromap the respective area in case of the #PF event. Link: http://lkml.kernel.org/r/20161216144821.5183-14-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>