aboutsummaryrefslogtreecommitdiffstats
path: root/include (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-05-14mm/memory_hotplug: make __remove_pages() and arch_remove_memory() never failDavid Hildenbrand1-4/+4
All callers of arch_remove_memory() ignore errors. And we should really try to remove any errors from the memory removal path. No more errors are reported from __remove_pages(). BUG() in s390x code in case arch_remove_memory() is triggered. We may implement that properly later. WARN in case powerpc code failed to remove the section mapping, which is better than ignoring the error completely right now. Link: http://lkml.kernel.org/r/20190409100148.24703-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Stefan Agner <stefan@agner.ch> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Arun KS <arunks@codeaurora.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Rob Herring <robh@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Qian Cai <cai@lca.pw> Cc: Mathieu Malaterre <malat@debian.org> Cc: Andrew Banman <andrew.banman@hpe.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mike Travis <mike.travis@hpe.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/memory_hotplug: make unregister_memory_section() never failDavid Hildenbrand1-1/+1
Failing while removing memory is mostly ignored and cannot really be handled. Let's treat errors in unregister_memory_section() in a nice way, warning, but continuing. Link: http://lkml.kernel.org/r/20190409100148.24703-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Banman <andrew.banman@hpe.com> Cc: Mike Travis <mike.travis@hpe.com> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Qian Cai <cai@lca.pw> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Arun KS <arunks@codeaurora.org> Cc: Mathieu Malaterre <malat@debian.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh@kernel.org> Cc: Stefan Agner <stefan@agner.ch> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm, memory_hotplug: provide a more generic restrictions for memory hotplugMichal Hocko1-7/+24
arch_add_memory, __add_pages take a want_memblock which controls whether the newly added memory should get the sysfs memblock user API (e.g. ZONE_DEVICE users do not want/need this interface). Some callers even want to control where do we allocate the memmap from by configuring altmap. Add a more generic hotplug context for arch_add_memory and __add_pages. struct mhp_restrictions contains flags which contains additional features to be enabled by the memory hotplug (MHP_MEMBLOCK_API currently) and altmap for alternative memmap allocator. This patch shouldn't introduce any functional change. [akpm@linux-foundation.org: build fix] Link: http://lkml.kernel.org/r/20190408082633.2864-3-osalvador@suse.de Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm, memory_hotplug: cleanup memory offline pathMichal Hocko1-1/+2
check_pages_isolated_cb currently accounts the whole pfn range as being offlined if test_pages_isolated suceeds on the range. This is based on the assumption that all pages in the range are freed which is currently the case in most cases but it won't be with later changes, as pages marked as vmemmap won't be isolated. Move the offlined pages counting to offline_isolated_pages_cb and rely on __offline_isolated_pages to return the correct value. check_pages_isolated_cb will still do it's primary job and check the pfn range. While we are at it remove check_pages_isolated and offline_isolated_pages and use directly walk_system_ram_range as do in online_pages. Link: http://lkml.kernel.org/r/20190408082633.2864-2-osalvador@suse.de Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sectionsAlexander Duyck1-0/+16
Add yet another iterator, for_each_free_mem_range_in_zone_from, and then use it to support initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES. By doing this we can greatly improve the cache locality of the pages while we do several loops over them in the init and freeing process. We are able to tighten the loops further as a result of the "from" iterator as we can perform the initial checks for first_init_pfn in our first call to the iterator, and continue without the need for those checks via the "from" iterator. I have added this functionality in the function called deferred_init_mem_pfn_range_in_zone that primes the iterator and causes us to exit if we encounter any failure. On my x86_64 test system with 384GB of memory per node I saw a reduction in initialization time from 1.85s to 1.38s as a result of this patch. Link: http://lkml.kernel.org/r/20190405221231.12227.85836.stgit@localhost.localdomain Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: <yi.z.zhang@linux.intel.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David S. Miller <davem@davemloft.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: implement new zone specific memblock iteratorAlexander Duyck1-0/+25
Introduce a new iterator for_each_free_mem_pfn_range_in_zone. This iterator will take care of making sure a given memory range provided is in fact contained within a zone. It takes are of all the bounds checking we were doing in deferred_grow_zone, and deferred_init_memmap. In addition it should help to speed up the search a bit by iterating until the end of a range is greater than the start of the zone pfn range, and will exit completely if the start is beyond the end of the zone. Link: http://lkml.kernel.org/r/20190405221225.12227.22573.stgit@localhost.localdomain Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@kernel.org> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <yi.z.zhang@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: use mm_zero_struct_page from SPARC on all 64b architecturesAlexander Duyck1-3/+38
Patch series "Deferred page init improvements", v7. This patchset is essentially a refactor of the page initialization logic that is meant to provide for better code reuse while providing a significant improvement in deferred page initialization performance. In my testing on an x86_64 system with 384GB of RAM I have seen the following. In the case of regular memory initialization the deferred init time was decreased from 3.75s to 1.38s on average. This amounts to a 172% improvement for the deferred memory initialization performance. I have called out the improvement observed with each patch. This patch (of 4): Use the same approach that was already in use on Sparc on all the architectures that support a 64b long. This is mostly motivated by the fact that 7 to 10 store/move instructions are likely always going to be faster than having to call into a function that is not specialized for handling page init. An added advantage to doing it this way is that the compiler can get away with combining writes in the __init_single_page call. As a result the memset call will be reduced to only about 4 write operations, or at least that is what I am seeing with GCC 6.2 as the flags, LRU pointers, and count/mapcount seem to be cancelling out at least 4 of the 8 assignments on my system. One change I had to make to the function was to reduce the minimum page size to 56 to support some powerpc64 configurations. This change should introduce no change on SPARC since it already had this code. In the case of x86_64 I saw a reduction from 3.75s to 2.80s when initializing 384GB of RAM per node. Pavel Tatashin tested on a system with Broadcom's Stingray CPU and 48GB of RAM and found that __init_single_page() takes 19.30ns / 64-byte struct page before this patch and with this patch it takes 17.33ns / 64-byte struct page. Mike Rapoport ran a similar test on a OpenPower (S812LC 8348-21C) with Power8 processor and 128GB or RAM. His results per 64-byte struct page were 4.68ns before, and 4.59ns after this patch. Link: http://lkml.kernel.org/r/20190405221213.12227.9392.stgit@localhost.localdomain Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@kernel.org> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <yi.z.zhang@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/mmu_notifier: mmu_notifier_range_update_to_read_only() helperJérôme Glisse1-0/+4
Helper to test if a range is updated to read only (it is still valid to read from the range). This is useful for device driver or anyone who wish to optimize out update when they know that they already have the range map read only. Link: http://lkml.kernel.org/r/20190326164747.24405-9-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/mmu_notifier: pass down vma and reasons why mmu notifier is happeningJérôme Glisse1-1/+5
CPU page table update can happens for many reasons, not only as a result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as a result of kernel activities (memory compression, reclaim, migration, ...). Users of mmu notifier API track changes to the CPU page table and take specific action for them. While current API only provide range of virtual address affected by the change, not why the changes is happening This patch is just passing down the new informations by adding it to the mmu_notifier_range structure. Link: http://lkml.kernel.org/r/20190326164747.24405-8-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/mmu_notifier: contextual information for event triggering invalidationJérôme Glisse1-1/+4
CPU page table update can happens for many reasons, not only as a result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as a result of kernel activities (memory compression, reclaim, migration, ...). Users of mmu notifier API track changes to the CPU page table and take specific action for them. While current API only provide range of virtual address affected by the change, not why the changes is happening. This patchset do the initial mechanical convertion of all the places that calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP event as well as the vma if it is know (most invalidation happens against a given vma). Passing down the vma allows the users of mmu notifier to inspect the new vma page protection. The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier should assume that every for the range is going away when that event happens. A latter patch do convert mm call path to use a more appropriate events for each call. This is done as 2 patches so that no call site is forgotten especialy as it uses this following coccinelle patch: %<---------------------------------------------------------------------- @@ identifier I1, I2, I3, I4; @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1, +enum mmu_notifier_event event, +unsigned flags, +struct vm_area_struct *vma, struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... } @@ @@ -#define mmu_notifier_range_init(range, mm, start, end) +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end) @@ expression E1, E3, E4; identifier I1; @@ <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, I1, I1->vm_mm, E3, E4) ...> @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(..., struct vm_area_struct *VMA, ...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(...) { struct vm_area_struct *VMA; <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN; @@ FN(...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, NULL, E2, E3, E4) ...> } ---------------------------------------------------------------------->% Applied with: spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place spatch --sp-file mmu-notifier.spatch --dir mm --in-place Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/mmu_notifier: contextual information for event enumsJérôme Glisse1-0/+30
CPU page table update can happens for many reasons, not only as a result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as a result of kernel activities (memory compression, reclaim, migration, ...). This patch introduce a set of enums that can be associated with each of the events triggering a mmu notifier. Latter patches take advantages of those enum values. - UNMAP: munmap() or mremap() - CLEAR: page table is cleared (migration, compaction, reclaim, ...) - PROTECTION_VMA: change in access protections for the range - PROTECTION_PAGE: change in access protections for page in the range - SOFT_DIRTY: soft dirtyness tracking Being able to identify munmap() and mremap() from other reasons why the page table is cleared is important to allow user of mmu notifier to update their own internal tracking structure accordingly (on munmap or mremap it is not longer needed to track range of virtual address as it becomes invalid). Link: http://lkml.kernel.org/r/20190326164747.24405-5-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/mmu_notifier: convert mmu_notifier_range->blockable to a flagsJérôme Glisse1-4/+7
Use an unsigned field for flags other than blockable and convert the blockable field to be one of those flags. Link: http://lkml.kernel.org/r/20190326164747.24405-4-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/mmu_notifier: helper to test if a range invalidation is blockableJérôme Glisse1-0/+11
Patch series "mmu notifier provide context informations", v6. Here I am not posting users of this, they already have been posted to appropriate mailing list [6] and will be merge through the appropriate tree once this patchset is upstream. Note that this serie does not change any behavior for any existing code. It just pass down more information to mmu notifier listener. The rationale for this patchset: CPU page table update can happens for many reasons, not only as a result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as a result of kernel activities (memory compression, reclaim, migration, ...). This patchset introduce a set of enums that can be associated with each of the events triggering a mmu notifier: - UNMAP: munmap() or mremap() - CLEAR: page table is cleared (migration, compaction, reclaim, ...) - PROTECTION_VMA: change in access protections for the range - PROTECTION_PAGE: change in access protections for page in the range - SOFT_DIRTY: soft dirtyness tracking Being able to identify munmap() and mremap() from other reasons why the page table is cleared is important to allow user of mmu notifier to update their own internal tracking structure accordingly (on munmap or mremap it is not longer needed to track range of virtual address as it becomes invalid). Without this serie, driver are force to assume that every notification is an munmap which triggers useless trashing within drivers that associate structure with range of virtual address. Each driver is force to free up its tracking structure and then restore it on next device page fault. With this series we can also optimize device page table update. Patches to use this are at https://lkml.org/lkml/2019/1/23/833 https://lkml.org/lkml/2019/1/23/834 https://lkml.org/lkml/2019/1/23/832 https://lkml.org/lkml/2019/1/23/831 Moreover this can also be used to optimize out some page table updates such as for KVM where we can update the secondary MMU directly from the callback instead of clearing it. ACKS AMD/RADEON https://lkml.org/lkml/2019/2/1/395 ACKS RDMA https://lkml.org/lkml/2018/12/6/1473 This patch (of 8): Simple helpers to test if range invalidation is blockable. Latter patches use cocinnelle to convert all direct dereference of range-> blockable to use this function instead so that we can convert the blockable field to an unsigned for more flags. Link: http://lkml.kernel.org/r/20190326164747.24405-2-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/hmm: convert various hmm_pfn_* to device_entry which is a better nameJérôme Glisse1-29/+64
Convert hmm_pfn_* to device_entry_* as here we are dealing with device driver specific entry format and hmm provide helpers to allow differents components (including HMM) to create/parse device entry. We keep wrapper with the old name so that we can convert driver to use the new API in stages in each device driver tree. This will get remove once all driver are converted. Link: http://lkml.kernel.org/r/20190403193318.16478-13-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/hmm: add a helper function that fault pages and map them to a deviceJérôme Glisse1-0/+9
This is a all in one helper that fault pages in a range and map them to a device so that every single device driver do not have to re-implement this common pattern. This is taken from ODP RDMA in preparation of ODP RDMA convertion. It will be use by nouveau and other drivers. [jglisse@redhat.com: Was using wrong field and wrong enum] Link: http://lkml.kernel.org/r/20190409175340.26614-1-jglisse@redhat.com Link: http://lkml.kernel.org/r/20190403193318.16478-12-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/hmm: add helpers to test if mm is still alive or notJérôme Glisse1-0/+24
The device driver can have kernel thread or worker doing work against a process mm and it is useful for those to test wether the mm is dead or alive to avoid doing useless work. Add an helper to test that so that driver can bail out early if a process is dying. Note that the helper does not perform any lock synchronization and thus is just a hint ie a process might be dying but the helper might still return the process as alive. All HMM functions are safe to use in that case as HMM internal properly protect itself with lock. If driver use this helper with non HMM functions it should ascertain that it is safe to do so. Link: http://lkml.kernel.org/r/20190403193318.16478-11-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/hmm: mirror hugetlbfs (snapshoting, faulting and DMA mapping)Jérôme Glisse1-2/+25
HMM mirror is a device driver helpers to mirror range of virtual address. It means that the process jobs running on the device can access the same virtual address as the CPU threads of that process. This patch adds support for hugetlbfs mapping (ie range of virtual address that are mmap of a hugetlbfs). [rcampbell@nvidia.com: fix initial PFN for hugetlbfs pages] Link: http://lkml.kernel.org/r/20190419233536.8080-1-rcampbell@nvidia.com Link: http://lkml.kernel.org/r/20190403193318.16478-9-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/hmm: add default fault flags to avoid the need to pre-fill pfns arraysJérôme Glisse1-0/+13
The HMM mirror API can be use in two fashions. The first one where the HMM user coalesce multiple page faults into one request and set flags per pfns for of those faults. The second one where the HMM user want to pre-fault a range with specific flags. For the latter one it is a waste to have the user pre-fill the pfn arrays with a default flags value. This patch adds a default flags value allowing user to set them for a range without having to pre-fill the pfn array. Link: http://lkml.kernel.org/r/20190403193318.16478-8-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/hmm: improve driver API to work and wait over a rangeJérôme Glisse1-36/+109
A common use case for HMM mirror is user trying to mirror a range and before they could program the hardware it get invalidated by some core mm event. Instead of having user re-try right away to mirror the range provide a completion mechanism for them to wait for any active invalidation affecting the range. This also changes how hmm_range_snapshot() and hmm_range_fault() works by not relying on vma so that we can drop the mmap_sem when waiting and lookup the vma again on retry. Link: http://lkml.kernel.org/r/20190403193318.16478-7-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault()Jérôme Glisse1-1/+12
Minor optimization around hmm_pte_need_fault(). Rename for consistency between code, comments and documentation. Also improves the comments on all the possible returns values. Improve the function by returning the number of populated entries in pfns array. Link: http://lkml.kernel.org/r/20190403193318.16478-6-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot()Jérôme Glisse1-2/+2
Rename for consistency between code, comments and documentation. Also improves the comments on all the possible returns values. Improve the function by returning the number of populated entries in pfns array. Link: http://lkml.kernel.org/r/20190403193318.16478-5-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/hmm: use reference counting for HMM structJérôme Glisse1-0/+2
Every time I read the code to check that the HMM structure does not vanish before it should thanks to the many lock protecting its removal i get a headache. Switch to reference counting instead it is much easier to follow and harder to break. This also remove some code that is no longer needed with refcounting. Link: http://lkml.kernel.org/r/20190403193318.16478-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14hugetlb: use same fault hash key for shared and private mappingsMike Kravetz1-3/+1
hugetlb uses a fault mutex hash table to prevent page faults of the same pages concurrently. The key for shared and private mappings is different. Shared keys off address_space and file index. Private keys off mm and virtual address. Consider a private mappings of a populated hugetlbfs file. A fault will map the page from the file and if needed do a COW to map a writable page. Hugetlbfs hole punch uses the fault mutex to prevent mappings of file pages. It uses the address_space file index key. However, private mappings will use a different key and could race with this code to map the file page. This causes problems (BUG) for the page cache remove code as it expects the page to be unmapped. A sample stack is: page dumped because: VM_BUG_ON_PAGE(page_mapped(page)) kernel BUG at mm/filemap.c:169! ... RIP: 0010:unaccount_page_cache_page+0x1b8/0x200 ... Call Trace: __delete_from_page_cache+0x39/0x220 delete_from_page_cache+0x45/0x70 remove_inode_hugepages+0x13c/0x380 ? __add_to_page_cache_locked+0x162/0x380 hugetlbfs_fallocate+0x403/0x540 ? _cond_resched+0x15/0x30 ? __inode_security_revalidate+0x5d/0x70 ? selinux_file_permission+0x100/0x130 vfs_fallocate+0x13f/0x270 ksys_fallocate+0x3c/0x80 __x64_sys_fallocate+0x1a/0x20 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 There seems to be another potential COW issue/race with this approach of different private and shared keys as noted in commit 8382d914ebf7 ("mm, hugetlb: improve page-fault scalability"). Since every hugetlb mapping (even anon and private) is actually a file mapping, just use the address_space index key for all mappings. This results in potentially more hash collisions. However, this should not be the common case. Link: http://lkml.kernel.org/r/20190328234704.27083-3-mike.kravetz@oracle.com Link: http://lkml.kernel.org/r/20190412165235.t4sscoujczfhuiyt@linux-r8p5 Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14include/linux/balloon_compaction.h: drop unused function stubsDavid Hildenbrand1-15/+0
These are leftovers from the pre-"general non-lru movable page" era. Link: http://lkml.kernel.org/r/20190329122649.28404-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Pankaj Gupta <pagupta@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/vmscan: drop may_writepage and classzone_idx from direct reclaim begin templateYafang Shao1-16/+10
There are three tracepoints using this template, which are mm_vmscan_direct_reclaim_begin, mm_vmscan_memcg_reclaim_begin, mm_vmscan_memcg_softlimit_reclaim_begin. Regarding mm_vmscan_direct_reclaim_begin, sc.may_writepage is !laptop_mode, that's a static setting, and reclaim_idx is derived from gfp_mask which is already show in this tracepoint. Regarding mm_vmscan_memcg_reclaim_begin, may_writepage is !laptop_mode too, and reclaim_idx is (MAX_NR_ZONES-1), which are both static value. mm_vmscan_memcg_softlimit_reclaim_begin is the same with mm_vmscan_memcg_reclaim_begin. So we can drop them all. Link: http://lkml.kernel.org/r/1553736322-32235-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: introduce put_user_page*(), placeholder versionsJohn Hubbard1-0/+24
A discussion of the overall problem is below. As mentioned in patch 0001, the steps are to fix the problem are: 1) Provide put_user_page*() routines, intended to be used for releasing pages that were pinned via get_user_pages*(). 2) Convert all of the call sites for get_user_pages*(), to invoke put_user_page*(), instead of put_page(). This involves dozens of call sites, and will take some time. 3) After (2) is complete, use get_user_pages*() and put_user_page*() to implement tracking of these pages. This tracking will be separate from the existing struct page refcounting. 4) Use the tracking and identification of these pages, to implement special handling (especially in writeback paths) when the pages are backed by a filesystem. Overview ======== Some kernel components (file systems, device drivers) need to access memory that is specified via process virtual address. For a long time, the API to achieve that was get_user_pages ("GUP") and its variations. However, GUP has critical limitations that have been overlooked; in particular, GUP does not interact correctly with filesystems in all situations. That means that file-backed memory + GUP is a recipe for potential problems, some of which have already occurred in the field. GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem code to get the struct page behind a virtual address and to let storage hardware perform a direct copy to or from that page. This is a short-lived access pattern, and as such, the window for a concurrent writeback of GUP'd page was small enough that there were not (we think) any reported problems. Also, userspace was expected to understand and accept that Direct IO was not synchronized with memory-mapped access to that data, nor with any process address space changes such as munmap(), mremap(), etc. Over the years, more GUP uses have appeared (virtualization, device drivers, RDMA) that can keep the pages they get via GUP for a long period of time (seconds, minutes, hours, days, ...). This long-term pinning makes an underlying design problem more obvious. In fact, there are a number of key problems inherent to GUP: Interactions with file systems ============================== File systems expect to be able to write back data, both to reclaim pages, and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain write access to the file memory pages means that such hardware can dirty the pages, without the filesystem being aware. This can, in some cases (depending on filesystem, filesystem options, block device, block device options, and other variables), lead to data corruption, and also to kernel bugs of the form: kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899! backtrace: ext4_writepage __writepage write_cache_pages ext4_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork ...which is due to the file system asserting that there are still buffer heads attached: ({ \ BUG_ON(!PagePrivate(page)); \ ((struct buffer_head *)page_private(page)); \ }) Dave Chinner's description of this is very clear: "The fundamental issue is that ->page_mkwrite must be called on every write access to a clean file backed page, not just the first one. How long the GUP reference lasts is irrelevant, if the page is clean and you need to dirty it, you must call ->page_mkwrite before it is marked writeable and dirtied. Every. Time." This is just one symptom of the larger design problem: real filesystems that actually write to a backing device, do not actually support get_user_pages() being called on their pages, and letting hardware write directly to those pages--even though that pattern has been going on since about 2005 or so. Long term GUP ============= Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a writeable mapping is created), and the pages are file-backed. That can lead to filesystem corruption. What happens is that when a file-backed page is being written back, it is first mapped read-only in all of the CPU page tables; the file system then assumes that nobody can write to the page, and that the page content is therefore stable. Unfortunately, the GUP callers generally do not monitor changes to the CPU pages tables; they instead assume that the following pattern is safe (it's not): get_user_pages() Hardware can keep a reference to those pages for a very long time, and write to it at any time. Because "hardware" here means "devices that are not a CPU", this activity occurs without any interaction with the kernel's file system code. for each page set_page_dirty put_page() In fact, the GUP documentation even recommends that pattern. Anyway, the file system assumes that the page is stable (nothing is writing to the page), and that is a problem: stable page content is necessary for many filesystem actions during writeback, such as checksum, encryption, RAID striping, etc. Furthermore, filesystem features like COW (copy on write) or snapshot also rely on being able to use a new page for as memory for that memory range inside the file. Corruption during write back is clearly possible here. To solve that, one idea is to identify pages that have active GUP, so that we can use a bounce page to write stable data to the filesystem. The filesystem would work on the bounce page, while any of the active GUP might write to the original page. This would avoid the stable page violation problem, but note that it is only part of the overall solution, because other problems remain. Other filesystem features that need to replace the page with a new one can be inhibited for pages that are GUP-pinned. This will, however, alter and limit some of those filesystem features. The only fix for that would be to require GUP users to monitor and respond to CPU page table updates. Subsystems such as ODP and HMM do this, for example. This aspect of the problem is still under discussion. Direct IO ========= Direct IO can cause corruption, if userspace does Direct-IO that writes to a range of virtual addresses that are mmap'd to a file. The pages written to are file-backed pages that can be under write back, while the Direct IO is taking place. Here, Direct IO races with a write back: it calls GUP before page_mkclean() has replaced the CPU pte with a read-only entry. The race window is pretty small, which is probably why years have gone by before we noticed this problem: Direct IO is generally very quick, and tends to finish up before the filesystem gets around to do anything with the page contents. However, it's still a real problem. The solution is to never let GUP return pages that are under write back, but instead, force GUP to take a write fault on those pages. That way, GUP will properly synchronize with the active write back. This does not change the required GUP behavior, it just avoids that race. Details ======= Introduces put_user_page(), which simply calls put_page(). This provides a way to update all get_user_pages*() callers, so that they call put_user_page(), instead of put_page(). Also introduces put_user_pages(), and a few dirty/locked variations, as a replacement for release_pages(), and also as a replacement for open-coded loops that release multiple pages. These may be used for subsequent performance improvements, via batching of pages to be released. This is the first step of fixing a problem (also described in [1] and [2]) with interactions between get_user_pages ("gup") and filesystems. Problem description: let's start with a bug report. Below, is what happens sometimes, under memory pressure, when a driver pins some pages via gup, and then marks those pages dirty, and releases them. Note that the gup documentation actually recommends that pattern. The problem is that the filesystem may do a writeback while the pages were gup-pinned, and then the filesystem believes that the pages are clean. So, when the driver later marks the pages as dirty, that conflicts with the filesystem's page tracking and results in a BUG(), like this one that I experienced: kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899! backtrace: ext4_writepage __writepage write_cache_pages ext4_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork ...which is due to the file system asserting that there are still buffer heads attached: ({ \ BUG_ON(!PagePrivate(page)); \ ((struct buffer_head *)page_private(page)); \ }) Dave Chinner's description of this is very clear: "The fundamental issue is that ->page_mkwrite must be called on every write access to a clean file backed page, not just the first one. How long the GUP reference lasts is irrelevant, if the page is clean and you need to dirty it, you must call ->page_mkwrite before it is marked writeable and dirtied. Every. Time." This is just one symptom of the larger design problem: real filesystems that actually write to a backing device, do not actually support get_user_pages() being called on their pages, and letting hardware write directly to those pages--even though that pattern has been going on since about 2005 or so. The steps are to fix it are: 1) (This patch): provide put_user_page*() routines, intended to be used for releasing pages that were pinned via get_user_pages*(). 2) Convert all of the call sites for get_user_pages*(), to invoke put_user_page*(), instead of put_page(). This involves dozens of call sites, and will take some time. 3) After (2) is complete, use get_user_pages*() and put_user_page*() to implement tracking of these pages. This tracking will be separate from the existing struct page refcounting. 4) Use the tracking and identification of these pages, to implement special handling (especially in writeback paths) when the pages are backed by a filesystem. [1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()" [2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()" Link: http://lkml.kernel.org/r/20190327023632.13307-2-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> [docs] Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Christoph Lameter <cl@linux.com> Tested-by: Ira Weiny <ira.weiny@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14hugetlb: allow to free gigantic pages regardless of the configurationAlexandre Ghiti2-1/+8
On systems without CONTIG_ALLOC activated but that support gigantic pages, boottime reserved gigantic pages can not be freed at all. This patch simply enables the possibility to hand back those pages to memory allocator. Link: http://lkml.kernel.org/r/20190327063626.18421-5-alex@ghiti.fr Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Acked-by: David S. Miller <davem@davemloft.net> [sparc] Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: simplify MEMORY_ISOLATION && COMPACTION || CMA into CONTIG_ALLOCAlexandre Ghiti1-1/+1
This condition allows to define alloc_contig_range, so simplify it into a more accurate naming. Link: http://lkml.kernel.org/r/20190327063626.18421-4-alex@ghiti.fr Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: memcontrol: quarantine the mem_cgroup_[node_]nr_lru_pages() APIJohannes Weiner1-5/+0
Only memcg_numa_stat_show() uses those wrappers and the lru bitmasks, group them together. Link: http://lkml.kernel.org/r/20190228163020.24100-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: memcontrol: push down mem_cgroup_node_nr_lru_pages()Johannes Weiner1-10/+0
mem_cgroup_node_nr_lru_pages() is just a convenience wrapper around lruvec_page_state() that takes bitmasks of lru indexes and aggregates the counts for those. Replace callsites where the bitmask is simple enough with direct lruvec_page_state() calls. This removes the last extern user of mem_cgroup_node_nr_lru_pages(), so make that function private again, too. Link: http://lkml.kernel.org/r/20190228163020.24100-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: memcontrol: replace zone summing with lruvec_page_state()Johannes Weiner1-18/+0
Instead of adding up the zone counters, use lruvec_page_state() to get the node state directly. This is a bit cheaper and more stream-lined. Link: http://lkml.kernel.org/r/20190228163020.24100-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: memcontrol: track LRU counts in the vmstats arrayJohannes Weiner1-1/+1
Patch series "mm: memcontrol: clean up the LRU counts tracking". The memcg LRU stats usage is currently a bit messy. Memcg has private per-zone counters because reclaim needs zone granularity sometimes, but we also have plenty of users that need to awkwardly sum them up to node or memcg granularity. Meanwhile the canonical per-memcg vmstats do not track the LRU counts (NR_INACTIVE_ANON etc.) as you'd expect. This series enables LRU count tracking in the per-memcg vmstats array such that lruvec_page_state() and memcg_page_state() work on the enum node_stat_item items for the LRU counters. Then it converts all the callers that don't specifically need per-zone numbers over to that. This patch (of 6): The memcg code currently maintains private per-zone breakdowns of the LRU counters. This is necessary for reclaim decisions which are still zone-based, but there are a variety of users of these counters that only want the aggregate per-lruvec or per-memcg LRU counts, and they need to painfully sum up the zone counters on each request for that. These would be better served using the memcg vmstats arrays, which track VM statistics at the desired scope already. They just don't have the LRU counts right now. So to kick off the conversion, begin tracking LRU counts in those. Link: http://lkml.kernel.org/r/20190228163020.24100-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/vmscan: add tracepoints for node reclaimYafang Shao1-0/+32
The page alloc fast path it may perform node reclaim, which may cause a latency spike. We should add tracepoint for this event, and also measure the latency it causes. So bellow two tracepoints are introduced, mm_vmscan_node_reclaim_begin mm_vmscan_node_reclaim_end Link: http://lkml.kernel.org/r/1551421452-5385-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: <shaoyafang@didiglobal.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm, compaction: some tracepoints should be defined only when CONFIG_COMPACTION is setYafang Shao1-4/+2
Only mm_compaction_isolate_{free, migrate}pages may be used when CONFIG_COMPACTION is not set. All others are used only when CONFIG_COMPACTION is set. After this change, if CONFIG_COMPACTION is not set, the tracepoints that only work when CONFIG_COMPACTION is set will not be exposed to userspace. Without this change, they will always be exposed in debugfs whether CONFIG_COMPACTION is set or not. This is an improvement. Link: http://lkml.kernel.org/r/1552440403-11780-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: compaction: show gfp flag names in try_to_compact_pages tracepointYafang Shao1-2/+2
Showing the gfp flag names instead of the gfp_mask makes trace more convenient. Link: http://lkml.kernel.org/r/1552527998-13162-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/gup: change GUP fast to use flags rather than a write 'bool'Ira Weiny1-2/+2
To facilitate additional options to get_user_pages_fast() change the singular write parameter to be gup_flags. This patch does not change any functionality. New functionality will follow in subsequent patches. Some of the get_user_pages_fast() call sites were unchanged because they already passed FOLL_WRITE or 0 for the write parameter. NOTE: It was suggested to change the ordering of the get_user_pages_fast() arguments to ensure that callers were converted. This breaks the current GUP call site convention of having the returned pages be the final parameter. So the suggestion was rejected. Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Mike Marshall <hubcap@omnibond.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERMIra Weiny1-13/+28
Pach series "Add FOLL_LONGTERM to GUP fast and use it". HFI1, qib, and mthca, use get_user_pages_fast() due to its performance advantages. These pages can be held for a significant time. But get_user_pages_fast() does not protect against mapping FS DAX pages. Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which retains the performance while also adding the FS DAX checks. XDP has also shown interest in using this functionality.[1] In addition we change get_user_pages() to use the new FOLL_LONGTERM flag and remove the specialized get_user_pages_longterm call. [1] https://lkml.org/lkml/2019/3/19/939 "longterm" is a relative thing and at this point is probably a misnomer. This is really flagging a pin which is going to be given to hardware and can't move. I've thought of a couple of alternative names but I think we have to settle on if we are going to use FL_LAYOUT or something else to solve the "longterm" problem. Then I think we can change the flag to a better name. Secondly, it depends on how often you are registering memory. I have spoken with some RDMA users who consider MR in the performance path... For the overall application performance. I don't have the numbers as the tests for HFI1 were done a long time ago. But there was a significant advantage. Some of which is probably due to the fact that you don't have to hold mmap_sem. Finally, architecturally I think it would be good for everyone to use *_fast. There are patches submitted to the RDMA list which would allow the use of *_fast (they reworking the use of mmap_sem) and as soon as they are accepted I'll submit a patch to convert the RDMA core as well. Also to this point others are looking to use *_fast. As an aside, Jasons pointed out in my previous submission that *_fast and *_unlocked look very much the same. I agree and I think further cleanup will be coming. But I'm focused on getting the final solution for DAX at the moment. This patch (of 7): This patch starts a series which aims to support FOLL_LONGTERM in get_user_pages_fast(). Some callers who would like to do a longterm (user controlled pin) of pages with the fast variant of GUP for performance purposes. Rather than have a separate get_user_pages_longterm() call, introduce FOLL_LONGTERM and change the longterm callers to use it. This patch does not change any functionality. In the short term "longterm" or user controlled pins are unsafe for Filesystems and FS DAX in particular has been blocked. However, callers of get_user_pages_fast() were not "protected". FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it requires vmas to determine if DAX is in use. NOTE: In merging with the CMA changes we opt to change the get_user_pages() call in check_and_migrate_cma_pages() to a call of __get_user_pages_locked() on the newly migrated pages. This makes the code read better in that we are calling __get_user_pages_locked() on the pages before and after a potential migration. As a side affect some of the interfaces are cleaned up but this is not the primary purpose of the series. In review[1] it was asked: <quote> > This I don't get - if you do lock down long term mappings performance > of the actual get_user_pages call shouldn't matter to start with. > > What do I miss? A couple of points. First "longterm" is a relative thing and at this point is probably a misnomer. This is really flagging a pin which is going to be given to hardware and can't move. I've thought of a couple of alternative names but I think we have to settle on if we are going to use FL_LAYOUT or something else to solve the "longterm" problem. Then I think we can change the flag to a better name. Second, It depends on how often you are registering memory. I have spoken with some RDMA users who consider MR in the performance path... For the overall application performance. I don't have the numbers as the tests for HFI1 were done a long time ago. But there was a significant advantage. Some of which is probably due to the fact that you don't have to hold mmap_sem. Finally, architecturally I think it would be good for everyone to use *_fast. There are patches submitted to the RDMA list which would allow the use of *_fast (they reworking the use of mmap_sem) and as soon as they are accepted I'll submit a patch to convert the RDMA core as well. Also to this point others are looking to use *_fast. As an asside, Jasons pointed out in my previous submission that *_fast and *_unlocked look very much the same. I agree and I think further cleanup will be coming. But I'm focused on getting the final solution for DAX at the moment. </quote> [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965 [ira.weiny@intel.com: v3] Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Rich Felker <dalias@libc.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: James Hogan <jhogan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: move nr_deactivate accounting to shrink_active_list()Kirill Tkhai1-0/+6
We know which LRU is not active. [chris@chrisdown.name: fix build on !CONFIG_MEMCG] Link: http://lkml.kernel.org/r/20190322150513.GA22021@chrisdown.name Link: http://lkml.kernel.org/r/155290128498.31489.18250485448913338607.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Chris Down <chris@chrisdown.name> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: move recent_rotated pages calculation to shrink_inactive_list()Kirill Tkhai2-6/+9
Patch series "mm: Generalize putback functions"] putback_inactive_pages() and move_active_pages_to_lru() are almost similar, so this patchset merges them ina single function. This patch (of 4): The patch moves the calculation from putback_inactive_pages() to shrink_inactive_list(). This makes putback_inactive_pages() looking more similar to move_active_pages_to_lru(). To do that, we account activated pages in reclaim_stat::nr_activate. Since a page may change its LRU type from anon to file cache inside shrink_page_list() (see ClearPageSwapBacked()), we have to account pages for the both types. So, nr_activate becomes an array. Previously we used nr_activate to account PGACTIVATE events, but now we account them into pgactivate variable (since they are about number of pages in general, not about sum of hpage_nr_pages). Link: http://lkml.kernel.org/r/155290127956.31489.3393586616054413298.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: page cache: store only head pages in i_pagesMatthew Wilcox1-0/+13
Transparent Huge Pages are currently stored in i_pages as pointers to consecutive subpages. This patch changes that to storing consecutive pointers to the head page in preparation for storing huge pages more efficiently in i_pages. Large parts of this are "inspired" by Kirill's patch https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/ [willy@infradead.org: fix swapcache pages] Link: http://lkml.kernel.org/r/20190324155441.GF10344@bombadil.infradead.org [kirill@shutemov.name: hugetlb stores pages in page cache differently] Link: http://lkml.kernel.org/r/20190404134553.vuvhgmghlkiw2hgl@kshutemo-mobl1 Link: http://lkml.kernel.org/r/20190307153051.18815-1-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Kirill Shutemov <kirill@shutemov.name> Reviewed-and-tested-by: Song Liu <songliubraving@fb.com> Tested-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Tested-by: Qian Cai <cai@lca.pw> Cc: Hugh Dickins <hughd@google.com> Cc: Song Liu <liu.song.a23@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14userfaultfd/sysctl: add vm.unprivileged_userfaultfdPeter Xu1-0/+2
Userfaultfd can be misued to make it easier to exploit existing use-after-free (and similar) bugs that might otherwise only make a short window or race condition available. By using userfaultfd to stall a kernel thread, a malicious program can keep some state that it wrote, stable for an extended period, which it can then access using an existing exploit. While it doesn't cause the exploit itself, and while it's not the only thing that can stall a kernel thread when accessing a memory location, it's one of the few that never needs privilege. We can add a flag, allowing userfaultfd to be restricted, so that in general it won't be useable by arbitrary user programs, but in environments that require userfaultfd it can be turned back on. Add a global sysctl knob "vm.unprivileged_userfaultfd" to control whether userfaultfd is allowed by unprivileged users. When this is set to zero, only privileged users (root user, or users with the CAP_SYS_PTRACE capability) will be able to use the userfaultfd syscalls. Andrea said: : The only difference between the bpf sysctl and the userfaultfd sysctl : this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability : requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement, : because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE : already if it's doing other kind of tracking on processes runtime, in : addition of userfaultfd. In other words both syscalls works only for : root, when the two sysctl are opt-in set to 1. [dgilbert@redhat.com: changelog additions] [akpm@linux-foundation.org: documentation tweak, per Mike] Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Suggested-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Maxime Coquelin <maxime.coquelin@redhat.com> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Martin Cracauer <cracauer@cons.org> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mel Gorman <mgorman@suse.de> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14include/trace/events/vmscan.h: drop zone id from kswapd tracepointsYafang Shao1-3/+4
It is not clear how the zone id is useful in kswapd tracepoints and the id itself is not really easy to process because it depends on the configuration (available zones). Let's drop the id for now. If somebody really needs that information then the zone name should be used instead. [mhocko@suse.com: new changelog] Link: http://lkml.kernel.org/r/1552451813-10833-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: remove stale comment from page structTobin C. Harding1-1/+1
We now use the slab_list list_head instead of the lru list_head. This comment has become stale. Remove stale comment from page struct slab_list list_head. Link: http://lkml.kernel.org/r/20190402230545.2929-8-tobin@kernel.org Signed-off-by: Tobin C. Harding <tobin@kernel.org> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14list: add function list_rotate_to_front()Tobin C. Harding1-0/+18
Patch series "mm: Use slab_list list_head instead of lru", v5. Currently the slab allocators (ab)use the struct page 'lru' list_head. We have a list head for slab allocators to use, 'slab_list'. During v2 it was noted by Christoph that the SLOB allocator was reaching into a list_head, this version adds 2 patches to the front of the set to fix that. Clean up all three allocators by using the 'slab_list' list_head instead of overloading the 'lru' list_head. This patch (of 7): Currently if we wish to rotate a list until a specific item is at the front of the list we can call list_move_tail(head, list). Note that the arguments are the reverse way to the usual use of list_move_tail(list, head). This is a hack, it depends on the developer knowing how the list_head operates internally which violates the layer of abstraction offered by the list_head. Also, it is not intuitive so the next developer to come along must study list.h in order to fully understand what is meant by the call, while this is 'good for' the developer it makes reading the code harder. We should have an function appropriately named that does this if there are users for it intree. By grep'ing the tree for list_move_tail() and list_tail() and attempting to guess the argument order from the names it seems there is only one place currently in the tree that does this - the slob allocatator. Add function list_rotate_to_front() to rotate a list until the specified item is at the front of the list. Link: http://lkml.kernel.org/r/20190402230545.2929-2-tobin@kernel.org Signed-off-by: Tobin C. Harding <tobin@kernel.org> Reviewed-by: Christoph Lameter <cl@linux.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/huge_memory: fix vmf_insert_pfn_{pmd, pud}() crash, handle unaligned addressesDan Williams1-4/+2
Starting with c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()") vmf_insert_pfn_pmd() internally calls pmdp_set_access_flags(). That helper enforces a pmd aligned @address argument via VM_BUG_ON() assertion. Update the implementation to take a 'struct vm_fault' argument directly and apply the address alignment fixup internally to fix crash signatures like: kernel BUG at arch/x86/mm/pgtable.c:515! invalid opcode: 0000 [#1] SMP NOPTI CPU: 51 PID: 43713 Comm: java Tainted: G OE 4.19.35 #1 [..] RIP: 0010:pmdp_set_access_flags+0x48/0x50 [..] Call Trace: vmf_insert_pfn_pmd+0x198/0x350 dax_iomap_fault+0xe82/0x1190 ext4_dax_huge_fault+0x103/0x1f0 ? __switch_to_asm+0x40/0x70 __handle_mm_fault+0x3f6/0x1370 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 handle_mm_fault+0xda/0x200 __do_page_fault+0x249/0x4f0 do_page_fault+0x32/0x110 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 Link: http://lkml.kernel.org/r/155741946350.372037.11148198430068238140.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Piotr Balcer <piotr.balcer@intel.com> Tested-by: Yan Ma <yan.ma@intel.com> Tested-by: Pankaj Gupta <pagupta@redhat.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Chandan Rajendra <chandan@linux.ibm.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-13Merge tag 'iommu-updates-v5.2' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/joro/iommuLinus Torvalds5-24/+194
Pull IOMMU updates from Joerg Roedel: - ATS support for ARM-SMMU-v3. - AUX domain support in the IOMMU-API and the Intel VT-d driver. This adds support for multiple DMA address spaces per (PCI-)device. The use-case is to multiplex devices between host and KVM guests in a more flexible way than supported by SR-IOV. - the rest are smaller cleanups and fixes, two of which needed to be reverted after testing in linux-next. * tag 'iommu-updates-v5.2' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (45 commits) Revert "iommu/amd: Flush not present cache in iommu_map_page" Revert "iommu/amd: Remove the leftover of bypass support" iommu/vt-d: Fix leak in intel_pasid_alloc_table on error path iommu/vt-d: Make kernel parameter igfx_off work with vIOMMU iommu/vt-d: Set intel_iommu_gfx_mapped correctly iommu/amd: Flush not present cache in iommu_map_page iommu/vt-d: Cleanup: no spaces at the start of a line iommu/vt-d: Don't request page request irq under dmar_global_lock iommu/vt-d: Use struct_size() helper iommu/mediatek: Fix leaked of_node references iommu/amd: Remove amd_iommu_pd_list iommu/arm-smmu: Log CBFRSYNRA register on context fault iommu/arm-smmu-v3: Don't disable SMMU in kdump kernel iommu/arm-smmu-v3: Disable tagged pointers iommu/arm-smmu-v3: Add support for PCI ATS iommu/arm-smmu-v3: Link domains and devices iommu/arm-smmu-v3: Add a master->domain pointer iommu/arm-smmu-v3: Store SteamIDs in master iommu/arm-smmu-v3: Rename arm_smmu_master_data to arm_smmu_master ACPI/IORT: Check ATS capability in root complex nodes ...
2019-05-12Merge tag 'upstream-5.2-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/rw/ubifsLinus Torvalds1-0/+11
Pull UBI/UBIFS updates from Richard Weinberger: - fscrypt framework usage updates - One huge fix for xattr unlink - Cleanup of fscrypt ifdefs - Fix for our new UBIFS auth feature * tag 'upstream-5.2-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: ubi: wl: Fix uninitialized variable ubifs: Drop unnecessary setting of zbr->znode ubifs: Remove ifdefs around CONFIG_UBIFS_ATIME_SUPPORT ubifs: Remove #ifdef around CONFIG_FS_ENCRYPTION ubifs: Limit number of xattrs per inode ubifs: orphan: Handle xattrs like files ubifs: journal: Handle xattrs like files ubifs: find.c: replace swap function with built-in one ubifs: Do not skip hash checking in data nodes ubifs: work around high stack usage with clang ubifs: remove unused function __ubifs_shash_final ubifs: remove unnecessary #ifdef around fscrypt_ioctl_get_policy() ubifs: remove unnecessary calls to set up directory key
2019-05-12Merge tag 'mtd/for-5.2' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mtd/linuxLinus Torvalds8-74/+146
Pull MTD updates from Richard Weinberger: "MTD core changes: - New AFS partition parser - Update MAINTAINERS entry - Use of fall-throughs markers NAND core changes: - Support having the bad block markers in either the first, second or last page of a block. The combination of all three location is now possible. - Constification of NAND_OP_PARSER(_PATTERN) elements. - Generic NAND DT bindings changed to yaml format (can be used to check the proposed bindings. First platform to be fully supported: sunxi. - Stopped using several legacy hooks. - Preparation to use the generic NAND layer with the addition of several helpers and the removal of the struct nand_chip from generic functions. - Kconfig cleanup to prepare the introduction of external ECC engines support. - Fallthrough comments. - Introduction of the SPI-mem dirmap API for SPI-NAND devices. Raw NAND controller drivers changes: - nandsim: - Switch to ->exec-op(). - meson: - Misc cleanups and fixes. - New OOB layout. - Sunxi: - A23/A33 NAND DMA support. - Ingenic: - Full reorganization and cleanup. - Clear separation between NAND controller and ECC engine. - Support JZ4740 an JZ4725B. - Denali: - Clear controller/chip separation. - ->exec_op() migration. - Various cleanups. - fsl_elbc: - Enable software ECC support. - Atmel: - Sam9x60 support. - GPMI: - Introduce the GPMI_IS_MXS() macro. - Various trivial/spelling/coding style fixes. SPI NOR core changes: - Print all JEDEC ID bytes on error - Fix comment of spi_nor_find_best_erase_type() - Add region locking flags for s25fl512s SPI NOR controller drivers changes: - intel-spi: - Avoid crossing 4K address boundary on read/write - Add support for Intel Comet Lake SPI serial flash" * tag 'mtd/for-5.2' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mtd/linux: (120 commits) mtd: part: fix incorrect format specifier for an unsigned long long mtd: lpddr_cmds: Mark expected switch fall-through mtd: phram: Mark expected switch fall-throughs mtd: cfi_cmdset_0002: Mark expected switch fall-throughs mtd: cfi_util: mark expected switch fall-throughs MAINTAINERS: MTD Git repository is hosted on kernel.org MAINTAINERS: Update jffs2 entry mtd: afs: add v2 partition parsing mtd: afs: factor the IIS read into partition parser mtd: afs: factor footer parsing into the v1 part parsing mtd: factor out v1 partition parsing mtd: afs: simplify partition detection mtd: afs: simplify partition parsing mtd: partitions: Add OF support to AFS partitions mtd: partitions: Add AFS partitions DT bindings mtd: afs: Move AFS partition parser to parsers subdir mtd: maps: Make uclinux_ram_map static mtd: maps: Allow MTD_PHYSMAP with MTD_RAM MAINTAINERS: Add myself as MTD maintainer MAINTAINERS: Remove my name from the MTD and NAND entries ...
2019-05-12Merge tag 'tag-chrome-platform-for-v5.2' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linuxLinus Torvalds1-20/+2
Pull chrome platform updates from Benson Leung: "CrOS EC: - Add EC host command support using rpmsg - Add new CrOS USB PD logging driver - Transfer spi messages at high priority - Add support to trace CrOS EC commands - Minor fixes and cleanups in protocol and debugfs Wilco EC: - Standardize Wilco EC mailbox interface - Add h1_gpio status to debugfs" * tag 'tag-chrome-platform-for-v5.2' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux: platform/chrome: cros_ec_proto: Add trace event to trace EC commands platform/chrome: cros_ec_debugfs: Use cros_ec_cmd_xfer_status helper platform/chrome: cros_ec: Add EC host command support using rpmsg platform/chrome: wilco_ec: Add h1_gpio status to debugfs platform/chrome: wilco_ec: Standardize mailbox interface platform/chrome: cros_ec_proto: check for NULL transfer function platform/chrome: Add CrOS USB PD logging driver platform/chrome: cros_ec_spi: Transfer messages at high priority platform/chrome: cros_ec_debugfs: no need to check return value of debugfs_create functions platform/chrome: cros_ec_debugfs: Remove dev_warn when console log is not supported
2019-05-11Merge tag 'gpio-v5.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpioLinus Torvalds4-24/+23
Pull gpio updates from Linus Walleij: "This is the bulk of the GPIO changes for the v5.2 kernel cycle. A bit later than usual because I was ironing out my own mistakes. I'm holding some stuff back for the next kernel as a result, and this should be a healthy and well tested batch. Core changes: - The gpiolib MMIO driver has been enhanced to handle two direction registers, i.e. one register to set lines as input and one register to set lines as output. It turns out some silicon engineer thinks the ability to configure a line as input and output at the same time makes sense, this can be debated but includes a lot of analog electronics reasoning, and the registers are there and need to be handled consistently. Unsurprisingly, we enforce the lines to be either inputs or outputs in such schemes. - Send in the proper argument value to .set_config() dispatched to the pin control subsystem. Nobody used it before, now someone does, so fix it to work as expected. - The ACPI gpiolib portions can now handle pin bias setting (pull up or pull down). This has been in the ACPI spec for years and we finally have it properly integrated with Linux GPIOs. It was based on an observation from Andy Schevchenko that Thomas Petazzoni's changes to the core for biasing the PCA950x GPIO expander actually happen to fit hand-in-glove with what the ACPI core needed. Such nice synergies happen sometimes. New drivers: - A new driver for the Mellanox BlueField GPIO controller. This is using 64bit MMIO registers and can configure lines as inputs and outputs at the same time and after improving the MMIO library we handle it just fine. Interesting. - A new IXP4xx proper gpiochip driver with hierarchical interrupts should be coming in from the ARM SoC tree as well. Driver enhancements: - The PCA053x driver handles the CAT9554 GPIO expander. - The PCA053x driver handles the NXP PCAL6416 GPIO expander. - Wake-up support on PCA053x GPIO lines. - OMAP now does a nice asynchronous IRQ handling on wake-ups by letting everything wake up on edges, and this makes runtime PM work as expected too. Misc: - Several cleanups such as devres fixes. - Get rid of some languager comstructs that cause problems when compiling with LLVMs clang. - Documentation review and update" * tag 'gpio-v5.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (85 commits) gpio: Update documentation docs: gpio: convert docs to ReST and rename to *.rst gpio: sch: Remove write-only core_base gpio: pxa: Make two symbols static gpiolib: acpi: Respect pin bias setting gpiolib: acpi: Add acpi_gpio_update_gpiod_lookup_flags() helper gpiolib: acpi: Set pin value, based on bias, more accurately gpiolib: acpi: Change type of dflags gpiolib: Introduce GPIO_LOOKUP_FLAGS_DEFAULT gpiolib: Make use of enum gpio_lookup_flags consistent gpiolib: Indent entry values of enum gpio_lookup_flags gpio: pca953x: add support for pca6416 dt-bindings: gpio: pca953x: document the nxp,pca6416 gpio: pca953x: add pcal6416 to the of_device_id table gpio: gpio-omap: Remove conditional pm_runtime handling for GPIO interrupts gpio: gpio-omap: configure edge detection for level IRQs for idle wakeup tracing: stop making gpio tracing configurable gpio: pca953x: Configure wake-up path when wake-up is enabled gpio: of: Optimize quirk checks gpio: mmio: Drop bgpio_dir_inverted ...