aboutsummaryrefslogtreecommitdiffstats
path: root/mm/page_alloc.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-07-18mm/sparsemem: support sub-section hotplugDan Williams1-1/+1
The libnvdimm sub-system has suffered a series of hacks and broken workarounds for the memory-hotplug implementation's awkward section-aligned (128MB) granularity. For example the following backtrace is emitted when attempting arch_add_memory() with physical address ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section: # cat /proc/iomem | grep -A1 -B1 Persistent\ Memory 100000000-1ffffffff : System RAM 200000000-303ffffff : Persistent Memory (legacy) 304000000-43fffffff : System RAM 440000000-23ffffffff : Persistent Memory 2400000000-43bfffffff : Persistent Memory 2400000000-43bfffffff : namespace2.0 WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60 [..] RIP: 0010:add_pages+0x5c/0x60 [..] Call Trace: devm_memremap_pages+0x460/0x6e0 pmem_attach_disk+0x29e/0x680 [nd_pmem] ? nd_dax_probe+0xfc/0x120 [libnvdimm] nvdimm_bus_probe+0x66/0x160 [libnvdimm] It was discovered that the problem goes beyond RAM vs PMEM collisions as some platform produce PMEM vs PMEM collisions within a given section. The libnvdimm workaround for that case revealed that the libnvdimm section-alignment-padding implementation has been broken for a long while. A fix for that long-standing breakage introduces as many problems as it solves as it would require a backward-incompatible change to the namespace metadata interpretation. Instead of that dubious route [1], address the root problem in the memory-hotplug implementation. Note that EEXIST is no longer treated as success as that is how sparse_add_section() reports subsection collisions, it was also obviated by recent changes to perform the request_region() for 'System RAM' before arch_add_memory() in the add_memory() sequence. [1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com [osalvador@suse.de: fix deactivate_section for early sections] Link: http://lkml.kernel.org/r/20190715081549.32577-2-osalvador@suse.de Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64] Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18mm: kill is_dev_zone() helperDan Williams1-1/+1
Given there are no more usages of is_dev_zone() outside of 'ifdef CONFIG_ZONE_DEVICE' protection, kill off the compilation helper. Link: http://lkml.kernel.org/r/156092353211.979959.1489004866360828964.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Wei Yang <richardw.yang@linux.intel.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64] Cc: Michal Hocko <mhocko@suse.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18mm/sparsemem: add helpers track active portions of a section at bootDan Williams1-2/+8
Prepare for hot{plug,remove} of sub-ranges of a section by tracking a sub-section active bitmask, each bit representing a PMD_SIZE span of the architecture's memory hotplug section size. The implications of a partially populated section is that pfn_valid() needs to go beyond a valid_section() check and either determine that the section is an "early section", or read the sub-section active ranges from the bitmask. The expectation is that the bitmask (subsection_map) fits in the same cacheline as the valid_section() / early_section() data, so the incremental performance overhead to pfn_valid() should be negligible. The rationale for using early_section() to short-ciruit the subsection_map check is that there are legacy code paths that use pfn_valid() at section granularity before validating the pfn against pgdat data. So, the early_section() check allows those traditional assumptions to persist while also permitting subsection_map to tell the truth for purposes of populating the unused portions of early sections with PMEM and other ZONE_DEVICE mappings. Link: http://lkml.kernel.org/r/156092350874.979959.18185938451405518285.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Qian Cai <cai@lca.pw> Tested-by: Jane Chu <jane.chu@oracle.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64] Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18mm/sparsemem: introduce struct mem_section_usageDan Williams1-1/+1
Patch series "mm: Sub-section memory hotplug support", v10. The memory hotplug section is an arbitrary / convenient unit for memory hotplug. 'Section-size' units have bled into the user interface ('memblock' sysfs) and can not be changed without breaking existing userspace. The section-size constraint, while mostly benign for typical memory hotplug, has and continues to wreak havoc with 'device-memory' use cases, persistent memory (pmem) in particular. Recall that pmem uses devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a 'struct page' memmap for pmem. However, it does not use the 'bottom half' of memory hotplug, i.e. never marks pmem pages online and never exposes the userspace memblock interface for pmem. This leaves an opening to redress the section-size constraint. To date, the libnvdimm subsystem has attempted to inject padding to satisfy the internal constraints of arch_add_memory(). Beyond complicating the code, leading to bugs [2], wasting memory, and limiting configuration flexibility, the padding hack is broken when the platform changes this physical memory alignment of pmem from one boot to the next. Device failure (intermittent or permanent) and physical reconfiguration are events that can cause the platform firmware to change the physical placement of pmem on a subsequent boot, and device failure is an everyday event in a data-center. It turns out that sections are only a hard requirement of the user-facing interface for memory hotplug and with a bit more infrastructure sub-section arch_add_memory() support can be added for kernel internal usages like devm_memremap_pages(). Here is an analysis of the current design assumptions in the current code and how they are addressed in the new implementation: Current design assumptions: - Sections that describe boot memory (early sections) are never unplugged / removed. - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a valid_section() check - __add_pages() and helper routines assume all operations occur in PAGES_PER_SECTION units. - The memblock sysfs interface only comprehends full sections New design assumptions: - Sections are instrumented with a sub-section bitmask to track (on x86) individual 2MB sub-divisions of a 128MB section. - Partially populated early sections can be extended with additional sub-sections, and those sub-sections can be removed with arch_remove_memory(). With this in place we no longer lose usable memory capacity to padding. - pfn_valid() is updated to look deeper than valid_section() to also check the active-sub-section mask. This indication is in the same cacheline as the valid_section() so the performance impact is expected to be negligible. So far the lkp robot has not reported any regressions. - Outside of the core vmemmap population routines which are replaced, other helper routines like shrink_{zone,pgdat}_span() are updated to handle the smaller granularity. Core memory hotplug routines that deal with online memory are not touched. - The existing memblock sysfs user api guarantees / assumptions are not touched since this capability is limited to !online !memblock-sysfs-accessible sections. Meanwhile the issue reports continue to roll in from users that do not understand when and how the 128MB constraint will bite them. The current implementation relied on being able to support at least one misaligned namespace, but that immediately falls over on any moderately complex namespace creation attempt. Beyond the initial problem of 'System RAM' colliding with pmem, and the unsolvable problem of physical alignment changes, Linux is now being exposed to platforms that collide pmem ranges with other pmem ranges by default [3]. In short, devm_memremap_pages() has pushed the venerable section-size constraint past the breaking point, and the simplicity of section-aligned arch_add_memory() is no longer tenable. These patches are exposed to the kbuild robot on a subsection-v10 branch [4], and a preview of the unit test for this functionality is available on the 'subsection-pending' branch of ndctl [5]. [2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com [3]: https://github.com/pmem/ndctl/issues/76 [4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10 [5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c This patch (of 13): Towards enabling memory hotplug to track partial population of a section, introduce 'struct mem_section_usage'. A pointer to a 'struct mem_section_usage' instance replaces the existing pointer to a 'pageblock_flags' bitmap. Effectively it adds one more 'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house a new 'subsection_map' bitmap. The new bitmap enables the memory hot{plug,remove} implementation to act on incremental sub-divisions of a section. SUBSECTION_SHIFT is defined as global constant instead of per-architecture value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of subsection users. Specifically a common subsection size allows for the possibility that persistent memory namespace configurations be made compatible across architectures. The primary motivation for this functionality is to support platforms that mix "System RAM" and "Persistent Memory" within a single section, or multiple PMEM ranges with different mapping lifetimes within a single section. The section restriction for hotplug has caused an ongoing saga of hacks and bugs for devm_memremap_pages() users. Beyond the fixups to teach existing paths how to retrieve the 'usemap' from a section, and updates to usemap allocation path, there are no expected behavior changes. Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Wei Yang <richardw.yang@linux.intel.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64] Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Qian Cai <cai@lca.pw> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16mm/vmscan.c: add a new member reclaim_state in struct shrink_controlYafang Shao1-4/+0
Patch series "mm/vmscan: calculate reclaimed slab in all reclaim paths". This patchset is to fix the issues in doing shrink slab. There're six different reclaim paths by now, - kswapd reclaim path - node reclaim path - hibernate preallocate memory reclaim path - direct reclaim path - memcg reclaim path - memcg softlimit reclaim path The slab caches reclaimed in these paths are only calculated in the above three paths. The issues are detailed explained in patch #2. We should calculate the reclaimed slab caches in every reclaim path. In order to do it, the struct reclaim_state is placed into the struct shrink_control. In node reclaim path, there'is another issue about shrinking slab, which is adressed in "mm/vmscan: shrink slab in node reclaim" (https://lore.kernel.org/linux-mm/1559874946-22960-1-git-send-email-laoar.shao@gmail.com/). This patch (of 2): The struct reclaim_state is used to record how many slab caches are reclaimed in one reclaim path. The struct shrink_control is used to control one reclaim path. So we'd better put reclaim_state into shrink_control. [laoar.shao@gmail.com: remove reclaim_state assignment from __perform_reclaim()] Link: http://lkml.kernel.org/r/1561381582-13697-1-git-send-email-laoar.shao@gmail.com Link: http://lkml.kernel.org/r/1561112086-6169-2-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-14Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds1-7/+6
Pull HMM updates from Jason Gunthorpe: "Improvements and bug fixes for the hmm interface in the kernel: - Improve clarity, locking and APIs related to the 'hmm mirror' feature merged last cycle. In linux-next we now see AMDGPU and nouveau to be using this API. - Remove old or transitional hmm APIs. These are hold overs from the past with no users, or APIs that existed only to manage cross tree conflicts. There are still a few more of these cleanups that didn't make the merge window cut off. - Improve some core mm APIs: - export alloc_pages_vma() for driver use - refactor into devm_request_free_mem_region() to manage DEVICE_PRIVATE resource reservations - refactor duplicative driver code into the core dev_pagemap struct - Remove hmm wrappers of improved core mm APIs, instead have drivers use the simplified API directly - Remove DEVICE_PUBLIC - Simplify the kconfig flow for the hmm users and core code" * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits) mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR mm: remove the HMM config option mm: sort out the DEVICE_PRIVATE Kconfig mess mm: simplify ZONE_DEVICE page private data mm: remove hmm_devmem_add mm: remove hmm_vma_alloc_locked_page nouveau: use devm_memremap_pages directly nouveau: use alloc_page_vma directly PCI/P2PDMA: use the dev_pagemap internal refcount device-dax: use the dev_pagemap internal refcount memremap: provide an optional internal refcount in struct dev_pagemap memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag memremap: remove the data field in struct dev_pagemap memremap: add a migrate_to_ram method to struct dev_pagemap_ops memremap: lift the devmap_enable manipulation into devm_memremap_pages memremap: pass a struct dev_pagemap to ->kill and ->cleanup memremap: move dev_pagemap callbacks into a separate structure memremap: validate the pagemap type passed to devm_memremap_pages mm: factor out a devm_request_free_mem_region helper mm: export alloc_pages_vma ...
2019-07-12mm: security: introduce init_on_alloc=1 and init_on_free=1 boot optionsAlexander Potapenko1-7/+64
Patch series "add init_on_alloc/init_on_free boot options", v10. Provide init_on_alloc and init_on_free boot options. These are aimed at preventing possible information leaks and making the control-flow bugs that depend on uninitialized values more deterministic. Enabling either of the options guarantees that the memory returned by the page allocator and SL[AU]B is initialized with zeroes. SLOB allocator isn't supported at the moment, as its emulation of kmem caches complicates handling of SLAB_TYPESAFE_BY_RCU caches correctly. Enabling init_on_free also guarantees that pages and heap objects are initialized right after they're freed, so it won't be possible to access stale data by using a dangling pointer. As suggested by Michal Hocko, right now we don't let the heap users to disable initialization for certain allocations. There's not enough evidence that doing so can speed up real-life cases, and introducing ways to opt-out may result in things going out of control. This patch (of 2): The new options are needed to prevent possible information leaks and make control-flow bugs that depend on uninitialized values more deterministic. This is expected to be on-by-default on Android and Chrome OS. And it gives the opportunity for anyone else to use it under distros too via the boot args. (The init_on_free feature is regularly requested by folks where memory forensics is included in their threat models.) init_on_alloc=1 makes the kernel initialize newly allocated pages and heap objects with zeroes. Initialization is done at allocation time at the places where checks for __GFP_ZERO are performed. init_on_free=1 makes the kernel initialize freed pages and heap objects with zeroes upon their deletion. This helps to ensure sensitive data doesn't leak via use-after-free accesses. Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator returns zeroed memory. The two exceptions are slab caches with constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never zero-initialized to preserve their semantics. Both init_on_alloc and init_on_free default to zero, but those defaults can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and CONFIG_INIT_ON_FREE_DEFAULT_ON. If either SLUB poisoning or page poisoning is enabled, those options take precedence over init_on_alloc and init_on_free: initialization is only applied to unpoisoned allocations. Slowdown for the new features compared to init_on_free=0, init_on_alloc=0: hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%) hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%) Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%) Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%) Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%) Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%) The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline is within the standard error. The new features are also going to pave the way for hardware memory tagging (e.g. arm64's MTE), which will require both on_alloc and on_free hooks to set the tags for heap objects. With MTE, tagging will have the same cost as memory initialization. Although init_on_free is rather costly, there are paranoid use-cases where in-memory data lifetime is desired to be minimized. There are various arguments for/against the realism of the associated threat models, but given that we'll need the infrastructure for MTE anyway, and there are people who want wipe-on-free behavior no matter what the performance cost, it seems reasonable to include it in this series. [glider@google.com: v8] Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com [glider@google.com: v9] Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com [glider@google.com: v10] Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.cz> [page and dmapool parts Acked-by: James Morris <jamorris@linux.microsoft.com>] Cc: Christoph Lameter <cl@linux.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Sandeep Patil <sspatil@android.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Jann Horn <jannh@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/large system hash: clear hashdist when only one node with memory is bootedNicholas Piggin1-13/+18
CONFIG_NUMA on 64-bit CPUs currently enables hashdist unconditionally even when booting on single node machines. This causes the large system hashes to be allocated with vmalloc, and mapped with small pages. This change clears hashdist if only one node has come up with memory. This results in the important large inode and dentry hashes using memblock allocations. All others are within 4MB size up to about 128GB of RAM, which allows them to be allocated from the linear map on most non-NUMA images. Other big hashes like futex and TCP should eventually be moved over to the same style of allocation as those vfs caches that use HASH_EARLY if !hashdist, so they don't exceed MAX_ORDER on very large non-NUMA images. This brings dTLB misses for linux kernel tree `git diff` from ~45,000 to ~8,000 on a Kaby Lake KVM guest with 8MB dentry hash and mitigations=off (performance is in the noise, under 1% difference, page tables are likely to be well cached for this workload). Link: http://lkml.kernel.org/r/20190605144814.29319-2-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdistNicholas Piggin1-7/+9
The kernel currently clamps large system hashes to MAX_ORDER when hashdist is not set, which is rather arbitrary. vmalloc space is limited on 32-bit machines, but this shouldn't result in much more used because of small physical memory limiting system hash sizes. Include "vmalloc" or "linear" in the kernel log message. Link: http://lkml.kernel.org/r/20190605144814.29319-1-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm, debug_pagealloc: use a page type instead of page_ext flagVlastimil Babka1-34/+6
When debug_pagealloc is enabled, we currently allocate the page_ext array to mark guard pages with the PAGE_EXT_DEBUG_GUARD flag. Now that we have the page_type field in struct page, we can use that instead, as guard pages are neither PageSlab nor mapped to userspace. This reduces memory overhead when debug_pagealloc is enabled and there are no other features requiring the page_ext array. Link: http://lkml.kernel.org/r/20190603143451.27353-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm, page_alloc: more extensive free page checking with debug_pageallocVlastimil Babka1-10/+43
The page allocator checks struct pages for expected state (mapcount, flags etc) as pages are being allocated (check_new_page()) and freed (free_pages_check()) to provide some defense against errors in page allocator users. Prior commits 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP") and 4db7548ccbd9 ("mm, page_alloc: defer debugging checks of freed pages until a PCP drain") this has happened for order-0 pages as they were allocated from or freed to the per-cpu caches (pcplists). Since those are fast paths, the checks are now performed only when pages are moved between pcplists and global free lists. This however lowers the chances of catching errors soon enough. In order to increase the chances of the checks to catch errors, the kernel has to be rebuilt with CONFIG_DEBUG_VM, which also enables multiple other internal debug checks (VM_BUG_ON() etc), which is suboptimal when the goal is to catch errors in mm users, not in mm code itself. To catch some wrong users of the page allocator we have CONFIG_DEBUG_PAGEALLOC, which is designed to have virtually no overhead unless enabled at boot time. Memory corruptions when writing to freed pages have often the same underlying errors (use-after-free, double free) as corrupting the corresponding struct pages, so this existing debugging functionality is a good fit to extend by also perform struct page checks at least as often as if CONFIG_DEBUG_VM was enabled. Specifically, after this patch, when debug_pagealloc is enabled on boot, and CONFIG_DEBUG_VM disabled, pages are checked when allocated from or freed to the pcplists *in addition* to being moved between pcplists and free lists. When both debug_pagealloc and CONFIG_DEBUG_VM are enabled, pages are checked when being moved between pcplists and free lists *in addition* to when allocated from or freed to the pcplists. When debug_pagealloc is not enabled on boot, the overhead in fast paths should be virtually none thanks to the use of static key. Link: http://lkml.kernel.org/r/20190603143451.27353-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm, debug_pagelloc: use static keys to enable debuggingVlastimil Babka1-6/+17
Patch series "debug_pagealloc improvements". I have been recently debugging some pcplist corruptions, where it would be useful to perform struct page checks immediately as pages are allocated from and freed to pcplists, which is now only possible by rebuilding the kernel with CONFIG_DEBUG_VM (details in Patch 2 changelog). To make this kind of debugging simpler in future on a distro kernel, I have improved CONFIG_DEBUG_PAGEALLOC so that it has even smaller overhead when not enabled at boot time (Patch 1) and also when enabled (Patch 3), and extended it to perform the struct page checks more often when enabled (Patch 2). Now it can be configured in when building a distro kernel without extra overhead, and debugging page use after free or double free can be enabled simply by rebooting with debug_pagealloc=on. This patch (of 3): CONFIG_DEBUG_PAGEALLOC has been redesigned by 031bc5743f15 ("mm/debug-pagealloc: make debug-pagealloc boottime configurable") to allow being always enabled in a distro kernel, but only perform its expensive functionality when booted with debug_pagelloc=on. We can further reduce the overhead when not boot-enabled (including page allocator fast paths) using static keys. This patch introduces one for debug_pagealloc core functionality, and another for the optional guard page functionality (enabled by booting with debug_guardpage_minorder=X). Link: http://lkml.kernel.org/r/20190603143451.27353-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm: remove the exporting of totalram_pagesDenis Efremov1-2/+0
Previously totalram_pages was the global variable. Currently, totalram_pages is the static inline function from the include/linux/mm.h However, the function is also marked as EXPORT_SYMBOL, which is at best an odd combination. Because there is no point for the static inline function from a public header to be exported, this commit removes the EXPORT_SYMBOL() marking. It will be still possible to use the function in modules because all the symbols it depends on are exported. Link: http://lkml.kernel.org/r/20190710141031.15642-1-efremov@linux.com Fixes: ca79b0c211af6 ("mm: convert totalram_pages and totalhigh_pages variables to atomic") Signed-off-by: Denis Efremov <efremov@linux.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-05mm/page_alloc.c: fix regression with deferred struct page initJuergen Gross1-1/+2
Commit 0e56acae4b4d ("mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections") is causing a regression on some systems when the kernel is booted as Xen dom0. The system will just hang in early boot. Reason is an endless loop in get_page_from_freelist() in case the first zone looked at has no free memory. deferred_grow_zone() is always returning true due to the following code snipplet: /* If the zone is empty somebody else may have cleared out the zone */ if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, first_deferred_pfn)) { pgdat->first_deferred_pfn = ULONG_MAX; pgdat_resize_unlock(pgdat, &flags); return true; } This in turn results in the loop as get_page_from_freelist() is assuming forward progress can be made by doing some more struct page initialization. Link: http://lkml.kernel.org/r/20190620160821.4210-1-jgross@suse.com Fixes: 0e56acae4b4d ("mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections") Signed-off-by: Juergen Gross <jgross@suse.com> Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-02mm: simplify ZONE_DEVICE page private dataChristoph Hellwig1-4/+4
Remove the clumsy hmm_devmem_page_{get,set}_drvdata helpers, and instead just access the page directly. Also make the page data a void pointer, and thus much easier to use. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flagChristoph Hellwig1-3/+2
Add a flags field to struct dev_pagemap to replace the altmap_valid boolean to be a little more extensible. Also add a pgmap_altmap() helper to find the optional altmap and clean up the code using the altmap using it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-21treewide: Add SPDX license identifier for missed filesThomas Gleixner1-0/+1
Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-14mm: maintain randomization of page free listsDan Williams1-2/+9
When freeing a page with an order >= shuffle_page_order randomly select the front or back of the list for insertion. While the mm tries to defragment physical pages into huge pages this can tend to make the page allocator more predictable over time. Inject the front-back randomness to preserve the initial randomness established by shuffle_free_memory() when the kernel was booted. The overhead of this manipulation is constrained by only being applied for MAX_ORDER sized pages by default. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Robert Elliott <elliott@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: move buddy list manipulations into helpersDan Williams1-43/+22
In preparation for runtime randomization of the zone lists, take all (well, most of) the list_*() functions in the buddy allocator and put them in helper functions. Provide a common control point for injecting additional behavior when freeing pages. [dan.j.williams@intel.com: fix buddy list helpers] Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com [vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter] Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: Keith Busch <keith.busch@intel.com> Cc: Robert Elliott <elliott@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: shuffle initial free memory to improve memory-side-cache utilizationDan Williams1-1/+5
Patch series "mm: Randomize free memory", v10. This patch (of 3): Randomization of the page allocator improves the average utilization of a direct-mapped memory-side-cache. Memory side caching is a platform capability that Linux has been previously exposed to in HPC (high-performance computing) environments on specialty platforms. In that instance it was a smaller pool of high-bandwidth-memory relative to higher-capacity / lower-bandwidth DRAM. Now, this capability is going to be found on general purpose server platforms where DRAM is a cache in front of higher latency persistent memory [1]. Robert offered an explanation of the state of the art of Linux interactions with memory-side-caches [2], and I copy it here: It's been a problem in the HPC space: http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/ A kernel module called zonesort is available to try to help: https://software.intel.com/en-us/articles/xeon-phi-software and this abandoned patch series proposed that for the kernel: https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com Dan's patch series doesn't attempt to ensure buffers won't conflict, but also reduces the chance that the buffers will. This will make performance more consistent, albeit slower than "optimal" (which is near impossible to attain in a general-purpose kernel). That's better than forcing users to deploy remedies like: "To eliminate this gradual degradation, we have added a Stream measurement to the Node Health Check that follows each job; nodes are rebooted whenever their measured memory bandwidth falls below 300 GB/s." A replacement for zonesort was merged upstream in commit cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability"). With this numa_emulation capability, memory can be split into cache sized ("near-memory" sized) numa nodes. A bind operation to such a node, and disabling workloads on other nodes, enables full cache performance. However, once the workload exceeds the cache size then cache conflicts are unavoidable. While HPC environments might be able to tolerate time-scheduling of cache sized workloads, for general purpose server platforms, the oversubscribed cache case will be the common case. The worst case scenario is that a server system owner benchmarks a workload at boot with an un-contended cache only to see that performance degrade over time, even below the average cache performance due to excessive conflicts. Randomization clips the peaks and fills in the valleys of cache utilization to yield steady average performance. Here are some performance impact details of the patches: 1/ An Intel internal synthetic memory bandwidth measurement tool, saw a 3X speedup in a contrived case that tries to force cache conflicts. The contrived cased used the numa_emulation capability to force an instance of the benchmark to be run in two of the near-memory sized numa nodes. If both instances were placed on the same emulated they would fit and cause zero conflicts. While on separate emulated nodes without randomization they underutilized the cache and conflicted unnecessarily due to the in-order allocation per node. 2/ A well known Java server application benchmark was run with a heap size that exceeded cache size by 3X. The cache conflict rate was 8% for the first run and degraded to 21% after page allocator aging. With randomization enabled the rate levelled out at 11%. 3/ A MongoDB workload did not observe measurable difference in cache-conflict rates, but the overall throughput dropped by 7% with randomization in one case. 4/ Mel Gorman ran his suite of performance workloads with randomization enabled on platforms without a memory-side-cache and saw a mix of some improvements and some losses [3]. While there is potentially significant improvement for applications that depend on low latency access across a wide working-set, the performance may be negligible to negative for other workloads. For this reason the shuffle capability defaults to off unless a direct-mapped memory-side-cache is detected. Even then, the page_alloc.shuffle=0 parameter can be specified to disable the randomization on those systems. Outside of memory-side-cache utilization concerns there is potentially security benefit from randomization. Some data exfiltration and return-oriented-programming attacks rely on the ability to infer the location of sensitive data objects. The kernel page allocator, especially early in system boot, has predictable first-in-first out behavior for physical pages. Pages are freed in physical address order when first onlined. Quoting Kees: "While we already have a base-address randomization (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and memory layouts would certainly be using the predictability of allocation ordering (i.e. for attacks where the base address isn't important: only the relative positions between allocated memory). This is common in lots of heap-style attacks. They try to gain control over ordering by spraying allocations, etc. I'd really like to see this because it gives us something similar to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator." While SLAB_FREELIST_RANDOM reduces the predictability of some local slab caches it leaves vast bulk of memory to be predictably in order allocated. However, it should be noted, the concrete security benefits are hard to quantify, and no known CVE is mitigated by this randomization. Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform a Fisher-Yates shuffle of the page allocator 'free_area' lists when they are initially populated with free memory at boot and at hotplug time. Do this based on either the presence of a page_alloc.shuffle=Y command line parameter, or autodetection of a memory-side-cache (to be added in a follow-on patch). The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10, 4MB this trades off randomization granularity for time spent shuffling. MAX_ORDER-1 was chosen to be minimally invasive to the page allocator while still showing memory-side cache behavior improvements, and the expectation that the security implications of finer granularity randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The performance impact of the shuffling appears to be in the noise compared to other memory initialization work. This initial randomization can be undone over time so a follow-on patch is introduced to inject entropy on page free decisions. It is reasonable to ask if the page free entropy is sufficient, but it is not enough due to the in-order initial freeing of pages. At the start of that process putting page1 in front or behind page0 still keeps them close together, page2 is still near page1 and has a high chance of being adjacent. As more pages are added ordering diversity improves, but there is still high page locality for the low address pages and this leads to no significant impact to the cache conflict rate. [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/ [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM [3]: https://lkml.org/lkml/2018/10/12/309 [dan.j.williams@intel.com: fix shuffle enable] Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts] Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Robert Elliott <elliott@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: update references to page _refcountBaruch Siach1-1/+1
Commit 0139aa7b7fa ("mm: rename _count, field of the struct page, to _refcount") left out a couple of references to the old field name. Fix that. Link: http://lkml.kernel.org/r/cedf87b02eb8a6b3eac57e8e91da53fb15c3c44c.1556537475.git.baruch@tkos.co.il Fixes: 0139aa7b7fa ("mm: rename _count, field of the struct page, to _refcount") Signed-off-by: Baruch Siach <baruch@tkos.co.il> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: memblock: make keeping memblock memory opt-in rather than opt-outMike Rapoport1-2/+1
Most architectures do not need the memblock memory after the page allocator is initialized, but only few enable ARCH_DISCARD_MEMBLOCK in the arch Kconfig. Replacing ARCH_DISCARD_MEMBLOCK with ARCH_KEEP_MEMBLOCK and inverting the logic makes it clear which architectures actually use memblock after system initialization and skips the necessity to add ARCH_DISCARD_MEMBLOCK to the architectures that are still missing that option. Link: http://lkml.kernel.org/r/1556102150-32517-1-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Burton <paul.burton@mips.com> Cc: James Hogan <jhogan@kernel.org> Cc: Ley Foon Tan <lftan@altera.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplistYafang Shao1-6/+5
Because rmqueue_pcplist() is only called when order is 0, we don't need to use order as a parameter. Link: http://lkml.kernel.org/r/1555591709-11744-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Pankaj Gupta <pagupta@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm, memory_hotplug: cleanup memory offline pathMichal Hocko1-2/+9
check_pages_isolated_cb currently accounts the whole pfn range as being offlined if test_pages_isolated suceeds on the range. This is based on the assumption that all pages in the range are freed which is currently the case in most cases but it won't be with later changes, as pages marked as vmemmap won't be isolated. Move the offlined pages counting to offline_isolated_pages_cb and rely on __offline_isolated_pages to return the correct value. check_pages_isolated_cb will still do it's primary job and check the pfn range. While we are at it remove check_pages_isolated and offline_isolated_pages and use directly walk_system_ram_range as do in online_pages. Link: http://lkml.kernel.org/r/20190408082633.2864-2-osalvador@suse.de Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sectionsAlexander Duyck1-41/+121
Add yet another iterator, for_each_free_mem_range_in_zone_from, and then use it to support initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES. By doing this we can greatly improve the cache locality of the pages while we do several loops over them in the init and freeing process. We are able to tighten the loops further as a result of the "from" iterator as we can perform the initial checks for first_init_pfn in our first call to the iterator, and continue without the need for those checks via the "from" iterator. I have added this functionality in the function called deferred_init_mem_pfn_range_in_zone that primes the iterator and causes us to exit if we encounter any failure. On my x86_64 test system with 384GB of memory per node I saw a reduction in initialization time from 1.85s to 1.38s as a result of this patch. Link: http://lkml.kernel.org/r/20190405221231.12227.85836.stgit@localhost.localdomain Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: <yi.z.zhang@linux.intel.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David S. Miller <davem@davemloft.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: implement new zone specific memblock iteratorAlexander Duyck1-19/+12
Introduce a new iterator for_each_free_mem_pfn_range_in_zone. This iterator will take care of making sure a given memory range provided is in fact contained within a zone. It takes are of all the bounds checking we were doing in deferred_grow_zone, and deferred_init_memmap. In addition it should help to speed up the search a bit by iterating until the end of a range is greater than the start of the zone pfn range, and will exit completely if the start is beyond the end of the zone. Link: http://lkml.kernel.org/r/20190405221225.12227.22573.stgit@localhost.localdomain Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@kernel.org> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <yi.z.zhang@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: drop meminit_pfn_in_nid as it is redundantAlexander Duyck1-37/+14
As best as I can tell the meminit_pfn_in_nid call is completely redundant. The deferred memory initialization is already making use of for_each_free_mem_range which in turn will call into __next_mem_range which will only return a memory range if it matches the node ID provided assuming it is not NUMA_NO_NODE. I am operating on the assumption that there are no zones or pgdata_t structures that have a NUMA node of NUMA_NO_NODE associated with them. If that is the case then __next_mem_range will never return a memory range that doesn't match the zone's node ID and as such the check is redundant. So one piece I would like to verify on this is if this works for ia64. Technically it was using a different approach to get the node ID, but it seems to have the node ID also encoded into the memblock. So I am assuming this is okay, but would like to get confirmation on that. On my x86_64 test system with 384GB of memory per node I saw a reduction in initialization time from 2.80s to 1.85s as a result of this patch. Link: http://lkml.kernel.org/r/20190405221219.12227.93957.stgit@localhost.localdomain Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@kernel.org> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <yi.z.zhang@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mem-hotplug: fix node spanned pages when we have a node with only ZONE_MOVABLELinxu Fang1-2/+4
342332e6a925 ("mm/page_alloc.c: introduce kernelcore=mirror option") and later patches rewrote the calculation of node spanned pages. e506b99696a2 ("mem-hotplug: fix node spanned pages when we have a movable node"), but the current code still has problems, When we have a node with only zone_movable and the node id is not zero, the size of node spanned pages is double added. That's because we have an empty normal zone, and zone_start_pfn or zone_end_pfn is not between arch_zone_lowest_possible_pfn and arch_zone_highest_possible_pfn, so we need to use clamp to constrain the range just like the commit <96e907d13602> (bootmem: Reimplement __absent_pages_in_range() using for_each_mem_pfn_range()). e.g. Zone ranges: DMA [mem 0x0000000000001000-0x0000000000ffffff] DMA32 [mem 0x0000000001000000-0x00000000ffffffff] Normal [mem 0x0000000100000000-0x000000023fffffff] Movable zone start for each node Node 0: 0x0000000100000000 Node 1: 0x0000000140000000 Early memory node ranges node 0: [mem 0x0000000000001000-0x000000000009efff] node 0: [mem 0x0000000000100000-0x00000000bffdffff] node 0: [mem 0x0000000100000000-0x000000013fffffff] node 1: [mem 0x0000000140000000-0x000000023fffffff] node 0 DMA spanned:0xfff present:0xf9e absent:0x61 node 0 DMA32 spanned:0xff000 present:0xbefe0 absent:0x40020 node 0 Normal spanned:0 present:0 absent:0 node 0 Movable spanned:0x40000 present:0x40000 absent:0 On node 0 totalpages(node_present_pages): 1048446 node_spanned_pages:1310719 node 1 DMA spanned:0 present:0 absent:0 node 1 DMA32 spanned:0 present:0 absent:0 node 1 Normal spanned:0x100000 present:0x100000 absent:0 node 1 Movable spanned:0x100000 present:0x100000 absent:0 On node 1 totalpages(node_present_pages): 2097152 node_spanned_pages:2097152 Memory: 6967796K/12582392K available (16388K kernel code, 3686K rwdata, 4468K rodata, 2160K init, 10444K bss, 5614596K reserved, 0K cma-reserved) It shows that the current memory of node 1 is double added. After this patch, the problem is fixed. node 0 DMA spanned:0xfff present:0xf9e absent:0x61 node 0 DMA32 spanned:0xff000 present:0xbefe0 absent:0x40020 node 0 Normal spanned:0 present:0 absent:0 node 0 Movable spanned:0x40000 present:0x40000 absent:0 On node 0 totalpages(node_present_pages): 1048446 node_spanned_pages:1310719 node 1 DMA spanned:0 present:0 absent:0 node 1 DMA32 spanned:0 present:0 absent:0 node 1 Normal spanned:0 present:0 absent:0 node 1 Movable spanned:0x100000 present:0x100000 absent:0 On node 1 totalpages(node_present_pages): 1048576 node_spanned_pages:1048576 memory: 6967796K/8388088K available (16388K kernel code, 3686K rwdata, 4468K rodata, 2160K init, 10444K bss, 1420292K reserved, 0K cma-reserved) Link: http://lkml.kernel.org/r/1554178276-10372-1-git-send-email-fanglinxu@huawei.com Signed-off-by: Linxu Fang <fanglinxu@huawei.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14hugetlb: allow to free gigantic pages regardless of the configurationAlexandre Ghiti1-2/+2
On systems without CONTIG_ALLOC activated but that support gigantic pages, boottime reserved gigantic pages can not be freed at all. This patch simply enables the possibility to hand back those pages to memory allocator. Link: http://lkml.kernel.org/r/20190327063626.18421-5-alex@ghiti.fr Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Acked-by: David S. Miller <davem@davemloft.net> [sparc] Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: simplify MEMORY_ISOLATION && COMPACTION || CMA into CONTIG_ALLOCAlexandre Ghiti1-2/+1
This condition allows to define alloc_contig_range, so simplify it into a more accurate naming. Link: http://lkml.kernel.org/r/20190327063626.18421-4-alex@ghiti.fr Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm, page_alloc: disallow __GFP_COMP in alloc_pages_exact()Vlastimil Babka1-3/+11
alloc_pages_exact*() allocates a page of sufficient order and then splits it to return only the number of pages requested. That makes it incompatible with __GFP_COMP, because compound pages cannot be split. As shown by [1] things may silently work until the requested size (possibly depending on user) stops being power of two. Then for CONFIG_DEBUG_VM, BUG_ON() triggers in split_page(). Without CONFIG_DEBUG_VM, consequences are unclear. There are several options here, none of them great: 1) Don't do the splitting when __GFP_COMP is passed, and return the whole compound page. However if caller then returns it via free_pages_exact(), that will be unexpected and the freeing actions there will be wrong. 2) Warn and remove __GFP_COMP from the flags. But the caller may have really wanted it, so things may break later somewhere. 3) Warn and return NULL. However NULL may be unexpected, especially for small sizes. This patch picks option 2, because as Michal Hocko put it: "callers wanted it" is much less probable than "caller is simply confused and more gfp flags is surely better than fewer". [1] https://lore.kernel.org/lkml/20181126002805.GI18977@shao2-debian/T/#u Link: http://lkml.kernel.org/r/0c6393eb-b28d-4607-c386-862a71f09de6@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Takashi Iwai <tiwai@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-30mm/hibernation: Make hibernation handle unmapped pagesRick Edgecombe1-2/+5
Make hibernate handle unmapped pages on the direct map when CONFIG_ARCH_HAS_SET_ALIAS=y is set. These functions allow for setting pages to invalid configurations, so now hibernate should check if the pages have valid mappings and handle if they are unmapped when doing a hibernate save operation. Previously this checking was already done when CONFIG_DEBUG_PAGEALLOC=y was configured. It does not appear to have a big hibernating performance impact. The speed of the saving operation before this change was measured as 819.02 MB/s, and after was measured at 813.32 MB/s. Before: [ 4.670938] PM: Wrote 171996 kbytes in 0.21 seconds (819.02 MB/s) After: [ 4.504714] PM: Wrote 178932 kbytes in 0.22 seconds (813.32 MB/s) Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-16-namit@vmware.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-26mm/page_alloc.c: fix never set ALLOC_NOFRAGMENT flagAndrey Ryabinin1-3/+3
Commit 0a79cdad5eb2 ("mm: use alloc_flags to record if kswapd can wake") removed setting of the ALLOC_NOFRAGMENT flag. Bring it back. The runtime effect is that ALLOC_NOFRAGMENT behaviour is restored so that allocations are spread across local zones to avoid fragmentation due to mixing pageblocks as long as possible. Link: http://lkml.kernel.org/r/20190423120806.3503-2-aryabinin@virtuozzo.com Fixes: 0a79cdad5eb2 ("mm: use alloc_flags to record if kswapd can wake") Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-26mm/page_alloc.c: avoid potential NULL pointer dereferenceAndrey Ryabinin1-0/+3
ac.preferred_zoneref->zone passed to alloc_flags_nofragment() can be NULL. 'zone' pointer unconditionally derefernced in alloc_flags_nofragment(). Bail out on NULL zone to avoid potential crash. Currently we don't see any crashes only because alloc_flags_nofragment() has another bug which allows compiler to optimize away all accesses to 'zone'. Link: http://lkml.kernel.org/r/20190423120806.3503-1-aryabinin@virtuozzo.com Fixes: 6bb154504f8b ("mm, page_alloc: spread allocations across zones before introducing fragmentation") Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-26mm, page_alloc: always use a captured page regardless of compaction resultMel Gorman1-5/+0
During the development of commit 5e1f0f098b46 ("mm, compaction: capture a page under direct compaction"), a paranoid check was added to ensure that if a captured page was available after compaction that it was consistent with the final state of compaction. The intent was to catch serious programming bugs such as using a stale page pointer and causing corruption problems. However, it is possible to get a captured page even if compaction was unsuccessful if an interrupt triggered and happened to free pages in interrupt context that got merged into a suitable high-order page. It's highly unlikely but Li Wang did report the following warning on s390 occuring when testing OOM handling. Note that the warning is slightly edited for clarity. WARNING: CPU: 0 PID: 9783 at mm/page_alloc.c:3777 __alloc_pages_direct_compact+0x182/0x190 Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc pkey ghash_s390 prng xts aes_s390 des_s390 des_generic sha512_s390 zcrypt_cex4 zcrypt vmur binfmt_misc ip_tables xfs libcrc32c dasd_fba_mod qeth_l2 dasd_eckd_mod dasd_mod qeth qdio lcs ctcm ccwgroup fsm dm_mirror dm_region_hash dm_log dm_mod CPU: 0 PID: 9783 Comm: copy.sh Kdump: loaded Not tainted 5.1.0-rc 5 #1 This patch simply removes the check entirely instead of trying to be clever about pages freed from interrupt context. If a serious programming error was introduced, it is highly likely to be caught by prep_new_page() instead. Link: http://lkml.kernel.org/r/20190419085133.GH18914@techsingularity.net Fixes: 5e1f0f098b46 ("mm, compaction: capture a page under direct compaction") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Li Wang <liwang@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-26mm: do not boost watermarks to avoid fragmentation for the DISCONTIG memory modelMel Gorman1-0/+13
Mikulas Patocka reported that commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") "broke" memory management on parisc. The machine is not NUMA but the DISCONTIG model creates three pgdats even though it's a UMA machine for the following ranges 0) Start 0x0000000000000000 End 0x000000003fffffff Size 1024 MB 1) Start 0x0000000100000000 End 0x00000001bfdfffff Size 3070 MB 2) Start 0x0000004040000000 End 0x00000040ffffffff Size 3072 MB Mikulas reported: With the patch 1c30844d2, the kernel will incorrectly reclaim the first zone when it fills up, ignoring the fact that there are two completely free zones. Basiscally, it limits cache size to 1GiB. For example, if I run: # dd if=/dev/sda of=/dev/null bs=1M count=2048 - with the proper kernel, there should be "Buffers - 2GiB" when this command finishes. With the patch 1c30844d2, buffers will consume just 1GiB or slightly more, because the kernel was incorrectly reclaiming them. The page allocator and reclaim makes assumptions that pgdats really represent NUMA nodes and zones represent ranges and makes decisions on that basis. Watermark boosting for small pgdats leads to unexpected results even though this would have behaved reasonably on SPARSEMEM. DISCONTIG is essentially deprecated and even parisc plans to move to SPARSEMEM so there is no need to be fancy, this patch simply disables watermark boosting by default on DISCONTIGMEM. Link: http://lkml.kernel.org/r/20190419094335.GJ18914@techsingularity.net Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-19mm/hotplug: treat CMA pages as unmovableQian Cai1-12/+18
has_unmovable_pages() is used by allocating CMA and gigantic pages as well as the memory hotplug. The later doesn't know how to offline CMA pool properly now, but if an unused (free) CMA page is encountered, then has_unmovable_pages() happily considers it as a free memory and propagates this up the call chain. Memory offlining code then frees the page without a proper CMA tear down which leads to an accounting issues. Moreover if the same memory range is onlined again then the memory never gets back to the CMA pool. State after memory offline: # grep cma /proc/vmstat nr_free_cma 205824 # cat /sys/kernel/debug/cma/cma-kvm_cma/count 209920 Also, kmemleak still think those memory address are reserved below but have already been used by the buddy allocator after onlining. This patch fixes the situation by treating CMA pageblocks as unmovable except when has_unmovable_pages() is called as part of CMA allocation. Offlined Pages 4096 kmemleak: Cannot insert 0xc000201f7d040008 into the object search tree (overlaps existing) Call Trace: dump_stack+0xb0/0xf4 (unreliable) create_object+0x344/0x380 __kmalloc_node+0x3ec/0x860 kvmalloc_node+0x58/0x110 seq_read+0x41c/0x620 __vfs_read+0x3c/0x70 vfs_read+0xbc/0x1a0 ksys_read+0x7c/0x140 system_call+0x5c/0x70 kmemleak: Kernel memory leak detector disabled kmemleak: Object 0xc000201cc8000000 (size 13757317120): kmemleak: comm "swapper/0", pid 0, jiffies 4294937297 kmemleak: min_count = -1 kmemleak: count = 0 kmemleak: flags = 0x5 kmemleak: checksum = 0 kmemleak: backtrace: cma_declare_contiguous+0x2a4/0x3b0 kvm_cma_reserve+0x11c/0x134 setup_arch+0x300/0x3f8 start_kernel+0x9c/0x6e8 start_here_common+0x1c/0x4b0 kmemleak: Automatic memory scanning thread ended [cai@lca.pw: use is_migrate_cma_page() and update commit log] Link: http://lkml.kernel.org/r/20190416170510.20048-1-cai@lca.pw Link: http://lkml.kernel.org/r/20190413002623.8967-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29mm/hotplug: fix offline undo_isolate_page_range()Qian Cai1-1/+1
Commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") introduced move_pfn_range_to_zone() which calls memmap_init_zone() during onlining a memory block. memmap_init_zone() will reset pagetype flags and makes migrate type to be MOVABLE. However, in __offline_pages(), it also call undo_isolate_page_range() after offline_isolated_pages() to do the same thing. Due to commit 2ce13640b3f4 ("mm: __first_valid_page skip over offline pages") changed __first_valid_page() to skip offline pages, undo_isolate_page_range() here just waste CPU cycles looping around the offlining PFN range while doing nothing, because __first_valid_page() will return NULL as offline_isolated_pages() has already marked all memory sections within the pfn range as offline via offline_mem_sections(). Also, after calling the "useless" undo_isolate_page_range() here, it reaches the point of no returning by notifying MEM_OFFLINE. Those pages will be marked as MIGRATE_MOVABLE again once onlining. The only thing left to do is to decrease the number of isolated pageblocks zone counter which would make some paths of the page allocation slower that the above commit introduced. Even if alloc_contig_range() can be used to isolate 16GB-hugetlb pages on ppc64, an "int" should still be enough to represent the number of pageblocks there. Fix an incorrect comment along the way. [cai@lca.pw: v4] Link: http://lkml.kernel.org/r/20190314150641.59358-1-cai@lca.pw Link: http://lkml.kernel.org/r/20190313143133.46200-1-cai@lca.pw Fixes: 2ce13640b3f4 ("mm: __first_valid_page skip over offline pages") Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [4.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12memblock: drop memblock_alloc_*_nopanic() variantsMike Rapoport1-5/+5
As all the memblock allocation functions return NULL in case of error rather than panic(), the duplicates with _nopanic suffix can be removed. Link: http://lkml.kernel.org/r/1548057848-15136-22-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Petr Mladek <pmladek@suse.com> [printk] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm: unexport free_reserved_areaChristoph Hellwig1-1/+0
This function is only used by built-in code, which makes perfect sense given the purpose of it. Link: http://lkml.kernel.org/r/20190213174621.29297-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05docs/core-api/mm: fix return value descriptions in mm/Mike Rapoport1-6/+18
Many kernel-doc comments in mm/ have the return value descriptions either misformatted or omitted at all which makes kernel-doc script unhappy: $ make V=1 htmldocs ... ./mm/util.c:36: info: Scanning doc for kstrdup ./mm/util.c:41: warning: No description found for return value of 'kstrdup' ./mm/util.c:57: info: Scanning doc for kstrdup_const ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const' ./mm/util.c:75: info: Scanning doc for kstrndup ./mm/util.c:83: warning: No description found for return value of 'kstrndup' ... Fixing the formatting and adding the missing return value descriptions eliminates ~100 such warnings. Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05numa: make "nr_online_nodes" unsigned intAlexey Dobriyan1-2/+2
Number of online NUMA nodes can't be negative as well. This doesn't save space as the variable is used only in 32-bit context, but do it anyway for consistency. Link: http://lkml.kernel.org/r/20190201223151.GB15820@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05numa: make "nr_node_ids" unsigned intAlexey Dobriyan1-1/+1
Number of NUMA nodes can't be negative. This saves a few bytes on x86_64: add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238) Function old new delta hv_synic_alloc.cold 88 110 +22 prealloc_shrinker 260 262 +2 bootstrap 249 251 +2 sched_init_numa 1566 1567 +1 show_slab_objects 778 777 -1 s_show 1201 1200 -1 kmem_cache_init 346 345 -1 __alloc_workqueue_key 1146 1145 -1 mem_cgroup_css_alloc 1614 1612 -2 __do_sys_swapon 4702 4699 -3 __list_lru_init 655 651 -4 nic_probe 2379 2374 -5 store_user_store 118 111 -7 red_zone_store 106 99 -7 poison_store 106 99 -7 wq_numa_init 348 338 -10 __kmem_cache_empty 75 65 -10 task_numa_free 186 173 -13 merge_across_nodes_store 351 336 -15 irq_create_affinity_masks 1261 1246 -15 do_numa_crng_init 343 321 -22 task_numa_fault 4760 4737 -23 swapfile_init 179 156 -23 hv_synic_alloc 536 492 -44 apply_wqattrs_prepare 746 695 -51 Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm/page_alloc.c: check return value of memblock_alloc_node_nopanic()Mike Rapoport1-1/+8
There are two early memory allocations that use memblock_alloc_node_nopanic() and do not check its return value. While this happens very early during boot and chances that the allocation will fail are diminishing, it is still worth to have proper checks for the allocation errors. Link: http://lkml.kernel.org/r/1547734941-944-1-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm: fix some typos in mm directoryWei Yang1-2/+2
No functional change. Link: http://lkml.kernel.org/r/20190118235123.27843-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm: no need to check return value of debugfs_create functionsGreg Kroah-Hartman1-16/+6
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Link: http://lkml.kernel.org/r/20190122152151.16139-14-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Laura Abbott <labbott@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm, compaction: capture a page under direct compactionMel Gorman1-4/+69
Compaction is inherently race-prone as a suitable page freed during compaction can be allocated by any parallel task. This patch uses a capture_control structure to isolate a page immediately when it is freed by a direct compactor in the slow path of the page allocator. The intent is to avoid redundant scanning. 5.0.0-rc1 5.0.0-rc1 selective-v3r17 capture-v3r19 Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%) Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%) Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%) Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%) Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%) Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%* Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%) Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%) Latency is only moderately affected but the devil is in the details. A closer examination indicates that base page fault latency is reduced but latency of huge pages is increased as it takes creater care to succeed. Part of the "problem" is that allocation success rates are close to 100% even when under pressure and compaction gets harder 5.0.0-rc1 5.0.0-rc1 selective-v3r17 capture-v3r19 Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%) Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%) Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%) Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%) Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%) Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%) Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%) Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%) And scan rates are reduced as expected by 6% for the migration scanner and 29% for the free scanner indicating that there is less redundant work. Compaction migrate scanned 20815362 19573286 Compaction free scanned 16352612 11510663 [mgorman@techsingularity.net: remove redundant check] Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm, compaction: ignore the fragmentation avoidance boost for isolation and compactionMel Gorman1-1/+1
When pageblocks get fragmented, watermarks are artifically boosted to reclaim pages to avoid further fragmentation events. However, compaction is often either fragmentation-neutral or moving movable pages away from unmovable/reclaimable pages. As the true watermarks are preserved, allow compaction to ignore the boost factor. The expected impact is very slight as the main benefit is that compaction is slightly more likely to succeed when the system has been fragmented very recently. On both 1-socket and 2-socket machines for THP-intensive allocation during fragmentation the success rate was increased by less than 1% which is marginal. However, detailed tracing indicated that failure of migration due to a premature ENOMEM triggered by watermark checks were eliminated. Link: http://lkml.kernel.org/r/20190118175136.31341-9-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm: remove extra drain pages on pcp listWei Yang1-1/+0
In the current implementation, there are two places to isolate a range of page: __offline_pages() and alloc_contig_range(). During this procedure, it will drain pages on pcp list. Below is a brief call flow: __offline_pages()/alloc_contig_range() start_isolate_page_range() set_migratetype_isolate() drain_all_pages() drain_all_pages() <--- A This snippet shows the current logic is isolate and drain pcp list for each pageblock and drain pcp list again for the whole range. start_isolate_page_range is responsible for isolating the given pfn range. One part of that job is to make sure that also pages that are on the allocator pcp lists are properly isolated. Otherwise they could be reused and the range wouldn't be completely isolated until the memory is freed back. While there is no strict guarantee here because pages might get allocated at any time before drain_all_pages is called there doesn't seem to be any strong demand for such a guarantee. In any case, draining is already done at the isolation level and there is no need to do it again later by start_isolate_page_range callers (memory hotplug and CMA allocator currently). Therefore remove pointless draining in existing callers to make the code more clear and functionally correct. [mhocko@suse.com: provide a clearer changelog for the last two paragraphs] Link: http://lkml.kernel.org/r/20190105233141.2329-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05memcg: localize memcg_kmem_enabled() checkShakeel Butt1-2/+2
Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge functions, so, the users don't have to explicitly check that condition. This is purely code cleanup patch without any functional change. Only the order of checks in memcg_charge_slab() can potentially be changed but the functionally it will be same. This should not matter as memcg_charge_slab() is not in the hot path. Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>