aboutsummaryrefslogtreecommitdiffstats
path: root/include/linux/mmzone.h (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-09-24mm: thp: extract split_queue_* into a structYang Shi1-3/+9
Patch series "Make deferred split shrinker memcg aware", v6. Currently THP deferred split shrinker is not memcg aware, this may cause premature OOM with some configuration. For example the below test would run into premature OOM easily: $ cgcreate -g memory:thp $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes $ cgexec -g memory:thp transhuge-stress 4000 transhuge-stress comes from kernel selftest. It is easy to hit OOM, but there are still a lot THP on the deferred split queue, memcg direct reclaim can't touch them since the deferred split shrinker is not memcg aware. Convert deferred split shrinker memcg aware by introducing per memcg deferred split queue. The THP should be on either per node or per memcg deferred split queue if it belongs to a memcg. When the page is immigrated to the other memcg, it will be immigrated to the target memcg's deferred split queue too. Reuse the second tail page's deferred_list for per memcg list since the same THP can't be on multiple deferred split queues. Make deferred split shrinker not depend on memcg kmem since it is not slab. It doesn't make sense to not shrink THP even though memcg kmem is disabled. With the above change the test demonstrated above doesn't trigger OOM even though with cgroup.memory=nokmem. This patch (of 4): Put split_queue, split_queue_lock and split_queue_len into a struct in order to reduce code duplication when we convert deferred_split to memcg aware in the later patches. Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Suggested-by: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24mm,thp: stats for file backed THPSong Liu1-0/+2
In preparation for non-shmem THP, this patch adds a few stats and exposes them in /proc/meminfo, /sys/bus/node/devices/<node>/meminfo, and /proc/<pid>/task/<tid>/smaps. This patch is mostly a rewrite of Kirill A. Shutemov's earlier version: https://lkml.kernel.org/r/20170126115819.58875-5-kirill.shutemov@linux.intel.com/ Link: http://lkml.kernel.org/r/20190801184244.3169074-5-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30mm: memcontrol: flush percpu slab vmstats on kmem offliningRoman Gushchin1-2/+3
I've noticed that the "slab" value in memory.stat is sometimes 0, even if some children memory cgroups have a non-zero "slab" value. The following investigation showed that this is the result of the kmem_cache reparenting in combination with the per-cpu batching of slab vmstats. At the offlining some vmstat value may leave in the percpu cache, not being propagated upwards by the cgroup hierarchy. It means that stats on ancestor levels are lower than actual. Later when slab pages are released, the precise number of pages is substracted on the parent level, making the value negative. We don't show negative values, 0 is printed instead. To fix this issue, let's flush percpu slab memcg and lruvec stats on memcg offlining. This guarantees that numbers on all ancestor levels are accurate and match the actual number of outstanding slab pages. Link: http://lkml.kernel.org/r/20190819202338.363363-3-guro@fb.com Fixes: fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal") Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18libnvdimm/pfn: stop padding pmem namespaces to section alignmentDan Williams1-0/+3
Now that the mm core supports section-unaligned hotplug of ZONE_DEVICE memory, we no longer need to add padding at pfn/dax device creation time. The kernel will still honor padding established by older kernels. Link: http://lkml.kernel.org/r/156092356588.979959.6793371748950931916.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Jeff Moyer <jmoyer@redhat.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64] Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18mm: kill is_dev_zone() helperDan Williams1-12/+0
Given there are no more usages of is_dev_zone() outside of 'ifdef CONFIG_ZONE_DEVICE' protection, kill off the compilation helper. Link: http://lkml.kernel.org/r/156092353211.979959.1489004866360828964.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Wei Yang <richardw.yang@linux.intel.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64] Cc: Michal Hocko <mhocko@suse.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18mm/sparsemem: add helpers track active portions of a section at bootDan Williams1-1/+32
Prepare for hot{plug,remove} of sub-ranges of a section by tracking a sub-section active bitmask, each bit representing a PMD_SIZE span of the architecture's memory hotplug section size. The implications of a partially populated section is that pfn_valid() needs to go beyond a valid_section() check and either determine that the section is an "early section", or read the sub-section active ranges from the bitmask. The expectation is that the bitmask (subsection_map) fits in the same cacheline as the valid_section() / early_section() data, so the incremental performance overhead to pfn_valid() should be negligible. The rationale for using early_section() to short-ciruit the subsection_map check is that there are legacy code paths that use pfn_valid() at section granularity before validating the pfn against pgdat data. So, the early_section() check allows those traditional assumptions to persist while also permitting subsection_map to tell the truth for purposes of populating the unused portions of early sections with PMEM and other ZONE_DEVICE mappings. Link: http://lkml.kernel.org/r/156092350874.979959.18185938451405518285.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Qian Cai <cai@lca.pw> Tested-by: Jane Chu <jane.chu@oracle.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64] Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18mm/sparsemem: introduce a SECTION_IS_EARLY flagDan Williams1-1/+7
In preparation for sub-section hotplug, track whether a given section was created during early memory initialization, or later via memory hotplug. This distinction is needed to maintain the coarse expectation that pfn_valid() returns true for any pfn within a given section even if that section has pages that are reserved from the page allocator. For example one of the of goals of subsection hotplug is to support cases where the system physical memory layout collides System RAM and PMEM within a section. Several pfn_valid() users expect to just check if a section is valid, but they are not careful to check if the given pfn is within a "System RAM" boundary and instead expect pgdat information to further validate the pfn. Rather than unwind those paths to make their pfn_valid() queries more precise a follow on patch uses the SECTION_IS_EARLY flag to maintain the traditional expectation that pfn_valid() returns true for all early sections. Link: https://lore.kernel.org/lkml/1560366952-10660-1-git-send-email-cai@lca.pw/ Link: http://lkml.kernel.org/r/156092350358.979959.5817209875548072819.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Qian Cai <cai@lca.pw> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64] Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: David Hildenbrand <david@redhat.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18mm/sparsemem: introduce struct mem_section_usageDan Williams1-3/+25
Patch series "mm: Sub-section memory hotplug support", v10. The memory hotplug section is an arbitrary / convenient unit for memory hotplug. 'Section-size' units have bled into the user interface ('memblock' sysfs) and can not be changed without breaking existing userspace. The section-size constraint, while mostly benign for typical memory hotplug, has and continues to wreak havoc with 'device-memory' use cases, persistent memory (pmem) in particular. Recall that pmem uses devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a 'struct page' memmap for pmem. However, it does not use the 'bottom half' of memory hotplug, i.e. never marks pmem pages online and never exposes the userspace memblock interface for pmem. This leaves an opening to redress the section-size constraint. To date, the libnvdimm subsystem has attempted to inject padding to satisfy the internal constraints of arch_add_memory(). Beyond complicating the code, leading to bugs [2], wasting memory, and limiting configuration flexibility, the padding hack is broken when the platform changes this physical memory alignment of pmem from one boot to the next. Device failure (intermittent or permanent) and physical reconfiguration are events that can cause the platform firmware to change the physical placement of pmem on a subsequent boot, and device failure is an everyday event in a data-center. It turns out that sections are only a hard requirement of the user-facing interface for memory hotplug and with a bit more infrastructure sub-section arch_add_memory() support can be added for kernel internal usages like devm_memremap_pages(). Here is an analysis of the current design assumptions in the current code and how they are addressed in the new implementation: Current design assumptions: - Sections that describe boot memory (early sections) are never unplugged / removed. - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a valid_section() check - __add_pages() and helper routines assume all operations occur in PAGES_PER_SECTION units. - The memblock sysfs interface only comprehends full sections New design assumptions: - Sections are instrumented with a sub-section bitmask to track (on x86) individual 2MB sub-divisions of a 128MB section. - Partially populated early sections can be extended with additional sub-sections, and those sub-sections can be removed with arch_remove_memory(). With this in place we no longer lose usable memory capacity to padding. - pfn_valid() is updated to look deeper than valid_section() to also check the active-sub-section mask. This indication is in the same cacheline as the valid_section() so the performance impact is expected to be negligible. So far the lkp robot has not reported any regressions. - Outside of the core vmemmap population routines which are replaced, other helper routines like shrink_{zone,pgdat}_span() are updated to handle the smaller granularity. Core memory hotplug routines that deal with online memory are not touched. - The existing memblock sysfs user api guarantees / assumptions are not touched since this capability is limited to !online !memblock-sysfs-accessible sections. Meanwhile the issue reports continue to roll in from users that do not understand when and how the 128MB constraint will bite them. The current implementation relied on being able to support at least one misaligned namespace, but that immediately falls over on any moderately complex namespace creation attempt. Beyond the initial problem of 'System RAM' colliding with pmem, and the unsolvable problem of physical alignment changes, Linux is now being exposed to platforms that collide pmem ranges with other pmem ranges by default [3]. In short, devm_memremap_pages() has pushed the venerable section-size constraint past the breaking point, and the simplicity of section-aligned arch_add_memory() is no longer tenable. These patches are exposed to the kbuild robot on a subsection-v10 branch [4], and a preview of the unit test for this functionality is available on the 'subsection-pending' branch of ndctl [5]. [2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com [3]: https://github.com/pmem/ndctl/issues/76 [4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10 [5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c This patch (of 13): Towards enabling memory hotplug to track partial population of a section, introduce 'struct mem_section_usage'. A pointer to a 'struct mem_section_usage' instance replaces the existing pointer to a 'pageblock_flags' bitmap. Effectively it adds one more 'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house a new 'subsection_map' bitmap. The new bitmap enables the memory hot{plug,remove} implementation to act on incremental sub-divisions of a section. SUBSECTION_SHIFT is defined as global constant instead of per-architecture value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of subsection users. Specifically a common subsection size allows for the possibility that persistent memory namespace configurations be made compatible across architectures. The primary motivation for this functionality is to support platforms that mix "System RAM" and "Persistent Memory" within a single section, or multiple PMEM ranges with different mapping lifetimes within a single section. The section restriction for hotplug has caused an ongoing saga of hacks and bugs for devm_memremap_pages() users. Beyond the fixups to teach existing paths how to retrieve the 'usemap' from a section, and updates to usemap allocation path, there are no expected behavior changes. Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Wei Yang <richardw.yang@linux.intel.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64] Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Qian Cai <cai@lca.pw> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18mm: section numbers use the type "unsigned long"David Hildenbrand1-2/+2
Patch series "mm: Further memory block device cleanups", v1. Some further cleanups around memory block devices. Especially, clean up and simplify walk_memory_range(). Including some other minor cleanups. This patch (of 6): We are using a mixture of "int" and "unsigned long". Let's make this consistent by using "unsigned long" everywhere. We'll do the same with memory block ids next. While at it, turn the "unsigned long i" in removable_show() into an int - sections_per_block is an int. [akpm@linux-foundation.org: s/unsigned long i/unsigned long nr/] [david@redhat.com: v3] Link: http://lkml.kernel.org/r/20190620183139.4352-2-david@redhat.com Link: http://lkml.kernel.org/r/20190614100114.311-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Arun KS <arunks@codeaurora.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: maintain randomization of page free listsDan Williams1-0/+12
When freeing a page with an order >= shuffle_page_order randomly select the front or back of the list for insertion. While the mm tries to defragment physical pages into huge pages this can tend to make the page allocator more predictable over time. Inject the front-back randomness to preserve the initial randomness established by shuffle_free_memory() when the kernel was booted. The overhead of this manipulation is constrained by only being applied for MAX_ORDER sized pages by default. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Robert Elliott <elliott@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: move buddy list manipulations into helpersDan Williams1-0/+46
In preparation for runtime randomization of the zone lists, take all (well, most of) the list_*() functions in the buddy allocator and put them in helper functions. Provide a common control point for injecting additional behavior when freeing pages. [dan.j.williams@intel.com: fix buddy list helpers] Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com [vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter] Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: Keith Busch <keith.busch@intel.com> Cc: Robert Elliott <elliott@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: shuffle initial free memory to improve memory-side-cache utilizationDan Williams1-0/+1
Patch series "mm: Randomize free memory", v10. This patch (of 3): Randomization of the page allocator improves the average utilization of a direct-mapped memory-side-cache. Memory side caching is a platform capability that Linux has been previously exposed to in HPC (high-performance computing) environments on specialty platforms. In that instance it was a smaller pool of high-bandwidth-memory relative to higher-capacity / lower-bandwidth DRAM. Now, this capability is going to be found on general purpose server platforms where DRAM is a cache in front of higher latency persistent memory [1]. Robert offered an explanation of the state of the art of Linux interactions with memory-side-caches [2], and I copy it here: It's been a problem in the HPC space: http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/ A kernel module called zonesort is available to try to help: https://software.intel.com/en-us/articles/xeon-phi-software and this abandoned patch series proposed that for the kernel: https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com Dan's patch series doesn't attempt to ensure buffers won't conflict, but also reduces the chance that the buffers will. This will make performance more consistent, albeit slower than "optimal" (which is near impossible to attain in a general-purpose kernel). That's better than forcing users to deploy remedies like: "To eliminate this gradual degradation, we have added a Stream measurement to the Node Health Check that follows each job; nodes are rebooted whenever their measured memory bandwidth falls below 300 GB/s." A replacement for zonesort was merged upstream in commit cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability"). With this numa_emulation capability, memory can be split into cache sized ("near-memory" sized) numa nodes. A bind operation to such a node, and disabling workloads on other nodes, enables full cache performance. However, once the workload exceeds the cache size then cache conflicts are unavoidable. While HPC environments might be able to tolerate time-scheduling of cache sized workloads, for general purpose server platforms, the oversubscribed cache case will be the common case. The worst case scenario is that a server system owner benchmarks a workload at boot with an un-contended cache only to see that performance degrade over time, even below the average cache performance due to excessive conflicts. Randomization clips the peaks and fills in the valleys of cache utilization to yield steady average performance. Here are some performance impact details of the patches: 1/ An Intel internal synthetic memory bandwidth measurement tool, saw a 3X speedup in a contrived case that tries to force cache conflicts. The contrived cased used the numa_emulation capability to force an instance of the benchmark to be run in two of the near-memory sized numa nodes. If both instances were placed on the same emulated they would fit and cause zero conflicts. While on separate emulated nodes without randomization they underutilized the cache and conflicted unnecessarily due to the in-order allocation per node. 2/ A well known Java server application benchmark was run with a heap size that exceeded cache size by 3X. The cache conflict rate was 8% for the first run and degraded to 21% after page allocator aging. With randomization enabled the rate levelled out at 11%. 3/ A MongoDB workload did not observe measurable difference in cache-conflict rates, but the overall throughput dropped by 7% with randomization in one case. 4/ Mel Gorman ran his suite of performance workloads with randomization enabled on platforms without a memory-side-cache and saw a mix of some improvements and some losses [3]. While there is potentially significant improvement for applications that depend on low latency access across a wide working-set, the performance may be negligible to negative for other workloads. For this reason the shuffle capability defaults to off unless a direct-mapped memory-side-cache is detected. Even then, the page_alloc.shuffle=0 parameter can be specified to disable the randomization on those systems. Outside of memory-side-cache utilization concerns there is potentially security benefit from randomization. Some data exfiltration and return-oriented-programming attacks rely on the ability to infer the location of sensitive data objects. The kernel page allocator, especially early in system boot, has predictable first-in-first out behavior for physical pages. Pages are freed in physical address order when first onlined. Quoting Kees: "While we already have a base-address randomization (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and memory layouts would certainly be using the predictability of allocation ordering (i.e. for attacks where the base address isn't important: only the relative positions between allocated memory). This is common in lots of heap-style attacks. They try to gain control over ordering by spraying allocations, etc. I'd really like to see this because it gives us something similar to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator." While SLAB_FREELIST_RANDOM reduces the predictability of some local slab caches it leaves vast bulk of memory to be predictably in order allocated. However, it should be noted, the concrete security benefits are hard to quantify, and no known CVE is mitigated by this randomization. Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform a Fisher-Yates shuffle of the page allocator 'free_area' lists when they are initially populated with free memory at boot and at hotplug time. Do this based on either the presence of a page_alloc.shuffle=Y command line parameter, or autodetection of a memory-side-cache (to be added in a follow-on patch). The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10, 4MB this trades off randomization granularity for time spent shuffling. MAX_ORDER-1 was chosen to be minimally invasive to the page allocator while still showing memory-side cache behavior improvements, and the expectation that the security implications of finer granularity randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The performance impact of the shuffling appears to be in the noise compared to other memory initialization work. This initial randomization can be undone over time so a follow-on patch is introduced to inject entropy on page free decisions. It is reasonable to ask if the page free entropy is sufficient, but it is not enough due to the in-order initial freeing of pages. At the start of that process putting page1 in front or behind page0 still keeps them close together, page2 is still near page1 and has a high chance of being adjacent. As more pages are added ordering diversity improves, but there is still high page locality for the low address pages and this leads to no significant impact to the cache conflict rate. [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/ [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM [3]: https://lkml.org/lkml/2018/10/12/309 [dan.j.williams@intel.com: fix shuffle enable] Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts] Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Robert Elliott <elliott@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: memcontrol: quarantine the mem_cgroup_[node_]nr_lru_pages() APIJohannes Weiner1-5/+0
Only memcg_numa_stat_show() uses those wrappers and the lru bitmasks, group them together. Link: http://lkml.kernel.org/r/20190228163020.24100-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm: remove zone_lru_lock() function, access ->lru_lock directlyAndrey Ryabinin1-4/+0
We have common pattern to access lru_lock from a page pointer: zone_lru_lock(page_zone(page)) Which is silly, because it unfolds to this: &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock while we can simply do &NODE_DATA(page_to_nid(page))->lru_lock Remove zone_lru_lock() function, since it's only complicate things. Use 'page_pgdat(page)->lru_lock' pattern instead. [aryabinin@virtuozzo.com: a slightly better version of __split_huge_page()] Link: http://lkml.kernel.org/r/20190301121651.7741-1-aryabinin@virtuozzo.com Link: http://lkml.kernel.org/r/20190228083329.31892-2-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm: fix some typos in mm directoryWei Yang1-1/+1
No functional change. Link: http://lkml.kernel.org/r/20190118235123.27843-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm, compaction: be selective about what pageblocks to clear skip hintsMel Gorman1-0/+2
Pageblock hints are cleared when compaction restarts or kswapd makes enough progress that it can sleep but it's over-eager in that the bit is cleared for migration sources with no LRU pages and migration targets with no free pages. As pageblock skip hint flushes are relatively rare and out-of-band with respect to kswapd, this patch makes a few more expensive checks to see if it's appropriate to even clear the bit. Every pageblock that is not cleared will avoid 512 pages being scanned unnecessarily on x86-64. The impact is variable with different workloads showing small differences in latency, success rates and scan rates. This is expected as clearing the hints is not that common but doing a small amount of work out-of-band to avoid a large amount of work in-band later is generally a good thing. Link: http://lkml.kernel.org/r/20190118175136.31341-22-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> [cai@lca.pw: no stuck in __reset_isolation_pfn()] Link: http://lkml.kernel.org/r/20190206034732.75687-1-cai@lca.pw Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08mm, page_alloc: do not wake kswapd with zone lock heldMel Gorman1-0/+6
syzbot reported the following regression in the latest merge window and it was confirmed by Qian Cai that a similar bug was visible from a different context. ====================================================== WARNING: possible circular locking dependency detected 4.20.0+ #297 Not tainted ------------------------------------------------------ syz-executor0/8529 is trying to acquire lock: 000000005e7fb829 (&pgdat->kswapd_wait){....}, at: __wake_up_common_lock+0x19e/0x330 kernel/sched/wait.c:120 but task is already holding lock: 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: spin_lock include/linux/spinlock.h:329 [inline] 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_bulk mm/page_alloc.c:2548 [inline] 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: __rmqueue_pcplist mm/page_alloc.c:3021 [inline] 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_pcplist mm/page_alloc.c:3050 [inline] 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue mm/page_alloc.c:3072 [inline] 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: get_page_from_freelist+0x1bae/0x52a0 mm/page_alloc.c:3491 It appears to be a false positive in that the only way the lock ordering should be inverted is if kswapd is waking itself and the wakeup allocates debugging objects which should already be allocated if it's kswapd doing the waking. Nevertheless, the possibility exists and so it's best to avoid the problem. This patch flags a zone as needing a kswapd using the, surprisingly, unused zone flag field. The flag is read without the lock held to do the wakeup. It's possible that the flag setting context is not the same as the flag clearing context or for small races to occur. However, each race possibility is harmless and there is no visible degredation in fragmentation treatment. While zone->flag could have continued to be unused, there is potential for moving some existing fields into the flags field instead. Particularly read-mostly ones like zone->initialized and zone->contiguous. Link: http://lkml.kernel.org/r/20190103225712.GJ31517@techsingularity.net Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Reported-by: syzbot+93d94a001cfbce9e60e1@syzkaller.appspotmail.com Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Qian Cai <cai@lca.pw> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, hotplug: move init_currently_empty_zone() under zone_span_lock protectionWei Yang1-2/+2
During online_pages phase, pgdat->nr_zones will be updated in case this zone is empty. Currently the online_pages phase is protected by the global locks (device_device_hotplug_lock and mem_hotplug_lock), which ensures there is no contention during the update of nr_zones. These global locks introduces scalability issues (especially the second one), which slow down code relying on get_online_mems(). This is also a preparation for not having to rely on get_online_mems() but instead some more fine grained locks. The patch moves init_currently_empty_zone under both zone_span_writelock and pgdat_resize_lock because both the pgdat state is changed (nr_zones) and the zone's start_pfn. Also this patch changes the documentation of node_size_lock to include the protection of nr_zones. Link: http://lkml.kernel.org/r/20181203205016.14123-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, sparse: drop pgdat_resize_lock in sparse_add/remove_one_section()Wei Yang1-2/+1
pgdat_resize_lock is used to protect pgdat's memory region information like: node_start_pfn, node_present_pages, etc. While in function sparse_add/remove_one_section(), pgdat_resize_lock is used to protect initialization/release of one mem_section. This looks not proper. These code paths are currently protected by mem_hotplug_lock currently but should there ever be any reason for locking at the sparse layer a dedicated lock should be introduced. Following is the current call trace of sparse_add/remove_one_section() mem_hotplug_begin() arch_add_memory() add_pages() __add_pages() __add_section() sparse_add_one_section() mem_hotplug_done() mem_hotplug_begin() arch_remove_memory() __remove_pages() __remove_section() sparse_remove_one_section() mem_hotplug_done() The comment above the pgdat_resize_lock also mentions "Holding this will also guarantee that any pfn_valid() stays that way.", which is true with the current implementation and false after this patch. But current implementation doesn't meet this comment. There isn't any pfn walkers to take the lock so this looks like a relict from the past. This patch also removes this comment. [richard.weiyang@gmail.com: v4] Link: http://lkml.kernel.org/r/20181204085657.20472-1-richard.weiyang@gmail.com [mhocko@suse.com: changelog suggestion] Link: http://lkml.kernel.org/r/20181128091243.19249-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: check nr_initialised with PAGES_PER_SECTION directly in defer_init()Wei Yang1-2/+0
When DEFERRED_STRUCT_PAGE_INIT is configured, only the first section of each node's highest zone is initialized before defer stage. static_init_pgcnt is used to store the number of pages like this: pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION, pgdat->node_spanned_pages); because we don't want to overflow zone's range. But this is not necessary, since defer_init() is called like this: memmap_init_zone() for pfn in [start_pfn, end_pfn) defer_init(pfn, end_pfn) In case (pgdat->node_spanned_pages < PAGES_PER_SECTION), the loop would stop before calling defer_init(). BTW, comparing PAGES_PER_SECTION with node_spanned_pages is not correct, since nr_initialised is zone based instead of node based. Even node_spanned_pages is bigger than PAGES_PER_SECTION, its highest zone would have pages less than PAGES_PER_SECTION. Link: http://lkml.kernel.org/r/20181122094807.6985-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/mmzone.c: make "migratetype_names" const char *Alexey Dobriyan1-1/+1
Those strings are immutable in fact. Link: http://lkml.kernel.org/r/20181124090327.GA10877@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: reclaim small amounts of memory when an external fragmentation event occursMel Gorman1-4/+7
An external fragmentation event was previously described as When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that will cause external fragmentation issues in the future. The kernel reduces the probability of such events by increasing the watermark sizes by calling set_recommended_min_free_kbytes early in the lifetime of the system. This works reasonably well in general but if there are enough sparsely populated pageblocks then the problem can still occur as enough memory is free overall and kswapd stays asleep. This patch introduces a watermark_boost_factor sysctl that allows a zone watermark to be temporarily boosted when an external fragmentation causing events occurs. The boosting will stall allocations that would decrease free memory below the boosted low watermark and kswapd is woken if the calling context allows to reclaim an amount of memory relative to the size of the high watermark and the watermark_boost_factor until the boost is cleared. When kswapd finishes, it wakes kcompactd at the pageblock order to clean some of the pageblocks that may have been affected by the fragmentation event. kswapd avoids any writeback, slab shrinkage and swap from reclaim context during this operation to avoid excessive system disruption in the name of fragmentation avoidance. Care is taken so that kswapd will do normal reclaim work if the system is really low on memory. This was evaluated using the same workloads as "mm, page_alloc: Spread allocations across zones before introducing fragmentation". 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) 4.20-rc3+patch1-4: 18421 (98% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%) Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%) Note that external fragmentation causing events are massively reduced by this path whether in comparison to the previous kernel or the vanilla kernel. The fault latency for huge pages appears to be increased but that is only because THP allocations were successful with the patch applied. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) 4.20-rc3+patch1-4: 13464 (95% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%) Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%) Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%) Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%) As before, massive reduction in external fragmentation events, some jitter on latencies and an increase in THP allocation success rates. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) 4.20-rc3+patch1-4: 14263 (93% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%) Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%) There is a 93% reduction in fragmentation causing events, there is a big reduction in the huge page fault latency and allocation success rate is higher. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) 4.20-rc3+patch1-4: 11095 (93% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%* Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%) There is a large reduction in fragmentation events with some jitter around the latencies and success rates. As before, the high THP allocation success rate does mean the system is under a lot of pressure. However, as the fragmentation events are reduced, it would be expected that the long-term allocation success rate would be higher. Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: move zone watermark accesses behind an accessorMel Gorman1-4/+5
This is a preparation patch only, no functional change. Link: http://lkml.kernel.org/r/20181123114528.28802-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: remove managed_page_count_lock spinlockArun KS1-6/+0
Now that totalram_pages and managed_pages are atomic varibles, no need of managed_page_count spinlock. The lock had really a weak consistency guarantee. It hasn't been used for anything but the update but no reader actually cares about all the values being updated to be in sync. Link: http://lkml.kernel.org/r/1542090790-21750-5-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: convert zone->managed_pages to atomic variableArun KS1-2/+7
totalram_pages, zone->managed_pages and totalhigh_pages updates are protected by managed_page_count_lock, but readers never care about it. Convert these variables to atomic to avoid readers potentially seeing a store tear. This patch converts zone->managed_pages. Subsequent patches will convert totalram_panges, totalhigh_pages and eventually managed_page_count_lock will be removed. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. Link: http://lkml.kernel.org/r/1542090790-21750-3-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-27Merge tag 'powerpc-4.21-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linuxLinus Torvalds1-1/+1
Pull powerpc updates from Michael Ellerman: "Notable changes: - Mitigations for Spectre v2 on some Freescale (NXP) CPUs. - A large series adding support for pass-through of Nvidia V100 GPUs to guests on Power9. - Another large series to enable hardware assistance for TLB table walk on MPC8xx CPUs. - Some preparatory changes to our DMA code, to make way for further cleanups from Christoph. - Several fixes for our Transactional Memory handling discovered by fuzzing the signal return path. - Support for generating our system call table(s) from a text file like other architectures. - A fix to our page fault handler so that instead of generating a WARN_ON_ONCE, user accesses of kernel addresses instead print a ratelimited and appropriately scary warning. - A cosmetic change to make our unhandled page fault messages more similar to other arches and also more compact and informative. - Freescale updates from Scott: "Highlights include elimination of legacy clock bindings use from dts files, an 83xx watchdog handler, fixes to old dts interrupt errors, and some minor cleanup." And many clean-ups, reworks and minor fixes etc. Thanks to: Alexandre Belloni, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Benjamin Herrenschmidt, Breno Leitao, Christian Lamparter, Christophe Leroy, Christoph Hellwig, Daniel Axtens, Darren Stevens, David Gibson, Diana Craciun, Dmitry V. Levin, Firoz Khan, Geert Uytterhoeven, Greg Kurz, Gustavo Romero, Hari Bathini, Joel Stanley, Kees Cook, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Mathieu Malaterre, Michal Suchánek, Naveen N. Rao, Nick Desaulniers, Oliver O'Halloran, Paul Mackerras, Ram Pai, Ravi Bangoria, Rob Herring, Russell Currey, Sabyasachi Gupta, Sam Bobroff, Satheesh Rajendran, Scott Wood, Segher Boessenkool, Stephen Rothwell, Tang Yuantian, Thiago Jung Bauermann, Yangtao Li, Yuantian Tang, Yue Haibing" * tag 'powerpc-4.21-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (201 commits) Revert "powerpc/fsl_pci: simplify fsl_pci_dma_set_mask" powerpc/zImage: Also check for stdout-path powerpc: Fix HMIs on big-endian with CONFIG_RELOCATABLE=y macintosh: Use of_node_name_{eq, prefix} for node name comparisons ide: Use of_node_name_eq for node name comparisons powerpc: Use of_node_name_eq for node name comparisons powerpc/pseries/pmem: Convert to %pOFn instead of device_node.name powerpc/mm: Remove very old comment in hash-4k.h powerpc/pseries: Fix node leak in update_lmb_associativity_index() powerpc/configs/85xx: Enable CONFIG_DEBUG_KERNEL powerpc/dts/fsl: Fix dtc-flagged interrupt errors clk: qoriq: add more compatibles strings powerpc/fsl: Use new clockgen binding powerpc/83xx: handle machine check caused by watchdog timer powerpc/fsl-rio: fix spelling mistake "reserverd" -> "reserved" powerpc/fsl_pci: simplify fsl_pci_dma_set_mask arch/powerpc/fsl_rmu: Use dma_zalloc_coherent vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver vfio_pci: Allow regions to add own capabilities vfio_pci: Allow mapping extra regions ...
2018-12-20powerpc: use mm zones more sensiblyChristoph Hellwig1-1/+1
Powerpc has somewhat odd usage where ZONE_DMA is used for all memory on common 64-bit configfs, and ZONE_DMA32 is used for 31-bit schemes. Move to a scheme closer to what other architectures use (and I dare to say the intent of the system): - ZONE_DMA: optionally for memory < 31-bit (64-bit embedded only) - ZONE_NORMAL: everything addressable by the kernel - ZONE_HIGHMEM: memory > 32-bit for 32-bit kernels Also provide information on how ZONE_DMA is used by defining ARCH_ZONE_DMA_BITS. Contains various fixes from Benjamin Herrenschmidt. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-12-14mm/sparse: add common helper to mark all memblocks presentLogan Gunthorpe1-0/+6
Presently the arches arm64, arm and sh have a function which loops through each memblock and calls memory present. riscv will require a similar function. Introduce a common memblocks_present() function that can be used by all the arches. Subsequent patches will cleanup the arches that make use of this. Link: http://lkml.kernel.org/r/20181107205433.3875-3-logang@deltatee.com Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31mm: remove CONFIG_NO_BOOTMEMMike Rapoport1-4/+1
All achitectures select NO_BOOTMEM which essentially becomes 'Y' for any kernel configuration and therefore it can be removed. [alexander.h.duyck@linux.intel.com: remove now defunct NO_BOOTMEM from depends list for deferred init] Link: http://lkml.kernel.org/r/20180925201814.3576.15105.stgit@localhost.localdomain Link: http://lkml.kernel.org/r/1536927045-23536-3-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: workingset: add vmstat counter for shadow nodesJohannes Weiner1-0/+1
Make it easier to catch bugs in the shadow node shrinker by adding a counter for the shadow nodes in circulation. [akpm@linux-foundation.org: assert that irqs are disabled, for __inc_lruvec_page_state()] [akpm@linux-foundation.org: s/WARN_ON_ONCE/VM_WARN_ON_ONCE/, per Johannes] Link: http://lkml.kernel.org/r/20181009184732.762-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: workingset: tell cache transitions from workingset thrashingJohannes Weiner1-0/+1
Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: rename and change semantics of nr_indirectly_reclaimable_bytesVlastimil Babka1-1/+1
The vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES was introduced by commit eb59254608bc ("mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES") with the goal of accounting objects that can be reclaimed, but cannot be allocated via a SLAB_RECLAIM_ACCOUNT cache. This is now possible via kmalloc() with __GFP_RECLAIMABLE flag, and the dcache external names user is converted. The counter is however still useful for accounting direct page allocations (i.e. not slab) with a shrinker, such as the ION page pool. So keep it, and: - change granularity to pages to be more like other counters; sub-page allocations should be able to use kmalloc - rename the counter to NR_KERNEL_MISC_RECLAIMABLE - expose the counter again in vmstat as "nr_kernel_misc_reclaimable"; we can again remove the check for not printing "hidden" counters Link: http://lkml.kernel.org/r/20180731090649.16028-5-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-09mm, sched/numa: Remove remaining traces of NUMA rate-limitingSrikar Dronamraju1-4/+0
Remove the leftover pglist_data::numabalancing_migrate_lock and its initialization, we stopped using this lock with: efaffc5e40ae ("mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration") [ mingo: Rewrote the changelog. ] Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linux-MM <linux-mm@kvack.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1538824999-31230-1-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migrationMel Gorman1-6/+0
Rate limiting of page migrations due to automatic NUMA balancing was introduced to mitigate the worst-case scenario of migrating at high frequency due to false sharing or slowly ping-ponging between nodes. Since then, a lot of effort was spent on correctly identifying these pages and avoiding unnecessary migrations and the safety net may no longer be required. Jirka Hladky reported a regression in 4.17 due to a scheduler patch that avoids spreading STREAM tasks wide prematurely. However, once the task was properly placed, it delayed migrating the memory due to rate limiting. Increasing the limit fixed the problem for him. Currently, the limit is hard-coded and does not account for the real capabilities of the hardware. Even if an estimate was attempted, it would not properly account for the number of memory controllers and it could not account for the amount of bandwidth used for normal accesses. Rather than fudging, this patch simply eliminates the rate limiting. However, Jirka reports that a STREAM configuration using multiple processes achieved similar performance to 4.16. In local tests, this patch improved performance of STREAM relative to the baseline but it is somewhat machine-dependent. Most workloads show little or not performance difference implying that there is not a heavily reliance on the throttling mechanism and it is safe to remove. STREAM on 2-socket machine 4.19.0-rc5 4.19.0-rc5 numab-v1r1 noratelimit-v1r1 MB/sec copy 43298.52 ( 0.00%) 44673.38 ( 3.18%) MB/sec scale 30115.06 ( 0.00%) 31293.06 ( 3.91%) MB/sec add 32825.12 ( 0.00%) 34883.62 ( 6.27%) MB/sec triad 32549.52 ( 0.00%) 34906.60 ( 7.24% Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linux-MM <linux-mm@kvack.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-08-22mm: access zone->node via zone_to_nid() and zone_set_nid()Pavel Tatashin1-6/+20
zone->node is configured only when CONFIG_NUMA=y, so it is a good idea to have inline functions to access this field in order to avoid ifdef's in c files. Link: http://lkml.kernel.org/r/20180730101757.28058-3-osalvador@techadventures.net Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm: remove zone_id() and make use of zone_idx() in is_dev_zone()Oscar Salvador1-19/+12
is_dev_zone() is using zone_id() to check if the zone is ZONE_DEVICE. zone_id() looks pretty much the same as zone_idx(), and while the use of zone_idx() is quite spread in the kernel, zone_id() is only being used by is_dev_zone(). This patch removes zone_id() and makes is_dev_zone() use zone_idx() to check the zone, so we do not have two things with the same functionality around. Link: http://lkml.kernel.org/r/20180730133718.28683-1-osalvador@techadventures.net Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE requestJoonsoo Kim1-1/+1
Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that important to reserve. When ZONE_MOVABLE is used, this problem would theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE allocation request which is mainly used for page cache and anon page allocation. So, fix it by setting 0 to sysctl_lowmem_reserve_ratio[ZONE_HIGHMEM]. And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size makes code complex. For example, if there is highmem system, following reserve ratio is activated for *NORMAL ZONE* which would be easyily misleading people. #ifdef CONFIG_HIGHMEM 32 #endif This patch also fixes this situation by defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES and place "#ifdef" to right place. Link: http://lkml.kernel.org/r/1504672525-17915-1-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Tony Lindgren <tony@atomide.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: <linux-api@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTESRoman Gushchin1-0/+1
Patch series "indirectly reclaimable memory", v2. This patchset introduces the concept of indirectly reclaimable memory and applies it to fix the issue of when a big number of dentries with external names can significantly affect the MemAvailable value. This patch (of 3): Introduce a concept of indirectly reclaimable memory and adds the corresponding memory counter and /proc/vmstat item. Indirectly reclaimable memory is any sort of memory, used by the kernel (except of reclaimable slabs), which is actually reclaimable, i.e. will be released under memory pressure. The counter is in bytes, as it's not always possible to count such objects in pages. The name contains BYTES by analogy to NR_KERNEL_STACK_KB. Link: http://lkml.kernel.org/r/20180305133743.12746-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memoryDavid Rientjes1-1/+2
Kswapd will not wakeup if per-zone watermarks are not failing or if too many previous attempts at background reclaim have failed. This can be true if there is a lot of free memory available. For high- order allocations, kswapd is responsible for waking up kcompactd for background compaction. If the zone is not below its watermarks or reclaim has recently failed (lots of free memory, nothing left to reclaim), kcompactd does not get woken up. When __GFP_DIRECT_RECLAIM is not allowed, allow kcompactd to still be woken up even if kswapd will not reclaim. This allows high-order allocations, such as thp, to still trigger background compaction even when the zone has an abundance of free memory. Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803111659420.209721@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm: disable interrupts while initializing deferred pagesPavel Tatashin1-2/+3
Vlastimil Babka reported about a window issue during which when deferred pages are initialized, and the current version of on-demand initialization is finished, allocations may fail. While this is highly unlikely scenario, since this kind of allocation request must be large, and must come from interrupt handler, we still want to cover it. We solve this by initializing deferred pages with interrupts disabled, and holding node_size_lock spin lock while pages in the node are being initialized. The on-demand deferred page initialization that comes later will use the same lock, and thus synchronize with deferred_init_memmap(). It is unlikely for threads that initialize deferred pages to be interrupted. They run soon after smp_init(), but before modules are initialized, and long before user space programs. This is why there is no adverse effect of having these threads running with interrupts disabled. [pasha.tatashin@oracle.com: v6] Link: http://lkml.kernel.org/r/20180313182355.17669-2-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20180309220807.24961-2-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Miles Chen <miles.chen@mediatek.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-27x86/mm/32: Remove unused node_memmap_size_bytes() & CONFIG_NEED_NODE_MEMMAP_SIZE logicDavid Rientjes1-5/+0
node_memmap_size_bytes() has been unused since the v3.9 kernel, so remove it. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Fixes: f03574f2d5b2 ("x86-32, mm: Rip out x86_32 NUMA remapping code") Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803262325540.256524@chino.kir.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-31include/linux/mmzone.h: fix explanation of lower bits in the SPARSEMEM mem_map pointerPetr Tesarik1-2/+10
The comment is confusing. On the one hand, it refers to 32-bit alignment (struct page alignment on 32-bit platforms), but this would only guarantee that the 2 lowest bits must be zero. On the other hand, it claims that at least 3 bits are available, and 3 bits are actually used. This is not broken, because there is a stronger alignment guarantee, just less obvious. Let's fix the comment to make it clear how many bits are available and why. Although memmap arrays are allocated in various places, the resulting pointer is encoded eventually, so I am adding a BUG_ON() here to enforce at runtime that all expected bits are indeed available. I have also added a BUILD_BUG_ON to check that PFN_SECTION_SHIFT is sufficient, because this part of the calculation can be easily checked at build time. [ptesarik@suse.com: v2] Link: http://lkml.kernel.org/r/20180125100516.589ea6af@ezekiel.suse.cz Link: http://lkml.kernel.org/r/20180119080908.3a662e6f@ezekiel.suse.cz Signed-off-by: Petr Tesarik <ptesarik@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemi Wang <kemi.wang@intel.com> Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/page_alloc.c: broken deferred calculationPavel Tatashin1-1/+2
In reset_deferred_meminit() we determine number of pages that must not be deferred. We initialize pages for at least 2G of memory, but also pages for reserved memory in this node. The reserved memory is determined in this function: memblock_reserved_memory_within(), which operates over physical addresses, and returns size in bytes. However, reset_deferred_meminit() assumes that that this function operates with pfns, and returns page count. The result is that in the best case machine boots slower than expected due to initializing more pages than needed in single thread, and in the worst case panics because fewer than needed pages are initialized early. Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing") Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: remove unused pgdat->inactive_ratioAndrey Ryabinin1-6/+0
Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list") 'pgdat->inactive_ratio' is not used, except for printing "node_inactive_ratio: 0" in /proc/zoneinfo output. Remove it. Link: http://lkml.kernel.org/r/20171003152611.27483-1-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-07Merge branch 'linus' into x86/asm, to pick up fixes and resolve conflictsIngo Molnar1-0/+1
Conflicts: arch/x86/kernel/cpu/Makefile Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman1-0/+1
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-20mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=yKirill A. Shutemov1-1/+5
Size of the mem_section[] array depends on the size of the physical address space. In preparation for boot-time switching between paging modes on x86-64 we need to make the allocation of mem_section[] dynamic, because otherwise we waste a lot of RAM: with CONFIG_NODE_SHIFT=10, mem_section[] size is 32kB for 4-level paging and 2MB for 5-level paging mode. The patch allocates the array on the first call to sparse_memory_present_with_active_regions(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@suse.de> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20170929140821.37654-2-kirill.shutemov@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-03mm/memory_hotplug: change pfn_to_section_nr/section_nr_to_pfn macro to inline functionYASUAKI ISHIMATSU1-2/+8
pfn_to_section_nr() and section_nr_to_pfn() are defined as macro. pfn_to_section_nr() has no issue even if it is defined as macro. But section_nr_to_pfn() has overflow issue if sec is defined as int. section_nr_to_pfn() just shifts sec by PFN_SECTION_SHIFT. If sec is defined as unsigned long, section_nr_to_pfn() returns pfn as 64 bit value. But if sec is defined as int, section_nr_to_pfn() returns pfn as 32 bit value. __remove_section() calculates start_pfn using section_nr_to_pfn() and scn_nr defined as int. So if hot-removed memory address is over 16TB, overflow issue occurs and section_nr_to_pfn() does not calculate correct pfn. To make callers use proper arg, the patch changes the macros to inline functions. Fixes: 815121d2b5cd ("memory_hotplug: clear zone when removing the memory") Link: http://lkml.kernel.org/r/e643a387-e573-6bbf-d418-c60c8ee3d15e@gmail.com Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08mm: update NUMA counter threshold sizeKemi Wang1-2/+1
There is significant overhead in cache bouncing caused by zone counters (NUMA associated counters) update in parallel in multi-threaded page allocation (suggested by Dave Hansen). This patch updates NUMA counter threshold to a fixed size of MAX_U16 - 2, as a small threshold greatly increases the update frequency of the global counter from local per cpu counter(suggested by Ying Huang). The rationality is that these statistics counters don't affect the kernel's decision, unlike other VM counters, so it's not a problem to use a large threshold. With this patchset, we see 31.3% drop of CPU cycles(537-->369) for per single page allocation and reclaim on Jesper's page_bench03 benchmark. Benchmark provided by Jesper D Brouer(increase loop times to 10000000): https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ bench Threshold CPU cycles Throughput(88 threads) 32 799 241760478 64 640 301628829 125 537 358906028 <==> system by default (base) 256 468 412397590 512 428 450550704 4096 399 482520943 20000 394 489009617 30000 395 488017817 65533 369(-31.3%) 521661345(+45.3%) <==> with this patchset N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics Link: http://lkml.kernel.org/r/1503568801-21305-3-git-send-email-kemi.wang@intel.com Signed-off-by: Kemi Wang <kemi.wang@intel.com> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Suggested-by: Dave Hansen <dave.hansen@intel.com> Suggested-by: Ying Huang <ying.huang@intel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Christopher Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08mm: change the call sites of numa statistics itemsKemi Wang1-8/+17
Patch series "Separate NUMA statistics from zone statistics", v2. Each page allocation updates a set of per-zone statistics with a call to zone_statistics(). As discussed in 2017 MM summit, these are a substantial source of overhead in the page allocator and are very rarely consumed. This significant overhead in cache bouncing caused by zone counters (NUMA associated counters) update in parallel in multi-threaded page allocation (pointed out by Dave Hansen). A link to the MM summit slides: http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf To mitigate this overhead, this patchset separates NUMA statistics from zone statistics framework, and update NUMA counter threshold to a fixed size of MAX_U16 - 2, as a small threshold greatly increases the update frequency of the global counter from local per cpu counter (suggested by Ying Huang). The rationality is that these statistics counters don't need to be read often, unlike other VM counters, so it's not a problem to use a large threshold and make readers more expensive. With this patchset, we see 31.3% drop of CPU cycles(537-->369, see below) for per single page allocation and reclaim on Jesper's page_bench03 benchmark. Meanwhile, this patchset keeps the same style of virtual memory statistics with little end-user-visible effects (only move the numa stats to show behind zone page stats, see the first patch for details). I did an experiment of single page allocation and reclaim concurrently using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based server (88 processors with 126G memory) with different size of threshold of pcp counter. Benchmark provided by Jesper D Brouer(increase loop times to 10000000): https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench Threshold CPU cycles Throughput(88 threads) 32 799 241760478 64 640 301628829 125 537 358906028 <==> system by default 256 468 412397590 512 428 450550704 4096 399 482520943 20000 394 489009617 30000 395 488017817 65533 369(-31.3%) 521661345(+45.3%) <==> with this patchset N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics This patch (of 3): In this patch, NUMA statistics is separated from zone statistics framework, all the call sites of NUMA stats are changed to use numa-stats-specific functions, it does not have any functionality change except that the number of NUMA stats is shown behind zone page stats when users *read* the zone info. E.g. cat /proc/zoneinfo ***Base*** ***With this patch*** nr_free_pages 3976 nr_free_pages 3976 nr_zone_inactive_anon 0 nr_zone_inactive_anon 0 nr_zone_active_anon 0 nr_zone_active_anon 0 nr_zone_inactive_file 0 nr_zone_inactive_file 0 nr_zone_active_file 0 nr_zone_active_file 0 nr_zone_unevictable 0 nr_zone_unevictable 0 nr_zone_write_pending 0 nr_zone_write_pending 0 nr_mlock 0 nr_mlock 0 nr_page_table_pages 0 nr_page_table_pages 0 nr_kernel_stack 0 nr_kernel_stack 0 nr_bounce 0 nr_bounce 0 nr_zspages 0 nr_zspages 0 numa_hit 0 *nr_free_cma 0* numa_miss 0 numa_hit 0 numa_foreign 0 numa_miss 0 numa_interleave 0 numa_foreign 0 numa_local 0 numa_interleave 0 numa_other 0 numa_local 0 *nr_free_cma 0* numa_other 0 ... ... vm stats threshold: 10 vm stats threshold: 10 ... ... The next patch updates the numa stats counter size and threshold. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.com Signed-off-by: Kemi Wang <kemi.wang@intel.com> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christopher Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Ying Huang <ying.huang@intel.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>