From 0116523cfffa62aeb5aa3b85ce7419f3dae0c1b8 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:29:37 -0800 Subject: kasan, mm: change hooks signatures Patch series "kasan: add software tag-based mode for arm64", v13. This patchset adds a new software tag-based mode to KASAN [1]. (Initially this mode was called KHWASAN, but it got renamed, see the naming rationale at the end of this section). The plan is to implement HWASan [2] for the kernel with the incentive, that it's going to have comparable to KASAN performance, but in the same time consume much less memory, trading that off for somewhat imprecise bug detection and being supported only for arm64. The underlying ideas of the approach used by software tag-based KASAN are: 1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store pointer tags in the top byte of each kernel pointer. 2. Using shadow memory, we can store memory tags for each chunk of kernel memory. 3. On each memory allocation, we can generate a random tag, embed it into the returned pointer and set the memory tags that correspond to this chunk of memory to the same value. 4. By using compiler instrumentation, before each memory access we can add a check that the pointer tag matches the tag of the memory that is being accessed. 5. On a tag mismatch we report an error. With this patchset the existing KASAN mode gets renamed to generic KASAN, with the word "generic" meaning that the implementation can be supported by any architecture as it is purely software. The new mode this patchset adds is called software tag-based KASAN. The word "tag-based" refers to the fact that this mode uses tags embedded into the top byte of kernel pointers and the TBI arm64 CPU feature that allows to dereference such pointers. The word "software" here means that shadow memory manipulation and tag checking on pointer dereference is done in software. As it is the only tag-based implementation right now, "software tag-based" KASAN is sometimes referred to as simply "tag-based" in this patchset. A potential expansion of this mode is a hardware tag-based mode, which would use hardware memory tagging support (announced by Arm [3]) instead of compiler instrumentation and manual shadow memory manipulation. Same as generic KASAN, software tag-based KASAN is strictly a debugging feature. [1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html [2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html [3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a ====== Rationale On mobile devices generic KASAN's memory usage is significant problem. One of the main reasons to have tag-based KASAN is to be able to perform a similar set of checks as the generic one does, but with lower memory requirements. Comment from Vishwath Mohan : I don't have data on-hand, but anecdotally both ASAN and KASAN have proven problematic to enable for environments that don't tolerate the increased memory pressure well. This includes (a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go, (c) Connected components like Pixel's visual core [1]. These are both places I'd love to have a low(er) memory footprint option at my disposal. Comment from Evgenii Stepanov : Looking at a live Android device under load, slab (according to /proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's overhead of 2x - 3x on top of it is not insignificant. Not having this overhead enables near-production use - ex. running KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do not reproduce in test configuration. These are the ones that often cost the most engineering time to track down. CPU overhead is bad, but generally tolerable. RAM is critical, in our experience. Once it gets low enough, OOM-killer makes your life miserable. [1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/ ====== Technical details Software tag-based KASAN mode is implemented in a very similar way to the generic one. This patchset essentially does the following: 1. TCR_TBI1 is set to enable Top Byte Ignore. 2. Shadow memory is used (with a different scale, 1:16, so each shadow byte corresponds to 16 bytes of kernel memory) to store memory tags. 3. All slab objects are aligned to shadow scale, which is 16 bytes. 4. All pointers returned from the slab allocator are tagged with a random tag and the corresponding shadow memory is poisoned with the same value. 5. Compiler instrumentation is used to insert tag checks. Either by calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and CONFIG_KASAN_INLINE flags are reused). 6. When a tag mismatch is detected in callback instrumentation mode KASAN simply prints a bug report. In case of inline instrumentation, clang inserts a brk instruction, and KASAN has it's own brk handler, which reports the bug. 7. The memory in between slab objects is marked with a reserved tag, and acts as a redzone. 8. When a slab object is freed it's marked with a reserved tag. Bug detection is imprecise for two reasons: 1. We won't catch some small out-of-bounds accesses, that fall into the same shadow cell, as the last byte of a slab object. 2. We only have 1 byte to store tags, which means we have a 1/256 probability of a tag match for an incorrect access (actually even slightly less due to reserved tag values). Despite that there's a particular type of bugs that tag-based KASAN can detect compared to generic KASAN: use-after-free after the object has been allocated by someone else. ====== Testing Some kernel developers voiced a concern that changing the top byte of kernel pointers may lead to subtle bugs that are difficult to discover. To address this concern deliberate testing has been performed. It doesn't seem feasible to do some kind of static checking to find potential issues with pointer tagging, so a dynamic approach was taken. All pointer comparisons/subtractions have been instrumented in an LLVM compiler pass and a kernel module that would print a bug report whenever two pointers with different tags are being compared/subtracted (ignoring comparisons with NULL pointers and with pointers obtained by casting an error code to a pointer type) has been used. Then the kernel has been booted in QEMU and on an Odroid C2 board and syzkaller has been run. This yielded the following results. The two places that look interesting are: is_vmalloc_addr in include/linux/mm.h is_kernel_rodata in mm/util.c Here we compare a pointer with some fixed untagged values to make sure that the pointer lies in a particular part of the kernel address space. Since tag-based KASAN doesn't add tags to pointers that belong to rodata or vmalloc regions, this should work as is. To make sure debug checks to those two functions that check that the result doesn't change whether we operate on pointers with or without untagging has been added. A few other cases that don't look that interesting: Comparing pointers to achieve unique sorting order of pointee objects (e.g. sorting locks addresses before performing a double lock): tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c pipe_double_lock in fs/pipe.c unix_state_double_lock in net/unix/af_unix.c lock_two_nondirectories in fs/inode.c mutex_lock_double in kernel/events/core.c ep_cmp_ffd in fs/eventpoll.c fsnotify_compare_groups fs/notify/mark.c Nothing needs to be done here, since the tags embedded into pointers don't change, so the sorting order would still be unique. Checks that a pointer belongs to some particular allocation: is_sibling_entry in lib/radix-tree.c object_is_on_stack in include/linux/sched/task_stack.h Nothing needs to be done here either, since two pointers can only belong to the same allocation if they have the same tag. Overall, since the kernel boots and works, there are no critical bugs. As for the rest, the traditional kernel testing way (use until fails) is the only one that looks feasible. Another point here is that tag-based KASAN is available under a separate config option that needs to be deliberately enabled. Even though it might be used in a "near-production" environment to find bugs that are not found during fuzzing or running tests, it is still a debug tool. ====== Benchmarks The following numbers were collected on Odroid C2 board. Both generic and tag-based KASAN were used in inline instrumentation mode. Boot time [1]: * ~1.7 sec for clean kernel * ~5.0 sec for generic KASAN * ~5.0 sec for tag-based KASAN Network performance [2]: * 8.33 Gbits/sec for clean kernel * 3.17 Gbits/sec for generic KASAN * 2.85 Gbits/sec for tag-based KASAN Slab memory usage after boot [3]: * ~40 kb for clean kernel * ~105 kb (~260% overhead) for generic KASAN * ~47 kb (~20% overhead) for tag-based KASAN KASAN memory overhead consists of three main parts: 1. Increased slab memory usage due to redzones. 2. Shadow memory (the whole reserved once during boot). 3. Quaratine (grows gradually until some preset limit; the more the limit, the more the chance to detect a use-after-free). Comparing tag-based vs generic KASAN for each of these points: 1. 20% vs 260% overhead. 2. 1/16th vs 1/8th of physical memory. 3. Tag-based KASAN doesn't require quarantine. [1] Time before the ext4 driver is initialized. [2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`. [3] Measured as `cat /proc/meminfo | grep Slab`. ====== Some notes A few notes: 1. The patchset can be found here: https://github.com/xairy/kasan-prototype/tree/khwasan 2. Building requires a recent Clang version (7.0.0 or later). 3. Stack instrumentation is not supported yet and will be added later. This patch (of 25): Tag-based KASAN changes the value of the top byte of pointers returned from the kernel allocation functions (such as kmalloc). This patch updates KASAN hooks signatures and their usage in SLAB and SLUB code to reflect that. Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Reviewed-by: Andrey Ryabinin Reviewed-by: Dmitry Vyukov Cc: Christoph Lameter Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/kasan.h | 43 +++++++++++++++++++++++++++++-------------- include/linux/slab.h | 4 ++-- 2 files changed, 31 insertions(+), 16 deletions(-) (limited to 'include') diff --git a/include/linux/kasan.h b/include/linux/kasan.h index 46aae129917c..52c86a568a4e 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -51,16 +51,16 @@ void kasan_cache_shutdown(struct kmem_cache *cache); void kasan_poison_slab(struct page *page); void kasan_unpoison_object_data(struct kmem_cache *cache, void *object); void kasan_poison_object_data(struct kmem_cache *cache, void *object); -void kasan_init_slab_obj(struct kmem_cache *cache, const void *object); +void *kasan_init_slab_obj(struct kmem_cache *cache, const void *object); -void kasan_kmalloc_large(const void *ptr, size_t size, gfp_t flags); +void *kasan_kmalloc_large(const void *ptr, size_t size, gfp_t flags); void kasan_kfree_large(void *ptr, unsigned long ip); void kasan_poison_kfree(void *ptr, unsigned long ip); -void kasan_kmalloc(struct kmem_cache *s, const void *object, size_t size, +void *kasan_kmalloc(struct kmem_cache *s, const void *object, size_t size, gfp_t flags); -void kasan_krealloc(const void *object, size_t new_size, gfp_t flags); +void *kasan_krealloc(const void *object, size_t new_size, gfp_t flags); -void kasan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags); +void *kasan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags); bool kasan_slab_free(struct kmem_cache *s, void *object, unsigned long ip); struct kasan_cache { @@ -105,19 +105,34 @@ static inline void kasan_unpoison_object_data(struct kmem_cache *cache, void *object) {} static inline void kasan_poison_object_data(struct kmem_cache *cache, void *object) {} -static inline void kasan_init_slab_obj(struct kmem_cache *cache, - const void *object) {} +static inline void *kasan_init_slab_obj(struct kmem_cache *cache, + const void *object) +{ + return (void *)object; +} -static inline void kasan_kmalloc_large(void *ptr, size_t size, gfp_t flags) {} +static inline void *kasan_kmalloc_large(void *ptr, size_t size, gfp_t flags) +{ + return ptr; +} static inline void kasan_kfree_large(void *ptr, unsigned long ip) {} static inline void kasan_poison_kfree(void *ptr, unsigned long ip) {} -static inline void kasan_kmalloc(struct kmem_cache *s, const void *object, - size_t size, gfp_t flags) {} -static inline void kasan_krealloc(const void *object, size_t new_size, - gfp_t flags) {} +static inline void *kasan_kmalloc(struct kmem_cache *s, const void *object, + size_t size, gfp_t flags) +{ + return (void *)object; +} +static inline void *kasan_krealloc(const void *object, size_t new_size, + gfp_t flags) +{ + return (void *)object; +} -static inline void kasan_slab_alloc(struct kmem_cache *s, void *object, - gfp_t flags) {} +static inline void *kasan_slab_alloc(struct kmem_cache *s, void *object, + gfp_t flags) +{ + return object; +} static inline bool kasan_slab_free(struct kmem_cache *s, void *object, unsigned long ip) { diff --git a/include/linux/slab.h b/include/linux/slab.h index 918f374e7156..351ac48dabc4 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -444,7 +444,7 @@ static __always_inline void *kmem_cache_alloc_trace(struct kmem_cache *s, { void *ret = kmem_cache_alloc(s, flags); - kasan_kmalloc(s, ret, size, flags); + ret = kasan_kmalloc(s, ret, size, flags); return ret; } @@ -455,7 +455,7 @@ kmem_cache_alloc_node_trace(struct kmem_cache *s, { void *ret = kmem_cache_alloc_node(s, gfpflags, node); - kasan_kmalloc(s, ret, size, gfpflags); + ret = kasan_kmalloc(s, ret, size, gfpflags); return ret; } #endif /* CONFIG_TRACING */ -- cgit v1.2.3-59-g8ed1b From 2bd926b439b4cb6b9ed240a9781cd01958b53d85 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:29:53 -0800 Subject: kasan: add CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS This commit splits the current CONFIG_KASAN config option into two: 1. CONFIG_KASAN_GENERIC, that enables the generic KASAN mode (the one that exists now); 2. CONFIG_KASAN_SW_TAGS, that enables the software tag-based KASAN mode. The name CONFIG_KASAN_SW_TAGS is chosen as in the future we will have another hardware tag-based KASAN mode, that will rely on hardware memory tagging support in arm64. With CONFIG_KASAN_SW_TAGS enabled, compiler options are changed to instrument kernel files with -fsantize=kernel-hwaddress (except the ones for which KASAN_SANITIZE := n is set). Both CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS support both CONFIG_KASAN_INLINE and CONFIG_KASAN_OUTLINE instrumentation modes. This commit also adds empty placeholder (for now) implementation of tag-based KASAN specific hooks inserted by the compiler and adjusts common hooks implementation. While this commit adds the CONFIG_KASAN_SW_TAGS config option, this option is not selectable, as it depends on HAVE_ARCH_KASAN_SW_TAGS, which we will enable once all the infrastracture code has been added. Link: http://lkml.kernel.org/r/b2550106eb8a68b10fefbabce820910b115aa853.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Reviewed-by: Andrey Ryabinin Reviewed-by: Dmitry Vyukov Cc: Christoph Lameter Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/compiler-clang.h | 6 ++- include/linux/compiler-gcc.h | 6 +++ include/linux/compiler_attributes.h | 13 ----- include/linux/kasan.h | 16 ++++-- lib/Kconfig.kasan | 98 ++++++++++++++++++++++++++++--------- mm/kasan/Makefile | 6 ++- mm/kasan/generic.c | 2 +- mm/kasan/kasan.h | 3 +- mm/kasan/tags.c | 75 ++++++++++++++++++++++++++++ mm/slub.c | 2 +- scripts/Makefile.kasan | 53 +++++++++++--------- 11 files changed, 214 insertions(+), 66 deletions(-) create mode 100644 mm/kasan/tags.c (limited to 'include') diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h index 3e7dafb3ea80..39f668d5066b 100644 --- a/include/linux/compiler-clang.h +++ b/include/linux/compiler-clang.h @@ -16,9 +16,13 @@ /* all clang versions usable with the kernel support KASAN ABI version 5 */ #define KASAN_ABI_VERSION 5 +#if __has_feature(address_sanitizer) || __has_feature(hwaddress_sanitizer) /* emulate gcc's __SANITIZE_ADDRESS__ flag */ -#if __has_feature(address_sanitizer) #define __SANITIZE_ADDRESS__ +#define __no_sanitize_address \ + __attribute__((no_sanitize("address", "hwaddress"))) +#else +#define __no_sanitize_address #endif /* diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h index 2010493e1040..5776da43da97 100644 --- a/include/linux/compiler-gcc.h +++ b/include/linux/compiler-gcc.h @@ -143,6 +143,12 @@ #define KASAN_ABI_VERSION 3 #endif +#if __has_attribute(__no_sanitize_address__) +#define __no_sanitize_address __attribute__((no_sanitize_address)) +#else +#define __no_sanitize_address +#endif + #if GCC_VERSION >= 50100 #define COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW 1 #endif diff --git a/include/linux/compiler_attributes.h b/include/linux/compiler_attributes.h index fe07b680dd4a..19f32b0c29af 100644 --- a/include/linux/compiler_attributes.h +++ b/include/linux/compiler_attributes.h @@ -199,19 +199,6 @@ */ #define __noreturn __attribute__((__noreturn__)) -/* - * Optional: only supported since gcc >= 4.8 - * Optional: not supported by icc - * - * gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-no_005fsanitize_005faddress-function-attribute - * clang: https://clang.llvm.org/docs/AttributeReference.html#no-sanitize-address-no-address-safety-analysis - */ -#if __has_attribute(__no_sanitize_address__) -# define __no_sanitize_address __attribute__((__no_sanitize_address__)) -#else -# define __no_sanitize_address -#endif - /* * gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html#index-packed-type-attribute * clang: https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#index-packed-variable-attribute diff --git a/include/linux/kasan.h b/include/linux/kasan.h index 52c86a568a4e..b66fdf5ea7ab 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -45,8 +45,6 @@ void kasan_free_pages(struct page *page, unsigned int order); void kasan_cache_create(struct kmem_cache *cache, unsigned int *size, slab_flags_t *flags); -void kasan_cache_shrink(struct kmem_cache *cache); -void kasan_cache_shutdown(struct kmem_cache *cache); void kasan_poison_slab(struct page *page); void kasan_unpoison_object_data(struct kmem_cache *cache, void *object); @@ -97,8 +95,6 @@ static inline void kasan_free_pages(struct page *page, unsigned int order) {} static inline void kasan_cache_create(struct kmem_cache *cache, unsigned int *size, slab_flags_t *flags) {} -static inline void kasan_cache_shrink(struct kmem_cache *cache) {} -static inline void kasan_cache_shutdown(struct kmem_cache *cache) {} static inline void kasan_poison_slab(struct page *page) {} static inline void kasan_unpoison_object_data(struct kmem_cache *cache, @@ -155,4 +151,16 @@ static inline size_t kasan_metadata_size(struct kmem_cache *cache) { return 0; } #endif /* CONFIG_KASAN */ +#ifdef CONFIG_KASAN_GENERIC + +void kasan_cache_shrink(struct kmem_cache *cache); +void kasan_cache_shutdown(struct kmem_cache *cache); + +#else /* CONFIG_KASAN_GENERIC */ + +static inline void kasan_cache_shrink(struct kmem_cache *cache) {} +static inline void kasan_cache_shutdown(struct kmem_cache *cache) {} + +#endif /* CONFIG_KASAN_GENERIC */ + #endif /* LINUX_KASAN_H */ diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan index d0bad1bd9a2b..d8c474b6691e 100644 --- a/lib/Kconfig.kasan +++ b/lib/Kconfig.kasan @@ -1,36 +1,92 @@ +# This config refers to the generic KASAN mode. config HAVE_ARCH_KASAN bool -if HAVE_ARCH_KASAN +config HAVE_ARCH_KASAN_SW_TAGS + bool + +config CC_HAS_KASAN_GENERIC + def_bool $(cc-option, -fsanitize=kernel-address) + +config CC_HAS_KASAN_SW_TAGS + def_bool $(cc-option, -fsanitize=kernel-hwaddress) config KASAN - bool "KASan: runtime memory debugger" + bool "KASAN: runtime memory debugger" + depends on (HAVE_ARCH_KASAN && CC_HAS_KASAN_GENERIC) || \ + (HAVE_ARCH_KASAN_SW_TAGS && CC_HAS_KASAN_SW_TAGS) + depends on (SLUB && SYSFS) || (SLAB && !DEBUG_SLAB) + help + Enables KASAN (KernelAddressSANitizer) - runtime memory debugger, + designed to find out-of-bounds accesses and use-after-free bugs. + See Documentation/dev-tools/kasan.rst for details. + +choice + prompt "KASAN mode" + depends on KASAN + default KASAN_GENERIC + help + KASAN has two modes: generic KASAN (similar to userspace ASan, + x86_64/arm64/xtensa, enabled with CONFIG_KASAN_GENERIC) and + software tag-based KASAN (a version based on software memory + tagging, arm64 only, similar to userspace HWASan, enabled with + CONFIG_KASAN_SW_TAGS). + Both generic and tag-based KASAN are strictly debugging features. + +config KASAN_GENERIC + bool "Generic mode" + depends on HAVE_ARCH_KASAN && CC_HAS_KASAN_GENERIC depends on (SLUB && SYSFS) || (SLAB && !DEBUG_SLAB) select SLUB_DEBUG if SLUB select CONSTRUCTORS select STACKDEPOT help - Enables kernel address sanitizer - runtime memory debugger, - designed to find out-of-bounds accesses and use-after-free bugs. - This is strictly a debugging feature and it requires a gcc version - of 4.9.2 or later. Detection of out of bounds accesses to stack or - global variables requires gcc 5.0 or later. - This feature consumes about 1/8 of available memory and brings about - ~x3 performance slowdown. + Enables generic KASAN mode. + Supported in both GCC and Clang. With GCC it requires version 4.9.2 + or later for basic support and version 5.0 or later for detection of + out-of-bounds accesses for stack and global variables and for inline + instrumentation mode (CONFIG_KASAN_INLINE). With Clang it requires + version 3.7.0 or later and it doesn't support detection of + out-of-bounds accesses for global variables yet. + This mode consumes about 1/8th of available memory at kernel start + and introduces an overhead of ~x1.5 for the rest of the allocations. + The performance slowdown is ~x3. For better error detection enable CONFIG_STACKTRACE. - Currently CONFIG_KASAN doesn't work with CONFIG_DEBUG_SLAB + Currently CONFIG_KASAN_GENERIC doesn't work with CONFIG_DEBUG_SLAB (the resulting kernel does not boot). +config KASAN_SW_TAGS + bool "Software tag-based mode" + depends on HAVE_ARCH_KASAN_SW_TAGS && CC_HAS_KASAN_SW_TAGS + depends on (SLUB && SYSFS) || (SLAB && !DEBUG_SLAB) + select SLUB_DEBUG if SLUB + select CONSTRUCTORS + select STACKDEPOT + help + Enables software tag-based KASAN mode. + This mode requires Top Byte Ignore support by the CPU and therefore + is only supported for arm64. + This mode requires Clang version 7.0.0 or later. + This mode consumes about 1/16th of available memory at kernel start + and introduces an overhead of ~20% for the rest of the allocations. + This mode may potentially introduce problems relating to pointer + casting and comparison, as it embeds tags into the top byte of each + pointer. + For better error detection enable CONFIG_STACKTRACE. + Currently CONFIG_KASAN_SW_TAGS doesn't work with CONFIG_DEBUG_SLAB + (the resulting kernel does not boot). + +endchoice + config KASAN_EXTRA - bool "KAsan: extra checks" - depends on KASAN && DEBUG_KERNEL && !COMPILE_TEST + bool "KASAN: extra checks" + depends on KASAN_GENERIC && DEBUG_KERNEL && !COMPILE_TEST help - This enables further checks in the kernel address sanitizer, for now - it only includes the address-use-after-scope check that can lead - to excessive kernel stack usage, frame size warnings and longer + This enables further checks in generic KASAN, for now it only + includes the address-use-after-scope check that can lead to + excessive kernel stack usage, frame size warnings and longer compile time. - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81715 has more - + See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81715 choice prompt "Instrumentation type" @@ -53,7 +109,7 @@ config KASAN_INLINE memory accesses. This is faster than outline (in some workloads it gives about x2 boost over outline instrumentation), but make kernel's .text size much bigger. - This requires a gcc version of 5.0 or later. + For CONFIG_KASAN_GENERIC this requires GCC 5.0 or later. endchoice @@ -67,11 +123,9 @@ config KASAN_S390_4_LEVEL_PAGING 4-level paging instead. config TEST_KASAN - tristate "Module for testing kasan for bug detection" + tristate "Module for testing KASAN for bug detection" depends on m && KASAN help This is a test module doing various nasty things like out of bounds accesses, use after free. It is useful for testing - kernel debugging features like kernel address sanitizer. - -endif + kernel debugging features like KASAN. diff --git a/mm/kasan/Makefile b/mm/kasan/Makefile index d643530b24aa..68ba1822f003 100644 --- a/mm/kasan/Makefile +++ b/mm/kasan/Makefile @@ -2,6 +2,7 @@ KASAN_SANITIZE := n UBSAN_SANITIZE_common.o := n UBSAN_SANITIZE_generic.o := n +UBSAN_SANITIZE_tags.o := n KCOV_INSTRUMENT := n CFLAGS_REMOVE_generic.o = -pg @@ -10,5 +11,8 @@ CFLAGS_REMOVE_generic.o = -pg CFLAGS_common.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) CFLAGS_generic.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) +CFLAGS_tags.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) -obj-y := common.o generic.o report.o init.o quarantine.o +obj-$(CONFIG_KASAN) := common.o init.o report.o +obj-$(CONFIG_KASAN_GENERIC) += generic.o quarantine.o +obj-$(CONFIG_KASAN_SW_TAGS) += tags.o diff --git a/mm/kasan/generic.c b/mm/kasan/generic.c index 44ec228de0a2..b8de6d33c55c 100644 --- a/mm/kasan/generic.c +++ b/mm/kasan/generic.c @@ -1,5 +1,5 @@ /* - * This file contains core KASAN code. + * This file contains core generic KASAN code. * * Copyright (c) 2014 Samsung Electronics Co., Ltd. * Author: Andrey Ryabinin diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h index 659463800f10..19b950eaccff 100644 --- a/mm/kasan/kasan.h +++ b/mm/kasan/kasan.h @@ -114,7 +114,8 @@ void kasan_report(unsigned long addr, size_t size, bool is_write, unsigned long ip); void kasan_report_invalid_free(void *object, unsigned long ip); -#if defined(CONFIG_SLAB) || defined(CONFIG_SLUB) +#if defined(CONFIG_KASAN_GENERIC) && \ + (defined(CONFIG_SLAB) || defined(CONFIG_SLUB)) void quarantine_put(struct kasan_free_meta *info, struct kmem_cache *cache); void quarantine_reduce(void); void quarantine_remove_cache(struct kmem_cache *cache); diff --git a/mm/kasan/tags.c b/mm/kasan/tags.c new file mode 100644 index 000000000000..04194923c543 --- /dev/null +++ b/mm/kasan/tags.c @@ -0,0 +1,75 @@ +/* + * This file contains core tag-based KASAN code. + * + * Copyright (c) 2018 Google, Inc. + * Author: Andrey Konovalov + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt +#define DISABLE_BRANCH_PROFILING + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "kasan.h" +#include "../slab.h" + +void check_memory_region(unsigned long addr, size_t size, bool write, + unsigned long ret_ip) +{ +} + +#define DEFINE_HWASAN_LOAD_STORE(size) \ + void __hwasan_load##size##_noabort(unsigned long addr) \ + { \ + } \ + EXPORT_SYMBOL(__hwasan_load##size##_noabort); \ + void __hwasan_store##size##_noabort(unsigned long addr) \ + { \ + } \ + EXPORT_SYMBOL(__hwasan_store##size##_noabort) + +DEFINE_HWASAN_LOAD_STORE(1); +DEFINE_HWASAN_LOAD_STORE(2); +DEFINE_HWASAN_LOAD_STORE(4); +DEFINE_HWASAN_LOAD_STORE(8); +DEFINE_HWASAN_LOAD_STORE(16); + +void __hwasan_loadN_noabort(unsigned long addr, unsigned long size) +{ +} +EXPORT_SYMBOL(__hwasan_loadN_noabort); + +void __hwasan_storeN_noabort(unsigned long addr, unsigned long size) +{ +} +EXPORT_SYMBOL(__hwasan_storeN_noabort); + +void __hwasan_tag_memory(unsigned long addr, u8 tag, unsigned long size) +{ +} +EXPORT_SYMBOL(__hwasan_tag_memory); diff --git a/mm/slub.c b/mm/slub.c index 8561a32910dd..e739d46600b9 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2992,7 +2992,7 @@ static __always_inline void slab_free(struct kmem_cache *s, struct page *page, do_slab_free(s, page, head, tail, cnt, addr); } -#ifdef CONFIG_KASAN +#ifdef CONFIG_KASAN_GENERIC void ___cache_free(struct kmem_cache *cache, void *x, unsigned long addr) { do_slab_free(cache, virt_to_head_page(x), x, NULL, 1, addr); diff --git a/scripts/Makefile.kasan b/scripts/Makefile.kasan index 69552a39951d..25c259df8ffa 100644 --- a/scripts/Makefile.kasan +++ b/scripts/Makefile.kasan @@ -1,5 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 -ifdef CONFIG_KASAN +ifdef CONFIG_KASAN_GENERIC + ifdef CONFIG_KASAN_INLINE call_threshold := 10000 else @@ -12,36 +13,44 @@ CFLAGS_KASAN_MINIMAL := -fsanitize=kernel-address cc-param = $(call cc-option, -mllvm -$(1), $(call cc-option, --param $(1))) -ifeq ($(call cc-option, $(CFLAGS_KASAN_MINIMAL) -Werror),) - ifneq ($(CONFIG_COMPILE_TEST),y) - $(warning Cannot use CONFIG_KASAN: \ - -fsanitize=kernel-address is not supported by compiler) - endif -else - # -fasan-shadow-offset fails without -fsanitize - CFLAGS_KASAN_SHADOW := $(call cc-option, -fsanitize=kernel-address \ +# -fasan-shadow-offset fails without -fsanitize +CFLAGS_KASAN_SHADOW := $(call cc-option, -fsanitize=kernel-address \ -fasan-shadow-offset=$(KASAN_SHADOW_OFFSET), \ $(call cc-option, -fsanitize=kernel-address \ -mllvm -asan-mapping-offset=$(KASAN_SHADOW_OFFSET))) - ifeq ($(strip $(CFLAGS_KASAN_SHADOW)),) - CFLAGS_KASAN := $(CFLAGS_KASAN_MINIMAL) - else - # Now add all the compiler specific options that are valid standalone - CFLAGS_KASAN := $(CFLAGS_KASAN_SHADOW) \ - $(call cc-param,asan-globals=1) \ - $(call cc-param,asan-instrumentation-with-call-threshold=$(call_threshold)) \ - $(call cc-param,asan-stack=1) \ - $(call cc-param,asan-use-after-scope=1) \ - $(call cc-param,asan-instrument-allocas=1) - endif - +ifeq ($(strip $(CFLAGS_KASAN_SHADOW)),) + CFLAGS_KASAN := $(CFLAGS_KASAN_MINIMAL) +else + # Now add all the compiler specific options that are valid standalone + CFLAGS_KASAN := $(CFLAGS_KASAN_SHADOW) \ + $(call cc-param,asan-globals=1) \ + $(call cc-param,asan-instrumentation-with-call-threshold=$(call_threshold)) \ + $(call cc-param,asan-stack=1) \ + $(call cc-param,asan-use-after-scope=1) \ + $(call cc-param,asan-instrument-allocas=1) endif ifdef CONFIG_KASAN_EXTRA CFLAGS_KASAN += $(call cc-option, -fsanitize-address-use-after-scope) endif -CFLAGS_KASAN_NOSANITIZE := -fno-builtin +endif # CONFIG_KASAN_GENERIC +ifdef CONFIG_KASAN_SW_TAGS + +ifdef CONFIG_KASAN_INLINE + instrumentation_flags := -mllvm -hwasan-mapping-offset=$(KASAN_SHADOW_OFFSET) +else + instrumentation_flags := -mllvm -hwasan-instrument-with-calls=1 +endif + +CFLAGS_KASAN := -fsanitize=kernel-hwaddress \ + -mllvm -hwasan-instrument-stack=0 \ + $(instrumentation_flags) + +endif # CONFIG_KASAN_SW_TAGS + +ifdef CONFIG_KASAN +CFLAGS_KASAN_NOSANITIZE := -fno-builtin endif -- cgit v1.2.3-59-g8ed1b From 9577dd7486487722ed8f0773243223f108e8089f Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:30:01 -0800 Subject: kasan: rename kasan_zero_page to kasan_early_shadow_page With tag based KASAN mode the early shadow value is 0xff and not 0x00, so this patch renames kasan_zero_(page|pte|pmd|pud|p4d) to kasan_early_shadow_(page|pte|pmd|pud|p4d) to avoid confusion. Link: http://lkml.kernel.org/r/3fed313280ebf4f88645f5b89ccbc066d320e177.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Suggested-by: Mark Rutland Cc: Andrey Ryabinin Cc: Christoph Lameter Cc: Dmitry Vyukov Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm64/mm/kasan_init.c | 43 ++++++++++++++------------ arch/s390/mm/dump_pagetables.c | 17 +++++----- arch/s390/mm/kasan_init.c | 33 ++++++++++++-------- arch/x86/mm/dump_pagetables.c | 11 ++++--- arch/x86/mm/kasan_init_64.c | 55 +++++++++++++++++---------------- arch/xtensa/mm/kasan_init.c | 18 ++++++----- include/linux/kasan.h | 12 ++++---- mm/kasan/init.c | 70 ++++++++++++++++++++++++------------------ 8 files changed, 145 insertions(+), 114 deletions(-) (limited to 'include') diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c index 63527e585aac..4ebc19422931 100644 --- a/arch/arm64/mm/kasan_init.c +++ b/arch/arm64/mm/kasan_init.c @@ -47,8 +47,9 @@ static pte_t *__init kasan_pte_offset(pmd_t *pmdp, unsigned long addr, int node, bool early) { if (pmd_none(READ_ONCE(*pmdp))) { - phys_addr_t pte_phys = early ? __pa_symbol(kasan_zero_pte) - : kasan_alloc_zeroed_page(node); + phys_addr_t pte_phys = early ? + __pa_symbol(kasan_early_shadow_pte) + : kasan_alloc_zeroed_page(node); __pmd_populate(pmdp, pte_phys, PMD_TYPE_TABLE); } @@ -60,8 +61,9 @@ static pmd_t *__init kasan_pmd_offset(pud_t *pudp, unsigned long addr, int node, bool early) { if (pud_none(READ_ONCE(*pudp))) { - phys_addr_t pmd_phys = early ? __pa_symbol(kasan_zero_pmd) - : kasan_alloc_zeroed_page(node); + phys_addr_t pmd_phys = early ? + __pa_symbol(kasan_early_shadow_pmd) + : kasan_alloc_zeroed_page(node); __pud_populate(pudp, pmd_phys, PMD_TYPE_TABLE); } @@ -72,8 +74,9 @@ static pud_t *__init kasan_pud_offset(pgd_t *pgdp, unsigned long addr, int node, bool early) { if (pgd_none(READ_ONCE(*pgdp))) { - phys_addr_t pud_phys = early ? __pa_symbol(kasan_zero_pud) - : kasan_alloc_zeroed_page(node); + phys_addr_t pud_phys = early ? + __pa_symbol(kasan_early_shadow_pud) + : kasan_alloc_zeroed_page(node); __pgd_populate(pgdp, pud_phys, PMD_TYPE_TABLE); } @@ -87,8 +90,9 @@ static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr, pte_t *ptep = kasan_pte_offset(pmdp, addr, node, early); do { - phys_addr_t page_phys = early ? __pa_symbol(kasan_zero_page) - : kasan_alloc_zeroed_page(node); + phys_addr_t page_phys = early ? + __pa_symbol(kasan_early_shadow_page) + : kasan_alloc_zeroed_page(node); next = addr + PAGE_SIZE; set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL)); } while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep))); @@ -205,14 +209,14 @@ void __init kasan_init(void) kasan_map_populate(kimg_shadow_start, kimg_shadow_end, early_pfn_to_nid(virt_to_pfn(lm_alias(_text)))); - kasan_populate_zero_shadow((void *)KASAN_SHADOW_START, - (void *)mod_shadow_start); - kasan_populate_zero_shadow((void *)kimg_shadow_end, - kasan_mem_to_shadow((void *)PAGE_OFFSET)); + kasan_populate_early_shadow((void *)KASAN_SHADOW_START, + (void *)mod_shadow_start); + kasan_populate_early_shadow((void *)kimg_shadow_end, + kasan_mem_to_shadow((void *)PAGE_OFFSET)); if (kimg_shadow_start > mod_shadow_end) - kasan_populate_zero_shadow((void *)mod_shadow_end, - (void *)kimg_shadow_start); + kasan_populate_early_shadow((void *)mod_shadow_end, + (void *)kimg_shadow_start); for_each_memblock(memory, reg) { void *start = (void *)__phys_to_virt(reg->base); @@ -227,14 +231,15 @@ void __init kasan_init(void) } /* - * KAsan may reuse the contents of kasan_zero_pte directly, so we - * should make sure that it maps the zero page read-only. + * KAsan may reuse the contents of kasan_early_shadow_pte directly, + * so we should make sure that it maps the zero page read-only. */ for (i = 0; i < PTRS_PER_PTE; i++) - set_pte(&kasan_zero_pte[i], - pfn_pte(sym_to_pfn(kasan_zero_page), PAGE_KERNEL_RO)); + set_pte(&kasan_early_shadow_pte[i], + pfn_pte(sym_to_pfn(kasan_early_shadow_page), + PAGE_KERNEL_RO)); - memset(kasan_zero_page, 0, PAGE_SIZE); + memset(kasan_early_shadow_page, 0, PAGE_SIZE); cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); /* At this point kasan is fully initialized. Enable error messages */ diff --git a/arch/s390/mm/dump_pagetables.c b/arch/s390/mm/dump_pagetables.c index 363f6470d742..3b93ba0b5d8d 100644 --- a/arch/s390/mm/dump_pagetables.c +++ b/arch/s390/mm/dump_pagetables.c @@ -111,11 +111,12 @@ static void note_page(struct seq_file *m, struct pg_state *st, } #ifdef CONFIG_KASAN -static void note_kasan_zero_page(struct seq_file *m, struct pg_state *st) +static void note_kasan_early_shadow_page(struct seq_file *m, + struct pg_state *st) { unsigned int prot; - prot = pte_val(*kasan_zero_pte) & + prot = pte_val(*kasan_early_shadow_pte) & (_PAGE_PROTECT | _PAGE_INVALID | _PAGE_NOEXEC); note_page(m, st, prot, 4); } @@ -154,8 +155,8 @@ static void walk_pmd_level(struct seq_file *m, struct pg_state *st, int i; #ifdef CONFIG_KASAN - if ((pud_val(*pud) & PAGE_MASK) == __pa(kasan_zero_pmd)) { - note_kasan_zero_page(m, st); + if ((pud_val(*pud) & PAGE_MASK) == __pa(kasan_early_shadow_pmd)) { + note_kasan_early_shadow_page(m, st); return; } #endif @@ -185,8 +186,8 @@ static void walk_pud_level(struct seq_file *m, struct pg_state *st, int i; #ifdef CONFIG_KASAN - if ((p4d_val(*p4d) & PAGE_MASK) == __pa(kasan_zero_pud)) { - note_kasan_zero_page(m, st); + if ((p4d_val(*p4d) & PAGE_MASK) == __pa(kasan_early_shadow_pud)) { + note_kasan_early_shadow_page(m, st); return; } #endif @@ -215,8 +216,8 @@ static void walk_p4d_level(struct seq_file *m, struct pg_state *st, int i; #ifdef CONFIG_KASAN - if ((pgd_val(*pgd) & PAGE_MASK) == __pa(kasan_zero_p4d)) { - note_kasan_zero_page(m, st); + if ((pgd_val(*pgd) & PAGE_MASK) == __pa(kasan_early_shadow_p4d)) { + note_kasan_early_shadow_page(m, st); return; } #endif diff --git a/arch/s390/mm/kasan_init.c b/arch/s390/mm/kasan_init.c index acb9645b762b..bac5c27d11fc 100644 --- a/arch/s390/mm/kasan_init.c +++ b/arch/s390/mm/kasan_init.c @@ -107,7 +107,8 @@ static void __init kasan_early_vmemmap_populate(unsigned long address, if (mode == POPULATE_ZERO_SHADOW && IS_ALIGNED(address, PGDIR_SIZE) && end - address >= PGDIR_SIZE) { - pgd_populate(&init_mm, pg_dir, kasan_zero_p4d); + pgd_populate(&init_mm, pg_dir, + kasan_early_shadow_p4d); address = (address + PGDIR_SIZE) & PGDIR_MASK; continue; } @@ -120,7 +121,8 @@ static void __init kasan_early_vmemmap_populate(unsigned long address, if (mode == POPULATE_ZERO_SHADOW && IS_ALIGNED(address, P4D_SIZE) && end - address >= P4D_SIZE) { - p4d_populate(&init_mm, p4_dir, kasan_zero_pud); + p4d_populate(&init_mm, p4_dir, + kasan_early_shadow_pud); address = (address + P4D_SIZE) & P4D_MASK; continue; } @@ -133,7 +135,8 @@ static void __init kasan_early_vmemmap_populate(unsigned long address, if (mode == POPULATE_ZERO_SHADOW && IS_ALIGNED(address, PUD_SIZE) && end - address >= PUD_SIZE) { - pud_populate(&init_mm, pu_dir, kasan_zero_pmd); + pud_populate(&init_mm, pu_dir, + kasan_early_shadow_pmd); address = (address + PUD_SIZE) & PUD_MASK; continue; } @@ -146,7 +149,8 @@ static void __init kasan_early_vmemmap_populate(unsigned long address, if (mode == POPULATE_ZERO_SHADOW && IS_ALIGNED(address, PMD_SIZE) && end - address >= PMD_SIZE) { - pmd_populate(&init_mm, pm_dir, kasan_zero_pte); + pmd_populate(&init_mm, pm_dir, + kasan_early_shadow_pte); address = (address + PMD_SIZE) & PMD_MASK; continue; } @@ -188,7 +192,7 @@ static void __init kasan_early_vmemmap_populate(unsigned long address, pte_val(*pt_dir) = __pa(page) | pgt_prot; break; case POPULATE_ZERO_SHADOW: - page = kasan_zero_page; + page = kasan_early_shadow_page; pte_val(*pt_dir) = __pa(page) | pgt_prot_zero; break; } @@ -256,14 +260,14 @@ void __init kasan_early_init(void) unsigned long vmax; unsigned long pgt_prot = pgprot_val(PAGE_KERNEL_RO); pte_t pte_z; - pmd_t pmd_z = __pmd(__pa(kasan_zero_pte) | _SEGMENT_ENTRY); - pud_t pud_z = __pud(__pa(kasan_zero_pmd) | _REGION3_ENTRY); - p4d_t p4d_z = __p4d(__pa(kasan_zero_pud) | _REGION2_ENTRY); + pmd_t pmd_z = __pmd(__pa(kasan_early_shadow_pte) | _SEGMENT_ENTRY); + pud_t pud_z = __pud(__pa(kasan_early_shadow_pmd) | _REGION3_ENTRY); + p4d_t p4d_z = __p4d(__pa(kasan_early_shadow_pud) | _REGION2_ENTRY); kasan_early_detect_facilities(); if (!has_nx) pgt_prot &= ~_PAGE_NOEXEC; - pte_z = __pte(__pa(kasan_zero_page) | pgt_prot); + pte_z = __pte(__pa(kasan_early_shadow_page) | pgt_prot); memsize = get_mem_detect_end(); if (!memsize) @@ -292,10 +296,13 @@ void __init kasan_early_init(void) } /* init kasan zero shadow */ - crst_table_init((unsigned long *)kasan_zero_p4d, p4d_val(p4d_z)); - crst_table_init((unsigned long *)kasan_zero_pud, pud_val(pud_z)); - crst_table_init((unsigned long *)kasan_zero_pmd, pmd_val(pmd_z)); - memset64((u64 *)kasan_zero_pte, pte_val(pte_z), PTRS_PER_PTE); + crst_table_init((unsigned long *)kasan_early_shadow_p4d, + p4d_val(p4d_z)); + crst_table_init((unsigned long *)kasan_early_shadow_pud, + pud_val(pud_z)); + crst_table_init((unsigned long *)kasan_early_shadow_pmd, + pmd_val(pmd_z)); + memset64((u64 *)kasan_early_shadow_pte, pte_val(pte_z), PTRS_PER_PTE); shadow_alloc_size = memsize >> KASAN_SHADOW_SCALE_SHIFT; pgalloc_low = round_up((unsigned long)_end, _SEGMENT_SIZE); diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c index abcb8d00b014..e3cdc85ce5b6 100644 --- a/arch/x86/mm/dump_pagetables.c +++ b/arch/x86/mm/dump_pagetables.c @@ -377,7 +377,7 @@ static void walk_pte_level(struct seq_file *m, struct pg_state *st, pmd_t addr, /* * This is an optimization for KASAN=y case. Since all kasan page tables - * eventually point to the kasan_zero_page we could call note_page() + * eventually point to the kasan_early_shadow_page we could call note_page() * right away without walking through lower level page tables. This saves * us dozens of seconds (minutes for 5-level config) while checking for * W+X mapping or reading kernel_page_tables debugfs file. @@ -385,10 +385,11 @@ static void walk_pte_level(struct seq_file *m, struct pg_state *st, pmd_t addr, static inline bool kasan_page_table(struct seq_file *m, struct pg_state *st, void *pt) { - if (__pa(pt) == __pa(kasan_zero_pmd) || - (pgtable_l5_enabled() && __pa(pt) == __pa(kasan_zero_p4d)) || - __pa(pt) == __pa(kasan_zero_pud)) { - pgprotval_t prot = pte_flags(kasan_zero_pte[0]); + if (__pa(pt) == __pa(kasan_early_shadow_pmd) || + (pgtable_l5_enabled() && + __pa(pt) == __pa(kasan_early_shadow_p4d)) || + __pa(pt) == __pa(kasan_early_shadow_pud)) { + pgprotval_t prot = pte_flags(kasan_early_shadow_pte[0]); note_page(m, st, __pgprot(prot), 0, 5); return true; } diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index 04a9cf6b034f..462fde83b515 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -211,7 +211,8 @@ static void __init kasan_early_p4d_populate(pgd_t *pgd, unsigned long next; if (pgd_none(*pgd)) { - pgd_entry = __pgd(_KERNPG_TABLE | __pa_nodebug(kasan_zero_p4d)); + pgd_entry = __pgd(_KERNPG_TABLE | + __pa_nodebug(kasan_early_shadow_p4d)); set_pgd(pgd, pgd_entry); } @@ -222,7 +223,8 @@ static void __init kasan_early_p4d_populate(pgd_t *pgd, if (!p4d_none(*p4d)) continue; - p4d_entry = __p4d(_KERNPG_TABLE | __pa_nodebug(kasan_zero_pud)); + p4d_entry = __p4d(_KERNPG_TABLE | + __pa_nodebug(kasan_early_shadow_pud)); set_p4d(p4d, p4d_entry); } while (p4d++, addr = next, addr != end && p4d_none(*p4d)); } @@ -261,10 +263,11 @@ static struct notifier_block kasan_die_notifier = { void __init kasan_early_init(void) { int i; - pteval_t pte_val = __pa_nodebug(kasan_zero_page) | __PAGE_KERNEL | _PAGE_ENC; - pmdval_t pmd_val = __pa_nodebug(kasan_zero_pte) | _KERNPG_TABLE; - pudval_t pud_val = __pa_nodebug(kasan_zero_pmd) | _KERNPG_TABLE; - p4dval_t p4d_val = __pa_nodebug(kasan_zero_pud) | _KERNPG_TABLE; + pteval_t pte_val = __pa_nodebug(kasan_early_shadow_page) | + __PAGE_KERNEL | _PAGE_ENC; + pmdval_t pmd_val = __pa_nodebug(kasan_early_shadow_pte) | _KERNPG_TABLE; + pudval_t pud_val = __pa_nodebug(kasan_early_shadow_pmd) | _KERNPG_TABLE; + p4dval_t p4d_val = __pa_nodebug(kasan_early_shadow_pud) | _KERNPG_TABLE; /* Mask out unsupported __PAGE_KERNEL bits: */ pte_val &= __default_kernel_pte_mask; @@ -273,16 +276,16 @@ void __init kasan_early_init(void) p4d_val &= __default_kernel_pte_mask; for (i = 0; i < PTRS_PER_PTE; i++) - kasan_zero_pte[i] = __pte(pte_val); + kasan_early_shadow_pte[i] = __pte(pte_val); for (i = 0; i < PTRS_PER_PMD; i++) - kasan_zero_pmd[i] = __pmd(pmd_val); + kasan_early_shadow_pmd[i] = __pmd(pmd_val); for (i = 0; i < PTRS_PER_PUD; i++) - kasan_zero_pud[i] = __pud(pud_val); + kasan_early_shadow_pud[i] = __pud(pud_val); for (i = 0; pgtable_l5_enabled() && i < PTRS_PER_P4D; i++) - kasan_zero_p4d[i] = __p4d(p4d_val); + kasan_early_shadow_p4d[i] = __p4d(p4d_val); kasan_map_early_shadow(early_top_pgt); kasan_map_early_shadow(init_top_pgt); @@ -326,7 +329,7 @@ void __init kasan_init(void) clear_pgds(KASAN_SHADOW_START & PGDIR_MASK, KASAN_SHADOW_END); - kasan_populate_zero_shadow((void *)(KASAN_SHADOW_START & PGDIR_MASK), + kasan_populate_early_shadow((void *)(KASAN_SHADOW_START & PGDIR_MASK), kasan_mem_to_shadow((void *)PAGE_OFFSET)); for (i = 0; i < E820_MAX_ENTRIES; i++) { @@ -338,41 +341,41 @@ void __init kasan_init(void) shadow_cpu_entry_begin = (void *)CPU_ENTRY_AREA_BASE; shadow_cpu_entry_begin = kasan_mem_to_shadow(shadow_cpu_entry_begin); - shadow_cpu_entry_begin = (void *)round_down((unsigned long)shadow_cpu_entry_begin, - PAGE_SIZE); + shadow_cpu_entry_begin = (void *)round_down( + (unsigned long)shadow_cpu_entry_begin, PAGE_SIZE); shadow_cpu_entry_end = (void *)(CPU_ENTRY_AREA_BASE + CPU_ENTRY_AREA_MAP_SIZE); shadow_cpu_entry_end = kasan_mem_to_shadow(shadow_cpu_entry_end); - shadow_cpu_entry_end = (void *)round_up((unsigned long)shadow_cpu_entry_end, - PAGE_SIZE); + shadow_cpu_entry_end = (void *)round_up( + (unsigned long)shadow_cpu_entry_end, PAGE_SIZE); - kasan_populate_zero_shadow( + kasan_populate_early_shadow( kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM), shadow_cpu_entry_begin); kasan_populate_shadow((unsigned long)shadow_cpu_entry_begin, (unsigned long)shadow_cpu_entry_end, 0); - kasan_populate_zero_shadow(shadow_cpu_entry_end, - kasan_mem_to_shadow((void *)__START_KERNEL_map)); + kasan_populate_early_shadow(shadow_cpu_entry_end, + kasan_mem_to_shadow((void *)__START_KERNEL_map)); kasan_populate_shadow((unsigned long)kasan_mem_to_shadow(_stext), (unsigned long)kasan_mem_to_shadow(_end), early_pfn_to_nid(__pa(_stext))); - kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END), - (void *)KASAN_SHADOW_END); + kasan_populate_early_shadow(kasan_mem_to_shadow((void *)MODULES_END), + (void *)KASAN_SHADOW_END); load_cr3(init_top_pgt); __flush_tlb_all(); /* - * kasan_zero_page has been used as early shadow memory, thus it may - * contain some garbage. Now we can clear and write protect it, since - * after the TLB flush no one should write to it. + * kasan_early_shadow_page has been used as early shadow memory, thus + * it may contain some garbage. Now we can clear and write protect it, + * since after the TLB flush no one should write to it. */ - memset(kasan_zero_page, 0, PAGE_SIZE); + memset(kasan_early_shadow_page, 0, PAGE_SIZE); for (i = 0; i < PTRS_PER_PTE; i++) { pte_t pte; pgprot_t prot; @@ -380,8 +383,8 @@ void __init kasan_init(void) prot = __pgprot(__PAGE_KERNEL_RO | _PAGE_ENC); pgprot_val(prot) &= __default_kernel_pte_mask; - pte = __pte(__pa(kasan_zero_page) | pgprot_val(prot)); - set_pte(&kasan_zero_pte[i], pte); + pte = __pte(__pa(kasan_early_shadow_page) | pgprot_val(prot)); + set_pte(&kasan_early_shadow_pte[i], pte); } /* Flush TLBs again to be sure that write protection applied. */ __flush_tlb_all(); diff --git a/arch/xtensa/mm/kasan_init.c b/arch/xtensa/mm/kasan_init.c index 6b95ca43aec0..1734cda6bc4a 100644 --- a/arch/xtensa/mm/kasan_init.c +++ b/arch/xtensa/mm/kasan_init.c @@ -24,12 +24,13 @@ void __init kasan_early_init(void) int i; for (i = 0; i < PTRS_PER_PTE; ++i) - set_pte(kasan_zero_pte + i, - mk_pte(virt_to_page(kasan_zero_page), PAGE_KERNEL)); + set_pte(kasan_early_shadow_pte + i, + mk_pte(virt_to_page(kasan_early_shadow_page), + PAGE_KERNEL)); for (vaddr = 0; vaddr < KASAN_SHADOW_SIZE; vaddr += PMD_SIZE, ++pmd) { BUG_ON(!pmd_none(*pmd)); - set_pmd(pmd, __pmd((unsigned long)kasan_zero_pte)); + set_pmd(pmd, __pmd((unsigned long)kasan_early_shadow_pte)); } early_trap_init(); } @@ -80,13 +81,16 @@ void __init kasan_init(void) populate(kasan_mem_to_shadow((void *)VMALLOC_START), kasan_mem_to_shadow((void *)XCHAL_KSEG_BYPASS_VADDR)); - /* Write protect kasan_zero_page and zero-initialize it again. */ + /* + * Write protect kasan_early_shadow_page and zero-initialize it again. + */ for (i = 0; i < PTRS_PER_PTE; ++i) - set_pte(kasan_zero_pte + i, - mk_pte(virt_to_page(kasan_zero_page), PAGE_KERNEL_RO)); + set_pte(kasan_early_shadow_pte + i, + mk_pte(virt_to_page(kasan_early_shadow_page), + PAGE_KERNEL_RO)); local_flush_tlb_all(); - memset(kasan_zero_page, 0, PAGE_SIZE); + memset(kasan_early_shadow_page, 0, PAGE_SIZE); /* At this point kasan is fully initialized. Enable error messages. */ current->kasan_depth = 0; diff --git a/include/linux/kasan.h b/include/linux/kasan.h index b66fdf5ea7ab..ec22d548d0d7 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -14,13 +14,13 @@ struct task_struct; #include #include -extern unsigned char kasan_zero_page[PAGE_SIZE]; -extern pte_t kasan_zero_pte[PTRS_PER_PTE]; -extern pmd_t kasan_zero_pmd[PTRS_PER_PMD]; -extern pud_t kasan_zero_pud[PTRS_PER_PUD]; -extern p4d_t kasan_zero_p4d[MAX_PTRS_PER_P4D]; +extern unsigned char kasan_early_shadow_page[PAGE_SIZE]; +extern pte_t kasan_early_shadow_pte[PTRS_PER_PTE]; +extern pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD]; +extern pud_t kasan_early_shadow_pud[PTRS_PER_PUD]; +extern p4d_t kasan_early_shadow_p4d[MAX_PTRS_PER_P4D]; -int kasan_populate_zero_shadow(const void *shadow_start, +int kasan_populate_early_shadow(const void *shadow_start, const void *shadow_end); static inline void *kasan_mem_to_shadow(const void *addr) diff --git a/mm/kasan/init.c b/mm/kasan/init.c index c7550eb65922..2b21d3717d62 100644 --- a/mm/kasan/init.c +++ b/mm/kasan/init.c @@ -30,13 +30,13 @@ * - Latter it reused it as zero shadow to cover large ranges of memory * that allowed to access, but not handled by kasan (vmalloc/vmemmap ...). */ -unsigned char kasan_zero_page[PAGE_SIZE] __page_aligned_bss; +unsigned char kasan_early_shadow_page[PAGE_SIZE] __page_aligned_bss; #if CONFIG_PGTABLE_LEVELS > 4 -p4d_t kasan_zero_p4d[MAX_PTRS_PER_P4D] __page_aligned_bss; +p4d_t kasan_early_shadow_p4d[MAX_PTRS_PER_P4D] __page_aligned_bss; static inline bool kasan_p4d_table(pgd_t pgd) { - return pgd_page(pgd) == virt_to_page(lm_alias(kasan_zero_p4d)); + return pgd_page(pgd) == virt_to_page(lm_alias(kasan_early_shadow_p4d)); } #else static inline bool kasan_p4d_table(pgd_t pgd) @@ -45,10 +45,10 @@ static inline bool kasan_p4d_table(pgd_t pgd) } #endif #if CONFIG_PGTABLE_LEVELS > 3 -pud_t kasan_zero_pud[PTRS_PER_PUD] __page_aligned_bss; +pud_t kasan_early_shadow_pud[PTRS_PER_PUD] __page_aligned_bss; static inline bool kasan_pud_table(p4d_t p4d) { - return p4d_page(p4d) == virt_to_page(lm_alias(kasan_zero_pud)); + return p4d_page(p4d) == virt_to_page(lm_alias(kasan_early_shadow_pud)); } #else static inline bool kasan_pud_table(p4d_t p4d) @@ -57,10 +57,10 @@ static inline bool kasan_pud_table(p4d_t p4d) } #endif #if CONFIG_PGTABLE_LEVELS > 2 -pmd_t kasan_zero_pmd[PTRS_PER_PMD] __page_aligned_bss; +pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD] __page_aligned_bss; static inline bool kasan_pmd_table(pud_t pud) { - return pud_page(pud) == virt_to_page(lm_alias(kasan_zero_pmd)); + return pud_page(pud) == virt_to_page(lm_alias(kasan_early_shadow_pmd)); } #else static inline bool kasan_pmd_table(pud_t pud) @@ -68,16 +68,16 @@ static inline bool kasan_pmd_table(pud_t pud) return 0; } #endif -pte_t kasan_zero_pte[PTRS_PER_PTE] __page_aligned_bss; +pte_t kasan_early_shadow_pte[PTRS_PER_PTE] __page_aligned_bss; static inline bool kasan_pte_table(pmd_t pmd) { - return pmd_page(pmd) == virt_to_page(lm_alias(kasan_zero_pte)); + return pmd_page(pmd) == virt_to_page(lm_alias(kasan_early_shadow_pte)); } -static inline bool kasan_zero_page_entry(pte_t pte) +static inline bool kasan_early_shadow_page_entry(pte_t pte) { - return pte_page(pte) == virt_to_page(lm_alias(kasan_zero_page)); + return pte_page(pte) == virt_to_page(lm_alias(kasan_early_shadow_page)); } static __init void *early_alloc(size_t size, int node) @@ -92,7 +92,8 @@ static void __ref zero_pte_populate(pmd_t *pmd, unsigned long addr, pte_t *pte = pte_offset_kernel(pmd, addr); pte_t zero_pte; - zero_pte = pfn_pte(PFN_DOWN(__pa_symbol(kasan_zero_page)), PAGE_KERNEL); + zero_pte = pfn_pte(PFN_DOWN(__pa_symbol(kasan_early_shadow_page)), + PAGE_KERNEL); zero_pte = pte_wrprotect(zero_pte); while (addr + PAGE_SIZE <= end) { @@ -112,7 +113,8 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr, next = pmd_addr_end(addr, end); if (IS_ALIGNED(addr, PMD_SIZE) && end - addr >= PMD_SIZE) { - pmd_populate_kernel(&init_mm, pmd, lm_alias(kasan_zero_pte)); + pmd_populate_kernel(&init_mm, pmd, + lm_alias(kasan_early_shadow_pte)); continue; } @@ -145,9 +147,11 @@ static int __ref zero_pud_populate(p4d_t *p4d, unsigned long addr, if (IS_ALIGNED(addr, PUD_SIZE) && end - addr >= PUD_SIZE) { pmd_t *pmd; - pud_populate(&init_mm, pud, lm_alias(kasan_zero_pmd)); + pud_populate(&init_mm, pud, + lm_alias(kasan_early_shadow_pmd)); pmd = pmd_offset(pud, addr); - pmd_populate_kernel(&init_mm, pmd, lm_alias(kasan_zero_pte)); + pmd_populate_kernel(&init_mm, pmd, + lm_alias(kasan_early_shadow_pte)); continue; } @@ -181,12 +185,14 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr, pud_t *pud; pmd_t *pmd; - p4d_populate(&init_mm, p4d, lm_alias(kasan_zero_pud)); + p4d_populate(&init_mm, p4d, + lm_alias(kasan_early_shadow_pud)); pud = pud_offset(p4d, addr); - pud_populate(&init_mm, pud, lm_alias(kasan_zero_pmd)); + pud_populate(&init_mm, pud, + lm_alias(kasan_early_shadow_pmd)); pmd = pmd_offset(pud, addr); pmd_populate_kernel(&init_mm, pmd, - lm_alias(kasan_zero_pte)); + lm_alias(kasan_early_shadow_pte)); continue; } @@ -209,13 +215,13 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr, } /** - * kasan_populate_zero_shadow - populate shadow memory region with - * kasan_zero_page + * kasan_populate_early_shadow - populate shadow memory region with + * kasan_early_shadow_page * @shadow_start - start of the memory range to populate * @shadow_end - end of the memory range to populate */ -int __ref kasan_populate_zero_shadow(const void *shadow_start, - const void *shadow_end) +int __ref kasan_populate_early_shadow(const void *shadow_start, + const void *shadow_end) { unsigned long addr = (unsigned long)shadow_start; unsigned long end = (unsigned long)shadow_end; @@ -231,7 +237,7 @@ int __ref kasan_populate_zero_shadow(const void *shadow_start, pmd_t *pmd; /* - * kasan_zero_pud should be populated with pmds + * kasan_early_shadow_pud should be populated with pmds * at this moment. * [pud,pmd]_populate*() below needed only for * 3,2 - level page tables where we don't have @@ -241,21 +247,25 @@ int __ref kasan_populate_zero_shadow(const void *shadow_start, * The ifndef is required to avoid build breakage. * * With 5level-fixup.h, pgd_populate() is not nop and - * we reference kasan_zero_p4d. It's not defined + * we reference kasan_early_shadow_p4d. It's not defined * unless 5-level paging enabled. * * The ifndef can be dropped once all KASAN-enabled * architectures will switch to pgtable-nop4d.h. */ #ifndef __ARCH_HAS_5LEVEL_HACK - pgd_populate(&init_mm, pgd, lm_alias(kasan_zero_p4d)); + pgd_populate(&init_mm, pgd, + lm_alias(kasan_early_shadow_p4d)); #endif p4d = p4d_offset(pgd, addr); - p4d_populate(&init_mm, p4d, lm_alias(kasan_zero_pud)); + p4d_populate(&init_mm, p4d, + lm_alias(kasan_early_shadow_pud)); pud = pud_offset(p4d, addr); - pud_populate(&init_mm, pud, lm_alias(kasan_zero_pmd)); + pud_populate(&init_mm, pud, + lm_alias(kasan_early_shadow_pmd)); pmd = pmd_offset(pud, addr); - pmd_populate_kernel(&init_mm, pmd, lm_alias(kasan_zero_pte)); + pmd_populate_kernel(&init_mm, pmd, + lm_alias(kasan_early_shadow_pte)); continue; } @@ -350,7 +360,7 @@ static void kasan_remove_pte_table(pte_t *pte, unsigned long addr, if (!pte_present(*pte)) continue; - if (WARN_ON(!kasan_zero_page_entry(*pte))) + if (WARN_ON(!kasan_early_shadow_page_entry(*pte))) continue; pte_clear(&init_mm, addr, pte); } @@ -480,7 +490,7 @@ int kasan_add_zero_shadow(void *start, unsigned long size) WARN_ON(size % (KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE))) return -EINVAL; - ret = kasan_populate_zero_shadow(shadow_start, shadow_end); + ret = kasan_populate_early_shadow(shadow_start, shadow_end); if (ret) kasan_remove_zero_shadow(shadow_start, size >> KASAN_SHADOW_SCALE_SHIFT); -- cgit v1.2.3-59-g8ed1b From 080eb83f54cf5b96ae5b6ce3c1896e35c341aff9 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:30:09 -0800 Subject: kasan: initialize shadow to 0xff for tag-based mode A tag-based KASAN shadow memory cell contains a memory tag, that corresponds to the tag in the top byte of the pointer, that points to that memory. The native top byte value of kernel pointers is 0xff, so with tag-based KASAN we need to initialize shadow memory to 0xff. [cai@lca.pw: arm64: skip kmemleak for KASAN again\ Link: http://lkml.kernel.org/r/20181226020550.63712-1-cai@lca.pw Link: http://lkml.kernel.org/r/5cc1b789aad7c99cf4f3ec5b328b147ad53edb40.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Reviewed-by: Andrey Ryabinin Reviewed-by: Dmitry Vyukov Cc: Christoph Lameter Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm64/mm/kasan_init.c | 14 ++++++++++++-- include/linux/kasan.h | 8 ++++++++ mm/kasan/common.c | 3 ++- 3 files changed, 22 insertions(+), 3 deletions(-) (limited to 'include') diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c index 4ebc19422931..38fa4bba9279 100644 --- a/arch/arm64/mm/kasan_init.c +++ b/arch/arm64/mm/kasan_init.c @@ -43,6 +43,14 @@ static phys_addr_t __init kasan_alloc_zeroed_page(int node) return __pa(p); } +static phys_addr_t __init kasan_alloc_raw_page(int node) +{ + void *p = memblock_alloc_try_nid_raw(PAGE_SIZE, PAGE_SIZE, + __pa(MAX_DMA_ADDRESS), + MEMBLOCK_ALLOC_KASAN, node); + return __pa(p); +} + static pte_t *__init kasan_pte_offset(pmd_t *pmdp, unsigned long addr, int node, bool early) { @@ -92,7 +100,9 @@ static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr, do { phys_addr_t page_phys = early ? __pa_symbol(kasan_early_shadow_page) - : kasan_alloc_zeroed_page(node); + : kasan_alloc_raw_page(node); + if (!early) + memset(__va(page_phys), KASAN_SHADOW_INIT, PAGE_SIZE); next = addr + PAGE_SIZE; set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL)); } while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep))); @@ -239,7 +249,7 @@ void __init kasan_init(void) pfn_pte(sym_to_pfn(kasan_early_shadow_page), PAGE_KERNEL_RO)); - memset(kasan_early_shadow_page, 0, PAGE_SIZE); + memset(kasan_early_shadow_page, KASAN_SHADOW_INIT, PAGE_SIZE); cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); /* At this point kasan is fully initialized. Enable error messages */ diff --git a/include/linux/kasan.h b/include/linux/kasan.h index ec22d548d0d7..c56af24bd3e7 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -153,6 +153,8 @@ static inline size_t kasan_metadata_size(struct kmem_cache *cache) { return 0; } #ifdef CONFIG_KASAN_GENERIC +#define KASAN_SHADOW_INIT 0 + void kasan_cache_shrink(struct kmem_cache *cache); void kasan_cache_shutdown(struct kmem_cache *cache); @@ -163,4 +165,10 @@ static inline void kasan_cache_shutdown(struct kmem_cache *cache) {} #endif /* CONFIG_KASAN_GENERIC */ +#ifdef CONFIG_KASAN_SW_TAGS + +#define KASAN_SHADOW_INIT 0xFF + +#endif /* CONFIG_KASAN_SW_TAGS */ + #endif /* LINUX_KASAN_H */ diff --git a/mm/kasan/common.c b/mm/kasan/common.c index 5f68c93734ba..7134e75447ff 100644 --- a/mm/kasan/common.c +++ b/mm/kasan/common.c @@ -473,11 +473,12 @@ int kasan_module_alloc(void *addr, size_t size) ret = __vmalloc_node_range(shadow_size, 1, shadow_start, shadow_start + shadow_size, - GFP_KERNEL | __GFP_ZERO, + GFP_KERNEL, PAGE_KERNEL, VM_NO_GUARD, NUMA_NO_NODE, __builtin_return_address(0)); if (ret) { + __memset(ret, KASAN_SHADOW_INIT, shadow_size); find_vm_area(addr)->flags |= VM_KASAN; kmemleak_ignore(ret); return 0; -- cgit v1.2.3-59-g8ed1b From 3c9e3aa11094e821aff4a8f6812a6e032293dbc0 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:30:16 -0800 Subject: kasan: add tag related helper functions This commit adds a few helper functions, that are meant to be used to work with tags embedded in the top byte of kernel pointers: to set, to get or to reset the top byte. Link: http://lkml.kernel.org/r/f6c6437bb8e143bc44f42c3c259c62e734be7935.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Cc: Andrey Ryabinin Cc: Christoph Lameter Cc: Dmitry Vyukov Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm64/include/asm/kasan.h | 8 ++++++-- arch/arm64/include/asm/memory.h | 12 ++++++++++++ arch/arm64/mm/kasan_init.c | 2 ++ include/linux/kasan.h | 13 +++++++++++++ mm/kasan/kasan.h | 31 +++++++++++++++++++++++++++++++ mm/kasan/tags.c | 37 +++++++++++++++++++++++++++++++++++++ 6 files changed, 101 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/arch/arm64/include/asm/kasan.h b/arch/arm64/include/asm/kasan.h index 8758bb008436..b52aacd2c526 100644 --- a/arch/arm64/include/asm/kasan.h +++ b/arch/arm64/include/asm/kasan.h @@ -4,12 +4,16 @@ #ifndef __ASSEMBLY__ -#ifdef CONFIG_KASAN - #include #include #include +#define arch_kasan_set_tag(addr, tag) __tag_set(addr, tag) +#define arch_kasan_reset_tag(addr) __tag_reset(addr) +#define arch_kasan_get_tag(addr) __tag_get(addr) + +#ifdef CONFIG_KASAN + /* * KASAN_SHADOW_START: beginning of the kernel virtual addresses. * KASAN_SHADOW_END: KASAN_SHADOW_START + 1/N of kernel virtual addresses, diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h index e73bb89d6141..25b46f88726c 100644 --- a/arch/arm64/include/asm/memory.h +++ b/arch/arm64/include/asm/memory.h @@ -226,6 +226,18 @@ extern u64 vabits_user; #define untagged_addr(addr) \ ((__typeof__(addr))sign_extend64((u64)(addr), 55)) +#ifdef CONFIG_KASAN_SW_TAGS +#define __tag_shifted(tag) ((u64)(tag) << 56) +#define __tag_set(addr, tag) (__typeof__(addr))( \ + ((u64)(addr) & ~__tag_shifted(0xff)) | __tag_shifted(tag)) +#define __tag_reset(addr) untagged_addr(addr) +#define __tag_get(addr) (__u8)((u64)(addr) >> 56) +#else +#define __tag_set(addr, tag) (addr) +#define __tag_reset(addr) (addr) +#define __tag_get(addr) 0 +#endif + /* * Physical vs virtual RAM address space conversion. These are * private definitions which should NOT be used outside memory.h diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c index 38fa4bba9279..3e142add890b 100644 --- a/arch/arm64/mm/kasan_init.c +++ b/arch/arm64/mm/kasan_init.c @@ -252,6 +252,8 @@ void __init kasan_init(void) memset(kasan_early_shadow_page, KASAN_SHADOW_INIT, PAGE_SIZE); cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); + kasan_init_tags(); + /* At this point kasan is fully initialized. Enable error messages */ init_task.kasan_depth = 0; pr_info("KernelAddressSanitizer initialized\n"); diff --git a/include/linux/kasan.h b/include/linux/kasan.h index c56af24bd3e7..a477ce2abdc9 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -169,6 +169,19 @@ static inline void kasan_cache_shutdown(struct kmem_cache *cache) {} #define KASAN_SHADOW_INIT 0xFF +void kasan_init_tags(void); + +void *kasan_reset_tag(const void *addr); + +#else /* CONFIG_KASAN_SW_TAGS */ + +static inline void kasan_init_tags(void) { } + +static inline void *kasan_reset_tag(const void *addr) +{ + return (void *)addr; +} + #endif /* CONFIG_KASAN_SW_TAGS */ #endif /* LINUX_KASAN_H */ diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h index 19b950eaccff..b080b8d92812 100644 --- a/mm/kasan/kasan.h +++ b/mm/kasan/kasan.h @@ -8,6 +8,10 @@ #define KASAN_SHADOW_SCALE_SIZE (1UL << KASAN_SHADOW_SCALE_SHIFT) #define KASAN_SHADOW_MASK (KASAN_SHADOW_SCALE_SIZE - 1) +#define KASAN_TAG_KERNEL 0xFF /* native kernel pointers tag */ +#define KASAN_TAG_INVALID 0xFE /* inaccessible memory tag */ +#define KASAN_TAG_MAX 0xFD /* maximum value for random tags */ + #define KASAN_FREE_PAGE 0xFF /* page was freed */ #define KASAN_PAGE_REDZONE 0xFE /* redzone for kmalloc_large allocations */ #define KASAN_KMALLOC_REDZONE 0xFC /* redzone inside slub object */ @@ -126,6 +130,33 @@ static inline void quarantine_reduce(void) { } static inline void quarantine_remove_cache(struct kmem_cache *cache) { } #endif +#ifdef CONFIG_KASAN_SW_TAGS + +u8 random_tag(void); + +#else + +static inline u8 random_tag(void) +{ + return 0; +} + +#endif + +#ifndef arch_kasan_set_tag +#define arch_kasan_set_tag(addr, tag) ((void *)(addr)) +#endif +#ifndef arch_kasan_reset_tag +#define arch_kasan_reset_tag(addr) ((void *)(addr)) +#endif +#ifndef arch_kasan_get_tag +#define arch_kasan_get_tag(addr) 0 +#endif + +#define set_tag(addr, tag) ((void *)arch_kasan_set_tag((addr), (tag))) +#define reset_tag(addr) ((void *)arch_kasan_reset_tag(addr)) +#define get_tag(addr) arch_kasan_get_tag(addr) + /* * Exported functions for interfaces called from assembly or from generated * code. Declarations here to avoid warning about missing declarations. diff --git a/mm/kasan/tags.c b/mm/kasan/tags.c index 04194923c543..1c4e7ce2e6fe 100644 --- a/mm/kasan/tags.c +++ b/mm/kasan/tags.c @@ -38,6 +38,43 @@ #include "kasan.h" #include "../slab.h" +static DEFINE_PER_CPU(u32, prng_state); + +void kasan_init_tags(void) +{ + int cpu; + + for_each_possible_cpu(cpu) + per_cpu(prng_state, cpu) = get_random_u32(); +} + +/* + * If a preemption happens between this_cpu_read and this_cpu_write, the only + * side effect is that we'll give a few allocated in different contexts objects + * the same tag. Since tag-based KASAN is meant to be used a probabilistic + * bug-detection debug feature, this doesn't have significant negative impact. + * + * Ideally the tags use strong randomness to prevent any attempts to predict + * them during explicit exploit attempts. But strong randomness is expensive, + * and we did an intentional trade-off to use a PRNG. This non-atomic RMW + * sequence has in fact positive effect, since interrupts that randomly skew + * PRNG at unpredictable points do only good. + */ +u8 random_tag(void) +{ + u32 state = this_cpu_read(prng_state); + + state = 1664525 * state + 1013904223; + this_cpu_write(prng_state, state); + + return (u8)(state % (KASAN_TAG_MAX + 1)); +} + +void *kasan_reset_tag(const void *addr) +{ + return reset_tag(addr); +} + void check_memory_region(unsigned long addr, size_t size, bool write, unsigned long ret_ip) { -- cgit v1.2.3-59-g8ed1b From 5b7c4148222d7acaa1612e5eec84fc66c88d54f3 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:30:46 -0800 Subject: mm: move obj_to_index to include/linux/slab_def.h While with SLUB we can actually preassign tags for caches with contructors and store them in pointers in the freelist, SLAB doesn't allow that since the freelist is stored as an array of indexes, so there are no pointers to store the tags. Instead we compute the tag twice, once when a slab is created before calling the constructor and then again each time when an object is allocated with kmalloc. Tag is computed simply by taking the lowest byte of the index that corresponds to the object. However in kasan_kmalloc we only have access to the objects pointer, so we need a way to find out which index this object corresponds to. This patch moves obj_to_index from slab.c to include/linux/slab_def.h to be reused by KASAN. Link: http://lkml.kernel.org/r/c02cd9e574cfd93858e43ac94b05e38f891fef64.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Reviewed-by: Andrey Ryabinin Reviewed-by: Dmitry Vyukov Acked-by: Christoph Lameter Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/slab_def.h | 13 +++++++++++++ mm/slab.c | 13 ------------- 2 files changed, 13 insertions(+), 13 deletions(-) (limited to 'include') diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h index 3485c58cfd1c..9a5eafb7145b 100644 --- a/include/linux/slab_def.h +++ b/include/linux/slab_def.h @@ -104,4 +104,17 @@ static inline void *nearest_obj(struct kmem_cache *cache, struct page *page, return object; } +/* + * We want to avoid an expensive divide : (offset / cache->size) + * Using the fact that size is a constant for a particular cache, + * we can replace (offset / cache->size) by + * reciprocal_divide(offset, cache->reciprocal_buffer_size) + */ +static inline unsigned int obj_to_index(const struct kmem_cache *cache, + const struct page *page, void *obj) +{ + u32 offset = (obj - page->s_mem); + return reciprocal_divide(offset, cache->reciprocal_buffer_size); +} + #endif /* _LINUX_SLAB_DEF_H */ diff --git a/mm/slab.c b/mm/slab.c index 15e53cef0378..a80beb543678 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -406,19 +406,6 @@ static inline void *index_to_obj(struct kmem_cache *cache, struct page *page, return page->s_mem + cache->size * idx; } -/* - * We want to avoid an expensive divide : (offset / cache->size) - * Using the fact that size is a constant for a particular cache, - * we can replace (offset / cache->size) by - * reciprocal_divide(offset, cache->reciprocal_buffer_size) - */ -static inline unsigned int obj_to_index(const struct kmem_cache *cache, - const struct page *page, void *obj) -{ - u32 offset = (obj - page->s_mem); - return reciprocal_divide(offset, cache->reciprocal_buffer_size); -} - #define BOOT_CPUCACHE_ENTRIES 1 /* internal cache of cache description objs */ static struct kmem_cache kmem_cache_boot = { -- cgit v1.2.3-59-g8ed1b From 41eea9cd239c5b3fff726894f85c97f60e5799a3 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:30:54 -0800 Subject: kasan, arm64: add brk handler for inline instrumentation Tag-based KASAN inline instrumentation mode (which embeds checks of shadow memory into the generated code, instead of inserting a callback) generates a brk instruction when a tag mismatch is detected. This commit adds a tag-based KASAN specific brk handler, that decodes the immediate value passed to the brk instructions (to extract information about the memory access that triggered the mismatch), reads the register values (x0 contains the guilty address) and reports the bug. Link: http://lkml.kernel.org/r/c91fe7684070e34dc34b419e6b69498f4dcacc2d.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Reviewed-by: Andrey Ryabinin Reviewed-by: Dmitry Vyukov Acked-by: Will Deacon Cc: Christoph Lameter Cc: Mark Rutland Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm64/include/asm/brk-imm.h | 2 ++ arch/arm64/kernel/traps.c | 60 ++++++++++++++++++++++++++++++++++++++++ include/linux/kasan.h | 3 ++ 3 files changed, 65 insertions(+) (limited to 'include') diff --git a/arch/arm64/include/asm/brk-imm.h b/arch/arm64/include/asm/brk-imm.h index ed693c5bcec0..2945fe6cd863 100644 --- a/arch/arm64/include/asm/brk-imm.h +++ b/arch/arm64/include/asm/brk-imm.h @@ -16,10 +16,12 @@ * 0x400: for dynamic BRK instruction * 0x401: for compile time BRK instruction * 0x800: kernel-mode BUG() and WARN() traps + * 0x9xx: tag-based KASAN trap (allowed values 0x900 - 0x9ff) */ #define FAULT_BRK_IMM 0x100 #define KGDB_DYN_DBG_BRK_IMM 0x400 #define KGDB_COMPILED_DBG_BRK_IMM 0x401 #define BUG_BRK_IMM 0x800 +#define KASAN_BRK_IMM 0x900 #endif diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c index 5f4d9acb32f5..cdc71cf70aad 100644 --- a/arch/arm64/kernel/traps.c +++ b/arch/arm64/kernel/traps.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include @@ -969,6 +970,58 @@ static struct break_hook bug_break_hook = { .fn = bug_handler, }; +#ifdef CONFIG_KASAN_SW_TAGS + +#define KASAN_ESR_RECOVER 0x20 +#define KASAN_ESR_WRITE 0x10 +#define KASAN_ESR_SIZE_MASK 0x0f +#define KASAN_ESR_SIZE(esr) (1 << ((esr) & KASAN_ESR_SIZE_MASK)) + +static int kasan_handler(struct pt_regs *regs, unsigned int esr) +{ + bool recover = esr & KASAN_ESR_RECOVER; + bool write = esr & KASAN_ESR_WRITE; + size_t size = KASAN_ESR_SIZE(esr); + u64 addr = regs->regs[0]; + u64 pc = regs->pc; + + if (user_mode(regs)) + return DBG_HOOK_ERROR; + + kasan_report(addr, size, write, pc); + + /* + * The instrumentation allows to control whether we can proceed after + * a crash was detected. This is done by passing the -recover flag to + * the compiler. Disabling recovery allows to generate more compact + * code. + * + * Unfortunately disabling recovery doesn't work for the kernel right + * now. KASAN reporting is disabled in some contexts (for example when + * the allocator accesses slab object metadata; this is controlled by + * current->kasan_depth). All these accesses are detected by the tool, + * even though the reports for them are not printed. + * + * This is something that might be fixed at some point in the future. + */ + if (!recover) + die("Oops - KASAN", regs, 0); + + /* If thread survives, skip over the brk instruction and continue: */ + arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE); + return DBG_HOOK_HANDLED; +} + +#define KASAN_ESR_VAL (0xf2000000 | KASAN_BRK_IMM) +#define KASAN_ESR_MASK 0xffffff00 + +static struct break_hook kasan_break_hook = { + .esr_val = KASAN_ESR_VAL, + .esr_mask = KASAN_ESR_MASK, + .fn = kasan_handler, +}; +#endif + /* * Initial handler for AArch64 BRK exceptions * This handler only used until debug_traps_init(). @@ -976,6 +1029,10 @@ static struct break_hook bug_break_hook = { int __init early_brk64(unsigned long addr, unsigned int esr, struct pt_regs *regs) { +#ifdef CONFIG_KASAN_SW_TAGS + if ((esr & KASAN_ESR_MASK) == KASAN_ESR_VAL) + return kasan_handler(regs, esr) != DBG_HOOK_HANDLED; +#endif return bug_handler(regs, esr) != DBG_HOOK_HANDLED; } @@ -983,4 +1040,7 @@ int __init early_brk64(unsigned long addr, unsigned int esr, void __init trap_init(void) { register_break_hook(&bug_break_hook); +#ifdef CONFIG_KASAN_SW_TAGS + register_break_hook(&kasan_break_hook); +#endif } diff --git a/include/linux/kasan.h b/include/linux/kasan.h index a477ce2abdc9..8da7b7a4397a 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -173,6 +173,9 @@ void kasan_init_tags(void); void *kasan_reset_tag(const void *addr); +void kasan_report(unsigned long addr, size_t size, + bool is_write, unsigned long ip); + #else /* CONFIG_KASAN_SW_TAGS */ static inline void kasan_init_tags(void) { } -- cgit v1.2.3-59-g8ed1b From 2813b9c0296259fb11e75c839bab2d958ba4f96c Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:30:57 -0800 Subject: kasan, mm, arm64: tag non slab memory allocated via pagealloc Tag-based KASAN doesn't check memory accesses through pointers tagged with 0xff. When page_address is used to get pointer to memory that corresponds to some page, the tag of the resulting pointer gets set to 0xff, even though the allocated memory might have been tagged differently. For slab pages it's impossible to recover the correct tag to return from page_address, since the page might contain multiple slab objects tagged with different values, and we can't know in advance which one of them is going to get accessed. For non slab pages however, we can recover the tag in page_address, since the whole page was marked with the same tag. This patch adds tagging to non slab memory allocated with pagealloc. To set the tag of the pointer returned from page_address, the tag gets stored to page->flags when the memory gets allocated. Link: http://lkml.kernel.org/r/d758ddcef46a5abc9970182b9137e2fbee202a2c.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Reviewed-by: Andrey Ryabinin Reviewed-by: Dmitry Vyukov Acked-by: Will Deacon Cc: Christoph Lameter Cc: Mark Rutland Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm64/include/asm/memory.h | 8 +++++++- include/linux/mm.h | 29 +++++++++++++++++++++++++++++ include/linux/page-flags-layout.h | 10 ++++++++++ mm/cma.c | 11 +++++++++++ mm/kasan/common.c | 15 +++++++++++++-- mm/page_alloc.c | 1 + mm/slab.c | 2 +- 7 files changed, 72 insertions(+), 4 deletions(-) (limited to 'include') diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h index 907946cc767c..2bb8721da7ef 100644 --- a/arch/arm64/include/asm/memory.h +++ b/arch/arm64/include/asm/memory.h @@ -321,7 +321,13 @@ static inline void *phys_to_virt(phys_addr_t x) #define __virt_to_pgoff(kaddr) (((u64)(kaddr) & ~PAGE_OFFSET) / PAGE_SIZE * sizeof(struct page)) #define __page_to_voff(kaddr) (((u64)(kaddr) & ~VMEMMAP_START) * PAGE_SIZE / sizeof(struct page)) -#define page_to_virt(page) ((void *)((__page_to_voff(page)) | PAGE_OFFSET)) +#define page_to_virt(page) ({ \ + unsigned long __addr = \ + ((__page_to_voff(page)) | PAGE_OFFSET); \ + __addr = __tag_set(__addr, page_kasan_tag(page)); \ + ((void *)__addr); \ +}) + #define virt_to_page(vaddr) ((struct page *)((__virt_to_pgoff(vaddr)) | VMEMMAP_START)) #define _virt_addr_valid(kaddr) pfn_valid((((u64)(kaddr) & ~PAGE_OFFSET) \ diff --git a/include/linux/mm.h b/include/linux/mm.h index 5411de93a363..b4d01969e700 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -804,6 +804,7 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH) #define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH) #define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH) +#define KASAN_TAG_PGOFF (LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH) /* * Define the bit shifts to access each section. For non-existent @@ -814,6 +815,7 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0)) #define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0)) #define LAST_CPUPID_PGSHIFT (LAST_CPUPID_PGOFF * (LAST_CPUPID_WIDTH != 0)) +#define KASAN_TAG_PGSHIFT (KASAN_TAG_PGOFF * (KASAN_TAG_WIDTH != 0)) /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */ #ifdef NODE_NOT_IN_PAGE_FLAGS @@ -836,6 +838,7 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #define NODES_MASK ((1UL << NODES_WIDTH) - 1) #define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1) #define LAST_CPUPID_MASK ((1UL << LAST_CPUPID_SHIFT) - 1) +#define KASAN_TAG_MASK ((1UL << KASAN_TAG_WIDTH) - 1) #define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1) static inline enum zone_type page_zonenum(const struct page *page) @@ -1101,6 +1104,32 @@ static inline bool cpupid_match_pid(struct task_struct *task, int cpupid) } #endif /* CONFIG_NUMA_BALANCING */ +#ifdef CONFIG_KASAN_SW_TAGS +static inline u8 page_kasan_tag(const struct page *page) +{ + return (page->flags >> KASAN_TAG_PGSHIFT) & KASAN_TAG_MASK; +} + +static inline void page_kasan_tag_set(struct page *page, u8 tag) +{ + page->flags &= ~(KASAN_TAG_MASK << KASAN_TAG_PGSHIFT); + page->flags |= (tag & KASAN_TAG_MASK) << KASAN_TAG_PGSHIFT; +} + +static inline void page_kasan_tag_reset(struct page *page) +{ + page_kasan_tag_set(page, 0xff); +} +#else +static inline u8 page_kasan_tag(const struct page *page) +{ + return 0xff; +} + +static inline void page_kasan_tag_set(struct page *page, u8 tag) { } +static inline void page_kasan_tag_reset(struct page *page) { } +#endif + static inline struct zone *page_zone(const struct page *page) { return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]; diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h index 7ec86bf31ce4..1dda31825ec4 100644 --- a/include/linux/page-flags-layout.h +++ b/include/linux/page-flags-layout.h @@ -82,6 +82,16 @@ #define LAST_CPUPID_WIDTH 0 #endif +#ifdef CONFIG_KASAN_SW_TAGS +#define KASAN_TAG_WIDTH 8 +#if SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH+LAST_CPUPID_WIDTH+KASAN_TAG_WIDTH \ + > BITS_PER_LONG - NR_PAGEFLAGS +#error "KASAN: not enough bits in page flags for tag" +#endif +#else +#define KASAN_TAG_WIDTH 0 +#endif + /* * We are going to use the flags for the page to node mapping if its in * there. This includes the case where there is no node, so it is implicit. diff --git a/mm/cma.c b/mm/cma.c index 4cb76121a3ab..c7b39dd3b4f6 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -407,6 +407,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, unsigned long pfn = -1; unsigned long start = 0; unsigned long bitmap_maxno, bitmap_no, bitmap_count; + size_t i; struct page *page = NULL; int ret = -ENOMEM; @@ -466,6 +467,16 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, trace_cma_alloc(pfn, page, count, align); + /* + * CMA can allocate multiple page blocks, which results in different + * blocks being marked with different tags. Reset the tags to ignore + * those page blocks. + */ + if (page) { + for (i = 0; i < count; i++) + page_kasan_tag_reset(page + i); + } + if (ret && !no_warn) { pr_err("%s: alloc failed, req-size: %zu pages, ret: %d\n", __func__, count, ret); diff --git a/mm/kasan/common.c b/mm/kasan/common.c index 27f0cae336c9..195ca385cf7a 100644 --- a/mm/kasan/common.c +++ b/mm/kasan/common.c @@ -220,8 +220,15 @@ void kasan_unpoison_stack_above_sp_to(const void *watermark) void kasan_alloc_pages(struct page *page, unsigned int order) { + u8 tag; + unsigned long i; + if (unlikely(PageHighMem(page))) return; + + tag = random_tag(); + for (i = 0; i < (1 << order); i++) + page_kasan_tag_set(page + i, tag); kasan_unpoison_shadow(page_address(page), PAGE_SIZE << order); } @@ -319,6 +326,10 @@ struct kasan_free_meta *get_free_info(struct kmem_cache *cache, void kasan_poison_slab(struct page *page) { + unsigned long i; + + for (i = 0; i < (1 << compound_order(page)); i++) + page_kasan_tag_reset(page + i); kasan_poison_shadow(page_address(page), PAGE_SIZE << compound_order(page), KASAN_KMALLOC_REDZONE); @@ -517,7 +528,7 @@ void kasan_poison_kfree(void *ptr, unsigned long ip) page = virt_to_head_page(ptr); if (unlikely(!PageSlab(page))) { - if (reset_tag(ptr) != page_address(page)) { + if (ptr != page_address(page)) { kasan_report_invalid_free(ptr, ip); return; } @@ -530,7 +541,7 @@ void kasan_poison_kfree(void *ptr, unsigned long ip) void kasan_kfree_large(void *ptr, unsigned long ip) { - if (reset_tag(ptr) != page_address(virt_to_head_page(ptr))) + if (ptr != page_address(virt_to_head_page(ptr))) kasan_report_invalid_free(ptr, ip); /* The object will be poisoned by page_alloc. */ } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e95b5b7c9c3d..d245de2124e3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1183,6 +1183,7 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn, init_page_count(page); page_mapcount_reset(page); page_cpupid_reset_last(page); + page_kasan_tag_reset(page); INIT_LIST_HEAD(&page->lru); #ifdef WANT_PAGE_VIRTUAL diff --git a/mm/slab.c b/mm/slab.c index a80beb543678..01991060714c 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2357,7 +2357,7 @@ static void *alloc_slabmgmt(struct kmem_cache *cachep, void *freelist; void *addr = page_address(page); - page->s_mem = addr + colour_off; + page->s_mem = kasan_reset_tag(addr) + colour_off; page->active = 0; if (OBJFREELIST_SLAB(cachep)) -- cgit v1.2.3-59-g8ed1b From 66afc7f1e07a1db74453be9167ac0d1205653854 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:31:01 -0800 Subject: kasan: add __must_check annotations to kasan hooks This patch adds __must_check annotations to kasan hooks that return a pointer to make sure that a tagged pointer always gets propagated. Link: http://lkml.kernel.org/r/03b269c5e453945f724bfca3159d4e1333a8fb1c.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Suggested-by: Andrey Ryabinin Cc: Christoph Lameter Cc: Dmitry Vyukov Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/kasan.h | 16 ++++++++++------ mm/kasan/common.c | 15 +++++++++------ 2 files changed, 19 insertions(+), 12 deletions(-) (limited to 'include') diff --git a/include/linux/kasan.h b/include/linux/kasan.h index 8da7b7a4397a..b40ea104dd36 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -49,16 +49,20 @@ void kasan_cache_create(struct kmem_cache *cache, unsigned int *size, void kasan_poison_slab(struct page *page); void kasan_unpoison_object_data(struct kmem_cache *cache, void *object); void kasan_poison_object_data(struct kmem_cache *cache, void *object); -void *kasan_init_slab_obj(struct kmem_cache *cache, const void *object); +void * __must_check kasan_init_slab_obj(struct kmem_cache *cache, + const void *object); -void *kasan_kmalloc_large(const void *ptr, size_t size, gfp_t flags); +void * __must_check kasan_kmalloc_large(const void *ptr, size_t size, + gfp_t flags); void kasan_kfree_large(void *ptr, unsigned long ip); void kasan_poison_kfree(void *ptr, unsigned long ip); -void *kasan_kmalloc(struct kmem_cache *s, const void *object, size_t size, - gfp_t flags); -void *kasan_krealloc(const void *object, size_t new_size, gfp_t flags); +void * __must_check kasan_kmalloc(struct kmem_cache *s, const void *object, + size_t size, gfp_t flags); +void * __must_check kasan_krealloc(const void *object, size_t new_size, + gfp_t flags); -void *kasan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags); +void * __must_check kasan_slab_alloc(struct kmem_cache *s, void *object, + gfp_t flags); bool kasan_slab_free(struct kmem_cache *s, void *object, unsigned long ip); struct kasan_cache { diff --git a/mm/kasan/common.c b/mm/kasan/common.c index 195ca385cf7a..1144e741feb6 100644 --- a/mm/kasan/common.c +++ b/mm/kasan/common.c @@ -373,7 +373,8 @@ static u8 assign_tag(struct kmem_cache *cache, const void *object, bool new) #endif } -void *kasan_init_slab_obj(struct kmem_cache *cache, const void *object) +void * __must_check kasan_init_slab_obj(struct kmem_cache *cache, + const void *object) { struct kasan_alloc_meta *alloc_info; @@ -389,7 +390,8 @@ void *kasan_init_slab_obj(struct kmem_cache *cache, const void *object) return (void *)object; } -void *kasan_slab_alloc(struct kmem_cache *cache, void *object, gfp_t flags) +void * __must_check kasan_slab_alloc(struct kmem_cache *cache, void *object, + gfp_t flags) { return kasan_kmalloc(cache, object, cache->object_size, flags); } @@ -449,8 +451,8 @@ bool kasan_slab_free(struct kmem_cache *cache, void *object, unsigned long ip) return __kasan_slab_free(cache, object, ip, true); } -void *kasan_kmalloc(struct kmem_cache *cache, const void *object, size_t size, - gfp_t flags) +void * __must_check kasan_kmalloc(struct kmem_cache *cache, const void *object, + size_t size, gfp_t flags) { unsigned long redzone_start; unsigned long redzone_end; @@ -482,7 +484,8 @@ void *kasan_kmalloc(struct kmem_cache *cache, const void *object, size_t size, } EXPORT_SYMBOL(kasan_kmalloc); -void *kasan_kmalloc_large(const void *ptr, size_t size, gfp_t flags) +void * __must_check kasan_kmalloc_large(const void *ptr, size_t size, + gfp_t flags) { struct page *page; unsigned long redzone_start; @@ -506,7 +509,7 @@ void *kasan_kmalloc_large(const void *ptr, size_t size, gfp_t flags) return (void *)ptr; } -void *kasan_krealloc(const void *object, size_t size, gfp_t flags) +void * __must_check kasan_krealloc(const void *object, size_t size, gfp_t flags) { struct page *page; -- cgit v1.2.3-59-g8ed1b From 4e45f712d82c6b7a37e02faf388173ad12ab464d Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Fri, 28 Dec 2018 00:33:17 -0800 Subject: include/linux/slab.h: fix sparse warning in kmalloc_type() Multiple people have reported the following sparse warning: ./include/linux/slab.h:332:43: warning: dubious: x & !y The minimal fix would be to change the logical & to boolean &&, which emits the same code, but Andrew has suggested that the branch-avoiding tricks are maybe not worthwile. David Laight provided a nice comparison of disassembly of multiple variants, which shows that the current version produces a 4 deep dependency chain, and fixing the sparse warning by changing logical and to multiplication emits an IMUL, making it even more expensive. The code as rewritten by this patch yielded the best disassembly, with a single predictable branch for the most common case, and a ternary operator for the rest, which gcc seems to compile without a branch or cmov by itself. The result should be more readable, without a sparse warning and probably also faster for the common case. Link: http://lkml.kernel.org/r/80340595-d7c5-97b9-4f6c-23fa893a91e9@suse.cz Fixes: 1291523f2c1d ("mm, slab/slub: introduce kmalloc-reclaimable caches") Reviewed-by: Andrew Morton Signed-off-by: Vlastimil Babka Reported-by: Bart Van Assche Reported-by: Darryl T. Agostinelli Reported-by: Masahiro Yamada Suggested-by: Andrew Morton Suggested-by: David Laight Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/slab.h | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) (limited to 'include') diff --git a/include/linux/slab.h b/include/linux/slab.h index 351ac48dabc4..6d9bd6fc0c57 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -314,22 +314,22 @@ kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1]; static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags) { - int is_dma = 0; - int type_dma = 0; - int is_reclaimable; - #ifdef CONFIG_ZONE_DMA - is_dma = !!(flags & __GFP_DMA); - type_dma = is_dma * KMALLOC_DMA; -#endif - - is_reclaimable = !!(flags & __GFP_RECLAIMABLE); + /* + * The most common case is KMALLOC_NORMAL, so test for it + * with a single branch for both flags. + */ + if (likely((flags & (__GFP_DMA | __GFP_RECLAIMABLE)) == 0)) + return KMALLOC_NORMAL; /* - * If an allocation is both __GFP_DMA and __GFP_RECLAIMABLE, return - * KMALLOC_DMA and effectively ignore __GFP_RECLAIMABLE + * At least one of the flags has to be set. If both are, __GFP_DMA + * is more important. */ - return type_dma + (is_reclaimable & !is_dma) * KMALLOC_RECLAIM; + return flags & __GFP_DMA ? KMALLOC_DMA : KMALLOC_RECLAIM; +#else + return flags & __GFP_RECLAIMABLE ? KMALLOC_RECLAIM : KMALLOC_NORMAL; +#endif } /* -- cgit v1.2.3-59-g8ed1b From 6a90a83f1d1957647581ca48caa1f7cc4fa44f8d Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 28 Dec 2018 00:33:28 -0800 Subject: mm/mmu_notifier.c: remove mmu_notifier_synchronize() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Contrary to its name, mmu_notifier_synchronize() does not synchronize the notifier's SRCU instance, but rather waits for RCU callbacks to finish. i.e. it invokes rcu_barrier(). The RCU documentation is quite clear on this matter, explicitly calling out that rcu_barrier() does not imply synchronize_rcu(). As there are no callers of mmu_notifier_synchronize() and it's unclear whether any user of mmu_notifier_call_srcu() will ever want to barrier on their callbacks, simply remove the function. Link: http://lkml.kernel.org/r/20181106134705.14197-1-sean.j.christopherson@intel.com Signed-off-by: Sean Christopherson Reviewed-by: Andrew Morton Cc: Peter Zijlstra Cc: Jérôme Glisse Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmu_notifier.h | 1 - mm/mmu_notifier.c | 7 ------- 2 files changed, 8 deletions(-) (limited to 'include') diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 9893a6432adf..913c3c13e36e 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -420,7 +420,6 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) extern void mmu_notifier_call_srcu(struct rcu_head *rcu, void (*func)(struct rcu_head *rcu)); -extern void mmu_notifier_synchronize(void); #else /* CONFIG_MMU_NOTIFIER */ diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 5119ff846769..755466cd289a 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -35,13 +35,6 @@ void mmu_notifier_call_srcu(struct rcu_head *rcu, } EXPORT_SYMBOL_GPL(mmu_notifier_call_srcu); -void mmu_notifier_synchronize(void) -{ - /* Wait for any running method to finish. */ - srcu_barrier(&srcu); -} -EXPORT_SYMBOL_GPL(mmu_notifier_synchronize); - /* * This function can't run concurrently against mmu_notifier_register * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap -- cgit v1.2.3-59-g8ed1b From 368686a95e55fd66b88542b5b23d802a4886b1aa Mon Sep 17 00:00:00 2001 From: Anders Roxell Date: Fri, 28 Dec 2018 00:33:31 -0800 Subject: writeback: don't decrement wb->refcnt if !wb->bdi This happened while running in qemu-system-aarch64, the AMBA PL011 UART driver when enabling CONFIG_DEBUG_TEST_DRIVER_REMOVE. arch_initcall(pl011_init) came before subsys_initcall(default_bdi_init), devtmpfs' handle_remove() crashes because the reference count is a NULL pointer only because wb->bdi hasn't been initialized yet. Rework so that wb_put have an extra check if wb->bdi before decrement wb->refcnt and also add a WARN_ON_ONCE to get a warning if it happens again in other drivers. Link: http://lkml.kernel.org/r/20181030113545.30999-2-anders.roxell@linaro.org Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks") Signed-off-by: Arnd Bergmann Signed-off-by: Anders Roxell Co-developed-by: Arnd Bergmann Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/backing-dev-defs.h | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'include') diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 9a6bc0951cfa..c31157135598 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -258,6 +258,14 @@ static inline void wb_get(struct bdi_writeback *wb) */ static inline void wb_put(struct bdi_writeback *wb) { + if (WARN_ON_ONCE(!wb->bdi)) { + /* + * A driver bug might cause a file to be removed before bdi was + * initialized. + */ + return; + } + if (wb != &wb->bdi->wb) percpu_ref_put(&wb->refcnt); } -- cgit v1.2.3-59-g8ed1b From d381c54760dcfad23743da40516e7e003d73952a Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 28 Dec 2018 00:33:56 -0800 Subject: mm: only report isolation failures when offlining memory Heiko has complained that his log is swamped by warnings from has_unmovable_pages [ 20.536664] page dumped because: has_unmovable_pages [ 20.536792] page:000003d081ff4080 count:1 mapcount:0 mapping:000000008ff88600 index:0x0 compound_mapcount: 0 [ 20.536794] flags: 0x3fffe0000010200(slab|head) [ 20.536795] raw: 03fffe0000010200 0000000000000100 0000000000000200 000000008ff88600 [ 20.536796] raw: 0000000000000000 0020004100000000 ffffffff00000001 0000000000000000 [ 20.536797] page dumped because: has_unmovable_pages [ 20.536814] page:000003d0823b0000 count:1 mapcount:0 mapping:0000000000000000 index:0x0 [ 20.536815] flags: 0x7fffe0000000000() [ 20.536817] raw: 07fffe0000000000 0000000000000100 0000000000000200 0000000000000000 [ 20.536818] raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000 which are not triggered by the memory hotplug but rather CMA allocator. The original idea behind dumping the page state for all call paths was that these messages will be helpful debugging failures. From the above it seems that this is not the case for the CMA path because we are lacking much more context. E.g the second reported page might be a CMA allocated page. It is still interesting to see a slab page in the CMA area but it is hard to tell whether this is bug from the above output alone. Address this issue by dumping the page state only on request. Both start_isolate_page_range and has_unmovable_pages already have an argument to ignore hwpoison pages so make this argument more generic and turn it into flags and allow callers to combine non-default modes into a mask. While we are at it, has_unmovable_pages call from is_pageblock_removable_nolock (sysfs removable file) is questionable to report the failure so drop it from there as well. Link: http://lkml.kernel.org/r/20181218092802.31429-1-mhocko@kernel.org Signed-off-by: Michal Hocko Reported-by: Heiko Carstens Reviewed-by: Oscar Salvador Cc: Anshuman Khandual Cc: Stephen Rothwell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/page-isolation.h | 11 +++++++++-- mm/memory_hotplug.c | 5 +++-- mm/page_alloc.c | 11 +++++------ mm/page_isolation.c | 10 ++++------ 4 files changed, 21 insertions(+), 16 deletions(-) (limited to 'include') diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index 4ae347cbc36d..4eb26d278046 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -30,8 +30,11 @@ static inline bool is_migrate_isolate(int migratetype) } #endif +#define SKIP_HWPOISON 0x1 +#define REPORT_FAILURE 0x2 + bool has_unmovable_pages(struct zone *zone, struct page *page, int count, - int migratetype, bool skip_hwpoisoned_pages); + int migratetype, int flags); void set_pageblock_migratetype(struct page *page, int migratetype); int move_freepages_block(struct zone *zone, struct page *page, int migratetype, int *num_movable); @@ -44,10 +47,14 @@ int move_freepages_block(struct zone *zone, struct page *page, * For isolating all pages in the range finally, the caller have to * free all pages in the range. test_page_isolated() can be used for * test it. + * + * The following flags are allowed (they can be combined in a bit mask) + * SKIP_HWPOISON - ignore hwpoison pages + * REPORT_FAILURE - report details about the failure to isolate the range */ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, - unsigned migratetype, bool skip_hwpoisoned_pages); + unsigned migratetype, int flags); /* * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE. diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index c82193db4be6..8537429d33a6 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1226,7 +1226,7 @@ static bool is_pageblock_removable_nolock(struct page *page) if (!zone_spans_pfn(zone, pfn)) return false; - return !has_unmovable_pages(zone, page, 0, MIGRATE_MOVABLE, true); + return !has_unmovable_pages(zone, page, 0, MIGRATE_MOVABLE, SKIP_HWPOISON); } /* Checks if this range of memory is likely to be hot-removable. */ @@ -1577,7 +1577,8 @@ static int __ref __offline_pages(unsigned long start_pfn, /* set above range as isolated */ ret = start_isolate_page_range(start_pfn, end_pfn, - MIGRATE_MOVABLE, true); + MIGRATE_MOVABLE, + SKIP_HWPOISON | REPORT_FAILURE); if (ret) { mem_hotplug_done(); reason = "failure to isolate range"; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c6f090e9a112..ce9e88577bde 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7767,8 +7767,7 @@ void *__init alloc_large_system_hash(const char *tablename, * race condition. So you can't expect this function should be exact. */ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, - int migratetype, - bool skip_hwpoisoned_pages) + int migratetype, int flags) { unsigned long pfn, iter, found; @@ -7842,7 +7841,7 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, * The HWPoisoned page may be not in buddy system, and * page_count() is not 0. */ - if (skip_hwpoisoned_pages && PageHWPoison(page)) + if ((flags & SKIP_HWPOISON) && PageHWPoison(page)) continue; if (__PageMovable(page)) @@ -7869,7 +7868,8 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, return false; unmovable: WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE); - dump_page(pfn_to_page(pfn+iter), "unmovable page"); + if (flags & REPORT_FAILURE) + dump_page(pfn_to_page(pfn+iter), "unmovable page"); return true; } @@ -7996,8 +7996,7 @@ int alloc_contig_range(unsigned long start, unsigned long end, */ ret = start_isolate_page_range(pfn_max_align_down(start), - pfn_max_align_up(end), migratetype, - false); + pfn_max_align_up(end), migratetype, 0); if (ret) return ret; diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 43e085608846..ce323e56b34d 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -15,8 +15,7 @@ #define CREATE_TRACE_POINTS #include -static int set_migratetype_isolate(struct page *page, int migratetype, - bool skip_hwpoisoned_pages) +static int set_migratetype_isolate(struct page *page, int migratetype, int isol_flags) { struct zone *zone; unsigned long flags, pfn; @@ -60,8 +59,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype, * FIXME: Now, memory hotplug doesn't call shrink_slab() by itself. * We just check MOVABLE pages. */ - if (!has_unmovable_pages(zone, page, arg.pages_found, migratetype, - skip_hwpoisoned_pages)) + if (!has_unmovable_pages(zone, page, arg.pages_found, migratetype, flags)) ret = 0; /* @@ -185,7 +183,7 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages) * prevents two threads from simultaneously working on overlapping ranges. */ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, - unsigned migratetype, bool skip_hwpoisoned_pages) + unsigned migratetype, int flags) { unsigned long pfn; unsigned long undo_pfn; @@ -199,7 +197,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, pfn += pageblock_nr_pages) { page = __first_valid_page(pfn, pageblock_nr_pages); if (page && - set_migratetype_isolate(page, migratetype, skip_hwpoisoned_pages)) { + set_migratetype_isolate(page, migratetype, flags)) { undo_pfn = pfn; goto undo; } -- cgit v1.2.3-59-g8ed1b From 0b9df58b79fa283fbedc0fb6a8e248599444bacc Mon Sep 17 00:00:00 2001 From: Timofey Titovets Date: Fri, 28 Dec 2018 00:34:00 -0800 Subject: xxHash: create arch dependent 32/64-bit xxhash() Patch series "Currently used jhash are slow enough and replace it allow as to make KSM", v8. Apeed (in kernel): ksm: crc32c hash() 12081 MB/s ksm: xxh64 hash() 8770 MB/s ksm: xxh32 hash() 4529 MB/s ksm: jhash2 hash() 1569 MB/s Sioh Lee's testing (copy from other mail): Test platform: openstack cloud platform (NEWTON version) Experiment node: openstack based cloud compute node (CPU: xeon E5-2620 v3, memory 64gb) VM: (2 VCPU, RAM 4GB, DISK 20GB) * 4 Linux kernel: 4.14 (latest version) KSM setup - sleep_millisecs: 200ms, pages_to_scan: 200 Experiment process: Firstly, we turn off KSM and launch 4 VMs. Then we turn on the KSM and measure the checksum computation time until full_scans become two. The experimental results (the experimental value is the average of the measured values) crc32c_intel: 1084.10ns crc32c (no hardware acceleration): 7012.51ns xxhash32: 2227.75ns xxhash64: 1413.16ns jhash2: 5128.30ns In summary, the result shows that crc32c_intel has advantages over all of the hash function used in the experiment. (decreased by 84.54% compared to crc32c, 78.86% compared to jhash2, 51.33% xxhash32, 23.28% compared to xxhash64) the results are similar to those of Timofey. But, use only xxhash for now, because for using crc32c, cryptoapi must be initialized first - that require some tricky solution to work good in all situations. So: - First patch implement compile time pickup of fastest implementation of xxhash for target platform. - The second patch replaces jhash2 with xxhash This patch (of 2): xxh32() - fast on both 32/64-bit platforms xxh64() - fast only on 64-bit platform Create xxhash() which will pick up the fastest version at compile time. Link: http://lkml.kernel.org/r/20181023182554.23464-2-nefelim4ag@gmail.com Signed-off-by: Timofey Titovets Reviewed-by: Pavel Tatashin Reviewed-by: Mike Rapoport Reviewed-by: Andrew Morton Cc: Andrea Arcangeli Cc: leesioh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/xxhash.h | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) (limited to 'include') diff --git a/include/linux/xxhash.h b/include/linux/xxhash.h index 9e1f42cb57e9..52b073fea17f 100644 --- a/include/linux/xxhash.h +++ b/include/linux/xxhash.h @@ -107,6 +107,29 @@ uint32_t xxh32(const void *input, size_t length, uint32_t seed); */ uint64_t xxh64(const void *input, size_t length, uint64_t seed); +/** + * xxhash() - calculate wordsize hash of the input with a given seed + * @input: The data to hash. + * @length: The length of the data to hash. + * @seed: The seed can be used to alter the result predictably. + * + * If the hash does not need to be comparable between machines with + * different word sizes, this function will call whichever of xxh32() + * or xxh64() is faster. + * + * Return: wordsize hash of the data. + */ + +static inline unsigned long xxhash(const void *input, size_t length, + uint64_t seed) +{ +#if BITS_PER_LONG == 64 + return xxh64(input, length, seed); +#else + return xxh32(input, length, seed); +#endif +} + /*-**************************** * Streaming Hash Functions *****************************/ -- cgit v1.2.3-59-g8ed1b From 9705bea5f833f4fc21d5bef5fce7348427f76ea4 Mon Sep 17 00:00:00 2001 From: Arun KS Date: Fri, 28 Dec 2018 00:34:24 -0800 Subject: mm: convert zone->managed_pages to atomic variable totalram_pages, zone->managed_pages and totalhigh_pages updates are protected by managed_page_count_lock, but readers never care about it. Convert these variables to atomic to avoid readers potentially seeing a store tear. This patch converts zone->managed_pages. Subsequent patches will convert totalram_panges, totalhigh_pages and eventually managed_page_count_lock will be removed. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. Link: http://lkml.kernel.org/r/1542090790-21750-3-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS Suggested-by: Michal Hocko Suggested-by: Vlastimil Babka Reviewed-by: Konstantin Khlebnikov Reviewed-by: David Hildenbrand Acked-by: Michal Hocko Acked-by: Vlastimil Babka Reviewed-by: Pavel Tatashin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 2 +- include/linux/mmzone.h | 9 +++++-- lib/show_mem.c | 2 +- mm/memblock.c | 2 +- mm/page_alloc.c | 44 +++++++++++++++++------------------ mm/vmstat.c | 4 ++-- 6 files changed, 34 insertions(+), 29 deletions(-) (limited to 'include') diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c index c02adbbeef2a..b7bc7d7d048f 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c @@ -853,7 +853,7 @@ static int kfd_fill_mem_info_for_cpu(int numa_node_id, int *avail_size, */ pgdat = NODE_DATA(numa_node_id); for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) - mem_in_bytes += pgdat->node_zones[zone_type].managed_pages; + mem_in_bytes += zone_managed_pages(&pgdat->node_zones[zone_type]); mem_in_bytes <<= PAGE_SHIFT; sub_type_hdr->length_low = lower_32_bits(mem_in_bytes); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 077d797d1f60..a23e34e21178 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -435,7 +435,7 @@ struct zone { * adjust_managed_page_count() should be used instead of directly * touching zone->managed_pages and totalram_pages. */ - unsigned long managed_pages; + atomic_long_t managed_pages; unsigned long spanned_pages; unsigned long present_pages; @@ -524,6 +524,11 @@ enum pgdat_flags { PGDAT_RECLAIM_LOCKED, /* prevents concurrent reclaim */ }; +static inline unsigned long zone_managed_pages(struct zone *zone) +{ + return (unsigned long)atomic_long_read(&zone->managed_pages); +} + static inline unsigned long zone_end_pfn(const struct zone *zone) { return zone->zone_start_pfn + zone->spanned_pages; @@ -820,7 +825,7 @@ static inline bool is_dev_zone(const struct zone *zone) */ static inline bool managed_zone(struct zone *zone) { - return zone->managed_pages; + return zone_managed_pages(zone); } /* Returns true if a zone has memory */ diff --git a/lib/show_mem.c b/lib/show_mem.c index 0beaa1d899aa..eefe67d50e84 100644 --- a/lib/show_mem.c +++ b/lib/show_mem.c @@ -28,7 +28,7 @@ void show_mem(unsigned int filter, nodemask_t *nodemask) continue; total += zone->present_pages; - reserved += zone->present_pages - zone->managed_pages; + reserved += zone->present_pages - zone_managed_pages(zone); if (is_highmem_idx(zoneid)) highmem += zone->present_pages; diff --git a/mm/memblock.c b/mm/memblock.c index 81ae63ca78d0..0068f87af1e8 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1950,7 +1950,7 @@ void reset_node_managed_pages(pg_data_t *pgdat) struct zone *z; for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++) - z->managed_pages = 0; + atomic_long_set(&z->managed_pages, 0); } void __init reset_all_zones_managed_pages(void) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b79e79caea99..4b5c4ff68f18 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1280,7 +1280,7 @@ static void __init __free_pages_boot_core(struct page *page, unsigned int order) __ClearPageReserved(p); set_page_count(p, 0); - page_zone(page)->managed_pages += nr_pages; + atomic_long_add(nr_pages, &page_zone(page)->managed_pages); set_page_refcounted(page); __free_pages(page, order); } @@ -2259,7 +2259,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone, * Limit the number reserved to 1 pageblock or roughly 1% of a zone. * Check is race-prone but harmless. */ - max_managed = (zone->managed_pages / 100) + pageblock_nr_pages; + max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages; if (zone->nr_reserved_highatomic >= max_managed) return; @@ -4661,7 +4661,7 @@ static unsigned long nr_free_zone_pages(int offset) struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL); for_each_zone_zonelist(zone, z, zonelist, offset) { - unsigned long size = zone->managed_pages; + unsigned long size = zone_managed_pages(zone); unsigned long high = high_wmark_pages(zone); if (size > high) sum += size - high; @@ -4768,7 +4768,7 @@ void si_meminfo_node(struct sysinfo *val, int nid) pg_data_t *pgdat = NODE_DATA(nid); for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) - managed_pages += pgdat->node_zones[zone_type].managed_pages; + managed_pages += zone_managed_pages(&pgdat->node_zones[zone_type]); val->totalram = managed_pages; val->sharedram = node_page_state(pgdat, NR_SHMEM); val->freeram = sum_zone_node_page_state(nid, NR_FREE_PAGES); @@ -4777,7 +4777,7 @@ void si_meminfo_node(struct sysinfo *val, int nid) struct zone *zone = &pgdat->node_zones[zone_type]; if (is_highmem(zone)) { - managed_highpages += zone->managed_pages; + managed_highpages += zone_managed_pages(zone); free_highpages += zone_page_state(zone, NR_FREE_PAGES); } } @@ -4984,7 +4984,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)), K(zone_page_state(zone, NR_ZONE_WRITE_PENDING)), K(zone->present_pages), - K(zone->managed_pages), + K(zone_managed_pages(zone)), K(zone_page_state(zone, NR_MLOCK)), zone_page_state(zone, NR_KERNEL_STACK_KB), K(zone_page_state(zone, NR_PAGETABLE)), @@ -5656,7 +5656,7 @@ static int zone_batchsize(struct zone *zone) * The per-cpu-pages pools are set to around 1000th of the * size of the zone. */ - batch = zone->managed_pages / 1024; + batch = zone_managed_pages(zone) / 1024; /* But no more than a meg. */ if (batch * PAGE_SIZE > 1024 * 1024) batch = (1024 * 1024) / PAGE_SIZE; @@ -5766,7 +5766,7 @@ static void pageset_set_high_and_batch(struct zone *zone, { if (percpu_pagelist_fraction) pageset_set_high(pcp, - (zone->managed_pages / + (zone_managed_pages(zone) / percpu_pagelist_fraction)); else pageset_set_batch(pcp, zone_batchsize(zone)); @@ -6323,7 +6323,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat) static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid, unsigned long remaining_pages) { - zone->managed_pages = remaining_pages; + atomic_long_set(&zone->managed_pages, remaining_pages); zone_set_nid(zone, nid); zone->name = zone_names[idx]; zone->zone_pgdat = NODE_DATA(nid); @@ -7076,7 +7076,7 @@ early_param("movablecore", cmdline_parse_movablecore); void adjust_managed_page_count(struct page *page, long count) { spin_lock(&managed_page_count_lock); - page_zone(page)->managed_pages += count; + atomic_long_add(count, &page_zone(page)->managed_pages); totalram_pages += count; #ifdef CONFIG_HIGHMEM if (PageHighMem(page)) @@ -7124,7 +7124,7 @@ void free_highmem_page(struct page *page) { __free_reserved_page(page); totalram_pages++; - page_zone(page)->managed_pages++; + atomic_long_inc(&page_zone(page)->managed_pages); totalhigh_pages++; } #endif @@ -7257,7 +7257,7 @@ static void calculate_totalreserve_pages(void) for (i = 0; i < MAX_NR_ZONES; i++) { struct zone *zone = pgdat->node_zones + i; long max = 0; - unsigned long managed_pages = zone->managed_pages; + unsigned long managed_pages = zone_managed_pages(zone); /* Find valid and maximum lowmem_reserve in the zone */ for (j = i; j < MAX_NR_ZONES; j++) { @@ -7293,7 +7293,7 @@ static void setup_per_zone_lowmem_reserve(void) for_each_online_pgdat(pgdat) { for (j = 0; j < MAX_NR_ZONES; j++) { struct zone *zone = pgdat->node_zones + j; - unsigned long managed_pages = zone->managed_pages; + unsigned long managed_pages = zone_managed_pages(zone); zone->lowmem_reserve[j] = 0; @@ -7311,7 +7311,7 @@ static void setup_per_zone_lowmem_reserve(void) lower_zone->lowmem_reserve[j] = managed_pages / sysctl_lowmem_reserve_ratio[idx]; } - managed_pages += lower_zone->managed_pages; + managed_pages += zone_managed_pages(lower_zone); } } } @@ -7330,14 +7330,14 @@ static void __setup_per_zone_wmarks(void) /* Calculate total number of !ZONE_HIGHMEM pages */ for_each_zone(zone) { if (!is_highmem(zone)) - lowmem_pages += zone->managed_pages; + lowmem_pages += zone_managed_pages(zone); } for_each_zone(zone) { u64 tmp; spin_lock_irqsave(&zone->lock, flags); - tmp = (u64)pages_min * zone->managed_pages; + tmp = (u64)pages_min * zone_managed_pages(zone); do_div(tmp, lowmem_pages); if (is_highmem(zone)) { /* @@ -7351,7 +7351,7 @@ static void __setup_per_zone_wmarks(void) */ unsigned long min_pages; - min_pages = zone->managed_pages / 1024; + min_pages = zone_managed_pages(zone) / 1024; min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL); zone->watermark[WMARK_MIN] = min_pages; } else { @@ -7368,7 +7368,7 @@ static void __setup_per_zone_wmarks(void) * ensure a minimum size on small systems. */ tmp = max_t(u64, tmp >> 2, - mult_frac(zone->managed_pages, + mult_frac(zone_managed_pages(zone), watermark_scale_factor, 10000)); zone->watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp; @@ -7498,8 +7498,8 @@ static void setup_min_unmapped_ratio(void) pgdat->min_unmapped_pages = 0; for_each_zone(zone) - zone->zone_pgdat->min_unmapped_pages += (zone->managed_pages * - sysctl_min_unmapped_ratio) / 100; + zone->zone_pgdat->min_unmapped_pages += (zone_managed_pages(zone) * + sysctl_min_unmapped_ratio) / 100; } @@ -7526,8 +7526,8 @@ static void setup_min_slab_ratio(void) pgdat->min_slab_pages = 0; for_each_zone(zone) - zone->zone_pgdat->min_slab_pages += (zone->managed_pages * - sysctl_min_slab_ratio) / 100; + zone->zone_pgdat->min_slab_pages += (zone_managed_pages(zone) * + sysctl_min_slab_ratio) / 100; } int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *table, int write, diff --git a/mm/vmstat.c b/mm/vmstat.c index 9c624595e904..83b30edc2f7f 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -227,7 +227,7 @@ int calculate_normal_threshold(struct zone *zone) * 125 1024 10 16-32 GB 9 */ - mem = zone->managed_pages >> (27 - PAGE_SHIFT); + mem = zone_managed_pages(zone) >> (27 - PAGE_SHIFT); threshold = 2 * fls(num_online_cpus()) * (1 + fls(mem)); @@ -1569,7 +1569,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, high_wmark_pages(zone), zone->spanned_pages, zone->present_pages, - zone->managed_pages); + zone_managed_pages(zone)); seq_printf(m, "\n protection: (%ld", -- cgit v1.2.3-59-g8ed1b From ca79b0c211af63fa3276f0e3fd7dd9ada2439839 Mon Sep 17 00:00:00 2001 From: Arun KS Date: Fri, 28 Dec 2018 00:34:29 -0800 Subject: mm: convert totalram_pages and totalhigh_pages variables to atomic totalram_pages and totalhigh_pages are made static inline function. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS Suggested-by: Michal Hocko Suggested-by: Vlastimil Babka Reviewed-by: Konstantin Khlebnikov Reviewed-by: Pavel Tatashin Acked-by: Michal Hocko Acked-by: Vlastimil Babka Cc: David Hildenbrand Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/csky/mm/init.c | 4 ++-- arch/powerpc/platforms/pseries/cmm.c | 10 +++++----- arch/s390/mm/init.c | 2 +- arch/um/kernel/mem.c | 2 +- arch/x86/kernel/cpu/microcode/core.c | 2 +- drivers/char/agp/backend.c | 4 ++-- drivers/gpu/drm/i915/i915_gem.c | 2 +- drivers/gpu/drm/i915/selftests/i915_gem_gtt.c | 4 ++-- drivers/hv/hv_balloon.c | 2 +- drivers/md/dm-bufio.c | 2 +- drivers/md/dm-crypt.c | 2 +- drivers/md/dm-integrity.c | 2 +- drivers/md/dm-stats.c | 2 +- drivers/media/platform/mtk-vpu/mtk_vpu.c | 2 +- drivers/misc/vmw_balloon.c | 2 +- drivers/parisc/ccio-dma.c | 4 ++-- drivers/parisc/sba_iommu.c | 4 ++-- drivers/staging/android/ion/ion_system_heap.c | 2 +- drivers/xen/xen-selfballoon.c | 6 +++--- fs/ceph/super.h | 2 +- fs/file_table.c | 2 +- fs/fuse/inode.c | 2 +- fs/nfs/write.c | 2 +- fs/nfsd/nfscache.c | 2 +- fs/ntfs/malloc.h | 2 +- fs/proc/base.c | 2 +- include/linux/highmem.h | 28 +++++++++++++++++++++++++-- include/linux/mm.h | 27 +++++++++++++++++++++++++- include/linux/swap.h | 1 - kernel/fork.c | 2 +- kernel/kexec_core.c | 2 +- kernel/power/snapshot.c | 2 +- mm/highmem.c | 5 ++--- mm/huge_memory.c | 2 +- mm/kasan/quarantine.c | 2 +- mm/memblock.c | 4 ++-- mm/mm_init.c | 2 +- mm/oom_kill.c | 2 +- mm/page_alloc.c | 20 ++++++++++--------- mm/shmem.c | 9 +++++---- mm/slab.c | 2 +- mm/swap.c | 2 +- mm/util.c | 2 +- mm/vmalloc.c | 4 ++-- mm/workingset.c | 2 +- mm/zswap.c | 4 ++-- net/dccp/proto.c | 2 +- net/decnet/dn_route.c | 2 +- net/ipv4/tcp_metrics.c | 2 +- net/netfilter/nf_conntrack_core.c | 2 +- net/netfilter/xt_hashlimit.c | 2 +- net/sctp/protocol.c | 2 +- security/integrity/ima/ima_kexec.c | 2 +- 53 files changed, 131 insertions(+), 81 deletions(-) (limited to 'include') diff --git a/arch/csky/mm/init.c b/arch/csky/mm/init.c index dc07c078f9b8..66e597053488 100644 --- a/arch/csky/mm/init.c +++ b/arch/csky/mm/init.c @@ -71,7 +71,7 @@ void free_initrd_mem(unsigned long start, unsigned long end) ClearPageReserved(virt_to_page(start)); init_page_count(virt_to_page(start)); free_page(start); - totalram_pages++; + totalram_pages_inc(); } } #endif @@ -88,7 +88,7 @@ void free_initmem(void) ClearPageReserved(virt_to_page(addr)); init_page_count(virt_to_page(addr)); free_page(addr); - totalram_pages++; + totalram_pages_inc(); addr += PAGE_SIZE; } diff --git a/arch/powerpc/platforms/pseries/cmm.c b/arch/powerpc/platforms/pseries/cmm.c index 25427a48feae..e8d63a6a9002 100644 --- a/arch/powerpc/platforms/pseries/cmm.c +++ b/arch/powerpc/platforms/pseries/cmm.c @@ -208,7 +208,7 @@ static long cmm_alloc_pages(long nr) pa->page[pa->index++] = addr; loaned_pages++; - totalram_pages--; + totalram_pages_dec(); spin_unlock(&cmm_lock); nr--; } @@ -247,7 +247,7 @@ static long cmm_free_pages(long nr) free_page(addr); loaned_pages--; nr--; - totalram_pages++; + totalram_pages_inc(); } spin_unlock(&cmm_lock); cmm_dbg("End request with %ld pages unfulfilled\n", nr); @@ -291,7 +291,7 @@ static void cmm_get_mpp(void) int rc; struct hvcall_mpp_data mpp_data; signed long active_pages_target, page_loan_request, target; - signed long total_pages = totalram_pages + loaned_pages; + signed long total_pages = totalram_pages() + loaned_pages; signed long min_mem_pages = (min_mem_mb * 1024 * 1024) / PAGE_SIZE; rc = h_get_mpp(&mpp_data); @@ -322,7 +322,7 @@ static void cmm_get_mpp(void) cmm_dbg("delta = %ld, loaned = %lu, target = %lu, oom = %lu, totalram = %lu\n", page_loan_request, loaned_pages, loaned_pages_target, - oom_freed_pages, totalram_pages); + oom_freed_pages, totalram_pages()); } static struct notifier_block cmm_oom_nb = { @@ -581,7 +581,7 @@ static int cmm_mem_going_offline(void *arg) free_page(pa_curr->page[idx]); freed++; loaned_pages--; - totalram_pages++; + totalram_pages_inc(); pa_curr->page[idx] = pa_last->page[--pa_last->index]; if (pa_last->index == 0) { if (pa_curr == pa_last) diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c index 76d0708438e9..50388190b393 100644 --- a/arch/s390/mm/init.c +++ b/arch/s390/mm/init.c @@ -59,7 +59,7 @@ static void __init setup_zero_pages(void) order = 7; /* Limit number of empty zero pages for small memory sizes */ - while (order > 2 && (totalram_pages >> 10) < (1UL << order)) + while (order > 2 && (totalram_pages() >> 10) < (1UL << order)) order--; empty_zero_page = __get_free_pages(GFP_KERNEL | __GFP_ZERO, order); diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c index 2da209687a22..8d21a83dd289 100644 --- a/arch/um/kernel/mem.c +++ b/arch/um/kernel/mem.c @@ -51,7 +51,7 @@ void __init mem_init(void) /* this will put all low memory onto the freelists */ memblock_free_all(); - max_low_pfn = totalram_pages; + max_low_pfn = totalram_pages(); max_pfn = max_low_pfn; mem_init_print_info(NULL); kmalloc_ok = 1; diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c index 168fa272cc3e..97f9ada9ceda 100644 --- a/arch/x86/kernel/cpu/microcode/core.c +++ b/arch/x86/kernel/cpu/microcode/core.c @@ -434,7 +434,7 @@ static ssize_t microcode_write(struct file *file, const char __user *buf, size_t len, loff_t *ppos) { ssize_t ret = -EINVAL; - unsigned long nr_pages = totalram_pages; + unsigned long nr_pages = totalram_pages(); if ((len >> PAGE_SHIFT) > nr_pages) { pr_err("too much data (max %ld pages)\n", nr_pages); diff --git a/drivers/char/agp/backend.c b/drivers/char/agp/backend.c index 38ffb281df97..004a3ce8ba72 100644 --- a/drivers/char/agp/backend.c +++ b/drivers/char/agp/backend.c @@ -115,9 +115,9 @@ static int agp_find_max(void) long memory, index, result; #if PAGE_SHIFT < 20 - memory = totalram_pages >> (20 - PAGE_SHIFT); + memory = totalram_pages() >> (20 - PAGE_SHIFT); #else - memory = totalram_pages << (PAGE_SHIFT - 20); + memory = totalram_pages() << (PAGE_SHIFT - 20); #endif index = 1; diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index d36a9755ad91..a9de07bb72c8 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2559,7 +2559,7 @@ static int i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj) * If there's no chance of allocating enough pages for the whole * object, bail early. */ - if (page_count > totalram_pages) + if (page_count > totalram_pages()) return -ENOMEM; st = kmalloc(sizeof(*st), GFP_KERNEL); diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c index 69fe86b30fbb..a9ed0ecc94e2 100644 --- a/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c +++ b/drivers/gpu/drm/i915/selftests/i915_gem_gtt.c @@ -170,7 +170,7 @@ static int igt_ppgtt_alloc(void *arg) * This should ensure that we do not run into the oomkiller during * the test and take down the machine wilfully. */ - limit = totalram_pages << PAGE_SHIFT; + limit = totalram_pages() << PAGE_SHIFT; limit = min(ppgtt->vm.total, limit); /* Check we can allocate the entire range */ @@ -1244,7 +1244,7 @@ static int exercise_mock(struct drm_i915_private *i915, u64 hole_start, u64 hole_end, unsigned long end_time)) { - const u64 limit = totalram_pages << PAGE_SHIFT; + const u64 limit = totalram_pages() << PAGE_SHIFT; struct i915_gem_context *ctx; struct i915_hw_ppgtt *ppgtt; IGT_TIMEOUT(end_time); diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index f3e7da981610..5301fef16c31 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -1090,7 +1090,7 @@ static void process_info(struct hv_dynmem_device *dm, struct dm_info_msg *msg) static unsigned long compute_balloon_floor(void) { unsigned long min_pages; - unsigned long nr_pages = totalram_pages; + unsigned long nr_pages = totalram_pages(); #define MB2PAGES(mb) ((mb) << (20 - PAGE_SHIFT)) /* Simple continuous piecewiese linear function: * max MiB -> min MiB gradient diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c index dc385b70e4c3..8b0b628e5d1c 100644 --- a/drivers/md/dm-bufio.c +++ b/drivers/md/dm-bufio.c @@ -1887,7 +1887,7 @@ static int __init dm_bufio_init(void) dm_bufio_allocated_vmalloc = 0; dm_bufio_current_allocated = 0; - mem = (__u64)mult_frac(totalram_pages - totalhigh_pages, + mem = (__u64)mult_frac(totalram_pages() - totalhigh_pages(), DM_BUFIO_MEMORY_PERCENT, 100) << PAGE_SHIFT; if (mem > ULONG_MAX) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index a7195eb5b8d8..a8c32de29e3f 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -2158,7 +2158,7 @@ static int crypt_wipe_key(struct crypt_config *cc) static void crypt_calculate_pages_per_client(void) { - unsigned long pages = (totalram_pages - totalhigh_pages) * DM_CRYPT_MEMORY_PERCENT / 100; + unsigned long pages = (totalram_pages() - totalhigh_pages()) * DM_CRYPT_MEMORY_PERCENT / 100; if (!dm_crypt_clients_n) return; diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c index d4ad0bfee251..62baa3214cc7 100644 --- a/drivers/md/dm-integrity.c +++ b/drivers/md/dm-integrity.c @@ -2843,7 +2843,7 @@ static int create_journal(struct dm_integrity_c *ic, char **error) journal_pages = roundup((__u64)ic->journal_sections * ic->journal_section_sectors, PAGE_SIZE >> SECTOR_SHIFT) >> (PAGE_SHIFT - SECTOR_SHIFT); journal_desc_size = journal_pages * sizeof(struct page_list); - if (journal_pages >= totalram_pages - totalhigh_pages || journal_desc_size > ULONG_MAX) { + if (journal_pages >= totalram_pages() - totalhigh_pages() || journal_desc_size > ULONG_MAX) { *error = "Journal doesn't fit into memory"; r = -ENOMEM; goto bad; diff --git a/drivers/md/dm-stats.c b/drivers/md/dm-stats.c index 21de30b4e2a1..45b92a3d9d8e 100644 --- a/drivers/md/dm-stats.c +++ b/drivers/md/dm-stats.c @@ -85,7 +85,7 @@ static bool __check_shared_memory(size_t alloc_size) a = shared_memory_amount + alloc_size; if (a < shared_memory_amount) return false; - if (a >> PAGE_SHIFT > totalram_pages / DM_STATS_MEMORY_FACTOR) + if (a >> PAGE_SHIFT > totalram_pages() / DM_STATS_MEMORY_FACTOR) return false; #ifdef CONFIG_MMU if (a > (VMALLOC_END - VMALLOC_START) / DM_STATS_VMALLOC_FACTOR) diff --git a/drivers/media/platform/mtk-vpu/mtk_vpu.c b/drivers/media/platform/mtk-vpu/mtk_vpu.c index 616f78b24a79..b6602490a247 100644 --- a/drivers/media/platform/mtk-vpu/mtk_vpu.c +++ b/drivers/media/platform/mtk-vpu/mtk_vpu.c @@ -855,7 +855,7 @@ static int mtk_vpu_probe(struct platform_device *pdev) /* Set PTCM to 96K and DTCM to 32K */ vpu_cfg_writel(vpu, 0x2, VPU_TCM_CFG); - vpu->enable_4GB = !!(totalram_pages > (SZ_2G >> PAGE_SHIFT)); + vpu->enable_4GB = !!(totalram_pages() > (SZ_2G >> PAGE_SHIFT)); dev_info(dev, "4GB mode %u\n", vpu->enable_4GB); if (vpu->enable_4GB) { diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c index 9b0b3fa4f836..e6126a4b95d3 100644 --- a/drivers/misc/vmw_balloon.c +++ b/drivers/misc/vmw_balloon.c @@ -570,7 +570,7 @@ static int vmballoon_send_get_target(struct vmballoon *b) unsigned long status; unsigned long limit; - limit = totalram_pages; + limit = totalram_pages(); /* Ensure limit fits in 32-bits */ if (limit != (u32)limit) diff --git a/drivers/parisc/ccio-dma.c b/drivers/parisc/ccio-dma.c index 701a7d6a74d5..358e380eb7fa 100644 --- a/drivers/parisc/ccio-dma.c +++ b/drivers/parisc/ccio-dma.c @@ -1251,7 +1251,7 @@ ccio_ioc_init(struct ioc *ioc) ** Hot-Plug/Removal of PCI cards. (aka PCI OLARD). */ - iova_space_size = (u32) (totalram_pages / count_parisc_driver(&ccio_driver)); + iova_space_size = (u32) (totalram_pages() / count_parisc_driver(&ccio_driver)); /* limit IOVA space size to 1MB-1GB */ @@ -1290,7 +1290,7 @@ ccio_ioc_init(struct ioc *ioc) DBG_INIT("%s() hpa 0x%p mem %luMB IOV %dMB (%d bits)\n", __func__, ioc->ioc_regs, - (unsigned long) totalram_pages >> (20 - PAGE_SHIFT), + (unsigned long) totalram_pages() >> (20 - PAGE_SHIFT), iova_space_size>>20, iov_order + PAGE_SHIFT); diff --git a/drivers/parisc/sba_iommu.c b/drivers/parisc/sba_iommu.c index c1e599a429af..e0655949480a 100644 --- a/drivers/parisc/sba_iommu.c +++ b/drivers/parisc/sba_iommu.c @@ -1414,7 +1414,7 @@ sba_ioc_init(struct parisc_device *sba, struct ioc *ioc, int ioc_num) ** for DMA hints - ergo only 30 bits max. */ - iova_space_size = (u32) (totalram_pages/global_ioc_cnt); + iova_space_size = (u32) (totalram_pages()/global_ioc_cnt); /* limit IOVA space size to 1MB-1GB */ if (iova_space_size < (1 << (20 - PAGE_SHIFT))) { @@ -1439,7 +1439,7 @@ sba_ioc_init(struct parisc_device *sba, struct ioc *ioc, int ioc_num) DBG_INIT("%s() hpa 0x%lx mem %ldMB IOV %dMB (%d bits)\n", __func__, ioc->ioc_hpa, - (unsigned long) totalram_pages >> (20 - PAGE_SHIFT), + (unsigned long) totalram_pages() >> (20 - PAGE_SHIFT), iova_space_size>>20, iov_order + PAGE_SHIFT); diff --git a/drivers/staging/android/ion/ion_system_heap.c b/drivers/staging/android/ion/ion_system_heap.c index 548bb02c0ca6..6cb0eebdff89 100644 --- a/drivers/staging/android/ion/ion_system_heap.c +++ b/drivers/staging/android/ion/ion_system_heap.c @@ -110,7 +110,7 @@ static int ion_system_heap_allocate(struct ion_heap *heap, unsigned long size_remaining = PAGE_ALIGN(size); unsigned int max_order = orders[0]; - if (size / PAGE_SIZE > totalram_pages / 2) + if (size / PAGE_SIZE > totalram_pages() / 2) return -ENOMEM; INIT_LIST_HEAD(&pages); diff --git a/drivers/xen/xen-selfballoon.c b/drivers/xen/xen-selfballoon.c index 5165aa82bf7d..246f6122c9ee 100644 --- a/drivers/xen/xen-selfballoon.c +++ b/drivers/xen/xen-selfballoon.c @@ -189,7 +189,7 @@ static void selfballoon_process(struct work_struct *work) bool reset_timer = false; if (xen_selfballooning_enabled) { - cur_pages = totalram_pages; + cur_pages = totalram_pages(); tgt_pages = cur_pages; /* default is no change */ goal_pages = vm_memory_committed() + totalreserve_pages + @@ -227,7 +227,7 @@ static void selfballoon_process(struct work_struct *work) if (tgt_pages < floor_pages) tgt_pages = floor_pages; balloon_set_new_target(tgt_pages + - balloon_stats.current_pages - totalram_pages); + balloon_stats.current_pages - totalram_pages()); reset_timer = true; } #ifdef CONFIG_FRONTSWAP @@ -569,7 +569,7 @@ int xen_selfballoon_init(bool use_selfballooning, bool use_frontswap_selfshrink) * much more reliably and response faster in some cases. */ if (!selfballoon_reserved_mb) { - reserve_pages = totalram_pages / 10; + reserve_pages = totalram_pages() / 10; selfballoon_reserved_mb = PAGES2MB(reserve_pages); } schedule_delayed_work(&selfballoon_worker, selfballoon_interval * HZ); diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 79a265ba9200..dfb64a5211b6 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -810,7 +810,7 @@ static inline int default_congestion_kb(void) * This allows larger machines to have larger/more transfers. * Limit the default to 256M */ - congestion_kb = (16*int_sqrt(totalram_pages)) << (PAGE_SHIFT-10); + congestion_kb = (16*int_sqrt(totalram_pages())) << (PAGE_SHIFT-10); if (congestion_kb > 256*1024) congestion_kb = 256*1024; diff --git a/fs/file_table.c b/fs/file_table.c index b6e9587f05c7..5679e7fcb6b0 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -380,7 +380,7 @@ void __init files_init(void) void __init files_maxfiles_init(void) { unsigned long n; - unsigned long nr_pages = totalram_pages; + unsigned long nr_pages = totalram_pages(); unsigned long memreserve = (nr_pages - nr_free_pages()) * 3/2; memreserve = min(memreserve, nr_pages - 1); diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 568abed20eb2..76baaa6be393 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -824,7 +824,7 @@ static const struct super_operations fuse_super_operations = { static void sanitize_global_limit(unsigned *limit) { if (*limit == 0) - *limit = ((totalram_pages << PAGE_SHIFT) >> 13) / + *limit = ((totalram_pages() << PAGE_SHIFT) >> 13) / sizeof(struct fuse_req); if (*limit >= 1 << 16) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 586726a590d8..4f15665f0ad1 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -2121,7 +2121,7 @@ int __init nfs_init_writepagecache(void) * This allows larger machines to have larger/more transfers. * Limit the default to 256M */ - nfs_congestion_kb = (16*int_sqrt(totalram_pages)) << (PAGE_SHIFT-10); + nfs_congestion_kb = (16*int_sqrt(totalram_pages())) << (PAGE_SHIFT-10); if (nfs_congestion_kb > 256*1024) nfs_congestion_kb = 256*1024; diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c index e2fe0e9ce0df..da52b594362a 100644 --- a/fs/nfsd/nfscache.c +++ b/fs/nfsd/nfscache.c @@ -99,7 +99,7 @@ static unsigned int nfsd_cache_size_limit(void) { unsigned int limit; - unsigned long low_pages = totalram_pages - totalhigh_pages; + unsigned long low_pages = totalram_pages() - totalhigh_pages(); limit = (16 * int_sqrt(low_pages)) << (PAGE_SHIFT-10); return min_t(unsigned int, limit, 256*1024); diff --git a/fs/ntfs/malloc.h b/fs/ntfs/malloc.h index ab172e5f51d9..5becc8acc8f4 100644 --- a/fs/ntfs/malloc.h +++ b/fs/ntfs/malloc.h @@ -47,7 +47,7 @@ static inline void *__ntfs_malloc(unsigned long size, gfp_t gfp_mask) return kmalloc(PAGE_SIZE, gfp_mask & ~__GFP_HIGHMEM); /* return (void *)__get_free_page(gfp_mask); */ } - if (likely((size >> PAGE_SHIFT) < totalram_pages)) + if (likely((size >> PAGE_SHIFT) < totalram_pages())) return __vmalloc(size, gfp_mask, PAGE_KERNEL); return NULL; } diff --git a/fs/proc/base.c b/fs/proc/base.c index ce3465479447..d7fd1ca807d2 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -530,7 +530,7 @@ static const struct file_operations proc_lstats_operations = { static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { - unsigned long totalpages = totalram_pages + total_swap_pages; + unsigned long totalpages = totalram_pages() + total_swap_pages; unsigned long points = 0; points = oom_badness(task, NULL, NULL, totalpages) * diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 0690679832d4..ea5cdbd8c2c3 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -36,7 +36,31 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size) /* declarations for linux/mm/highmem.c */ unsigned int nr_free_highpages(void); -extern unsigned long totalhigh_pages; +extern atomic_long_t _totalhigh_pages; +static inline unsigned long totalhigh_pages(void) +{ + return (unsigned long)atomic_long_read(&_totalhigh_pages); +} + +static inline void totalhigh_pages_inc(void) +{ + atomic_long_inc(&_totalhigh_pages); +} + +static inline void totalhigh_pages_dec(void) +{ + atomic_long_dec(&_totalhigh_pages); +} + +static inline void totalhigh_pages_add(long count) +{ + atomic_long_add(count, &_totalhigh_pages); +} + +static inline void totalhigh_pages_set(long val) +{ + atomic_long_set(&_totalhigh_pages, val); +} void kmap_flush_unused(void); @@ -51,7 +75,7 @@ static inline struct page *kmap_to_page(void *addr) return virt_to_page(addr); } -#define totalhigh_pages 0UL +static inline unsigned long totalhigh_pages(void) { return 0UL; } #ifndef ARCH_HAS_KMAP static inline void *kmap(struct page *page) diff --git a/include/linux/mm.h b/include/linux/mm.h index b4d01969e700..1d2be4c2d34a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -48,7 +48,32 @@ static inline void set_max_mapnr(unsigned long limit) static inline void set_max_mapnr(unsigned long limit) { } #endif -extern unsigned long totalram_pages; +extern atomic_long_t _totalram_pages; +static inline unsigned long totalram_pages(void) +{ + return (unsigned long)atomic_long_read(&_totalram_pages); +} + +static inline void totalram_pages_inc(void) +{ + atomic_long_inc(&_totalram_pages); +} + +static inline void totalram_pages_dec(void) +{ + atomic_long_dec(&_totalram_pages); +} + +static inline void totalram_pages_add(long count) +{ + atomic_long_add(count, &_totalram_pages); +} + +static inline void totalram_pages_set(long val) +{ + atomic_long_set(&_totalram_pages, val); +} + extern void * high_memory; extern int page_cluster; diff --git a/include/linux/swap.h b/include/linux/swap.h index a8f6d5d89524..77459d695010 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -310,7 +310,6 @@ void workingset_update_node(struct xa_node *node); } while (0) /* linux/mm/page_alloc.c */ -extern unsigned long totalram_pages; extern unsigned long totalreserve_pages; extern unsigned long nr_free_buffer_pages(void); extern unsigned long nr_free_pagecache_pages(void); diff --git a/kernel/fork.c b/kernel/fork.c index 8617a326e9f5..c979605fe806 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -744,7 +744,7 @@ void __init __weak arch_task_cache_init(void) { } static void set_max_threads(unsigned int max_threads_suggested) { u64 threads; - unsigned long nr_pages = totalram_pages; + unsigned long nr_pages = totalram_pages(); /* * The number of threads shall be limited such that the thread diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 7e967ca98d92..d7140447be75 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -152,7 +152,7 @@ int sanity_check_segment_list(struct kimage *image) int i; unsigned long nr_segments = image->nr_segments; unsigned long total_pages = 0; - unsigned long nr_pages = totalram_pages; + unsigned long nr_pages = totalram_pages(); /* * Verify we have good destination addresses. The caller is diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index b0308a2c6000..640b2034edd6 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -105,7 +105,7 @@ unsigned long image_size; void __init hibernate_image_size_init(void) { - image_size = ((totalram_pages * 2) / 5) * PAGE_SIZE; + image_size = ((totalram_pages() * 2) / 5) * PAGE_SIZE; } /* diff --git a/mm/highmem.c b/mm/highmem.c index 59db3223a5d6..107b10f9878e 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -105,9 +105,8 @@ static inline wait_queue_head_t *get_pkmap_wait_queue_head(unsigned int color) } #endif -unsigned long totalhigh_pages __read_mostly; -EXPORT_SYMBOL(totalhigh_pages); - +atomic_long_t _totalhigh_pages __read_mostly; +EXPORT_SYMBOL(_totalhigh_pages); EXPORT_PER_CPU_SYMBOL(__kmap_atomic_idx); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e84a10b0d310..da6682bb69aa 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -420,7 +420,7 @@ static int __init hugepage_init(void) * where the extra memory used could hurt more than TLB overhead * is likely to save. The admin can still enable it through /sys. */ - if (totalram_pages < (512 << (20 - PAGE_SHIFT))) { + if (totalram_pages() < (512 << (20 - PAGE_SHIFT))) { transparent_hugepage_flags = 0; return 0; } diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c index 57334ef2d7ef..978bc4a3eb51 100644 --- a/mm/kasan/quarantine.c +++ b/mm/kasan/quarantine.c @@ -237,7 +237,7 @@ void quarantine_reduce(void) * Update quarantine size in case of hotplug. Allocate a fraction of * the installed memory to quarantine minus per-cpu queue limits. */ - total_size = (READ_ONCE(totalram_pages) << PAGE_SHIFT) / + total_size = (totalram_pages() << PAGE_SHIFT) / QUARANTINE_FRACTION; percpu_quarantines = QUARANTINE_PERCPU_SIZE * num_online_cpus(); new_quarantine_size = (total_size < percpu_quarantines) ? diff --git a/mm/memblock.c b/mm/memblock.c index 0068f87af1e8..a53d8697612c 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1576,7 +1576,7 @@ void __init __memblock_free_late(phys_addr_t base, phys_addr_t size) for (; cursor < end; cursor++) { memblock_free_pages(pfn_to_page(cursor), cursor, 0); - totalram_pages++; + totalram_pages_inc(); } } @@ -1978,7 +1978,7 @@ unsigned long __init memblock_free_all(void) reset_all_zones_managed_pages(); pages = free_low_memory_core_early(); - totalram_pages += pages; + totalram_pages_add(pages); return pages; } diff --git a/mm/mm_init.c b/mm/mm_init.c index 6838a530789b..33917105a3a2 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -146,7 +146,7 @@ static void __meminit mm_compute_batch(void) s32 batch = max_t(s32, nr*2, 32); /* batch size set to 0.4% of (total memory/#cpus), or max int32 */ - memsized_batch = min_t(u64, (totalram_pages/nr)/256, 0x7fffffff); + memsized_batch = min_t(u64, (totalram_pages()/nr)/256, 0x7fffffff); vm_committed_as_batch = max_t(s32, memsized_batch, batch); } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 6589f60d5018..21d487749e1d 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -269,7 +269,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc) } /* Default to all available memory */ - oc->totalpages = totalram_pages + total_swap_pages; + oc->totalpages = totalram_pages() + total_swap_pages; if (!IS_ENABLED(CONFIG_NUMA)) return CONSTRAINT_NONE; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4b5c4ff68f18..eb2027892ef9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -16,6 +16,7 @@ #include #include +#include #include #include #include @@ -124,7 +125,8 @@ EXPORT_SYMBOL(node_states); /* Protect totalram_pages and zone->managed_pages */ static DEFINE_SPINLOCK(managed_page_count_lock); -unsigned long totalram_pages __read_mostly; +atomic_long_t _totalram_pages __read_mostly; +EXPORT_SYMBOL(_totalram_pages); unsigned long totalreserve_pages __read_mostly; unsigned long totalcma_pages __read_mostly; @@ -4747,11 +4749,11 @@ EXPORT_SYMBOL_GPL(si_mem_available); void si_meminfo(struct sysinfo *val) { - val->totalram = totalram_pages; + val->totalram = totalram_pages(); val->sharedram = global_node_page_state(NR_SHMEM); val->freeram = global_zone_page_state(NR_FREE_PAGES); val->bufferram = nr_blockdev_pages(); - val->totalhigh = totalhigh_pages; + val->totalhigh = totalhigh_pages(); val->freehigh = nr_free_highpages(); val->mem_unit = PAGE_SIZE; } @@ -7077,10 +7079,10 @@ void adjust_managed_page_count(struct page *page, long count) { spin_lock(&managed_page_count_lock); atomic_long_add(count, &page_zone(page)->managed_pages); - totalram_pages += count; + totalram_pages_add(count); #ifdef CONFIG_HIGHMEM if (PageHighMem(page)) - totalhigh_pages += count; + totalhigh_pages_add(count); #endif spin_unlock(&managed_page_count_lock); } @@ -7123,9 +7125,9 @@ EXPORT_SYMBOL(free_reserved_area); void free_highmem_page(struct page *page) { __free_reserved_page(page); - totalram_pages++; + totalram_pages_inc(); atomic_long_inc(&page_zone(page)->managed_pages); - totalhigh_pages++; + totalhigh_pages_inc(); } #endif @@ -7174,10 +7176,10 @@ void __init mem_init_print_info(const char *str) physpages << (PAGE_SHIFT - 10), codesize >> 10, datasize >> 10, rosize >> 10, (init_data_size + init_code_size) >> 10, bss_size >> 10, - (physpages - totalram_pages - totalcma_pages) << (PAGE_SHIFT - 10), + (physpages - totalram_pages() - totalcma_pages) << (PAGE_SHIFT - 10), totalcma_pages << (PAGE_SHIFT - 10), #ifdef CONFIG_HIGHMEM - totalhigh_pages << (PAGE_SHIFT - 10), + totalhigh_pages() << (PAGE_SHIFT - 10), #endif str ? ", " : "", str ? str : ""); } diff --git a/mm/shmem.c b/mm/shmem.c index b1f0f54470fb..6ece1e2fe76e 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -109,13 +109,14 @@ struct shmem_falloc { #ifdef CONFIG_TMPFS static unsigned long shmem_default_max_blocks(void) { - return totalram_pages / 2; + return totalram_pages() / 2; } static unsigned long shmem_default_max_inodes(void) { - unsigned long nr_pages = totalram_pages; - return min(nr_pages - totalhigh_pages, nr_pages / 2); + unsigned long nr_pages = totalram_pages(); + + return min(nr_pages - totalhigh_pages(), nr_pages / 2); } #endif @@ -3302,7 +3303,7 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo, size = memparse(value,&rest); if (*rest == '%') { size <<= PAGE_SHIFT; - size *= totalram_pages; + size *= totalram_pages(); do_div(size, 100); rest++; } diff --git a/mm/slab.c b/mm/slab.c index 01991060714c..73fe23e649c9 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1235,7 +1235,7 @@ void __init kmem_cache_init(void) * page orders on machines with more than 32MB of memory if * not overridden on the command line. */ - if (!slab_max_order_set && totalram_pages > (32 << 20) >> PAGE_SHIFT) + if (!slab_max_order_set && totalram_pages() > (32 << 20) >> PAGE_SHIFT) slab_max_order = SLAB_MAX_ORDER_HI; /* Bootstrap is tricky, because several objects are allocated diff --git a/mm/swap.c b/mm/swap.c index 5d786019eab9..4d8a1f1afaab 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -1022,7 +1022,7 @@ EXPORT_SYMBOL(pagevec_lookup_range_nr_tag); */ void __init swap_setup(void) { - unsigned long megs = totalram_pages >> (20 - PAGE_SHIFT); + unsigned long megs = totalram_pages() >> (20 - PAGE_SHIFT); /* Use a smaller cluster for small-memory machines */ if (megs < 16) diff --git a/mm/util.c b/mm/util.c index 8bf08b5b5760..4df23d64aac7 100644 --- a/mm/util.c +++ b/mm/util.c @@ -593,7 +593,7 @@ unsigned long vm_commit_limit(void) if (sysctl_overcommit_kbytes) allowed = sysctl_overcommit_kbytes >> (PAGE_SHIFT - 10); else - allowed = ((totalram_pages - hugetlb_total_pages()) + allowed = ((totalram_pages() - hugetlb_total_pages()) * sysctl_overcommit_ratio / 100); allowed += total_swap_pages; diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 97d4b25d0373..871e41c55e23 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1634,7 +1634,7 @@ void *vmap(struct page **pages, unsigned int count, might_sleep(); - if (count > totalram_pages) + if (count > totalram_pages()) return NULL; size = (unsigned long)count << PAGE_SHIFT; @@ -1739,7 +1739,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, unsigned long real_size = size; size = PAGE_ALIGN(size); - if (!size || (size >> PAGE_SHIFT) > totalram_pages) + if (!size || (size >> PAGE_SHIFT) > totalram_pages()) goto fail; area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED | diff --git a/mm/workingset.c b/mm/workingset.c index d46f8c92aa2f..dcb994f2acc2 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -549,7 +549,7 @@ static int __init workingset_init(void) * double the initial memory by using totalram_pages as-is. */ timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT; - max_order = fls_long(totalram_pages - 1); + max_order = fls_long(totalram_pages() - 1); if (max_order > timestamp_bits) bucket_order = max_order - timestamp_bits; pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n", diff --git a/mm/zswap.c b/mm/zswap.c index cd91fd9d96b8..a4e4d36ec085 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -219,8 +219,8 @@ static const struct zpool_ops zswap_zpool_ops = { static bool zswap_is_full(void) { - return totalram_pages * zswap_max_pool_percent / 100 < - DIV_ROUND_UP(zswap_pool_total_size, PAGE_SIZE); + return totalram_pages() * zswap_max_pool_percent / 100 < + DIV_ROUND_UP(zswap_pool_total_size, PAGE_SIZE); } static void zswap_update_total_size(void) diff --git a/net/dccp/proto.c b/net/dccp/proto.c index ff727ff61b5b..0e2f71ab8367 100644 --- a/net/dccp/proto.c +++ b/net/dccp/proto.c @@ -1131,7 +1131,7 @@ EXPORT_SYMBOL_GPL(dccp_debug); static int __init dccp_init(void) { unsigned long goal; - unsigned long nr_pages = totalram_pages; + unsigned long nr_pages = totalram_pages(); int ehash_order, bhash_order, i; int rc; diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c index 1c002c0fb712..950613ee7881 100644 --- a/net/decnet/dn_route.c +++ b/net/decnet/dn_route.c @@ -1866,7 +1866,7 @@ void __init dn_route_init(void) dn_route_timer.expires = jiffies + decnet_dst_gc_interval * HZ; add_timer(&dn_route_timer); - goal = totalram_pages >> (26 - PAGE_SHIFT); + goal = totalram_pages() >> (26 - PAGE_SHIFT); for(order = 0; (1UL << order) < goal; order++) /* NOTHING */; diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c index 03b51cdcc731..b467a7cabf40 100644 --- a/net/ipv4/tcp_metrics.c +++ b/net/ipv4/tcp_metrics.c @@ -1000,7 +1000,7 @@ static int __net_init tcp_net_metrics_init(struct net *net) slots = tcpmhash_entries; if (!slots) { - if (totalram_pages >= 128 * 1024) + if (totalram_pages() >= 128 * 1024) slots = 16 * 1024; else slots = 8 * 1024; diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 5eb990830348..741b533148ba 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -2248,7 +2248,7 @@ static __always_inline unsigned int total_extension_size(void) int nf_conntrack_init_start(void) { - unsigned long nr_pages = totalram_pages; + unsigned long nr_pages = totalram_pages(); int max_factor = 8; int ret = -ENOMEM; int i; diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c index 88b520ba2abc..8d86e39d6280 100644 --- a/net/netfilter/xt_hashlimit.c +++ b/net/netfilter/xt_hashlimit.c @@ -274,7 +274,7 @@ static int htable_create(struct net *net, struct hashlimit_cfg3 *cfg, struct xt_hashlimit_htable *hinfo; const struct seq_operations *ops; unsigned int size, i; - unsigned long nr_pages = totalram_pages; + unsigned long nr_pages = totalram_pages(); int ret; if (cfg->size) { diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c index a5b24182b3cc..d5878ae55840 100644 --- a/net/sctp/protocol.c +++ b/net/sctp/protocol.c @@ -1368,7 +1368,7 @@ static __init int sctp_init(void) int status = -EINVAL; unsigned long goal; unsigned long limit; - unsigned long nr_pages = totalram_pages; + unsigned long nr_pages = totalram_pages(); int max_share; int order; int num_entries; diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c index 16bd18747cfa..d6f32807b347 100644 --- a/security/integrity/ima/ima_kexec.c +++ b/security/integrity/ima/ima_kexec.c @@ -106,7 +106,7 @@ void ima_add_kexec_buffer(struct kimage *image) kexec_segment_size = ALIGN(ima_get_binary_runtime_size() + PAGE_SIZE / 2, PAGE_SIZE); if ((kexec_segment_size == ULONG_MAX) || - ((kexec_segment_size >> PAGE_SHIFT) > totalram_pages / 2)) { + ((kexec_segment_size >> PAGE_SHIFT) > totalram_pages() / 2)) { pr_err("Binary measurement list too large.\n"); return; } -- cgit v1.2.3-59-g8ed1b From 476567e8735a0d06225f3873a86dfa0efd95f3a5 Mon Sep 17 00:00:00 2001 From: Arun KS Date: Fri, 28 Dec 2018 00:34:32 -0800 Subject: mm: remove managed_page_count_lock spinlock Now that totalram_pages and managed_pages are atomic varibles, no need of managed_page_count spinlock. The lock had really a weak consistency guarantee. It hasn't been used for anything but the update but no reader actually cares about all the values being updated to be in sync. Link: http://lkml.kernel.org/r/1542090790-21750-5-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS Reviewed-by: Konstantin Khlebnikov Acked-by: Michal Hocko Acked-by: Vlastimil Babka Cc: David Hildenbrand Reviewed-by: Pavel Tatashin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 6 ------ mm/page_alloc.c | 5 ----- 2 files changed, 11 deletions(-) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a23e34e21178..bc0990c1f1c3 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -428,12 +428,6 @@ struct zone { * Write access to present_pages at runtime should be protected by * mem_hotplug_begin/end(). Any reader who can't tolerant drift of * present_pages should get_online_mems() to get a stable value. - * - * Read access to managed_pages should be safe because it's unsigned - * long. Write access to zone->managed_pages and totalram_pages are - * protected by managed_page_count_lock at runtime. Idealy only - * adjust_managed_page_count() should be used instead of directly - * touching zone->managed_pages and totalram_pages. */ atomic_long_t managed_pages; unsigned long spanned_pages; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eb2027892ef9..6f3d2c7af84b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -122,9 +122,6 @@ nodemask_t node_states[NR_NODE_STATES] __read_mostly = { }; EXPORT_SYMBOL(node_states); -/* Protect totalram_pages and zone->managed_pages */ -static DEFINE_SPINLOCK(managed_page_count_lock); - atomic_long_t _totalram_pages __read_mostly; EXPORT_SYMBOL(_totalram_pages); unsigned long totalreserve_pages __read_mostly; @@ -7077,14 +7074,12 @@ early_param("movablecore", cmdline_parse_movablecore); void adjust_managed_page_count(struct page *page, long count) { - spin_lock(&managed_page_count_lock); atomic_long_add(count, &page_zone(page)->managed_pages); totalram_pages_add(count); #ifdef CONFIG_HIGHMEM if (PageHighMem(page)) totalhigh_pages_add(count); #endif - spin_unlock(&managed_page_count_lock); } EXPORT_SYMBOL(adjust_managed_page_count); -- cgit v1.2.3-59-g8ed1b From 8b09549c2bfd9f3f8f4cdad74107ef4f4ff9cdd7 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 28 Dec 2018 00:34:36 -0800 Subject: vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n Commit fa5e084e43eb ("vmscan: do not unconditionally treat zones that fail zone_reclaim() as full") changed the return value of node_reclaim(). The original return value 0 means NODE_RECLAIM_SOME after this commit. While the return value of node_reclaim() when CONFIG_NUMA is n is not changed. This will leads to call zone_watermark_ok() again. This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN. Since node_reclaim() is only called in page_alloc.c, move it to mm/internal.h. Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Acked-by: Michal Hocko Reviewed-by: Matthew Wilcox Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/swap.h | 6 ------ mm/internal.h | 10 ++++++++++ 2 files changed, 10 insertions(+), 6 deletions(-) (limited to 'include') diff --git a/include/linux/swap.h b/include/linux/swap.h index 77459d695010..f9e576a2c188 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -359,14 +359,8 @@ extern unsigned long vm_total_pages; extern int node_reclaim_mode; extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; -extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int); #else #define node_reclaim_mode 0 -static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask, - unsigned int order) -{ - return 0; -} #endif extern int page_evictable(struct page *page); diff --git a/mm/internal.h b/mm/internal.h index 291eb2b6d1d8..6a57811ae47d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -444,6 +444,16 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn, #define NODE_RECLAIM_SOME 0 #define NODE_RECLAIM_SUCCESS 1 +#ifdef CONFIG_NUMA +extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int); +#else +static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask, + unsigned int order) +{ + return NODE_RECLAIM_NOSCAN; +} +#endif + extern int hwpoison_filter(struct page *p); extern u32 hwpoison_filter_dev_major; -- cgit v1.2.3-59-g8ed1b From 66f71da9dd38af17dc17209cdde7987d4679a699 Mon Sep 17 00:00:00 2001 From: Aaron Lu Date: Fri, 28 Dec 2018 00:34:39 -0800 Subject: mm/swap: use nr_node_ids for avail_lists in swap_info_struct Since a2468cc9bfdf ("swap: choose swap device according to numa node"), avail_lists field of swap_info_struct is changed to an array with MAX_NUMNODES elements. This made swap_info_struct size increased to 40KiB and needs an order-4 page to hold it. This is not optimal in that: 1 Most systems have way less than MAX_NUMNODES(1024) nodes so it is a waste of memory; 2 It could cause swapon failure if the swap device is swapped on after system has been running for a while, due to no order-4 page is available as pointed out by Vasily Averin. Solve the above two issues by using nr_node_ids(which is the actual possible node number the running system has) for avail_lists instead of MAX_NUMNODES. nr_node_ids is unknown at compile time so can't be directly used when declaring this array. What I did here is to declare avail_lists as zero element array and allocate space for it when allocating space for swap_info_struct. The reason why keep using array but not pointer is plist_for_each_entry needs the field to be part of the struct, so pointer will not work. This patch is on top of Vasily Averin's fix commit. I think the use of kvzalloc for swap_info_struct is still needed in case nr_node_ids is really big on some systems. Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.com Signed-off-by: Aaron Lu Reviewed-by: Andrew Morton Acked-by: Michal Hocko Cc: Vasily Averin Cc: Huang Ying Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/swap.h | 11 ++++++++++- mm/swapfile.c | 3 ++- 2 files changed, 12 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/swap.h b/include/linux/swap.h index f9e576a2c188..622025ac1461 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -235,7 +235,6 @@ struct swap_info_struct { unsigned long flags; /* SWP_USED etc: see above */ signed short prio; /* swap priority of this type */ struct plist_node list; /* entry in swap_active_head */ - struct plist_node avail_lists[MAX_NUMNODES];/* entry in swap_avail_heads */ signed char type; /* strange name for an index */ unsigned int max; /* extent of the swap_map */ unsigned char *swap_map; /* vmalloc'ed array of usage counts */ @@ -276,6 +275,16 @@ struct swap_info_struct { */ struct work_struct discard_work; /* discard worker */ struct swap_cluster_list discard_clusters; /* discard clusters list */ + struct plist_node avail_lists[0]; /* + * entries in swap_avail_heads, one + * entry per node. + * Must be last as the number of the + * array is nr_node_ids, which is not + * a fixed value so have to allocate + * dynamically. + * And it has to be an array so that + * plist_for_each_* can work. + */ }; #ifdef CONFIG_64BIT diff --git a/mm/swapfile.c b/mm/swapfile.c index 8688ae65ef58..6e06821623f6 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2812,8 +2812,9 @@ static struct swap_info_struct *alloc_swap_info(void) struct swap_info_struct *p; unsigned int type; int i; + int size = sizeof(*p) + nr_node_ids * sizeof(struct plist_node); - p = kvzalloc(sizeof(*p), GFP_KERNEL); + p = kvzalloc(size, GFP_KERNEL); if (!p) return ERR_PTR(-ENOMEM); -- cgit v1.2.3-59-g8ed1b From a95c90f1e2c253b280385ecf3d4ebfe476926b28 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 28 Dec 2018 00:34:57 -0800 Subject: mm, devm_memremap_pages: fix shutdown handling MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The last step before devm_memremap_pages() returns success is to allocate a release action, devm_memremap_pages_release(), to tear the entire setup down. However, the result from devm_add_action() is not checked. Checking the error from devm_add_action() is not enough. The api currently relies on the fact that the percpu_ref it is using is killed by the time the devm_memremap_pages_release() is run. Rather than continue this awkward situation, offload the responsibility of killing the percpu_ref to devm_memremap_pages_release() directly. This allows devm_memremap_pages() to do the right thing relative to init failures and shutdown. Without this change we could fail to register the teardown of devm_memremap_pages(). The likelihood of hitting this failure is tiny as small memory allocations almost always succeed. However, the impact of the failure is large given any future reconfiguration, or disable/enable, of an nvdimm namespace will fail forever as subsequent calls to devm_memremap_pages() will fail to setup the pgmap_radix since there will be stale entries for the physical address range. An argument could be made to require that the ->kill() operation be set in the @pgmap arg rather than passed in separately. However, it helps code readability, tracking the lifetime of a given instance, to be able to grep the kill routine directly at the devm_memremap_pages() call site. Link: http://lkml.kernel.org/r/154275558526.76910.7535251937849268605.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...") Reviewed-by: "Jérôme Glisse" Reported-by: Logan Gunthorpe Reviewed-by: Logan Gunthorpe Reviewed-by: Christoph Hellwig Cc: Balbir Singh Cc: Michal Hocko Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/dax/pmem.c | 14 +++----------- drivers/nvdimm/pmem.c | 13 +++++-------- include/linux/memremap.h | 2 ++ kernel/memremap.c | 30 ++++++++++++++---------------- tools/testing/nvdimm/test/iomap.c | 15 ++++++++++++++- 5 files changed, 38 insertions(+), 36 deletions(-) (limited to 'include') diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c index 99e2aace8078..2c1f459c0c63 100644 --- a/drivers/dax/pmem.c +++ b/drivers/dax/pmem.c @@ -48,9 +48,8 @@ static void dax_pmem_percpu_exit(void *data) percpu_ref_exit(ref); } -static void dax_pmem_percpu_kill(void *data) +static void dax_pmem_percpu_kill(struct percpu_ref *ref) { - struct percpu_ref *ref = data; struct dax_pmem *dax_pmem = to_dax_pmem(ref); dev_dbg(dax_pmem->dev, "trace\n"); @@ -112,17 +111,10 @@ static int dax_pmem_probe(struct device *dev) } dax_pmem->pgmap.ref = &dax_pmem->ref; + dax_pmem->pgmap.kill = dax_pmem_percpu_kill; addr = devm_memremap_pages(dev, &dax_pmem->pgmap); - if (IS_ERR(addr)) { - devm_remove_action(dev, dax_pmem_percpu_exit, &dax_pmem->ref); - percpu_ref_exit(&dax_pmem->ref); + if (IS_ERR(addr)) return PTR_ERR(addr); - } - - rc = devm_add_action_or_reset(dev, dax_pmem_percpu_kill, - &dax_pmem->ref); - if (rc) - return rc; /* adjust the dax_region resource to the start of data */ memcpy(&res, &dax_pmem->pgmap.res, sizeof(res)); diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 0e39e3d1846f..d28418b05a04 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -309,8 +309,11 @@ static void pmem_release_queue(void *q) blk_cleanup_queue(q); } -static void pmem_freeze_queue(void *q) +static void pmem_freeze_queue(struct percpu_ref *ref) { + struct request_queue *q; + + q = container_of(ref, typeof(*q), q_usage_counter); blk_freeze_queue_start(q); } @@ -402,6 +405,7 @@ static int pmem_attach_disk(struct device *dev, pmem->pfn_flags = PFN_DEV; pmem->pgmap.ref = &q->q_usage_counter; + pmem->pgmap.kill = pmem_freeze_queue; if (is_nd_pfn(dev)) { if (setup_pagemap_fsdax(dev, &pmem->pgmap)) return -ENOMEM; @@ -427,13 +431,6 @@ static int pmem_attach_disk(struct device *dev, memcpy(&bb_res, &nsio->res, sizeof(bb_res)); } - /* - * At release time the queue must be frozen before - * devm_memremap_pages is unwound - */ - if (devm_add_action_or_reset(dev, pmem_freeze_queue, q)) - return -ENOMEM; - if (IS_ERR(addr)) return PTR_ERR(addr); pmem->virt_addr = addr; diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 0ac69ddf5fc4..55db66b3716f 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -111,6 +111,7 @@ typedef void (*dev_page_free_t)(struct page *page, void *data); * @altmap: pre-allocated/reserved memory for vmemmap allocations * @res: physical address range covered by @ref * @ref: reference count that pins the devm_memremap_pages() mapping + * @kill: callback to transition @ref to the dead state * @dev: host device of the mapping for debug * @data: private data pointer for page_free() * @type: memory type: see MEMORY_* in memory_hotplug.h @@ -122,6 +123,7 @@ struct dev_pagemap { bool altmap_valid; struct resource res; struct percpu_ref *ref; + void (*kill)(struct percpu_ref *ref); struct device *dev; void *data; enum memory_type type; diff --git a/kernel/memremap.c b/kernel/memremap.c index 99d14940acfa..5e45f0c327a5 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -88,14 +88,10 @@ static void devm_memremap_pages_release(void *data) resource_size_t align_start, align_size; unsigned long pfn; + pgmap->kill(pgmap->ref); for_each_device_pfn(pfn, pgmap) put_page(pfn_to_page(pfn)); - if (percpu_ref_tryget_live(pgmap->ref)) { - dev_WARN(dev, "%s: page mapping is still live!\n", __func__); - percpu_ref_put(pgmap->ref); - } - /* pages are dead and unused, undo the arch mapping */ align_start = res->start & ~(SECTION_SIZE - 1); align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE) @@ -116,7 +112,7 @@ static void devm_memremap_pages_release(void *data) /** * devm_memremap_pages - remap and provide memmap backing for the given resource * @dev: hosting device for @res - * @pgmap: pointer to a struct dev_pgmap + * @pgmap: pointer to a struct dev_pagemap * * Notes: * 1/ At a minimum the res, ref and type members of @pgmap must be initialized @@ -125,11 +121,8 @@ static void devm_memremap_pages_release(void *data) * 2/ The altmap field may optionally be initialized, in which case altmap_valid * must be set to true * - * 3/ pgmap.ref must be 'live' on entry and 'dead' before devm_memunmap_pages() - * time (or devm release event). The expected order of events is that ref has - * been through percpu_ref_kill() before devm_memremap_pages_release(). The - * wait for the completion of all references being dropped and - * percpu_ref_exit() must occur after devm_memremap_pages_release(). + * 3/ pgmap->ref must be 'live' on entry and will be killed at + * devm_memremap_pages_release() time, or if this routine fails. * * 4/ res is expected to be a host memory range that could feasibly be * treated as a "System RAM" range, i.e. not a device mmio range, but @@ -145,6 +138,9 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) pgprot_t pgprot = PAGE_KERNEL; int error, nid, is_ram; + if (!pgmap->ref || !pgmap->kill) + return ERR_PTR(-EINVAL); + align_start = res->start & ~(SECTION_SIZE - 1); align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE) - align_start; @@ -170,12 +166,10 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) if (is_ram != REGION_DISJOINT) { WARN_ONCE(1, "%s attempted on %s region %pr\n", __func__, is_ram == REGION_MIXED ? "mixed" : "ram", res); - return ERR_PTR(-ENXIO); + error = -ENXIO; + goto err_array; } - if (!pgmap->ref) - return ERR_PTR(-EINVAL); - pgmap->dev = dev; error = xa_err(xa_store_range(&pgmap_array, PHYS_PFN(res->start), @@ -217,7 +211,10 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) align_size >> PAGE_SHIFT, pgmap); percpu_ref_get_many(pgmap->ref, pfn_end(pgmap) - pfn_first(pgmap)); - devm_add_action(dev, devm_memremap_pages_release, pgmap); + error = devm_add_action_or_reset(dev, devm_memremap_pages_release, + pgmap); + if (error) + return ERR_PTR(error); return __va(res->start); @@ -228,6 +225,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) err_pfn_remap: pgmap_array_delete(res); err_array: + pgmap->kill(pgmap->ref); return ERR_PTR(error); } EXPORT_SYMBOL_GPL(devm_memremap_pages); diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c index ed18a0cbc0c8..c6635fee27d8 100644 --- a/tools/testing/nvdimm/test/iomap.c +++ b/tools/testing/nvdimm/test/iomap.c @@ -104,13 +104,26 @@ void *__wrap_devm_memremap(struct device *dev, resource_size_t offset, } EXPORT_SYMBOL(__wrap_devm_memremap); +static void nfit_test_kill(void *_pgmap) +{ + struct dev_pagemap *pgmap = _pgmap; + + pgmap->kill(pgmap->ref); +} + void *__wrap_devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) { resource_size_t offset = pgmap->res.start; struct nfit_test_resource *nfit_res = get_nfit_res(offset); - if (nfit_res) + if (nfit_res) { + int rc; + + rc = devm_add_action_or_reset(dev, nfit_test_kill, pgmap); + if (rc) + return ERR_PTR(rc); return nfit_res->buf + offset - nfit_res->res.start; + } return devm_memremap_pages(dev, pgmap); } EXPORT_SYMBOL_GPL(__wrap_devm_memremap_pages); -- cgit v1.2.3-59-g8ed1b From 58ef15b765af0d2cbe6799ec564f1dc485010ab8 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 28 Dec 2018 00:35:07 -0800 Subject: mm, hmm: use devm semantics for hmm_devmem_{add, remove} MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit devm semantics arrange for resources to be torn down when device-driver-probe fails or when device-driver-release completes. Similar to devm_memremap_pages() there is no need to support an explicit remove operation when the users properly adhere to devm semantics. Note that devm_kzalloc() automatically handles allocating node-local memory. Link: http://lkml.kernel.org/r/154275559545.76910.9186690723515469051.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams Reviewed-by: Christoph Hellwig Reviewed-by: Jérôme Glisse Cc: "Jérôme Glisse" Cc: Logan Gunthorpe Cc: Balbir Singh Cc: Michal Hocko Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/hmm.h | 4 +- mm/hmm.c | 127 ++++++++++------------------------------------------ 2 files changed, 25 insertions(+), 106 deletions(-) (limited to 'include') diff --git a/include/linux/hmm.h b/include/linux/hmm.h index c6fb869a81c0..ed89fbc525d2 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -512,8 +512,7 @@ struct hmm_devmem { * enough and allocate struct page for it. * * The device driver can wrap the hmm_devmem struct inside a private device - * driver struct. The device driver must call hmm_devmem_remove() before the - * device goes away and before freeing the hmm_devmem struct memory. + * driver struct. */ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, struct device *device, @@ -521,7 +520,6 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops, struct device *device, struct resource *res); -void hmm_devmem_remove(struct hmm_devmem *devmem); /* * hmm_devmem_page_set_drvdata - set per-page driver data field diff --git a/mm/hmm.c b/mm/hmm.c index 90c34f3d1243..8510881e7b44 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -987,7 +987,6 @@ static void hmm_devmem_ref_exit(void *data) devmem = container_of(ref, struct hmm_devmem, ref); percpu_ref_exit(ref); - devm_remove_action(devmem->device, &hmm_devmem_ref_exit, data); } static void hmm_devmem_ref_kill(void *data) @@ -998,7 +997,6 @@ static void hmm_devmem_ref_kill(void *data) devmem = container_of(ref, struct hmm_devmem, ref); percpu_ref_kill(ref); wait_for_completion(&devmem->completion); - devm_remove_action(devmem->device, &hmm_devmem_ref_kill, data); } static int hmm_devmem_fault(struct vm_area_struct *vma, @@ -1036,7 +1034,7 @@ static void hmm_devmem_radix_release(struct resource *resource) mutex_unlock(&hmm_devmem_lock); } -static void hmm_devmem_release(struct device *dev, void *data) +static void hmm_devmem_release(void *data) { struct hmm_devmem *devmem = data; struct resource *resource = devmem->resource; @@ -1044,11 +1042,6 @@ static void hmm_devmem_release(struct device *dev, void *data) struct zone *zone; struct page *page; - if (percpu_ref_tryget_live(&devmem->ref)) { - dev_WARN(dev, "%s: page mapping is still live!\n", __func__); - percpu_ref_put(&devmem->ref); - } - /* pages are dead and unused, undo the arch mapping */ start_pfn = (resource->start & ~(PA_SECTION_SIZE - 1)) >> PAGE_SHIFT; npages = ALIGN(resource_size(resource), PA_SECTION_SIZE) >> PAGE_SHIFT; @@ -1174,19 +1167,6 @@ error: return ret; } -static int hmm_devmem_match(struct device *dev, void *data, void *match_data) -{ - struct hmm_devmem *devmem = data; - - return devmem->resource == match_data; -} - -static void hmm_devmem_pages_remove(struct hmm_devmem *devmem) -{ - devres_release(devmem->device, &hmm_devmem_release, - &hmm_devmem_match, devmem->resource); -} - /* * hmm_devmem_add() - hotplug ZONE_DEVICE memory for device memory * @@ -1214,8 +1194,7 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, dev_pagemap_get_ops(); - devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem), - GFP_KERNEL, dev_to_node(device)); + devmem = devm_kzalloc(device, sizeof(*devmem), GFP_KERNEL); if (!devmem) return ERR_PTR(-ENOMEM); @@ -1229,11 +1208,11 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release, 0, GFP_KERNEL); if (ret) - goto error_percpu_ref; + return ERR_PTR(ret); - ret = devm_add_action(device, hmm_devmem_ref_exit, &devmem->ref); + ret = devm_add_action_or_reset(device, hmm_devmem_ref_exit, &devmem->ref); if (ret) - goto error_devm_add_action; + return ERR_PTR(ret); size = ALIGN(size, PA_SECTION_SIZE); addr = min((unsigned long)iomem_resource.end, @@ -1253,16 +1232,12 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, devmem->resource = devm_request_mem_region(device, addr, size, dev_name(device)); - if (!devmem->resource) { - ret = -ENOMEM; - goto error_no_resource; - } + if (!devmem->resource) + return ERR_PTR(-ENOMEM); break; } - if (!devmem->resource) { - ret = -ERANGE; - goto error_no_resource; - } + if (!devmem->resource) + return ERR_PTR(-ERANGE); devmem->resource->desc = IORES_DESC_DEVICE_PRIVATE_MEMORY; devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT; @@ -1271,28 +1246,13 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, ret = hmm_devmem_pages_create(devmem); if (ret) - goto error_pages; - - devres_add(device, devmem); + return ERR_PTR(ret); - ret = devm_add_action(device, hmm_devmem_ref_kill, &devmem->ref); - if (ret) { - hmm_devmem_remove(devmem); + ret = devm_add_action_or_reset(device, hmm_devmem_release, devmem); + if (ret) return ERR_PTR(ret); - } return devmem; - -error_pages: - devm_release_mem_region(device, devmem->resource->start, - resource_size(devmem->resource)); -error_no_resource: -error_devm_add_action: - hmm_devmem_ref_kill(&devmem->ref); - hmm_devmem_ref_exit(&devmem->ref); -error_percpu_ref: - devres_free(devmem); - return ERR_PTR(ret); } EXPORT_SYMBOL(hmm_devmem_add); @@ -1308,8 +1268,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops, dev_pagemap_get_ops(); - devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem), - GFP_KERNEL, dev_to_node(device)); + devmem = devm_kzalloc(device, sizeof(*devmem), GFP_KERNEL); if (!devmem) return ERR_PTR(-ENOMEM); @@ -1323,12 +1282,12 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops, ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release, 0, GFP_KERNEL); if (ret) - goto error_percpu_ref; + return ERR_PTR(ret); - ret = devm_add_action(device, hmm_devmem_ref_exit, &devmem->ref); + ret = devm_add_action_or_reset(device, hmm_devmem_ref_exit, + &devmem->ref); if (ret) - goto error_devm_add_action; - + return ERR_PTR(ret); devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT; devmem->pfn_last = devmem->pfn_first + @@ -1336,59 +1295,21 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops, ret = hmm_devmem_pages_create(devmem); if (ret) - goto error_devm_add_action; + return ERR_PTR(ret); - devres_add(device, devmem); + ret = devm_add_action_or_reset(device, hmm_devmem_release, devmem); + if (ret) + return ERR_PTR(ret); - ret = devm_add_action(device, hmm_devmem_ref_kill, &devmem->ref); - if (ret) { - hmm_devmem_remove(devmem); + ret = devm_add_action_or_reset(device, hmm_devmem_ref_kill, + &devmem->ref); + if (ret) return ERR_PTR(ret); - } return devmem; - -error_devm_add_action: - hmm_devmem_ref_kill(&devmem->ref); - hmm_devmem_ref_exit(&devmem->ref); -error_percpu_ref: - devres_free(devmem); - return ERR_PTR(ret); } EXPORT_SYMBOL(hmm_devmem_add_resource); -/* - * hmm_devmem_remove() - remove device memory (kill and free ZONE_DEVICE) - * - * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory - * - * This will hot-unplug memory that was hotplugged by hmm_devmem_add on behalf - * of the device driver. It will free struct page and remove the resource that - * reserved the physical address range for this device memory. - */ -void hmm_devmem_remove(struct hmm_devmem *devmem) -{ - resource_size_t start, size; - struct device *device; - bool cdm = false; - - if (!devmem) - return; - - device = devmem->device; - start = devmem->resource->start; - size = resource_size(devmem->resource); - - cdm = devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY; - hmm_devmem_ref_kill(&devmem->ref); - hmm_devmem_ref_exit(&devmem->ref); - hmm_devmem_pages_remove(devmem); - - if (!cdm) - devm_release_mem_region(device, start, size); -} -EXPORT_SYMBOL(hmm_devmem_remove); - /* * A device driver that wants to handle multiple devices memory through a * single fake device can use hmm_device to do so. This is purely a helper -- cgit v1.2.3-59-g8ed1b From 4d72868c8f7c293fc8408a54db4e0a12dc031152 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Fri, 28 Dec 2018 00:35:29 -0800 Subject: memblock: replace usage of __memblock_free_early() with memblock_free() __memblock_free_early() is only used by the convenience wrappers, so essentially we wrap a call to memblock_free() twice. Replace calls of __memblock_free_early() with calls to memblock_free() and drop the former. Link: http://lkml.kernel.org/r/20181125102940.GE28634@rapoport-lnx Signed-off-by: Mike Rapoport Reviewed-by: Andrew Morton Cc: Wentao Wang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memblock.h | 5 ++--- mm/memblock.c | 22 ++++++++-------------- 2 files changed, 10 insertions(+), 17 deletions(-) (limited to 'include') diff --git a/include/linux/memblock.h b/include/linux/memblock.h index aee299a6aa76..5f74ba623dbd 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -154,7 +154,6 @@ void __next_mem_range_rev(u64 *idx, int nid, enum memblock_flags flags, void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start, phys_addr_t *out_end); -void __memblock_free_early(phys_addr_t base, phys_addr_t size); void __memblock_free_late(phys_addr_t base, phys_addr_t size); /** @@ -414,13 +413,13 @@ static inline void * __init memblock_alloc_node_nopanic(phys_addr_t size, static inline void __init memblock_free_early(phys_addr_t base, phys_addr_t size) { - __memblock_free_early(base, size); + memblock_free(base, size); } static inline void __init memblock_free_early_nid(phys_addr_t base, phys_addr_t size, int nid) { - __memblock_free_early(base, size); + memblock_free(base, size); } static inline void __init memblock_free_late(phys_addr_t base, phys_addr_t size) diff --git a/mm/memblock.c b/mm/memblock.c index 207058b6891b..f57d7620668b 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -800,7 +800,14 @@ int __init_memblock memblock_remove(phys_addr_t base, phys_addr_t size) return memblock_remove_range(&memblock.memory, base, size); } - +/** + * memblock_free - free boot memory block + * @base: phys starting address of the boot memory block + * @size: size of the boot memory block in bytes + * + * Free boot memory block previously allocated by memblock_alloc_xx() API. + * The freeing memory will not be released to the buddy allocator. + */ int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size) { phys_addr_t end = base + size - 1; @@ -1536,19 +1543,6 @@ void * __init memblock_alloc_try_nid( return NULL; } -/** - * __memblock_free_early - free boot memory block - * @base: phys starting address of the boot memory block - * @size: size of the boot memory block in bytes - * - * Free boot memory block previously allocated by memblock_alloc_xx() API. - * The freeing memory will not be released to the buddy allocator. - */ -void __init __memblock_free_early(phys_addr_t base, phys_addr_t size) -{ - memblock_free(base, size); -} - /** * __memblock_free_late - free bootmem block pages directly to buddy allocator * @base: phys starting address of the boot memory block -- cgit v1.2.3-59-g8ed1b From f29d8e9c0191a2a02500945db505e5c89159c3f4 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 28 Dec 2018 00:35:36 -0800 Subject: mm/memory_hotplug: drop "online" parameter from add_memory_resource() Userspace should always be in charge of how to online memory and if memory should be onlined automatically in the kernel. Let's drop the parameter to overwrite this - XEN passes memhp_auto_online, just like add_memory(), so we can directly use that instead internally. Link: http://lkml.kernel.org/r/20181123123740.27652-1-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Michal Hocko Reviewed-by: Oscar Salvador Acked-by: Juergen Gross Cc: Boris Ostrovsky Cc: Stefano Stabellini Cc: Dan Williams Cc: Pavel Tatashin Cc: David Hildenbrand Cc: Joonsoo Kim Cc: Arun KS Cc: Mathieu Malaterre Cc: Stephen Rothwell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/xen/balloon.c | 2 +- include/linux/memory_hotplug.h | 2 +- mm/memory_hotplug.c | 6 +++--- 3 files changed, 5 insertions(+), 5 deletions(-) (limited to 'include') diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c index 221b7333d067..ceb5048de9a7 100644 --- a/drivers/xen/balloon.c +++ b/drivers/xen/balloon.c @@ -352,7 +352,7 @@ static enum bp_state reserve_additional_memory(void) mutex_unlock(&balloon_mutex); /* add_memory_resource() requires the device_hotplug lock */ lock_device_hotplug(); - rc = add_memory_resource(nid, resource, memhp_auto_online); + rc = add_memory_resource(nid, resource); unlock_device_hotplug(); mutex_lock(&balloon_mutex); diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index ffd9cd10fcf3..7383a7a76d69 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -326,7 +326,7 @@ extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn, void *arg, int (*func)(struct memory_block *, void *)); extern int __add_memory(int nid, u64 start, u64 size); extern int add_memory(int nid, u64 start, u64 size); -extern int add_memory_resource(int nid, struct resource *resource, bool online); +extern int add_memory_resource(int nid, struct resource *resource); extern int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, bool want_memblock); extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 935eb332bbb4..6258e0e923cc 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1078,7 +1078,7 @@ static int online_memory_block(struct memory_block *mem, void *arg) * * we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */ -int __ref add_memory_resource(int nid, struct resource *res, bool online) +int __ref add_memory_resource(int nid, struct resource *res) { u64 start, size; bool new_node = false; @@ -1133,7 +1133,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online) mem_hotplug_done(); /* online pages if requested */ - if (online) + if (memhp_auto_online) walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL, online_memory_block); @@ -1157,7 +1157,7 @@ int __ref __add_memory(int nid, u64 start, u64 size) if (IS_ERR(res)) return PTR_ERR(res); - ret = add_memory_resource(nid, res, memhp_auto_online); + ret = add_memory_resource(nid, res); if (ret < 0) release_memory_resource(res); return ret; -- cgit v1.2.3-59-g8ed1b From a921444382b49cc7fdeca3fba3e278bc09484a27 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Fri, 28 Dec 2018 00:35:44 -0800 Subject: mm: move zone watermark accesses behind an accessor This is a preparation patch only, no functional change. Link: http://lkml.kernel.org/r/20181123114528.28802-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Andrea Arcangeli Cc: David Rientjes Cc: Michal Hocko Cc: Zi Yan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 9 +++++---- mm/compaction.c | 2 +- mm/page_alloc.c | 12 ++++++------ 3 files changed, 12 insertions(+), 11 deletions(-) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index bc0990c1f1c3..dcf1b66a96ab 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -269,9 +269,10 @@ enum zone_watermarks { NR_WMARK }; -#define min_wmark_pages(z) (z->watermark[WMARK_MIN]) -#define low_wmark_pages(z) (z->watermark[WMARK_LOW]) -#define high_wmark_pages(z) (z->watermark[WMARK_HIGH]) +#define min_wmark_pages(z) (z->_watermark[WMARK_MIN]) +#define low_wmark_pages(z) (z->_watermark[WMARK_LOW]) +#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH]) +#define wmark_pages(z, i) (z->_watermark[i]) struct per_cpu_pages { int count; /* number of pages in the list */ @@ -362,7 +363,7 @@ struct zone { /* Read-mostly fields */ /* zone watermarks, access with *_wmark_pages(zone) macros */ - unsigned long watermark[NR_WMARK]; + unsigned long _watermark[NR_WMARK]; unsigned long nr_reserved_highatomic; diff --git a/mm/compaction.c b/mm/compaction.c index 7c607479de4a..ef29490b0f46 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1431,7 +1431,7 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order, if (is_via_compact_memory(order)) return COMPACT_CONTINUE; - watermark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK]; + watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); /* * If watermarks for high-order allocation are already met, there * should be no need for compaction at all. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 251b8a0c9c5d..2046e333ea8f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3376,7 +3376,7 @@ retry: } } - mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK]; + mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); if (!zone_watermark_fast(zone, order, mark, ac_classzone_idx(ac), alloc_flags)) { int ret; @@ -4793,7 +4793,7 @@ long si_mem_available(void) pages[lru] = global_node_page_state(NR_LRU_BASE + lru); for_each_zone(zone) - wmark_low += zone->watermark[WMARK_LOW]; + wmark_low += low_wmark_pages(zone); /* * Estimate the amount of memory available for userspace allocations, @@ -7431,13 +7431,13 @@ static void __setup_per_zone_wmarks(void) min_pages = zone_managed_pages(zone) / 1024; min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL); - zone->watermark[WMARK_MIN] = min_pages; + zone->_watermark[WMARK_MIN] = min_pages; } else { /* * If it's a lowmem zone, reserve a number of pages * proportionate to the zone's size. */ - zone->watermark[WMARK_MIN] = tmp; + zone->_watermark[WMARK_MIN] = tmp; } /* @@ -7449,8 +7449,8 @@ static void __setup_per_zone_wmarks(void) mult_frac(zone_managed_pages(zone), watermark_scale_factor, 10000)); - zone->watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp; - zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2; + zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp; + zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2; spin_unlock_irqrestore(&zone->lock, flags); } -- cgit v1.2.3-59-g8ed1b From 1c30844d2dfe272d58c8fc000960b835d13aa2ac Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Fri, 28 Dec 2018 00:35:52 -0800 Subject: mm: reclaim small amounts of memory when an external fragmentation event occurs An external fragmentation event was previously described as When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that will cause external fragmentation issues in the future. The kernel reduces the probability of such events by increasing the watermark sizes by calling set_recommended_min_free_kbytes early in the lifetime of the system. This works reasonably well in general but if there are enough sparsely populated pageblocks then the problem can still occur as enough memory is free overall and kswapd stays asleep. This patch introduces a watermark_boost_factor sysctl that allows a zone watermark to be temporarily boosted when an external fragmentation causing events occurs. The boosting will stall allocations that would decrease free memory below the boosted low watermark and kswapd is woken if the calling context allows to reclaim an amount of memory relative to the size of the high watermark and the watermark_boost_factor until the boost is cleared. When kswapd finishes, it wakes kcompactd at the pageblock order to clean some of the pageblocks that may have been affected by the fragmentation event. kswapd avoids any writeback, slab shrinkage and swap from reclaim context during this operation to avoid excessive system disruption in the name of fragmentation avoidance. Care is taken so that kswapd will do normal reclaim work if the system is really low on memory. This was evaluated using the same workloads as "mm, page_alloc: Spread allocations across zones before introducing fragmentation". 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) 4.20-rc3+patch1-4: 18421 (98% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%) Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%) Note that external fragmentation causing events are massively reduced by this path whether in comparison to the previous kernel or the vanilla kernel. The fault latency for huge pages appears to be increased but that is only because THP allocations were successful with the patch applied. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) 4.20-rc3+patch1-4: 13464 (95% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%) Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%) Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%) Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%) As before, massive reduction in external fragmentation events, some jitter on latencies and an increase in THP allocation success rates. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) 4.20-rc3+patch1-4: 14263 (93% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%) Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%) There is a 93% reduction in fragmentation causing events, there is a big reduction in the huge page fault latency and allocation success rate is higher. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) 4.20-rc3+patch1-4: 11095 (93% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%* Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%) There is a large reduction in fragmentation events with some jitter around the latencies and success rates. As before, the high THP allocation success rate does mean the system is under a lot of pressure. However, as the fragmentation events are reduced, it would be expected that the long-term allocation success rate would be higher. Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Andrea Arcangeli Cc: David Rientjes Cc: Michal Hocko Cc: Zi Yan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/vm.txt | 21 +++++++ include/linux/mm.h | 1 + include/linux/mmzone.h | 11 ++-- kernel/sysctl.c | 8 +++ mm/page_alloc.c | 43 +++++++++++++- mm/vmscan.c | 133 +++++++++++++++++++++++++++++++++++++++++--- 6 files changed, 202 insertions(+), 15 deletions(-) (limited to 'include') diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 7d73882e2c27..187ce4f599a2 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -63,6 +63,7 @@ Currently, these files are in /proc/sys/vm: - swappiness - user_reserve_kbytes - vfs_cache_pressure +- watermark_boost_factor - watermark_scale_factor - zone_reclaim_mode @@ -856,6 +857,26 @@ ten times more freeable objects than there are. ============================================================= +watermark_boost_factor: + +This factor controls the level of reclaim when memory is being fragmented. +It defines the percentage of the high watermark of a zone that will be +reclaimed if pages of different mobility are being mixed within pageblocks. +The intent is that compaction has less work to do in the future and to +increase the success rate of future high-order allocations such as SLUB +allocations, THP and hugetlbfs pages. + +To make it sensible with respect to the watermark_scale_factor parameter, +the unit is in fractions of 10,000. The default value of 15,000 means +that up to 150% of the high watermark will be reclaimed in the event of +a pageblock being mixed due to fragmentation. The level of reclaim is +determined by the number of fragmentation events that occurred in the +recent past. If this value is smaller than a pageblock then a pageblocks +worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor +of 0 will disable the feature. + +============================================================= + watermark_scale_factor: This factor controls the aggressiveness of kswapd. It defines the diff --git a/include/linux/mm.h b/include/linux/mm.h index 1d2be4c2d34a..031b2ce983f9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2256,6 +2256,7 @@ extern void zone_pcp_reset(struct zone *zone); /* page_alloc.c */ extern int min_free_kbytes; +extern int watermark_boost_factor; extern int watermark_scale_factor; /* nommu.c */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index dcf1b66a96ab..5b4bfb90fb94 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -269,10 +269,10 @@ enum zone_watermarks { NR_WMARK }; -#define min_wmark_pages(z) (z->_watermark[WMARK_MIN]) -#define low_wmark_pages(z) (z->_watermark[WMARK_LOW]) -#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH]) -#define wmark_pages(z, i) (z->_watermark[i]) +#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost) +#define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost) +#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) +#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) struct per_cpu_pages { int count; /* number of pages in the list */ @@ -364,6 +364,7 @@ struct zone { /* zone watermarks, access with *_wmark_pages(zone) macros */ unsigned long _watermark[NR_WMARK]; + unsigned long watermark_boost; unsigned long nr_reserved_highatomic; @@ -890,6 +891,8 @@ static inline int is_highmem(struct zone *zone) struct ctl_table; int min_free_kbytes_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); +int watermark_boost_factor_sysctl_handler(struct ctl_table *, int, + void __user *, size_t *, loff_t *); int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES]; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 5fc724e4e454..1825f712e73b 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1462,6 +1462,14 @@ static struct ctl_table vm_table[] = { .proc_handler = min_free_kbytes_sysctl_handler, .extra1 = &zero, }, + { + .procname = "watermark_boost_factor", + .data = &watermark_boost_factor, + .maxlen = sizeof(watermark_boost_factor), + .mode = 0644, + .proc_handler = watermark_boost_factor_sysctl_handler, + .extra1 = &zero, + }, { .procname = "watermark_scale_factor", .data = &watermark_scale_factor, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 32b3e121a388..80373eca453d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -262,6 +262,7 @@ compound_page_dtor * const compound_page_dtors[] = { int min_free_kbytes = 1024; int user_min_free_kbytes = -1; +int watermark_boost_factor __read_mostly = 15000; int watermark_scale_factor = 10; static unsigned long nr_kernel_pages __meminitdata; @@ -2129,6 +2130,21 @@ static bool can_steal_fallback(unsigned int order, int start_mt) return false; } +static inline void boost_watermark(struct zone *zone) +{ + unsigned long max_boost; + + if (!watermark_boost_factor) + return; + + max_boost = mult_frac(zone->_watermark[WMARK_HIGH], + watermark_boost_factor, 10000); + max_boost = max(pageblock_nr_pages, max_boost); + + zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages, + max_boost); +} + /* * This function implements actual steal behaviour. If order is large enough, * we can steal whole pageblock. If not, we first move freepages in this @@ -2138,7 +2154,7 @@ static bool can_steal_fallback(unsigned int order, int start_mt) * itself, so pages freed in the future will be put on the correct free list. */ static void steal_suitable_fallback(struct zone *zone, struct page *page, - int start_type, bool whole_block) + unsigned int alloc_flags, int start_type, bool whole_block) { unsigned int current_order = page_order(page); struct free_area *area; @@ -2160,6 +2176,15 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, goto single_page; } + /* + * Boost watermarks to increase reclaim pressure to reduce the + * likelihood of future fallbacks. Wake kswapd now as the node + * may be balanced overall and kswapd will not wake naturally. + */ + boost_watermark(zone); + if (alloc_flags & ALLOC_KSWAPD) + wakeup_kswapd(zone, 0, 0, zone_idx(zone)); + /* We are not allowed to try stealing from the whole block */ if (!whole_block) goto single_page; @@ -2443,7 +2468,8 @@ do_steal: page = list_first_entry(&area->free_list[fallback_mt], struct page, lru); - steal_suitable_fallback(zone, page, start_migratetype, can_steal); + steal_suitable_fallback(zone, page, alloc_flags, start_migratetype, + can_steal); trace_mm_page_alloc_extfrag(page, order, current_order, start_migratetype, fallback_mt); @@ -7454,6 +7480,7 @@ static void __setup_per_zone_wmarks(void) zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp; zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2; + zone->watermark_boost = 0; spin_unlock_irqrestore(&zone->lock, flags); } @@ -7554,6 +7581,18 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write, return 0; } +int watermark_boost_factor_sysctl_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *length, loff_t *ppos) +{ + int rc; + + rc = proc_dointvec_minmax(table, write, buffer, length, ppos); + if (rc) + return rc; + + return 0; +} + int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { diff --git a/mm/vmscan.c b/mm/vmscan.c index 24ab1f7394ab..bd8971a29204 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -88,6 +88,9 @@ struct scan_control { /* Can pages be swapped as part of reclaim? */ unsigned int may_swap:1; + /* e.g. boosted watermark reclaim leaves slabs alone */ + unsigned int may_shrinkslab:1; + /* * Cgroups are not reclaimed below their configured memory.low, * unless we threaten to OOM. If any cgroups are skipped due to @@ -2756,8 +2759,10 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) shrink_node_memcg(pgdat, memcg, sc, &lru_pages); node_lru_pages += lru_pages; - shrink_slab(sc->gfp_mask, pgdat->node_id, + if (sc->may_shrinkslab) { + shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority); + } /* Record the group's reclaim efficiency */ vmpressure(sc->gfp_mask, memcg, false, @@ -3239,6 +3244,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = 1, + .may_shrinkslab = 1, }; /* @@ -3283,6 +3289,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, .may_unmap = 1, .reclaim_idx = MAX_NR_ZONES - 1, .may_swap = !noswap, + .may_shrinkslab = 1, }; unsigned long lru_pages; @@ -3329,6 +3336,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = may_swap, + .may_shrinkslab = 1, }; /* @@ -3379,6 +3387,30 @@ static void age_active_anon(struct pglist_data *pgdat, } while (memcg); } +static bool pgdat_watermark_boosted(pg_data_t *pgdat, int classzone_idx) +{ + int i; + struct zone *zone; + + /* + * Check for watermark boosts top-down as the higher zones + * are more likely to be boosted. Both watermarks and boosts + * should not be checked at the time time as reclaim would + * start prematurely when there is no boosting and a lower + * zone is balanced. + */ + for (i = classzone_idx; i >= 0; i--) { + zone = pgdat->node_zones + i; + if (!managed_zone(zone)) + continue; + + if (zone->watermark_boost) + return true; + } + + return false; +} + /* * Returns true if there is an eligible zone balanced for the request order * and classzone_idx @@ -3389,6 +3421,10 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) unsigned long mark = -1; struct zone *zone; + /* + * Check watermarks bottom-up as lower zones are more likely to + * meet watermarks. + */ for (i = 0; i <= classzone_idx; i++) { zone = pgdat->node_zones + i; @@ -3517,14 +3553,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; unsigned long pflags; + unsigned long nr_boost_reclaim; + unsigned long zone_boosts[MAX_NR_ZONES] = { 0, }; + bool boosted; struct zone *zone; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .order = order, - .priority = DEF_PRIORITY, - .may_writepage = !laptop_mode, .may_unmap = 1, - .may_swap = 1, }; psi_memstall_enter(&pflags); @@ -3532,9 +3568,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) count_vm_event(PAGEOUTRUN); + /* + * Account for the reclaim boost. Note that the zone boost is left in + * place so that parallel allocations that are near the watermark will + * stall or direct reclaim until kswapd is finished. + */ + nr_boost_reclaim = 0; + for (i = 0; i <= classzone_idx; i++) { + zone = pgdat->node_zones + i; + if (!managed_zone(zone)) + continue; + + nr_boost_reclaim += zone->watermark_boost; + zone_boosts[i] = zone->watermark_boost; + } + boosted = nr_boost_reclaim; + +restart: + sc.priority = DEF_PRIORITY; do { unsigned long nr_reclaimed = sc.nr_reclaimed; bool raise_priority = true; + bool balanced; bool ret; sc.reclaim_idx = classzone_idx; @@ -3561,13 +3616,40 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) } /* - * Only reclaim if there are no eligible zones. Note that - * sc.reclaim_idx is not used as buffer_heads_over_limit may - * have adjusted it. + * If the pgdat is imbalanced then ignore boosting and preserve + * the watermarks for a later time and restart. Note that the + * zone watermarks will be still reset at the end of balancing + * on the grounds that the normal reclaim should be enough to + * re-evaluate if boosting is required when kswapd next wakes. + */ + balanced = pgdat_balanced(pgdat, sc.order, classzone_idx); + if (!balanced && nr_boost_reclaim) { + nr_boost_reclaim = 0; + goto restart; + } + + /* + * If boosting is not active then only reclaim if there are no + * eligible zones. Note that sc.reclaim_idx is not used as + * buffer_heads_over_limit may have adjusted it. */ - if (pgdat_balanced(pgdat, sc.order, classzone_idx)) + if (!nr_boost_reclaim && balanced) goto out; + /* Limit the priority of boosting to avoid reclaim writeback */ + if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2) + raise_priority = false; + + /* + * Do not writeback or swap pages for boosted reclaim. The + * intent is to relieve pressure not issue sub-optimal IO + * from reclaim context. If no pages are reclaimed, the + * reclaim will be aborted. + */ + sc.may_writepage = !laptop_mode && !nr_boost_reclaim; + sc.may_swap = !nr_boost_reclaim; + sc.may_shrinkslab = !nr_boost_reclaim; + /* * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. All @@ -3619,6 +3701,16 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) * progress in reclaiming pages */ nr_reclaimed = sc.nr_reclaimed - nr_reclaimed; + nr_boost_reclaim -= min(nr_boost_reclaim, nr_reclaimed); + + /* + * If reclaim made no progress for a boost, stop reclaim as + * IO cannot be queued and it could be an infinite loop in + * extreme circumstances. + */ + if (nr_boost_reclaim && !nr_reclaimed) + break; + if (raise_priority || !nr_reclaimed) sc.priority--; } while (sc.priority >= 1); @@ -3627,6 +3719,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) pgdat->kswapd_failures++; out: + /* If reclaim was boosted, account for the reclaim done in this pass */ + if (boosted) { + unsigned long flags; + + for (i = 0; i <= classzone_idx; i++) { + if (!zone_boosts[i]) + continue; + + /* Increments are under the zone lock */ + zone = pgdat->node_zones + i; + spin_lock_irqsave(&zone->lock, flags); + zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]); + spin_unlock_irqrestore(&zone->lock, flags); + } + + /* + * As there is now likely space, wakeup kcompact to defragment + * pageblocks. + */ + wakeup_kcompactd(pgdat, pageblock_order, classzone_idx); + } + snapshot_refaults(NULL, pgdat); __fs_reclaim_release(); psi_memstall_leave(&pflags); @@ -3855,7 +3969,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order, /* Hopeless node, leave it to direct reclaim if possible */ if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES || - pgdat_balanced(pgdat, order, classzone_idx)) { + (pgdat_balanced(pgdat, order, classzone_idx) && + !pgdat_watermark_boosted(pgdat, classzone_idx))) { /* * There may be plenty of free memory available, but it's too * fragmented for high-order allocations. Wake up kcompactd -- cgit v1.2.3-59-g8ed1b From c999fbd3dcc6535b1e298b016665ec23ac2b0a9a Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Fri, 28 Dec 2018 00:35:55 -0800 Subject: mm/mmzone.c: make "migratetype_names" const char * Those strings are immutable in fact. Link: http://lkml.kernel.org/r/20181124090327.GA10877@avx2 Signed-off-by: Alexey Dobriyan Reviewed-by: Andrew Morton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 2 +- mm/page_alloc.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 5b4bfb90fb94..e0c3bc2edbbd 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -65,7 +65,7 @@ enum migratetype { }; /* In mm/page_alloc.c; keep in sync also with show_migration_types() there */ -extern char * const migratetype_names[MIGRATE_TYPES]; +extern const char * const migratetype_names[MIGRATE_TYPES]; #ifdef CONFIG_CMA # define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 80373eca453d..4115d7f20223 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -236,7 +236,7 @@ static char * const zone_names[MAX_NR_ZONES] = { #endif }; -char * const migratetype_names[MIGRATE_TYPES] = { +const char * const migratetype_names[MIGRATE_TYPES] = { "Unmovable", "Movable", "Reclaimable", -- cgit v1.2.3-59-g8ed1b From 9a2f45ff320287d49a3cd90ce68cb58a6da6f5e1 Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Fri, 28 Dec 2018 00:35:59 -0800 Subject: mm/debug.c: make "migrate_reason_names[]" const char * Those strings are immutable as well. Link: http://lkml.kernel.org/r/20181124090508.GB10877@avx2 Signed-off-by: Alexey Dobriyan Reviewed-by: Andrew Morton Reviewed-by: David Hildenbrand Acked-by: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/migrate.h | 2 +- mm/debug.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/migrate.h b/include/linux/migrate.h index f2b4abbca55e..617615fa11ce 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -29,7 +29,7 @@ enum migrate_reason { }; /* In mm/debug.c; also keep sync with include/trace/events/migrate.h */ -extern char *migrate_reason_names[MR_TYPES]; +extern const char *migrate_reason_names[MR_TYPES]; static inline struct page *new_page_nodemask(struct page *page, int preferred_nid, nodemask_t *nodemask) diff --git a/mm/debug.c b/mm/debug.c index 72daa4b087ba..0abb987dad9b 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -17,7 +17,7 @@ #include "internal.h" -char *migrate_reason_names[MR_TYPES] = { +const char *migrate_reason_names[MR_TYPES] = { "compaction", "memory_failure", "memory_hotplug", -- cgit v1.2.3-59-g8ed1b From e5cb113f2dbc8125f31005faebab161a2a84ebe6 Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Fri, 28 Dec 2018 00:36:03 -0800 Subject: mm: make free_reserved_area() return "const char *" and propagate through down the call stack. Link: http://lkml.kernel.org/r/20181124091411.GC10969@avx2 Signed-off-by: Alexey Dobriyan Reviewed-by: Andrew Morton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/x86/include/asm/processor.h | 2 +- arch/x86/mm/init.c | 2 +- include/linux/mm.h | 2 +- mm/page_alloc.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) (limited to 'include') diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 071b2a6fff85..33051436c864 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -967,7 +967,7 @@ static inline uint32_t hypervisor_cpuid_base(const char *sig, uint32_t leaves) } extern unsigned long arch_align_stack(unsigned long sp); -extern void free_init_pages(char *what, unsigned long begin, unsigned long end); +void free_init_pages(const char *what, unsigned long begin, unsigned long end); extern void free_kernel_image_pages(void *begin, void *end); void default_idle(void); diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index 427a955a2cf2..f905a2371080 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -742,7 +742,7 @@ int devmem_is_allowed(unsigned long pagenr) return 1; } -void free_init_pages(char *what, unsigned long begin, unsigned long end) +void free_init_pages(const char *what, unsigned long begin, unsigned long end) { unsigned long begin_aligned, end_aligned; diff --git a/include/linux/mm.h b/include/linux/mm.h index 031b2ce983f9..9963f77f1101 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2108,7 +2108,7 @@ extern void free_initmem(void); * Return pages freed into the buddy system. */ extern unsigned long free_reserved_area(void *start, void *end, - int poison, char *s); + int poison, const char *s); #ifdef CONFIG_HIGHMEM /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4115d7f20223..e97ebaf5ba26 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7193,7 +7193,7 @@ void adjust_managed_page_count(struct page *page, long count) } EXPORT_SYMBOL(adjust_managed_page_count); -unsigned long free_reserved_area(void *start, void *end, int poison, char *s) +unsigned long free_reserved_area(void *start, void *end, int poison, const char *s) { void *pos; unsigned long pages = 0; -- cgit v1.2.3-59-g8ed1b From ef8444ea01d7442652f8e1b8a8b94278cb57eafd Mon Sep 17 00:00:00 2001 From: yuzhoujian Date: Fri, 28 Dec 2018 00:36:07 -0800 Subject: mm, oom: reorganize the oom report in dump_header OOM report contains several sections. The first one is the allocation context that has triggered the OOM. Then we have cpuset context followed by the stack trace of the OOM path. The tird one is the OOM memory information. Followed by the current memory state of all system tasks. At last, we will show oom eligible tasks and the information about the chosen oom victim. One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context. This patch is reorganizing the oom report to 1) who invoked oom and what was the allocation request [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 2) OOM stack trace [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3 [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016 [ 515.906821] Call Trace: [ 515.908062] dump_stack+0x5a/0x73 [ 515.909311] dump_header+0x55/0x28c [ 515.914260] oom_kill_process+0x2d8/0x300 [ 515.916708] out_of_memory+0x145/0x4a0 [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16 [ 515.919157] __alloc_pages_nodemask+0x277/0x290 [ 515.920367] filemap_fault+0x3d0/0x6c0 [ 515.921529] ? filemap_map_pages+0x2b8/0x420 [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4] [ 515.923884] __do_fault+0x20/0x80 [ 515.925032] __handle_mm_fault+0xbc0/0xe80 [ 515.926195] handle_mm_fault+0xfa/0x210 [ 515.927357] __do_page_fault+0x233/0x4c0 [ 515.928506] do_page_fault+0x32/0x140 [ 515.929646] ? page_fault+0x8/0x30 [ 515.930770] page_fault+0x1e/0x30 3) OOM memory information [ 515.958093] Mem-Info: [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0 active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 ... 4) current memory state of all system tasks [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic 5) oom context (contrains and the chosen victim). oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0 An admin can easily get the full oom context at a single line which makes parsing much easier. Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian Acked-by: Michal Hocko Cc: Andrea Arcangeli Cc: David Rientjes Cc: "Kirill A . Shutemov" Cc: Roman Gushchin Cc: Tetsuo Handa Cc: Yang Shi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/oom.h | 10 ++++++++++ kernel/cgroup/cpuset.c | 4 ++-- mm/oom_kill.c | 29 ++++++++++++++++++++--------- mm/page_alloc.c | 4 ++-- 4 files changed, 34 insertions(+), 13 deletions(-) (limited to 'include') diff --git a/include/linux/oom.h b/include/linux/oom.h index 69864a547663..d07992009265 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -15,6 +15,13 @@ struct notifier_block; struct mem_cgroup; struct task_struct; +enum oom_constraint { + CONSTRAINT_NONE, + CONSTRAINT_CPUSET, + CONSTRAINT_MEMORY_POLICY, + CONSTRAINT_MEMCG, +}; + /* * Details of the page allocation that triggered the oom killer that are used to * determine what should be killed. @@ -42,6 +49,9 @@ struct oom_control { unsigned long totalpages; struct task_struct *chosen; unsigned long chosen_points; + + /* Used to print the constraint info. */ + enum oom_constraint constraint; }; extern struct mutex oom_lock; diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 266f10cb7222..9510a5b32eaf 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2666,9 +2666,9 @@ void cpuset_print_current_mems_allowed(void) rcu_read_lock(); cgrp = task_cs(current)->css.cgroup; - pr_info("%s cpuset=", current->comm); + pr_cont(",cpuset="); pr_cont_cgroup_name(cgrp); - pr_cont(" mems_allowed=%*pbl\n", + pr_cont(",mems_allowed=%*pbl", nodemask_pr_args(¤t->mems_allowed)); rcu_read_unlock(); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 21d487749e1d..d90253f1ff93 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -245,11 +245,11 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, return points > 0 ? points : 1; } -enum oom_constraint { - CONSTRAINT_NONE, - CONSTRAINT_CPUSET, - CONSTRAINT_MEMORY_POLICY, - CONSTRAINT_MEMCG, +static const char * const oom_constraint_text[] = { + [CONSTRAINT_NONE] = "CONSTRAINT_NONE", + [CONSTRAINT_CPUSET] = "CONSTRAINT_CPUSET", + [CONSTRAINT_MEMORY_POLICY] = "CONSTRAINT_MEMORY_POLICY", + [CONSTRAINT_MEMCG] = "CONSTRAINT_MEMCG", }; /* @@ -428,16 +428,25 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) rcu_read_unlock(); } +static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim) +{ + /* one line summary of the oom killer context. */ + pr_info("oom-kill:constraint=%s,nodemask=%*pbl", + oom_constraint_text[oc->constraint], + nodemask_pr_args(oc->nodemask)); + cpuset_print_current_mems_allowed(); + pr_cont(",task=%s,pid=%d,uid=%d\n", victim->comm, victim->pid, + from_kuid(&init_user_ns, task_uid(victim))); +} + static void dump_header(struct oom_control *oc, struct task_struct *p) { - pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n", - current->comm, oc->gfp_mask, &oc->gfp_mask, - nodemask_pr_args(oc->nodemask), oc->order, + pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\n", + current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order, current->signal->oom_score_adj); if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order) pr_warn("COMPACTION is disabled!!!\n"); - cpuset_print_current_mems_allowed(); dump_stack(); if (is_memcg_oom(oc)) mem_cgroup_print_oom_info(oc->memcg, p); @@ -448,6 +457,8 @@ static void dump_header(struct oom_control *oc, struct task_struct *p) } if (sysctl_oom_dump_tasks) dump_tasks(oc->memcg, oc->nodemask); + if (p) + dump_oom_summary(oc, p); } /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e97ebaf5ba26..a48db99da7b5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3517,13 +3517,13 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) va_start(args, fmt); vaf.fmt = fmt; vaf.va = &args; - pr_warn("%s: %pV, mode:%#x(%pGg), nodemask=%*pbl\n", + pr_warn("%s: %pV, mode:%#x(%pGg), nodemask=%*pbl", current->comm, &vaf, gfp_mask, &gfp_mask, nodemask_pr_args(nodemask)); va_end(args); cpuset_print_current_mems_allowed(); - + pr_cont("\n"); dump_stack(); warn_alloc_show_mem(gfp_mask, nodemask); } -- cgit v1.2.3-59-g8ed1b From f0c867d9588d9efc10d6a55009c9560336673369 Mon Sep 17 00:00:00 2001 From: yuzhoujian Date: Fri, 28 Dec 2018 00:36:10 -0800 Subject: mm, oom: add oom victim's memcg to the oom context information The current oom report doesn't display victim's memcg context during the global OOM situation. While this information is not strictly needed, it can be really helpful for containerized environments to locate which container has lost a process. Now that we have a single line for the oom context, we can trivially add both the oom memcg (this can be either global_oom or a specific memcg which hits its hard limits) and task_memcg which is the victim's memcg. Below is the single line output in the oom report after this patch. - global oom context information: oom-kill:constraint=,nodemask=,cpuset=,mems_allowed=,global_oom,task_memcg=,task=,pid=,uid= - memcg oom context information: oom-kill:constraint=,nodemask=,cpuset=,mems_allowed=,oom_memcg=,task_memcg=,task=,pid=,uid= [penguin-kernel@I-love.SAKURA.ne.jp: use pr_cont() in mem_cgroup_print_oom_context()] Link: http://lkml.kernel.org/r/201812190723.wBJ7NdkN032628@www262.sakura.ne.jp Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian Signed-off-by: Tetsuo Handa Acked-by: Michal Hocko Cc: David Rientjes Cc: "Kirill A . Shutemov" Cc: Andrea Arcangeli Cc: Tetsuo Handa Cc: Roman Gushchin Cc: Yang Shi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 11 +++++++++-- mm/memcontrol.c | 33 ++++++++++++++++++++------------- mm/oom_kill.c | 3 ++- 3 files changed, 31 insertions(+), 16 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 7ab2120155a4..83ae11cbd12c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -526,9 +526,11 @@ void mem_cgroup_handle_over_high(void); unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg); -void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, +void mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct *p); +void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg); + static inline void mem_cgroup_enter_user_fault(void) { WARN_ON(current->in_user_fault); @@ -970,7 +972,12 @@ static inline unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg) } static inline void -mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) +mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct *p) +{ +} + +static inline void +mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) { } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6e1469b80cb7..4afd5971f2d4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1293,32 +1293,39 @@ static const char *const memcg1_stat_names[] = { #define K(x) ((x) << (PAGE_SHIFT-10)) /** - * mem_cgroup_print_oom_info: Print OOM information relevant to memory controller. + * mem_cgroup_print_oom_context: Print OOM information relevant to + * memory controller. * @memcg: The memory cgroup that went over limit * @p: Task that is going to be killed * * NOTE: @memcg and @p's mem_cgroup can be different when hierarchy is * enabled */ -void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) +void mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct *p) { - struct mem_cgroup *iter; - unsigned int i; - rcu_read_lock(); + if (memcg) { + pr_cont(",oom_memcg="); + pr_cont_cgroup_path(memcg->css.cgroup); + } else + pr_cont(",global_oom"); if (p) { - pr_info("Task in "); + pr_cont(",task_memcg="); pr_cont_cgroup_path(task_cgroup(p, memory_cgrp_id)); - pr_cont(" killed as a result of limit of "); - } else { - pr_info("Memory limit reached of cgroup "); } - - pr_cont_cgroup_path(memcg->css.cgroup); - pr_cont("\n"); - rcu_read_unlock(); +} + +/** + * mem_cgroup_print_oom_meminfo: Print OOM memory information relevant to + * memory controller. + * @memcg: The memory cgroup that went over limit + */ +void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) +{ + struct mem_cgroup *iter; + unsigned int i; pr_info("memory: usage %llukB, limit %llukB, failcnt %lu\n", K((u64)page_counter_read(&memcg->memory)), diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d90253f1ff93..5442cb12e4ed 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -435,6 +435,7 @@ static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim) oom_constraint_text[oc->constraint], nodemask_pr_args(oc->nodemask)); cpuset_print_current_mems_allowed(); + mem_cgroup_print_oom_context(oc->memcg, victim); pr_cont(",task=%s,pid=%d,uid=%d\n", victim->comm, victim->pid, from_kuid(&init_user_ns, task_uid(victim))); } @@ -449,7 +450,7 @@ static void dump_header(struct oom_control *oc, struct task_struct *p) dump_stack(); if (is_memcg_oom(oc)) - mem_cgroup_print_oom_info(oc->memcg, p); + mem_cgroup_print_oom_meminfo(oc->memcg); else { show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask); if (is_dump_unreclaim_slabs()) -- cgit v1.2.3-59-g8ed1b From 9a1ea439b16b92002e0a6fceebc5d1794906e297 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Fri, 28 Dec 2018 00:36:14 -0800 Subject: mm: put_and_wait_on_page_locked() while page is migrated Waiting on a page migration entry has used wait_on_page_locked() all along since 2006: but you cannot safely wait_on_page_locked() without holding a reference to the page, and that extra reference is enough to make migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults on the entry before migrate_page_move_mapping() gets there. And that failure is retried nine times, amplifying the pain when trying to migrate a popular page. With a single persistent faulter, migration sometimes succeeds; with two or three concurrent faulters, success becomes much less likely (and the more the page was mapped, the worse the overhead of unmapping and remapping it on each try). This is especially a problem for memory offlining, where the outer level retries forever (or until terminated from userspace), because a heavy refault workload can trigger an endless loop of migration failures. wait_on_page_locked() is the wrong tool for the job. David Herrmann (but was he the first?) noticed this issue in 2014: https://marc.info/?l=linux-mm&m=140110465608116&w=2 Tim Chen started a thread in August 2017 which appears relevant: https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went on to implicate __migration_entry_wait(): https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in wake_up_page_bit") Baoquan He reported "Memory hotplug softlock issue" 14 November 2018: https://marc.info/?l=linux-mm&m=154217936431300&w=2 We have all assumed that it is essential to hold a page reference while waiting on a page lock: partly to guarantee that there is still a struct page when MEMORY_HOTREMOVE is configured, but also to protect against reuse of the struct page going to someone who then holds the page locked indefinitely, when the waiter can reasonably expect timely unlocking. But in fact, so long as wait_on_page_bit_common() does the put_page(), and is careful not to rely on struct page contents thereafter, there is no need to hold a reference to the page while waiting on it. That does mean that this case cannot go back through the loop: but that's fine for the page migration case, and even if used more widely, is limited by the "Stop walking if it's locked" optimization in wake_page_function(). Add interface put_and_wait_on_page_locked() to do this, using "behavior" enum in place of "lock" arg to wait_on_page_bit_common() to implement it. No interruptible or killable variant needed yet, but they might follow: I have a vague notion that reporting -EINTR should take precedence over return from wait_on_page_bit_common() without knowing the page state, so arrange it accordingly - but that may be nothing but pedantic. __migration_entry_wait() still has to take a brief reference to the page, prior to calling put_and_wait_on_page_locked(): but now that it is dropped before waiting, the chance of impeding page migration is very much reduced. Should we perhaps disable preemption across this? shrink_page_list()'s __ClearPageLocked(): that was a surprise! This survived a lot of testing before that showed up. PageWaiters may have been set by wait_on_page_bit_common(), and the reference dropped, just before shrink_page_list() succeeds in freezing its last page reference: in such a case, unlock_page() must be used. Follow the suggestion from Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now: that optimization predates PageWaiters, and won't buy much these days; but we can reinstate it for the !PageWaiters case if anyone notices. It does raise the question: should vmscan.c's is_page_cache_freeable() and __remove_mapping() now treat a PageWaiters page as if an extra reference were held? Perhaps, but I don't think it matters much, since shrink_page_list() already had to win its trylock_page(), so waiters are not very common there: I noticed no difference when trying the bigger change, and it's surely not needed while put_and_wait_on_page_locked() is only used for page migration. [willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils Signed-off-by: Hugh Dickins Reported-by: Baoquan He Tested-by: Baoquan He Reviewed-by: Andrea Arcangeli Acked-by: Michal Hocko Acked-by: Linus Torvalds Acked-by: Vlastimil Babka Cc: Matthew Wilcox Cc: Baoquan He Cc: David Hildenbrand Cc: Mel Gorman Cc: David Herrmann Cc: Tim Chen Cc: Kan Liang Cc: Andi Kleen Cc: Davidlohr Bueso Cc: Peter Zijlstra Cc: Christoph Lameter Cc: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/pagemap.h | 2 ++ mm/filemap.c | 87 +++++++++++++++++++++++++++++++++++++++++-------- mm/huge_memory.c | 6 ++-- mm/migrate.c | 12 +++---- mm/vmscan.c | 10 ++---- 5 files changed, 84 insertions(+), 33 deletions(-) (limited to 'include') diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 226f96f0dee0..e2d7039af6a3 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -537,6 +537,8 @@ static inline int wait_on_page_locked_killable(struct page *page) return wait_on_page_bit_killable(compound_head(page), PG_locked); } +extern void put_and_wait_on_page_locked(struct page *page); + /* * Wait for a page to complete writeback */ diff --git a/mm/filemap.c b/mm/filemap.c index 81adec8ee02c..d2df272152f5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -981,7 +981,14 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, if (wait_page->bit_nr != key->bit_nr) return 0; - /* Stop walking if it's locked */ + /* + * Stop walking if it's locked. + * Is this safe if put_and_wait_on_page_locked() is in use? + * Yes: the waker must hold a reference to this page, and if PG_locked + * has now already been set by another task, that task must also hold + * a reference to the *same usage* of this page; so there is no need + * to walk on to wake even the put_and_wait_on_page_locked() callers. + */ if (test_bit(key->bit_nr, &key->page->flags)) return -1; @@ -1049,25 +1056,44 @@ static void wake_up_page(struct page *page, int bit) wake_up_page_bit(page, bit); } +/* + * A choice of three behaviors for wait_on_page_bit_common(): + */ +enum behavior { + EXCLUSIVE, /* Hold ref to page and take the bit when woken, like + * __lock_page() waiting on then setting PG_locked. + */ + SHARED, /* Hold ref to page and check the bit when woken, like + * wait_on_page_writeback() waiting on PG_writeback. + */ + DROP, /* Drop ref to page before wait, no check when woken, + * like put_and_wait_on_page_locked() on PG_locked. + */ +}; + static inline int wait_on_page_bit_common(wait_queue_head_t *q, - struct page *page, int bit_nr, int state, bool lock) + struct page *page, int bit_nr, int state, enum behavior behavior) { struct wait_page_queue wait_page; wait_queue_entry_t *wait = &wait_page.wait; + bool bit_is_set; bool thrashing = false; + bool delayacct = false; unsigned long pflags; int ret = 0; if (bit_nr == PG_locked && !PageUptodate(page) && PageWorkingset(page)) { - if (!PageSwapBacked(page)) + if (!PageSwapBacked(page)) { delayacct_thrashing_start(); + delayacct = true; + } psi_memstall_enter(&pflags); thrashing = true; } init_wait(wait); - wait->flags = lock ? WQ_FLAG_EXCLUSIVE : 0; + wait->flags = behavior == EXCLUSIVE ? WQ_FLAG_EXCLUSIVE : 0; wait->func = wake_page_function; wait_page.page = page; wait_page.bit_nr = bit_nr; @@ -1084,14 +1110,17 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, spin_unlock_irq(&q->lock); - if (likely(test_bit(bit_nr, &page->flags))) { + bit_is_set = test_bit(bit_nr, &page->flags); + if (behavior == DROP) + put_page(page); + + if (likely(bit_is_set)) io_schedule(); - } - if (lock) { + if (behavior == EXCLUSIVE) { if (!test_and_set_bit_lock(bit_nr, &page->flags)) break; - } else { + } else if (behavior == SHARED) { if (!test_bit(bit_nr, &page->flags)) break; } @@ -1100,12 +1129,23 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, ret = -EINTR; break; } + + if (behavior == DROP) { + /* + * We can no longer safely access page->flags: + * even if CONFIG_MEMORY_HOTREMOVE is not enabled, + * there is a risk of waiting forever on a page reused + * for something that keeps it locked indefinitely. + * But best check for -EINTR above before breaking. + */ + break; + } } finish_wait(q, wait); if (thrashing) { - if (!PageSwapBacked(page)) + if (delayacct) delayacct_thrashing_end(); psi_memstall_leave(&pflags); } @@ -1124,17 +1164,36 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, void wait_on_page_bit(struct page *page, int bit_nr) { wait_queue_head_t *q = page_waitqueue(page); - wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, false); + wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, SHARED); } EXPORT_SYMBOL(wait_on_page_bit); int wait_on_page_bit_killable(struct page *page, int bit_nr) { wait_queue_head_t *q = page_waitqueue(page); - return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, false); + return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, SHARED); } EXPORT_SYMBOL(wait_on_page_bit_killable); +/** + * put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked + * @page: The page to wait for. + * + * The caller should hold a reference on @page. They expect the page to + * become unlocked relatively soon, but do not wish to hold up migration + * (for example) by holding the reference while waiting for the page to + * come unlocked. After this function returns, the caller should not + * dereference @page. + */ +void put_and_wait_on_page_locked(struct page *page) +{ + wait_queue_head_t *q; + + page = compound_head(page); + q = page_waitqueue(page); + wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, DROP); +} + /** * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue * @page: Page defining the wait queue of interest @@ -1264,7 +1323,8 @@ void __lock_page(struct page *__page) { struct page *page = compound_head(__page); wait_queue_head_t *q = page_waitqueue(page); - wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, true); + wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, + EXCLUSIVE); } EXPORT_SYMBOL(__lock_page); @@ -1272,7 +1332,8 @@ int __lock_page_killable(struct page *__page) { struct page *page = compound_head(__page); wait_queue_head_t *q = page_waitqueue(page); - return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE, true); + return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE, + EXCLUSIVE); } EXPORT_SYMBOL_GPL(__lock_page_killable); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index da6682bb69aa..0c0e18409fde 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1490,8 +1490,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) if (!get_page_unless_zero(page)) goto out_unlock; spin_unlock(vmf->ptl); - wait_on_page_locked(page); - put_page(page); + put_and_wait_on_page_locked(page); goto out; } @@ -1527,8 +1526,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) if (!get_page_unless_zero(page)) goto out_unlock; spin_unlock(vmf->ptl); - wait_on_page_locked(page); - put_page(page); + put_and_wait_on_page_locked(page); goto out; } diff --git a/mm/migrate.c b/mm/migrate.c index f7e4bfdc13b7..acda06f99754 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -327,16 +327,13 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, /* * Once page cache replacement of page migration started, page_count - * *must* be zero. And, we don't want to call wait_on_page_locked() - * against a page without get_page(). - * So, we use get_page_unless_zero(), here. Even failed, page fault - * will occur again. + * is zero; but we must not call put_and_wait_on_page_locked() without + * a ref. Use get_page_unless_zero(), and just fault again if it fails. */ if (!get_page_unless_zero(page)) goto out; pte_unmap_unlock(ptep, ptl); - wait_on_page_locked(page); - put_page(page); + put_and_wait_on_page_locked(page); return; out: pte_unmap_unlock(ptep, ptl); @@ -370,8 +367,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd) if (!get_page_unless_zero(page)) goto unlock; spin_unlock(ptl); - wait_on_page_locked(page); - put_page(page); + put_and_wait_on_page_locked(page); return; unlock: spin_unlock(ptl); diff --git a/mm/vmscan.c b/mm/vmscan.c index bd8971a29204..a714c4f800e9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1460,14 +1460,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, count_memcg_page_event(page, PGLAZYFREED); } else if (!mapping || !__remove_mapping(mapping, page, true)) goto keep_locked; - /* - * At this point, we have no other references and there is - * no way to pick any more up (removed from LRU, removed - * from pagecache). Can use non-atomic bitops now (and - * we obviously don't have to worry about waking up a process - * waiting on the page lock, because there are no references. - */ - __ClearPageLocked(page); + + unlock_page(page); free_it: nr_reclaimed++; -- cgit v1.2.3-59-g8ed1b From 23b68cfaae0ea40a9509fad37b756a6916dec54e Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 28 Dec 2018 00:36:18 -0800 Subject: mm: check nr_initialised with PAGES_PER_SECTION directly in defer_init() When DEFERRED_STRUCT_PAGE_INIT is configured, only the first section of each node's highest zone is initialized before defer stage. static_init_pgcnt is used to store the number of pages like this: pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION, pgdat->node_spanned_pages); because we don't want to overflow zone's range. But this is not necessary, since defer_init() is called like this: memmap_init_zone() for pfn in [start_pfn, end_pfn) defer_init(pfn, end_pfn) In case (pgdat->node_spanned_pages < PAGES_PER_SECTION), the loop would stop before calling defer_init(). BTW, comparing PAGES_PER_SECTION with node_spanned_pages is not correct, since nr_initialised is zone based instead of node based. Even node_spanned_pages is bigger than PAGES_PER_SECTION, its highest zone would have pages less than PAGES_PER_SECTION. Link: http://lkml.kernel.org/r/20181122094807.6985-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Reviewed-by: Alexander Duyck Cc: Pavel Tatashin Cc: Oscar Salvador Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 2 -- mm/page_alloc.c | 13 ++++++------- 2 files changed, 6 insertions(+), 9 deletions(-) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e0c3bc2edbbd..a6e300732ec7 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -692,8 +692,6 @@ typedef struct pglist_data { * is the first PFN that needs to be initialised. */ unsigned long first_deferred_pfn; - /* Number of non-deferred pages */ - unsigned long static_init_pgcnt; #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ #ifdef CONFIG_TRANSPARENT_HUGEPAGE diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a48db99da7b5..a1e5f0de76bd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -326,8 +326,13 @@ defer_init(int nid, unsigned long pfn, unsigned long end_pfn) /* Always populate low zones for address-constrained allocations */ if (end_pfn < pgdat_end_pfn(NODE_DATA(nid))) return false; + + /* + * We start only with one section of pages, more pages are added as + * needed until the rest of deferred pages are initialized. + */ nr_initialised++; - if ((nr_initialised > NODE_DATA(nid)->static_init_pgcnt) && + if ((nr_initialised > PAGES_PER_SECTION) && (pfn & (PAGES_PER_SECTION - 1)) == 0) { NODE_DATA(nid)->first_deferred_pfn = pfn; return true; @@ -6585,12 +6590,6 @@ static void __ref alloc_node_mem_map(struct pglist_data *pgdat) { } #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT static inline void pgdat_set_deferred_range(pg_data_t *pgdat) { - /* - * We start only with one section of pages, more pages are added as - * needed until the rest of deferred pages are initialized. - */ - pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION, - pgdat->node_spanned_pages); pgdat->first_deferred_pfn = ULONG_MAX; } #else -- cgit v1.2.3-59-g8ed1b From 2c2a5af6fed20cf74401c9d64319c76c5ff81309 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Fri, 28 Dec 2018 00:36:22 -0800 Subject: mm, memory_hotplug: add nid parameter to arch_remove_memory Patch series "Do not touch pages in hot-remove path", v2. This patchset aims for two things: 1) A better definition about offline and hot-remove stage 2) Solving bugs where we can access non-initialized pages during hot-remove operations [2] [3]. This is achieved by moving all page/zone handling to the offline stage, so we do not need to access pages when hot-removing memory. [1] https://patchwork.kernel.org/cover/10691415/ [2] https://patchwork.kernel.org/patch/10547445/ [3] https://www.spinics.net/lists/linux-mm/msg161316.html This patch (of 5): This is a preparation for the following-up patches. The idea of passing the nid is that it will allow us to get rid of the zone parameter afterwards. Link: http://lkml.kernel.org/r/20181127162005.15833-2-osalvador@suse.de Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand Reviewed-by: Pavel Tatashin Cc: Michal Hocko Cc: Dan Williams Cc: Jerome Glisse Cc: Jonathan Cameron Cc: "Rafael J. Wysocki" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/ia64/mm/init.c | 2 +- arch/powerpc/mm/mem.c | 3 ++- arch/s390/mm/init.c | 2 +- arch/sh/mm/init.c | 2 +- arch/x86/mm/init_32.c | 2 +- arch/x86/mm/init_64.c | 3 ++- include/linux/memory_hotplug.h | 4 ++-- kernel/memremap.c | 5 ++++- mm/memory_hotplug.c | 2 +- 9 files changed, 15 insertions(+), 10 deletions(-) (limited to 'include') diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c index d5e12ff1d73c..904fe55e10fc 100644 --- a/arch/ia64/mm/init.c +++ b/arch/ia64/mm/init.c @@ -661,7 +661,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, } #ifdef CONFIG_MEMORY_HOTREMOVE -int arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) +int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap) { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index 20394e52fe27..33cc6f676fa6 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -139,7 +139,8 @@ int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap * } #ifdef CONFIG_MEMORY_HOTREMOVE -int __meminit arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) +int __meminit arch_remove_memory(int nid, u64 start, u64 size, + struct vmem_altmap *altmap) { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c index 50388190b393..3e82f66d5c61 100644 --- a/arch/s390/mm/init.c +++ b/arch/s390/mm/init.c @@ -242,7 +242,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, } #ifdef CONFIG_MEMORY_HOTREMOVE -int arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) +int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap) { /* * There is no hardware or firmware interface which could trigger a diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c index c8c13c777162..a8e5c0e00fca 100644 --- a/arch/sh/mm/init.c +++ b/arch/sh/mm/init.c @@ -443,7 +443,7 @@ EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); #endif #ifdef CONFIG_MEMORY_HOTREMOVE -int arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) +int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap) { unsigned long start_pfn = PFN_DOWN(start); unsigned long nr_pages = size >> PAGE_SHIFT; diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c index 49ecf5ecf6d3..85c94f9a87f8 100644 --- a/arch/x86/mm/init_32.c +++ b/arch/x86/mm/init_32.c @@ -860,7 +860,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, } #ifdef CONFIG_MEMORY_HOTREMOVE -int arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) +int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap) { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 484c1b92f078..bccff68e3267 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1141,7 +1141,8 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end) remove_pagetable(start, end, true, NULL); } -int __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) +int __ref arch_remove_memory(int nid, u64 start, u64 size, + struct vmem_altmap *altmap) { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 7383a7a76d69..9e4d9b9b93ea 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -107,8 +107,8 @@ static inline bool movable_node_is_enabled(void) } #ifdef CONFIG_MEMORY_HOTREMOVE -extern int arch_remove_memory(u64 start, u64 size, - struct vmem_altmap *altmap); +extern int arch_remove_memory(int nid, u64 start, u64 size, + struct vmem_altmap *altmap); extern int __remove_pages(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages, struct vmem_altmap *altmap); #endif /* CONFIG_MEMORY_HOTREMOVE */ diff --git a/kernel/memremap.c b/kernel/memremap.c index 3eef989ef035..0d5603d76c37 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -87,6 +87,7 @@ static void devm_memremap_pages_release(void *data) struct resource *res = &pgmap->res; resource_size_t align_start, align_size; unsigned long pfn; + int nid; pgmap->kill(pgmap->ref); for_each_device_pfn(pfn, pgmap) @@ -97,13 +98,15 @@ static void devm_memremap_pages_release(void *data) align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE) - align_start; + nid = page_to_nid(pfn_to_page(align_start >> PAGE_SHIFT)); + mem_hotplug_begin(); if (pgmap->type == MEMORY_DEVICE_PRIVATE) { pfn = align_start >> PAGE_SHIFT; __remove_pages(page_zone(pfn_to_page(pfn)), pfn, align_size >> PAGE_SHIFT, NULL); } else { - arch_remove_memory(align_start, align_size, + arch_remove_memory(nid, align_start, align_size, pgmap->altmap_valid ? &pgmap->altmap : NULL); kasan_remove_zero_shadow(__va(align_start), align_size); } diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 6258e0e923cc..0718cf7427b2 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1841,7 +1841,7 @@ void __ref __remove_memory(int nid, u64 start, u64 size) memblock_free(start, size); memblock_remove(start, size); - arch_remove_memory(start, size, NULL); + arch_remove_memory(nid, start, size, NULL); try_offline_node(nid); -- cgit v1.2.3-59-g8ed1b From fed84c78527009d4f799a3ed9a566502fa026d82 Mon Sep 17 00:00:00 2001 From: Qian Cai Date: Fri, 28 Dec 2018 00:36:29 -0800 Subject: mm/memblock.c: skip kmemleak for kasan_init() Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and Huawei TaiShan 2280 aarch64 servers). After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early log buffer went from something like 280 to 260000 which caused kmemleak disabled and crash dump memory reservation failed. The multitude of kmemleak_alloc() calls is from nested loops while KASAN is setting up full memory mappings, so let early kmemleak allocations skip those memblock_alloc_internal() calls came from kasan_init() given that those early KASAN memory mappings should not reference to other memory. Hence, no kmemleak false positives. kasan_init kasan_map_populate [1] kasan_pgd_populate [2] kasan_pud_populate [3] kasan_pmd_populate [4] kasan_pte_populate [5] kasan_alloc_zeroed_page memblock_alloc_try_nid memblock_alloc_internal kmemleak_alloc [1] for_each_memblock(memory, reg) [2] while (pgdp++, addr = next, addr != end) [3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp))) [4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp))) [5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep))) Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.us Signed-off-by: Qian Cai Acked-by: Catalin Marinas Cc: Michal Hocko Cc: Mike Rapoport Cc: Alexander Potapenko Cc: Dmitry Vyukov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm64/mm/kasan_init.c | 2 +- include/linux/memblock.h | 1 + mm/memblock.c | 19 +++++++++++-------- 3 files changed, 13 insertions(+), 9 deletions(-) (limited to 'include') diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c index 3e142add890b..4b55b15707a3 100644 --- a/arch/arm64/mm/kasan_init.c +++ b/arch/arm64/mm/kasan_init.c @@ -39,7 +39,7 @@ static phys_addr_t __init kasan_alloc_zeroed_page(int node) { void *p = memblock_alloc_try_nid(PAGE_SIZE, PAGE_SIZE, __pa(MAX_DMA_ADDRESS), - MEMBLOCK_ALLOC_ACCESSIBLE, node); + MEMBLOCK_ALLOC_KASAN, node); return __pa(p); } diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 5f74ba623dbd..64c41cf45590 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -319,6 +319,7 @@ static inline int memblock_get_region_node(const struct memblock_region *r) /* Flags for memblock allocation APIs */ #define MEMBLOCK_ALLOC_ANYWHERE (~(phys_addr_t)0) #define MEMBLOCK_ALLOC_ACCESSIBLE 0 +#define MEMBLOCK_ALLOC_KASAN 1 /* We are using top down, so it is safe to use 0 here */ #define MEMBLOCK_LOW_LIMIT 0 diff --git a/mm/memblock.c b/mm/memblock.c index f57d7620668b..022d4cbb3618 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -262,7 +262,8 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size, phys_addr_t kernel_end, ret; /* pump up @end */ - if (end == MEMBLOCK_ALLOC_ACCESSIBLE) + if (end == MEMBLOCK_ALLOC_ACCESSIBLE || + end == MEMBLOCK_ALLOC_KASAN) end = memblock.current_limit; /* avoid allocating the first page */ @@ -1419,13 +1420,15 @@ again: done: ptr = phys_to_virt(alloc); - /* - * The min_count is set to 0 so that bootmem allocated blocks - * are never reported as leaks. This is because many of these blocks - * are only referred via the physical address which is not - * looked up by kmemleak. - */ - kmemleak_alloc(ptr, size, 0, 0); + /* Skip kmemleak for kasan_init() due to high volume. */ + if (max_addr != MEMBLOCK_ALLOC_KASAN) + /* + * The min_count is set to 0 so that bootmem allocated + * blocks are never reported as leaks. This is because many + * of these blocks are only referred via the physical + * address which is not looked up by kmemleak. + */ + kmemleak_alloc(ptr, size, 0, 0); return ptr; } -- cgit v1.2.3-59-g8ed1b From 9e247bab0668a5893b3efa131cec5b5859467834 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Fri, 28 Dec 2018 00:36:58 -0800 Subject: mm: remove pte_lock_deinit() Pagetable page doesn't touch page->mapping or have any used field that overlaps with it. No need to clear mapping in dtor. In fact, doing so might mask problems that otherwise would be detected by bad_page(). Link: http://lkml.kernel.org/r/20181128235525.58780-1-yuzhao@google.com Signed-off-by: Yu Zhao Reviewed-by: Matthew Wilcox Acked-by: Michal Hocko Cc: Hugh Dickins Cc: "Kirill A . Shutemov" Cc: Dan Williams Cc: Pavel Tatashin Cc: Souptick Joarder Cc: Logan Gunthorpe Cc: Keith Busch Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-) (limited to 'include') diff --git a/include/linux/mm.h b/include/linux/mm.h index 9963f77f1101..3c39b9dc7a90 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1954,13 +1954,6 @@ static inline bool ptlock_init(struct page *page) return true; } -/* Reset page->mapping so free_pages_check won't complain. */ -static inline void pte_lock_deinit(struct page *page) -{ - page->mapping = NULL; - ptlock_free(page); -} - #else /* !USE_SPLIT_PTE_PTLOCKS */ /* * We use mm->page_table_lock to guard all pagetable pages of the mm. @@ -1971,7 +1964,7 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) } static inline void ptlock_cache_init(void) {} static inline bool ptlock_init(struct page *page) { return true; } -static inline void pte_lock_deinit(struct page *page) {} +static inline void ptlock_free(struct page *page) {} #endif /* USE_SPLIT_PTE_PTLOCKS */ static inline void pgtable_init(void) @@ -1991,7 +1984,7 @@ static inline bool pgtable_page_ctor(struct page *page) static inline void pgtable_page_dtor(struct page *page) { - pte_lock_deinit(page); + ptlock_free(page); __ClearPageTable(page); dec_zone_page_state(page, NR_PAGETABLE); } -- cgit v1.2.3-59-g8ed1b From 83af658898cb292a32d8b6cd9b51266d7cfc4b6a Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 28 Dec 2018 00:37:02 -0800 Subject: mm, sparse: drop pgdat_resize_lock in sparse_add/remove_one_section() pgdat_resize_lock is used to protect pgdat's memory region information like: node_start_pfn, node_present_pages, etc. While in function sparse_add/remove_one_section(), pgdat_resize_lock is used to protect initialization/release of one mem_section. This looks not proper. These code paths are currently protected by mem_hotplug_lock currently but should there ever be any reason for locking at the sparse layer a dedicated lock should be introduced. Following is the current call trace of sparse_add/remove_one_section() mem_hotplug_begin() arch_add_memory() add_pages() __add_pages() __add_section() sparse_add_one_section() mem_hotplug_done() mem_hotplug_begin() arch_remove_memory() __remove_pages() __remove_section() sparse_remove_one_section() mem_hotplug_done() The comment above the pgdat_resize_lock also mentions "Holding this will also guarantee that any pfn_valid() stays that way.", which is true with the current implementation and false after this patch. But current implementation doesn't meet this comment. There isn't any pfn walkers to take the lock so this looks like a relict from the past. This patch also removes this comment. [richard.weiyang@gmail.com: v4] Link: http://lkml.kernel.org/r/20181204085657.20472-1-richard.weiyang@gmail.com [mhocko@suse.com: changelog suggestion] Link: http://lkml.kernel.org/r/20181128091243.19249-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Reviewed-by: David Hildenbrand Acked-by: Michal Hocko Cc: Dave Hansen Cc: Oscar Salvador Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 3 +-- mm/sparse.c | 9 +-------- 2 files changed, 2 insertions(+), 10 deletions(-) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a6e300732ec7..fc4b5cdb6c2d 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -637,8 +637,7 @@ typedef struct pglist_data { #if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT) /* * Must be held any time you expect node_start_pfn, node_present_pages - * or node_spanned_pages stay constant. Holding this will also - * guarantee that any pfn_valid() stays that way. + * or node_spanned_pages stay constant. * * pgdat_resize_lock() and pgdat_resize_unlock() are provided to * manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG diff --git a/mm/sparse.c b/mm/sparse.c index 691544a2814c..7323a03fbc39 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -685,7 +685,6 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat, struct mem_section *ms; struct page *memmap; unsigned long *usemap; - unsigned long flags; int ret; /* @@ -705,8 +704,6 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat, return -ENOMEM; } - pgdat_resize_lock(pgdat, &flags); - ms = __pfn_to_section(start_pfn); if (ms->section_mem_map & SECTION_MARKED_PRESENT) { ret = -EEXIST; @@ -723,7 +720,6 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat, sparse_init_one_section(ms, section_nr, memmap, usemap); out: - pgdat_resize_unlock(pgdat, &flags); if (ret < 0) { kfree(usemap); __kfree_section_memmap(memmap, altmap); @@ -794,10 +790,8 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms, unsigned long map_offset, struct vmem_altmap *altmap) { struct page *memmap = NULL; - unsigned long *usemap = NULL, flags; - struct pglist_data *pgdat = zone->zone_pgdat; + unsigned long *usemap = NULL; - pgdat_resize_lock(pgdat, &flags); if (ms->section_mem_map) { usemap = ms->pageblock_flags; memmap = sparse_decode_mem_map(ms->section_mem_map, @@ -805,7 +799,6 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms, ms->section_mem_map = 0; ms->pageblock_flags = NULL; } - pgdat_resize_unlock(pgdat, &flags); clear_hwpoisoned_pages(memmap + map_offset, PAGES_PER_SECTION - map_offset); -- cgit v1.2.3-59-g8ed1b From 4e0d2e7ef14d9e1c900dac909db45263822b824f Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 28 Dec 2018 00:37:06 -0800 Subject: mm, sparse: pass nid instead of pgdat to sparse_add_one_section() Since the information needed in sparse_add_one_section() is node id to allocate proper memory, it is not necessary to pass its pgdat. This patch changes the prototype of sparse_add_one_section() to pass node id directly. This is intended to reduce misleading that sparse_add_one_section() would touch pgdat. Link: http://lkml.kernel.org/r/20181204085657.20472-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang Reviewed-by: David Hildenbrand Acked-by: Michal Hocko Cc: Dave Hansen Cc: Oscar Salvador Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memory_hotplug.h | 4 ++-- mm/memory_hotplug.c | 2 +- mm/sparse.c | 8 ++++---- 3 files changed, 7 insertions(+), 7 deletions(-) (limited to 'include') diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 9e4d9b9b93ea..8ed6e09a5c0c 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -333,8 +333,8 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages, struct vmem_altmap *altmap); extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages); extern bool is_memblock_offlined(struct memory_block *mem); -extern int sparse_add_one_section(struct pglist_data *pgdat, - unsigned long start_pfn, struct vmem_altmap *altmap); +extern int sparse_add_one_section(int nid, unsigned long start_pfn, + struct vmem_altmap *altmap); extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms, unsigned long map_offset, struct vmem_altmap *altmap); extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map, diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 0718cf7427b2..5f15f9c04c4a 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -253,7 +253,7 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn, if (pfn_valid(phys_start_pfn)) return -EEXIST; - ret = sparse_add_one_section(NODE_DATA(nid), phys_start_pfn, altmap); + ret = sparse_add_one_section(nid, phys_start_pfn, altmap); if (ret < 0) return ret; diff --git a/mm/sparse.c b/mm/sparse.c index 7323a03fbc39..7ea5dc6c6b19 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -678,8 +678,8 @@ static void free_map_bootmem(struct page *memmap) * set. If this is <=0, then that means that the passed-in * map was not consumed and must be freed. */ -int __meminit sparse_add_one_section(struct pglist_data *pgdat, - unsigned long start_pfn, struct vmem_altmap *altmap) +int __meminit sparse_add_one_section(int nid, unsigned long start_pfn, + struct vmem_altmap *altmap) { unsigned long section_nr = pfn_to_section_nr(start_pfn); struct mem_section *ms; @@ -691,11 +691,11 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat, * no locking for this, because it does its own * plus, it does a kmalloc */ - ret = sparse_index_init(section_nr, pgdat->node_id); + ret = sparse_index_init(section_nr, nid); if (ret < 0 && ret != -EEXIST) return ret; ret = 0; - memmap = kmalloc_section_memmap(section_nr, pgdat->node_id, altmap); + memmap = kmalloc_section_memmap(section_nr, nid, altmap); if (!memmap) return -ENOMEM; usemap = __kmalloc_section_usemap(); -- cgit v1.2.3-59-g8ed1b From fa004ab7365ffa1e17e6b267d64798afccb94946 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 28 Dec 2018 00:37:10 -0800 Subject: mm, hotplug: move init_currently_empty_zone() under zone_span_lock protection During online_pages phase, pgdat->nr_zones will be updated in case this zone is empty. Currently the online_pages phase is protected by the global locks (device_device_hotplug_lock and mem_hotplug_lock), which ensures there is no contention during the update of nr_zones. These global locks introduces scalability issues (especially the second one), which slow down code relying on get_online_mems(). This is also a preparation for not having to rely on get_online_mems() but instead some more fine grained locks. The patch moves init_currently_empty_zone under both zone_span_writelock and pgdat_resize_lock because both the pgdat state is changed (nr_zones) and the zone's start_pfn. Also this patch changes the documentation of node_size_lock to include the protection of nr_zones. Link: http://lkml.kernel.org/r/20181203205016.14123-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Acked-by: Michal Hocko Reviewed-by: Oscar Salvador Cc: David Hildenbrand Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 4 ++-- mm/memory_hotplug.c | 5 ++--- 2 files changed, 4 insertions(+), 5 deletions(-) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index fc4b5cdb6c2d..cc4a507d7ca4 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -636,8 +636,8 @@ typedef struct pglist_data { #endif #if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT) /* - * Must be held any time you expect node_start_pfn, node_present_pages - * or node_spanned_pages stay constant. + * Must be held any time you expect node_start_pfn, + * node_present_pages, node_spanned_pages or nr_zones to stay constant. * * pgdat_resize_lock() and pgdat_resize_unlock() are provided to * manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 5f15f9c04c4a..c2b34ec602ee 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -743,14 +743,13 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, int nid = pgdat->node_id; unsigned long flags; - if (zone_is_empty(zone)) - init_currently_empty_zone(zone, start_pfn, nr_pages); - clear_zone_contiguous(zone); /* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */ pgdat_resize_lock(pgdat, &flags); zone_span_writelock(zone); + if (zone_is_empty(zone)) + init_currently_empty_zone(zone, start_pfn, nr_pages); resize_zone_range(zone, start_pfn, nr_pages); zone_span_writeunlock(zone); resize_pgdat_range(pgdat, start_pfn, nr_pages); -- cgit v1.2.3-59-g8ed1b From 144552ff8995dd34d049a203d636b259ab751137 Mon Sep 17 00:00:00 2001 From: Anthony Yznaga Date: Fri, 28 Dec 2018 00:37:31 -0800 Subject: /proc/kpagecount: return 0 for special pages that are never mapped Certain pages that are never mapped to userspace have a type indicated in the page_type field of their struct pages (e.g. PG_buddy). page_type overlaps with _mapcount so set the count to 0 and avoid calling page_mapcount() for these pages. [anthony.yznaga@oracle.com: incorporate feedback from Matthew Wilcox] Link: http://lkml.kernel.org/r/1544481313-27318-1-git-send-email-anthony.yznaga@oracle.com Link: http://lkml.kernel.org/r/1543963526-27917-1-git-send-email-anthony.yznaga@oracle.com Signed-off-by: Anthony Yznaga Reviewed-by: Andrew Morton Acked-by: Matthew Wilcox Reviewed-by: Naoya Horiguchi Cc: Vlastimil Babka Cc: David Rientjes Cc: Alexey Dobriyan Cc: Kirill A. Shutemov Cc: Mike Rapoport Cc: Michal Hocko Cc: Alexander Duyck Cc: Johannes Weiner Cc: Miles Chen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/page.c | 2 +- include/linux/page-flags.h | 6 ++++++ 2 files changed, 7 insertions(+), 1 deletion(-) (limited to 'include') diff --git a/fs/proc/page.c b/fs/proc/page.c index 6c517b11acf8..40b05e0d4274 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -46,7 +46,7 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf, ppage = pfn_to_page(pfn); else ppage = NULL; - if (!ppage || PageSlab(ppage)) + if (!ppage || PageSlab(ppage) || page_has_type(ppage)) pcount = 0; else pcount = page_mapcount(ppage); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 50ce1bddaf56..39b4494e29f1 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -669,6 +669,7 @@ PAGEFLAG_FALSE(DoubleMap) #define PAGE_TYPE_BASE 0xf0000000 /* Reserve 0x0000007f to catch underflows of page_mapcount */ +#define PAGE_MAPCOUNT_RESERVE -128 #define PG_buddy 0x00000080 #define PG_balloon 0x00000100 #define PG_kmemcg 0x00000200 @@ -677,6 +678,11 @@ PAGEFLAG_FALSE(DoubleMap) #define PageType(page, flag) \ ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) +static inline int page_has_type(struct page *page) +{ + return (int)page->page_type < PAGE_MAPCOUNT_RESERVE; +} + #define PAGE_TYPE_OPS(uname, lname) \ static __always_inline int Page##uname(struct page *page) \ { \ -- cgit v1.2.3-59-g8ed1b From 8e2d43405b22e98cf5f3730c1829ec1fdbe17ae7 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Fri, 28 Dec 2018 00:37:53 -0800 Subject: lib/ioremap: ensure break-before-make is used for huge p4d mappings Whilst no architectures actually enable support for huge p4d mappings in the vmap area, the code that is implemented should be using break-before-make, as we do for pud and pmd huge entries. Link: http://lkml.kernel.org/r/1544120495-17438-6-git-send-email-will.deacon@arm.com Signed-off-by: Will Deacon Reviewed-by: Toshi Kani Cc: Chintan Pandya Cc: Toshi Kani Cc: Thomas Gleixner Cc: Michal Hocko Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Sean Christopherson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm64/mm/mmu.c | 5 +++++ arch/x86/mm/pgtable.c | 8 ++++++++ include/asm-generic/pgtable.h | 5 +++++ lib/ioremap.c | 27 +++++++++++++++++++++------ 4 files changed, 39 insertions(+), 6 deletions(-) (limited to 'include') diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 13b80361d9f5..b6f5aa52ac67 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1043,6 +1043,11 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr) return 1; } +int p4d_free_pud_page(p4d_t *p4d, unsigned long addr) +{ + return 0; /* Don't attempt a block mapping */ +} + #ifdef CONFIG_MEMORY_HOTPLUG int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, bool want_memblock) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index e95a7d6ac8f8..b0284eab14dc 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -794,6 +794,14 @@ int pmd_clear_huge(pmd_t *pmd) return 0; } +/* + * Until we support 512GB pages, skip them in the vmap area. + */ +int p4d_free_pud_page(p4d_t *p4d, unsigned long addr) +{ + return 0; +} + #ifdef CONFIG_X86_64 /** * pud_free_pmd_page - Clear pud entry and free pmd page. diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index a9cac82e9a7a..05e61e6c843f 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -1057,6 +1057,7 @@ int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot); int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot); int pud_clear_huge(pud_t *pud); int pmd_clear_huge(pmd_t *pmd); +int p4d_free_pud_page(p4d_t *p4d, unsigned long addr); int pud_free_pmd_page(pud_t *pud, unsigned long addr); int pmd_free_pte_page(pmd_t *pmd, unsigned long addr); #else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ @@ -1084,6 +1085,10 @@ static inline int pmd_clear_huge(pmd_t *pmd) { return 0; } +static inline int p4d_free_pud_page(p4d_t *p4d, unsigned long addr) +{ + return 0; +} static inline int pud_free_pmd_page(pud_t *pud, unsigned long addr) { return 0; diff --git a/lib/ioremap.c b/lib/ioremap.c index 10d7c5485c39..063213685563 100644 --- a/lib/ioremap.c +++ b/lib/ioremap.c @@ -156,6 +156,25 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr, return 0; } +static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr, + unsigned long end, phys_addr_t phys_addr, + pgprot_t prot) +{ + if (!ioremap_p4d_enabled()) + return 0; + + if ((end - addr) != P4D_SIZE) + return 0; + + if (!IS_ALIGNED(phys_addr, P4D_SIZE)) + return 0; + + if (p4d_present(*p4d) && !p4d_free_pud_page(p4d, addr)) + return 0; + + return p4d_set_huge(p4d, phys_addr, prot); +} + static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot) { @@ -168,12 +187,8 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr, do { next = p4d_addr_end(addr, end); - if (ioremap_p4d_enabled() && - ((next - addr) == P4D_SIZE) && - IS_ALIGNED(phys_addr, P4D_SIZE)) { - if (p4d_set_huge(p4d, phys_addr, prot)) - continue; - } + if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) + continue; if (ioremap_pud_range(p4d, addr, next, phys_addr, prot)) return -ENOMEM; -- cgit v1.2.3-59-g8ed1b From 5d6527a784f7a6d247961e046e830de8d71b47d1 Mon Sep 17 00:00:00 2001 From: Jérôme Glisse Date: Fri, 28 Dec 2018 00:38:05 -0800 Subject: mm/mmu_notifier: use structure for invalidate_range_start/end callback MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "mmu notifier contextual informations", v2. This patchset adds contextual information, why an invalidation is happening, to mmu notifier callback. This is necessary for user of mmu notifier that wish to maintains their own data structure without having to add new fields to struct vm_area_struct (vma). For instance device can have they own page table that mirror the process address space. When a vma is unmap (munmap() syscall) the device driver can free the device page table for the range. Today we do not have any information on why a mmu notifier call back is happening and thus device driver have to assume that it is always an munmap(). This is inefficient at it means that it needs to re-allocate device page table on next page fault and rebuild the whole device driver data structure for the range. Other use case beside munmap() also exist, for instance it is pointless for device driver to invalidate the device page table when the invalidation is for the soft dirtyness tracking. Or device driver can optimize away mprotect() that change the page table permission access for the range. This patchset enables all this optimizations for device drivers. I do not include any of those in this series but another patchset I am posting will leverage this. The patchset is pretty simple from a code point of view. The first two patches consolidate all mmu notifier arguments into a struct so that it is easier to add/change arguments. The last patch adds the contextual information (munmap, protection, soft dirty, clear, ...). This patch (of 3): To avoid having to change many callback definition everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end callback. No functional changes with this patch. [akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc] Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.com Signed-off-by: Jérôme Glisse Acked-by: Jan Kara Acked-by: Jason Gunthorpe [infiniband] Cc: Matthew Wilcox Cc: Ross Zwisler Cc: Dan Williams Cc: Paolo Bonzini Cc: Radim Krcmar Cc: Michal Hocko Cc: Christian Koenig Cc: Felix Kuehling Cc: Ralph Campbell Cc: John Hubbard Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 47 ++++++++++++++------------------- drivers/gpu/drm/i915/i915_gem_userptr.c | 14 +++++----- drivers/gpu/drm/radeon/radeon_mn.c | 16 +++++------ drivers/infiniband/core/umem_odp.c | 20 ++++++-------- drivers/infiniband/hw/hfi1/mmu_rb.c | 13 ++++----- drivers/misc/mic/scif/scif_dma.c | 11 +++----- drivers/misc/sgi-gru/grutlbpurge.c | 14 +++++----- drivers/xen/gntdev.c | 12 ++++----- include/linux/mmu_notifier.h | 14 ++++++---- mm/hmm.c | 23 +++++++--------- mm/mmu_notifier.c | 21 +++++++++++++-- virt/kvm/kvm_main.c | 14 ++++------ 12 files changed, 103 insertions(+), 116 deletions(-) (limited to 'include') diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index e55508b39496..3e6823fdd939 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -238,44 +238,40 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node, * amdgpu_mn_invalidate_range_start_gfx - callback to notify about mm change * * @mn: our notifier - * @mm: the mm this callback is about - * @start: start of updated range - * @end: end of updated range + * @range: mmu notifier context * * Block for operations on BOs to finish and mark pages as accessed and * potentially dirty. */ static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, - unsigned long end, - bool blockable) + const struct mmu_notifier_range *range) { struct amdgpu_mn *amn = container_of(mn, struct amdgpu_mn, mn); struct interval_tree_node *it; + unsigned long end; /* notification is exclusive, but interval is inclusive */ - end -= 1; + end = range->end - 1; /* TODO we should be able to split locking for interval tree and * amdgpu_mn_invalidate_node */ - if (amdgpu_mn_read_lock(amn, blockable)) + if (amdgpu_mn_read_lock(amn, range->blockable)) return -EAGAIN; - it = interval_tree_iter_first(&amn->objects, start, end); + it = interval_tree_iter_first(&amn->objects, range->start, end); while (it) { struct amdgpu_mn_node *node; - if (!blockable) { + if (!range->blockable) { amdgpu_mn_read_unlock(amn); return -EAGAIN; } node = container_of(it, struct amdgpu_mn_node, it); - it = interval_tree_iter_next(it, start, end); + it = interval_tree_iter_next(it, range->start, end); - amdgpu_mn_invalidate_node(node, start, end); + amdgpu_mn_invalidate_node(node, range->start, end); } return 0; @@ -294,39 +290,38 @@ static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn, * are restorted in amdgpu_mn_invalidate_range_end_hsa. */ static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, - unsigned long end, - bool blockable) + const struct mmu_notifier_range *range) { struct amdgpu_mn *amn = container_of(mn, struct amdgpu_mn, mn); struct interval_tree_node *it; + unsigned long end; /* notification is exclusive, but interval is inclusive */ - end -= 1; + end = range->end - 1; - if (amdgpu_mn_read_lock(amn, blockable)) + if (amdgpu_mn_read_lock(amn, range->blockable)) return -EAGAIN; - it = interval_tree_iter_first(&amn->objects, start, end); + it = interval_tree_iter_first(&amn->objects, range->start, end); while (it) { struct amdgpu_mn_node *node; struct amdgpu_bo *bo; - if (!blockable) { + if (!range->blockable) { amdgpu_mn_read_unlock(amn); return -EAGAIN; } node = container_of(it, struct amdgpu_mn_node, it); - it = interval_tree_iter_next(it, start, end); + it = interval_tree_iter_next(it, range->start, end); list_for_each_entry(bo, &node->bos, mn_list) { struct kgd_mem *mem = bo->kfd_bo; if (amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, - start, end)) - amdgpu_amdkfd_evict_userptr(mem, mm); + range->start, + end)) + amdgpu_amdkfd_evict_userptr(mem, range->mm); } } @@ -344,9 +339,7 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn, * Release the lock again to allow new command submissions. */ static void amdgpu_mn_invalidate_range_end(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, - unsigned long end) + const struct mmu_notifier_range *range) { struct amdgpu_mn *amn = container_of(mn, struct amdgpu_mn, mn); diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c index 2c9b284036d1..3df77020aada 100644 --- a/drivers/gpu/drm/i915/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c @@ -113,27 +113,25 @@ static void del_object(struct i915_mmu_object *mo) } static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, - struct mm_struct *mm, - unsigned long start, - unsigned long end, - bool blockable) + const struct mmu_notifier_range *range) { struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn); struct i915_mmu_object *mo; struct interval_tree_node *it; LIST_HEAD(cancelled); + unsigned long end; if (RB_EMPTY_ROOT(&mn->objects.rb_root)) return 0; /* interval ranges are inclusive, but invalidate range is exclusive */ - end--; + end = range->end - 1; spin_lock(&mn->lock); - it = interval_tree_iter_first(&mn->objects, start, end); + it = interval_tree_iter_first(&mn->objects, range->start, end); while (it) { - if (!blockable) { + if (!range->blockable) { spin_unlock(&mn->lock); return -EAGAIN; } @@ -151,7 +149,7 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, queue_work(mn->wq, &mo->work); list_add(&mo->link, &cancelled); - it = interval_tree_iter_next(it, start, end); + it = interval_tree_iter_next(it, range->start, end); } list_for_each_entry(mo, &cancelled, link) del_object(mo); diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c index f8b35df44c60..b3019505065a 100644 --- a/drivers/gpu/drm/radeon/radeon_mn.c +++ b/drivers/gpu/drm/radeon/radeon_mn.c @@ -119,40 +119,38 @@ static void radeon_mn_release(struct mmu_notifier *mn, * unmap them by move them into system domain again. */ static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, - unsigned long end, - bool blockable) + const struct mmu_notifier_range *range) { struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn); struct ttm_operation_ctx ctx = { false, false }; struct interval_tree_node *it; + unsigned long end; int ret = 0; /* notification is exclusive, but interval is inclusive */ - end -= 1; + end = range->end - 1; /* TODO we should be able to split locking for interval tree and * the tear down. */ - if (blockable) + if (range->blockable) mutex_lock(&rmn->lock); else if (!mutex_trylock(&rmn->lock)) return -EAGAIN; - it = interval_tree_iter_first(&rmn->objects, start, end); + it = interval_tree_iter_first(&rmn->objects, range->start, end); while (it) { struct radeon_mn_node *node; struct radeon_bo *bo; long r; - if (!blockable) { + if (!range->blockable) { ret = -EAGAIN; goto out_unlock; } node = container_of(it, struct radeon_mn_node, it); - it = interval_tree_iter_next(it, start, end); + it = interval_tree_iter_next(it, range->start, end); list_for_each_entry(bo, &node->bos, mn_list) { diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c index 9608681224e6..a4ec43093cb3 100644 --- a/drivers/infiniband/core/umem_odp.c +++ b/drivers/infiniband/core/umem_odp.c @@ -146,15 +146,12 @@ static int invalidate_range_start_trampoline(struct ib_umem_odp *item, } static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, - unsigned long end, - bool blockable) + const struct mmu_notifier_range *range) { struct ib_ucontext_per_mm *per_mm = container_of(mn, struct ib_ucontext_per_mm, mn); - if (blockable) + if (range->blockable) down_read(&per_mm->umem_rwsem); else if (!down_read_trylock(&per_mm->umem_rwsem)) return -EAGAIN; @@ -169,9 +166,10 @@ static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn, return 0; } - return rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, start, end, + return rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, range->start, + range->end, invalidate_range_start_trampoline, - blockable, NULL); + range->blockable, NULL); } static int invalidate_range_end_trampoline(struct ib_umem_odp *item, u64 start, @@ -182,9 +180,7 @@ static int invalidate_range_end_trampoline(struct ib_umem_odp *item, u64 start, } static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, - unsigned long end) + const struct mmu_notifier_range *range) { struct ib_ucontext_per_mm *per_mm = container_of(mn, struct ib_ucontext_per_mm, mn); @@ -192,8 +188,8 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn, if (unlikely(!per_mm->active)) return; - rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, start, - end, + rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, range->start, + range->end, invalidate_range_end_trampoline, true, NULL); up_read(&per_mm->umem_rwsem); } diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c index 475b769e120c..14d2a90964c3 100644 --- a/drivers/infiniband/hw/hfi1/mmu_rb.c +++ b/drivers/infiniband/hw/hfi1/mmu_rb.c @@ -68,8 +68,7 @@ struct mmu_rb_handler { static unsigned long mmu_node_start(struct mmu_rb_node *); static unsigned long mmu_node_last(struct mmu_rb_node *); static int mmu_notifier_range_start(struct mmu_notifier *, - struct mm_struct *, - unsigned long, unsigned long, bool); + const struct mmu_notifier_range *); static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *, unsigned long, unsigned long); static void do_remove(struct mmu_rb_handler *handler, @@ -284,10 +283,7 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler, } static int mmu_notifier_range_start(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, - unsigned long end, - bool blockable) + const struct mmu_notifier_range *range) { struct mmu_rb_handler *handler = container_of(mn, struct mmu_rb_handler, mn); @@ -297,10 +293,11 @@ static int mmu_notifier_range_start(struct mmu_notifier *mn, bool added = false; spin_lock_irqsave(&handler->lock, flags); - for (node = __mmu_int_rb_iter_first(root, start, end - 1); + for (node = __mmu_int_rb_iter_first(root, range->start, range->end-1); node; node = ptr) { /* Guard against node removal. */ - ptr = __mmu_int_rb_iter_next(node, start, end - 1); + ptr = __mmu_int_rb_iter_next(node, range->start, + range->end - 1); trace_hfi1_mmu_mem_invalidate(node->addr, node->len); if (handler->ops->invalidate(handler->ops_arg, node)) { __mmu_int_rb_remove(node, root); diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c index 18b8ed57c4ac..e0d97044d0e9 100644 --- a/drivers/misc/mic/scif/scif_dma.c +++ b/drivers/misc/mic/scif/scif_dma.c @@ -201,23 +201,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn, } static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, - unsigned long end, - bool blockable) + const struct mmu_notifier_range *range) { struct scif_mmu_notif *mmn; mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier); - scif_rma_destroy_tcw(mmn, start, end - start); + scif_rma_destroy_tcw(mmn, range->start, range->end - range->start); return 0; } static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, - unsigned long end) + const struct mmu_notifier_range *range) { /* * Nothing to do here, everything needed was done in diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c index 03b49d52092e..ca2032afe035 100644 --- a/drivers/misc/sgi-gru/grutlbpurge.c +++ b/drivers/misc/sgi-gru/grutlbpurge.c @@ -220,9 +220,7 @@ void gru_flush_all_tlb(struct gru_state *gru) * MMUOPS notifier callout functions */ static int gru_invalidate_range_start(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, unsigned long end, - bool blockable) + const struct mmu_notifier_range *range) { struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct, ms_notifier); @@ -230,15 +228,14 @@ static int gru_invalidate_range_start(struct mmu_notifier *mn, STAT(mmu_invalidate_range); atomic_inc(&gms->ms_range_active); gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms, - start, end, atomic_read(&gms->ms_range_active)); - gru_flush_tlb_range(gms, start, end - start); + range->start, range->end, atomic_read(&gms->ms_range_active)); + gru_flush_tlb_range(gms, range->start, range->end - range->start); return 0; } static void gru_invalidate_range_end(struct mmu_notifier *mn, - struct mm_struct *mm, unsigned long start, - unsigned long end) + const struct mmu_notifier_range *range) { struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct, ms_notifier); @@ -247,7 +244,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn, (void)atomic_dec_and_test(&gms->ms_range_active); wake_up_all(&gms->ms_wait_queue); - gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms, start, end); + gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", + gms, range->start, range->end); } static void gru_release(struct mmu_notifier *mn, struct mm_struct *mm) diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c index b0b02a501167..5efc5eee9544 100644 --- a/drivers/xen/gntdev.c +++ b/drivers/xen/gntdev.c @@ -520,26 +520,26 @@ static int unmap_if_in_range(struct gntdev_grant_map *map, } static int mn_invl_range_start(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, unsigned long end, - bool blockable) + const struct mmu_notifier_range *range) { struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn); struct gntdev_grant_map *map; int ret = 0; - if (blockable) + if (range->blockable) mutex_lock(&priv->lock); else if (!mutex_trylock(&priv->lock)) return -EAGAIN; list_for_each_entry(map, &priv->maps, next) { - ret = unmap_if_in_range(map, start, end, blockable); + ret = unmap_if_in_range(map, range->start, range->end, + range->blockable); if (ret) goto out_unlock; } list_for_each_entry(map, &priv->freeable_maps, next) { - ret = unmap_if_in_range(map, start, end, blockable); + ret = unmap_if_in_range(map, range->start, range->end, + range->blockable); if (ret) goto out_unlock; } diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 913c3c13e36e..3d377805b29c 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -25,6 +25,13 @@ struct mmu_notifier_mm { spinlock_t lock; }; +struct mmu_notifier_range { + struct mm_struct *mm; + unsigned long start; + unsigned long end; + bool blockable; +}; + struct mmu_notifier_ops { /* * Called either by mmu_notifier_unregister or when the mm is @@ -146,12 +153,9 @@ struct mmu_notifier_ops { * */ int (*invalidate_range_start)(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, unsigned long end, - bool blockable); + const struct mmu_notifier_range *range); void (*invalidate_range_end)(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, unsigned long end); + const struct mmu_notifier_range *range); /* * invalidate_range() is either called between diff --git a/mm/hmm.c b/mm/hmm.c index 361f3706962f..789587731217 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -189,35 +189,30 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm) } static int hmm_invalidate_range_start(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, - unsigned long end, - bool blockable) + const struct mmu_notifier_range *range) { struct hmm_update update; - struct hmm *hmm = mm->hmm; + struct hmm *hmm = range->mm->hmm; VM_BUG_ON(!hmm); - update.start = start; - update.end = end; + update.start = range->start; + update.end = range->end; update.event = HMM_UPDATE_INVALIDATE; - update.blockable = blockable; + update.blockable = range->blockable; return hmm_invalidate_range(hmm, true, &update); } static void hmm_invalidate_range_end(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, - unsigned long end) + const struct mmu_notifier_range *range) { struct hmm_update update; - struct hmm *hmm = mm->hmm; + struct hmm *hmm = range->mm->hmm; VM_BUG_ON(!hmm); - update.start = start; - update.end = end; + update.start = range->start; + update.end = range->end; update.event = HMM_UPDATE_INVALIDATE; update.blockable = true; hmm_invalidate_range(hmm, false, &update); diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 755466cd289a..74a7dc3d11c8 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -171,14 +171,20 @@ int __mmu_notifier_invalidate_range_start(struct mm_struct *mm, unsigned long start, unsigned long end, bool blockable) { + struct mmu_notifier_range _range, *range = &_range; struct mmu_notifier *mn; int ret = 0; int id; + range->blockable = blockable; + range->start = start; + range->end = end; + range->mm = mm; + id = srcu_read_lock(&srcu); hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) { if (mn->ops->invalidate_range_start) { - int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable); + int _ret = mn->ops->invalidate_range_start(mn, range); if (_ret) { pr_info("%pS callback failed with %d in %sblockable context.\n", mn->ops->invalidate_range_start, _ret, @@ -198,9 +204,20 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, unsigned long end, bool only_end) { + struct mmu_notifier_range _range, *range = &_range; struct mmu_notifier *mn; int id; + /* + * The end call back will never be call if the start refused to go + * through because of blockable was false so here assume that we + * can block. + */ + range->blockable = true; + range->start = start; + range->end = end; + range->mm = mm; + id = srcu_read_lock(&srcu); hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) { /* @@ -219,7 +236,7 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, if (!only_end && mn->ops->invalidate_range) mn->ops->invalidate_range(mn, mm, start, end); if (mn->ops->invalidate_range_end) - mn->ops->invalidate_range_end(mn, mm, start, end); + mn->ops->invalidate_range_end(mn, range); } srcu_read_unlock(&srcu, id); } diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index cf7cc0554094..666d0155662d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -363,10 +363,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, } static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, - unsigned long end, - bool blockable) + const struct mmu_notifier_range *range) { struct kvm *kvm = mmu_notifier_to_kvm(mn); int need_tlb_flush = 0, idx; @@ -380,7 +377,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, * count is also read inside the mmu_lock critical section. */ kvm->mmu_notifier_count++; - need_tlb_flush = kvm_unmap_hva_range(kvm, start, end); + need_tlb_flush = kvm_unmap_hva_range(kvm, range->start, range->end); need_tlb_flush |= kvm->tlbs_dirty; /* we've to flush the tlb before the pages can be freed */ if (need_tlb_flush) @@ -388,7 +385,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, spin_unlock(&kvm->mmu_lock); - ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable); + ret = kvm_arch_mmu_notifier_invalidate_range(kvm, range->start, + range->end, range->blockable); srcu_read_unlock(&kvm->srcu, idx); @@ -396,9 +394,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, } static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, - unsigned long end) + const struct mmu_notifier_range *range) { struct kvm *kvm = mmu_notifier_to_kvm(mn); -- cgit v1.2.3-59-g8ed1b From ac46d4f3c43241ffa23d5bf36153a0830c0e02cc Mon Sep 17 00:00:00 2001 From: Jérôme Glisse Date: Fri, 28 Dec 2018 00:38:09 -0800 Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit To avoid having to change many call sites everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end cakks. No functional changes with this patch. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse Acked-by: Christian König Acked-by: Jan Kara Cc: Matthew Wilcox Cc: Ross Zwisler Cc: Dan Williams Cc: Paolo Bonzini Cc: Radim Krcmar Cc: Michal Hocko Cc: Felix Kuehling Cc: Ralph Campbell Cc: John Hubbard From: Jérôme Glisse Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3 fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/dax.c | 8 ++-- fs/proc/task_mmu.c | 7 +++- include/linux/mm.h | 6 ++- include/linux/mmu_notifier.h | 87 +++++++++++++++++++++++++------------- kernel/events/uprobes.c | 10 ++--- mm/huge_memory.c | 54 +++++++++++------------- mm/hugetlb.c | 52 +++++++++++------------ mm/khugepaged.c | 10 ++--- mm/ksm.c | 21 ++++------ mm/madvise.c | 21 +++++----- mm/memory.c | 99 ++++++++++++++++++++++---------------------- mm/migrate.c | 28 ++++++------- mm/mmu_notifier.c | 37 ++++------------- mm/mprotect.c | 15 +++---- mm/mremap.c | 10 ++--- mm/oom_kill.c | 17 ++++---- mm/rmap.c | 30 ++++++++------ 17 files changed, 262 insertions(+), 250 deletions(-) (limited to 'include') diff --git a/fs/dax.c b/fs/dax.c index 48132eca3761..262e14f29933 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -779,7 +779,8 @@ static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index, i_mmap_lock_read(mapping); vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) { - unsigned long address, start, end; + struct mmu_notifier_range range; + unsigned long address; cond_resched(); @@ -793,7 +794,8 @@ static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index, * call mmu_notifier_invalidate_range_start() on our behalf * before taking any lock. */ - if (follow_pte_pmd(vma->vm_mm, address, &start, &end, &ptep, &pmdp, &ptl)) + if (follow_pte_pmd(vma->vm_mm, address, &range, + &ptep, &pmdp, &ptl)) continue; /* @@ -835,7 +837,7 @@ unlock_pte: pte_unmap_unlock(ptep, ptl); } - mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); + mmu_notifier_invalidate_range_end(&range); } i_mmap_unlock_read(mapping); } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 47c3764c469b..b3ddceb003bc 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1096,6 +1096,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, return -ESRCH; mm = get_task_mm(task); if (mm) { + struct mmu_notifier_range range; struct clear_refs_private cp = { .type = type, }; @@ -1139,11 +1140,13 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, downgrade_write(&mm->mmap_sem); break; } - mmu_notifier_invalidate_range_start(mm, 0, -1); + + mmu_notifier_range_init(&range, mm, 0, -1UL); + mmu_notifier_invalidate_range_start(&range); } walk_page_range(0, mm->highest_vm_end, &clear_refs_walk); if (type == CLEAR_REFS_SOFT_DIRTY) - mmu_notifier_invalidate_range_end(mm, 0, -1); + mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb, 0, -1); up_read(&mm->mmap_sem); out_mm: diff --git a/include/linux/mm.h b/include/linux/mm.h index 3c39b9dc7a90..ea1f12d15365 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1451,6 +1451,8 @@ struct mm_walk { void *private; }; +struct mmu_notifier_range; + int walk_page_range(unsigned long addr, unsigned long end, struct mm_walk *walk); int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk); @@ -1459,8 +1461,8 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); int follow_pte_pmd(struct mm_struct *mm, unsigned long address, - unsigned long *start, unsigned long *end, - pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp); + struct mmu_notifier_range *range, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp); int follow_pfn(struct vm_area_struct *vma, unsigned long address, unsigned long *pfn); int follow_phys(struct vm_area_struct *vma, unsigned long address, diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 3d377805b29c..4050ec1c3b45 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -220,11 +220,8 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm, unsigned long address); extern void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address, pte_t pte); -extern int __mmu_notifier_invalidate_range_start(struct mm_struct *mm, - unsigned long start, unsigned long end, - bool blockable); -extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, - unsigned long start, unsigned long end, +extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r); +extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r, bool only_end); extern void __mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end); @@ -268,33 +265,37 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm, __mmu_notifier_change_pte(mm, address, pte); } -static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline void +mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) { - if (mm_has_notifiers(mm)) - __mmu_notifier_invalidate_range_start(mm, start, end, true); + if (mm_has_notifiers(range->mm)) { + range->blockable = true; + __mmu_notifier_invalidate_range_start(range); + } } -static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline int +mmu_notifier_invalidate_range_start_nonblock(struct mmu_notifier_range *range) { - if (mm_has_notifiers(mm)) - return __mmu_notifier_invalidate_range_start(mm, start, end, false); + if (mm_has_notifiers(range->mm)) { + range->blockable = false; + return __mmu_notifier_invalidate_range_start(range); + } return 0; } -static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline void +mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range) { - if (mm_has_notifiers(mm)) - __mmu_notifier_invalidate_range_end(mm, start, end, false); + if (mm_has_notifiers(range->mm)) + __mmu_notifier_invalidate_range_end(range, false); } -static inline void mmu_notifier_invalidate_range_only_end(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline void +mmu_notifier_invalidate_range_only_end(struct mmu_notifier_range *range) { - if (mm_has_notifiers(mm)) - __mmu_notifier_invalidate_range_end(mm, start, end, true); + if (mm_has_notifiers(range->mm)) + __mmu_notifier_invalidate_range_end(range, true); } static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, @@ -315,6 +316,17 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) __mmu_notifier_mm_destroy(mm); } + +static inline void mmu_notifier_range_init(struct mmu_notifier_range *range, + struct mm_struct *mm, + unsigned long start, + unsigned long end) +{ + range->mm = mm; + range->start = start; + range->end = end; +} + #define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ ({ \ int __young; \ @@ -427,6 +439,23 @@ extern void mmu_notifier_call_srcu(struct rcu_head *rcu, #else /* CONFIG_MMU_NOTIFIER */ +struct mmu_notifier_range { + unsigned long start; + unsigned long end; +}; + +static inline void _mmu_notifier_range_init(struct mmu_notifier_range *range, + unsigned long start, + unsigned long end) +{ + range->start = start; + range->end = end; +} + +#define mmu_notifier_range_init(range, mm, start, end) \ + _mmu_notifier_range_init(range, start, end) + + static inline int mm_has_notifiers(struct mm_struct *mm) { return 0; @@ -454,24 +483,24 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm, { } -static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline void +mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) { } -static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline int +mmu_notifier_invalidate_range_start_nonblock(struct mmu_notifier_range *range) { return 0; } -static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline +void mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range) { } -static inline void mmu_notifier_invalidate_range_only_end(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline void +mmu_notifier_invalidate_range_only_end(struct mmu_notifier_range *range) { } diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index abbd8da9ac21..8aef47ee7bfa 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -171,11 +171,11 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, .address = addr, }; int err; - /* For mmu_notifiers */ - const unsigned long mmun_start = addr; - const unsigned long mmun_end = addr + PAGE_SIZE; + struct mmu_notifier_range range; struct mem_cgroup *memcg; + mmu_notifier_range_init(&range, mm, addr, addr + PAGE_SIZE); + VM_BUG_ON_PAGE(PageTransHuge(old_page), old_page); err = mem_cgroup_try_charge(new_page, vma->vm_mm, GFP_KERNEL, &memcg, @@ -186,7 +186,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, /* For try_to_free_swap() and munlock_vma_page() below */ lock_page(old_page); - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_start(&range); err = -EAGAIN; if (!page_vma_mapped_walk(&pvmw)) { mem_cgroup_cancel_charge(new_page, memcg, false); @@ -220,7 +220,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, err = 0; unlock: - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); unlock_page(old_page); return err; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0c0e18409fde..05136ad0f325 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1134,8 +1134,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, int i; vm_fault_t ret = 0; struct page **pages; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ + struct mmu_notifier_range range; pages = kmalloc_array(HPAGE_PMD_NR, sizeof(struct page *), GFP_KERNEL); @@ -1173,9 +1172,9 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, cond_resched(); } - mmun_start = haddr; - mmun_end = haddr + HPAGE_PMD_SIZE; - mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end); + mmu_notifier_range_init(&range, vma->vm_mm, haddr, + haddr + HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) @@ -1220,8 +1219,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, * No need to double call mmu_notifier->invalidate_range() callback as * the above pmdp_huge_clear_flush_notify() did already call it. */ - mmu_notifier_invalidate_range_only_end(vma->vm_mm, mmun_start, - mmun_end); + mmu_notifier_invalidate_range_only_end(&range); ret |= VM_FAULT_WRITE; put_page(page); @@ -1231,7 +1229,7 @@ out: out_free_pages: spin_unlock(vmf->ptl); - mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); for (i = 0; i < HPAGE_PMD_NR; i++) { memcg = (void *)page_private(pages[i]); set_page_private(pages[i], 0); @@ -1248,8 +1246,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd) struct page *page = NULL, *new_page; struct mem_cgroup *memcg; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ + struct mmu_notifier_range range; gfp_t huge_gfp; /* for allocation and charge */ vm_fault_t ret = 0; @@ -1338,9 +1335,9 @@ alloc: vma, HPAGE_PMD_NR); __SetPageUptodate(new_page); - mmun_start = haddr; - mmun_end = haddr + HPAGE_PMD_SIZE; - mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end); + mmu_notifier_range_init(&range, vma->vm_mm, haddr, + haddr + HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); spin_lock(vmf->ptl); if (page) @@ -1375,8 +1372,7 @@ out_mn: * No need to double call mmu_notifier->invalidate_range() callback as * the above pmdp_huge_clear_flush_notify() did already call it. */ - mmu_notifier_invalidate_range_only_end(vma->vm_mm, mmun_start, - mmun_end); + mmu_notifier_invalidate_range_only_end(&range); out: return ret; out_unlock: @@ -2015,14 +2011,15 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, unsigned long address) { spinlock_t *ptl; - struct mm_struct *mm = vma->vm_mm; - unsigned long haddr = address & HPAGE_PUD_MASK; + struct mmu_notifier_range range; - mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PUD_SIZE); - ptl = pud_lock(mm, pud); + mmu_notifier_range_init(&range, vma->vm_mm, address & HPAGE_PUD_MASK, + (address & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE); + mmu_notifier_invalidate_range_start(&range); + ptl = pud_lock(vma->vm_mm, pud); if (unlikely(!pud_trans_huge(*pud) && !pud_devmap(*pud))) goto out; - __split_huge_pud_locked(vma, pud, haddr); + __split_huge_pud_locked(vma, pud, range.start); out: spin_unlock(ptl); @@ -2030,8 +2027,7 @@ out: * No need to double call mmu_notifier->invalidate_range() callback as * the above pudp_huge_clear_flush_notify() did already call it. */ - mmu_notifier_invalidate_range_only_end(mm, haddr, haddr + - HPAGE_PUD_SIZE); + mmu_notifier_invalidate_range_only_end(&range); } #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ @@ -2233,11 +2229,12 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long address, bool freeze, struct page *page) { spinlock_t *ptl; - struct mm_struct *mm = vma->vm_mm; - unsigned long haddr = address & HPAGE_PMD_MASK; + struct mmu_notifier_range range; - mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE); - ptl = pmd_lock(mm, pmd); + mmu_notifier_range_init(&range, vma->vm_mm, address & HPAGE_PMD_MASK, + (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); + ptl = pmd_lock(vma->vm_mm, pmd); /* * If caller asks to setup a migration entries, we need a page to check @@ -2253,7 +2250,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, clear_page_mlock(page); } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd))) goto out; - __split_huge_pmd_locked(vma, pmd, haddr, freeze); + __split_huge_pmd_locked(vma, pmd, range.start, freeze); out: spin_unlock(ptl); /* @@ -2269,8 +2266,7 @@ out: * any further changes to individual pte will notify. So no need * to call mmu_notifier->invalidate_range() */ - mmu_notifier_invalidate_range_only_end(mm, haddr, haddr + - HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_only_end(&range); } void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a80832487981..12000ba5c868 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3240,16 +3240,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, int cow; struct hstate *h = hstate_vma(vma); unsigned long sz = huge_page_size(h); - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ + struct mmu_notifier_range range; int ret = 0; cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; - mmun_start = vma->vm_start; - mmun_end = vma->vm_end; - if (cow) - mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end); + if (cow) { + mmu_notifier_range_init(&range, src, vma->vm_start, + vma->vm_end); + mmu_notifier_invalidate_range_start(&range); + } for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) { spinlock_t *src_ptl, *dst_ptl; @@ -3325,7 +3325,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, } if (cow) - mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); return ret; } @@ -3342,8 +3342,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, struct page *page; struct hstate *h = hstate_vma(vma); unsigned long sz = huge_page_size(h); - unsigned long mmun_start = start; /* For mmu_notifiers */ - unsigned long mmun_end = end; /* For mmu_notifiers */ + struct mmu_notifier_range range; WARN_ON(!is_vm_hugetlb_page(vma)); BUG_ON(start & ~huge_page_mask(h)); @@ -3359,8 +3358,9 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, /* * If sharing possible, alert mmu notifiers of worst case. */ - adjust_range_if_pmd_sharing_possible(vma, &mmun_start, &mmun_end); - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); + mmu_notifier_range_init(&range, mm, start, end); + adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end); + mmu_notifier_invalidate_range_start(&range); address = start; for (; address < end; address += sz) { ptep = huge_pte_offset(mm, address, sz); @@ -3428,7 +3428,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, if (ref_page) break; } - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); tlb_end_vma(tlb, vma); } @@ -3546,9 +3546,8 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, struct page *old_page, *new_page; int outside_reserve = 0; vm_fault_t ret = 0; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ unsigned long haddr = address & huge_page_mask(h); + struct mmu_notifier_range range; pte = huge_ptep_get(ptep); old_page = pte_page(pte); @@ -3627,9 +3626,8 @@ retry_avoidcopy: __SetPageUptodate(new_page); set_page_huge_active(new_page); - mmun_start = haddr; - mmun_end = mmun_start + huge_page_size(h); - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); + mmu_notifier_range_init(&range, mm, haddr, haddr + huge_page_size(h)); + mmu_notifier_invalidate_range_start(&range); /* * Retake the page table lock to check for racing updates @@ -3642,7 +3640,7 @@ retry_avoidcopy: /* Break COW */ huge_ptep_clear_flush(vma, haddr, ptep); - mmu_notifier_invalidate_range(mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range(mm, range.start, range.end); set_huge_pte_at(mm, haddr, ptep, make_huge_pte(vma, new_page, 1)); page_remove_rmap(old_page, true); @@ -3651,7 +3649,7 @@ retry_avoidcopy: new_page = old_page; } spin_unlock(ptl); - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); out_release_all: restore_reserve_on_error(h, vma, haddr, new_page); put_page(new_page); @@ -4340,21 +4338,21 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, pte_t pte; struct hstate *h = hstate_vma(vma); unsigned long pages = 0; - unsigned long f_start = start; - unsigned long f_end = end; bool shared_pmd = false; + struct mmu_notifier_range range; /* * In the case of shared PMDs, the area to flush could be beyond - * start/end. Set f_start/f_end to cover the maximum possible + * start/end. Set range.start/range.end to cover the maximum possible * range if PMD sharing is possible. */ - adjust_range_if_pmd_sharing_possible(vma, &f_start, &f_end); + mmu_notifier_range_init(&range, mm, start, end); + adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end); BUG_ON(address >= end); - flush_cache_range(vma, f_start, f_end); + flush_cache_range(vma, range.start, range.end); - mmu_notifier_invalidate_range_start(mm, f_start, f_end); + mmu_notifier_invalidate_range_start(&range); i_mmap_lock_write(vma->vm_file->f_mapping); for (; address < end; address += huge_page_size(h)) { spinlock_t *ptl; @@ -4405,7 +4403,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, * did unshare a page of pmds, flush the range corresponding to the pud. */ if (shared_pmd) - flush_hugetlb_tlb_range(vma, f_start, f_end); + flush_hugetlb_tlb_range(vma, range.start, range.end); else flush_hugetlb_tlb_range(vma, start, end); /* @@ -4415,7 +4413,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, * See Documentation/vm/mmu_notifier.rst */ i_mmap_unlock_write(vma->vm_file->f_mapping); - mmu_notifier_invalidate_range_end(mm, f_start, f_end); + mmu_notifier_invalidate_range_end(&range); return pages << h->order; } diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 43ce2f4d2551..4f017339ddb2 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -944,8 +944,7 @@ static void collapse_huge_page(struct mm_struct *mm, int isolated = 0, result = 0; struct mem_cgroup *memcg; struct vm_area_struct *vma; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ + struct mmu_notifier_range range; gfp_t gfp; VM_BUG_ON(address & ~HPAGE_PMD_MASK); @@ -1017,9 +1016,8 @@ static void collapse_huge_page(struct mm_struct *mm, pte = pte_offset_map(pmd, address); pte_ptl = pte_lockptr(mm, pmd); - mmun_start = address; - mmun_end = address + HPAGE_PMD_SIZE; - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); + mmu_notifier_range_init(&range, mm, address, address + HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ /* * After this gup_fast can't run anymore. This also removes @@ -1029,7 +1027,7 @@ static void collapse_huge_page(struct mm_struct *mm, */ _pmd = pmdp_collapse_flush(vma, address, pmd); spin_unlock(pmd_ptl); - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); spin_lock(pte_ptl); isolated = __collapse_huge_page_isolate(vma, address, pte); diff --git a/mm/ksm.c b/mm/ksm.c index 1a088306ef81..38c0360482fa 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1042,8 +1042,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, }; int swapped; int err = -EFAULT; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ + struct mmu_notifier_range range; pvmw.address = page_address_in_vma(page, vma); if (pvmw.address == -EFAULT) @@ -1051,9 +1050,9 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, BUG_ON(PageTransCompound(page)); - mmun_start = pvmw.address; - mmun_end = pvmw.address + PAGE_SIZE; - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); + mmu_notifier_range_init(&range, mm, pvmw.address, + pvmw.address + PAGE_SIZE); + mmu_notifier_invalidate_range_start(&range); if (!page_vma_mapped_walk(&pvmw)) goto out_mn; @@ -1105,7 +1104,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, out_unlock: page_vma_mapped_walk_done(&pvmw); out_mn: - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); out: return err; } @@ -1129,8 +1128,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, spinlock_t *ptl; unsigned long addr; int err = -EFAULT; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ + struct mmu_notifier_range range; addr = page_address_in_vma(page, vma); if (addr == -EFAULT) @@ -1140,9 +1138,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, if (!pmd) goto out; - mmun_start = addr; - mmun_end = addr + PAGE_SIZE; - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); + mmu_notifier_range_init(&range, mm, addr, addr + PAGE_SIZE); + mmu_notifier_invalidate_range_start(&range); ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); if (!pte_same(*ptep, orig_pte)) { @@ -1188,7 +1185,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, pte_unmap_unlock(ptep, ptl); err = 0; out_mn: - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); out: return err; } diff --git a/mm/madvise.c b/mm/madvise.c index 6cb1ca93e290..21a7881a2db4 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -458,29 +458,30 @@ static void madvise_free_page_range(struct mmu_gather *tlb, static int madvise_free_single_vma(struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr) { - unsigned long start, end; struct mm_struct *mm = vma->vm_mm; + struct mmu_notifier_range range; struct mmu_gather tlb; /* MADV_FREE works for only anon vma at the moment */ if (!vma_is_anonymous(vma)) return -EINVAL; - start = max(vma->vm_start, start_addr); - if (start >= vma->vm_end) + range.start = max(vma->vm_start, start_addr); + if (range.start >= vma->vm_end) return -EINVAL; - end = min(vma->vm_end, end_addr); - if (end <= vma->vm_start) + range.end = min(vma->vm_end, end_addr); + if (range.end <= vma->vm_start) return -EINVAL; + mmu_notifier_range_init(&range, mm, range.start, range.end); lru_add_drain(); - tlb_gather_mmu(&tlb, mm, start, end); + tlb_gather_mmu(&tlb, mm, range.start, range.end); update_hiwater_rss(mm); - mmu_notifier_invalidate_range_start(mm, start, end); - madvise_free_page_range(&tlb, vma, start, end); - mmu_notifier_invalidate_range_end(mm, start, end); - tlb_finish_mmu(&tlb, start, end); + mmu_notifier_invalidate_range_start(&range); + madvise_free_page_range(&tlb, vma, range.start, range.end); + mmu_notifier_invalidate_range_end(&range); + tlb_finish_mmu(&tlb, range.start, range.end); return 0; } diff --git a/mm/memory.c b/mm/memory.c index 4ad2d293ddc2..b7a8bfe5f5ec 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -973,8 +973,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, unsigned long next; unsigned long addr = vma->vm_start; unsigned long end = vma->vm_end; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ + struct mmu_notifier_range range; bool is_cow; int ret; @@ -1008,11 +1007,11 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, * is_cow_mapping() returns true. */ is_cow = is_cow_mapping(vma->vm_flags); - mmun_start = addr; - mmun_end = end; - if (is_cow) - mmu_notifier_invalidate_range_start(src_mm, mmun_start, - mmun_end); + + if (is_cow) { + mmu_notifier_range_init(&range, src_mm, addr, end); + mmu_notifier_invalidate_range_start(&range); + } ret = 0; dst_pgd = pgd_offset(dst_mm, addr); @@ -1029,7 +1028,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, } while (dst_pgd++, src_pgd++, addr = next, addr != end); if (is_cow) - mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); return ret; } @@ -1332,12 +1331,13 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr) { - struct mm_struct *mm = vma->vm_mm; + struct mmu_notifier_range range; - mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); + mmu_notifier_range_init(&range, vma->vm_mm, start_addr, end_addr); + mmu_notifier_invalidate_range_start(&range); for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); - mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); + mmu_notifier_invalidate_range_end(&range); } /** @@ -1351,18 +1351,18 @@ void unmap_vmas(struct mmu_gather *tlb, void zap_page_range(struct vm_area_struct *vma, unsigned long start, unsigned long size) { - struct mm_struct *mm = vma->vm_mm; + struct mmu_notifier_range range; struct mmu_gather tlb; - unsigned long end = start + size; lru_add_drain(); - tlb_gather_mmu(&tlb, mm, start, end); - update_hiwater_rss(mm); - mmu_notifier_invalidate_range_start(mm, start, end); - for ( ; vma && vma->vm_start < end; vma = vma->vm_next) - unmap_single_vma(&tlb, vma, start, end, NULL); - mmu_notifier_invalidate_range_end(mm, start, end); - tlb_finish_mmu(&tlb, start, end); + mmu_notifier_range_init(&range, vma->vm_mm, start, start + size); + tlb_gather_mmu(&tlb, vma->vm_mm, start, range.end); + update_hiwater_rss(vma->vm_mm); + mmu_notifier_invalidate_range_start(&range); + for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next) + unmap_single_vma(&tlb, vma, start, range.end, NULL); + mmu_notifier_invalidate_range_end(&range); + tlb_finish_mmu(&tlb, start, range.end); } /** @@ -1377,17 +1377,17 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start, static void zap_page_range_single(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *details) { - struct mm_struct *mm = vma->vm_mm; + struct mmu_notifier_range range; struct mmu_gather tlb; - unsigned long end = address + size; lru_add_drain(); - tlb_gather_mmu(&tlb, mm, address, end); - update_hiwater_rss(mm); - mmu_notifier_invalidate_range_start(mm, address, end); - unmap_single_vma(&tlb, vma, address, end, details); - mmu_notifier_invalidate_range_end(mm, address, end); - tlb_finish_mmu(&tlb, address, end); + mmu_notifier_range_init(&range, vma->vm_mm, address, address + size); + tlb_gather_mmu(&tlb, vma->vm_mm, address, range.end); + update_hiwater_rss(vma->vm_mm); + mmu_notifier_invalidate_range_start(&range); + unmap_single_vma(&tlb, vma, address, range.end, details); + mmu_notifier_invalidate_range_end(&range); + tlb_finish_mmu(&tlb, address, range.end); } /** @@ -2247,9 +2247,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) struct page *new_page = NULL; pte_t entry; int page_copied = 0; - const unsigned long mmun_start = vmf->address & PAGE_MASK; - const unsigned long mmun_end = mmun_start + PAGE_SIZE; struct mem_cgroup *memcg; + struct mmu_notifier_range range; if (unlikely(anon_vma_prepare(vma))) goto oom; @@ -2272,7 +2271,9 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) __SetPageUptodate(new_page); - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); + mmu_notifier_range_init(&range, mm, vmf->address & PAGE_MASK, + (vmf->address & PAGE_MASK) + PAGE_SIZE); + mmu_notifier_invalidate_range_start(&range); /* * Re-check the pte - we dropped the lock @@ -2349,7 +2350,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) * No need to double call mmu_notifier->invalidate_range() callback as * the above ptep_clear_flush_notify() did already call it. */ - mmu_notifier_invalidate_range_only_end(mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_only_end(&range); if (old_page) { /* * Don't let another task, with possibly unlocked vma, @@ -4030,7 +4031,7 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) #endif /* __PAGETABLE_PMD_FOLDED */ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, - unsigned long *start, unsigned long *end, + struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) { pgd_t *pgd; @@ -4058,10 +4059,10 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, if (!pmdpp) goto out; - if (start && end) { - *start = address & PMD_MASK; - *end = *start + PMD_SIZE; - mmu_notifier_invalidate_range_start(mm, *start, *end); + if (range) { + mmu_notifier_range_init(range, mm, address & PMD_MASK, + (address & PMD_MASK) + PMD_SIZE); + mmu_notifier_invalidate_range_start(range); } *ptlp = pmd_lock(mm, pmd); if (pmd_huge(*pmd)) { @@ -4069,17 +4070,17 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, return 0; } spin_unlock(*ptlp); - if (start && end) - mmu_notifier_invalidate_range_end(mm, *start, *end); + if (range) + mmu_notifier_invalidate_range_end(range); } if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) goto out; - if (start && end) { - *start = address & PAGE_MASK; - *end = *start + PAGE_SIZE; - mmu_notifier_invalidate_range_start(mm, *start, *end); + if (range) { + range->start = address & PAGE_MASK; + range->end = range->start + PAGE_SIZE; + mmu_notifier_invalidate_range_start(range); } ptep = pte_offset_map_lock(mm, pmd, address, ptlp); if (!pte_present(*ptep)) @@ -4088,8 +4089,8 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, return 0; unlock: pte_unmap_unlock(ptep, *ptlp); - if (start && end) - mmu_notifier_invalidate_range_end(mm, *start, *end); + if (range) + mmu_notifier_invalidate_range_end(range); out: return -EINVAL; } @@ -4101,20 +4102,20 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address, /* (void) is needed to make gcc happy */ (void) __cond_lock(*ptlp, - !(res = __follow_pte_pmd(mm, address, NULL, NULL, + !(res = __follow_pte_pmd(mm, address, NULL, ptepp, NULL, ptlp))); return res; } int follow_pte_pmd(struct mm_struct *mm, unsigned long address, - unsigned long *start, unsigned long *end, - pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) + struct mmu_notifier_range *range, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) { int res; /* (void) is needed to make gcc happy */ (void) __cond_lock(*ptlp, - !(res = __follow_pte_pmd(mm, address, start, end, + !(res = __follow_pte_pmd(mm, address, range, ptepp, pmdpp, ptlp))); return res; } diff --git a/mm/migrate.c b/mm/migrate.c index acda06f99754..462163f5f278 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2299,6 +2299,7 @@ next: */ static void migrate_vma_collect(struct migrate_vma *migrate) { + struct mmu_notifier_range range; struct mm_walk mm_walk; mm_walk.pmd_entry = migrate_vma_collect_pmd; @@ -2310,13 +2311,11 @@ static void migrate_vma_collect(struct migrate_vma *migrate) mm_walk.mm = migrate->vma->vm_mm; mm_walk.private = migrate; - mmu_notifier_invalidate_range_start(mm_walk.mm, - migrate->start, - migrate->end); + mmu_notifier_range_init(&range, mm_walk.mm, migrate->start, + migrate->end); + mmu_notifier_invalidate_range_start(&range); walk_page_range(migrate->start, migrate->end, &mm_walk); - mmu_notifier_invalidate_range_end(mm_walk.mm, - migrate->start, - migrate->end); + mmu_notifier_invalidate_range_end(&range); migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT); } @@ -2697,9 +2696,8 @@ static void migrate_vma_pages(struct migrate_vma *migrate) { const unsigned long npages = migrate->npages; const unsigned long start = migrate->start; - struct vm_area_struct *vma = migrate->vma; - struct mm_struct *mm = vma->vm_mm; - unsigned long addr, i, mmu_start; + struct mmu_notifier_range range; + unsigned long addr, i; bool notified = false; for (i = 0, addr = start; i < npages; addr += PAGE_SIZE, i++) { @@ -2718,11 +2716,12 @@ static void migrate_vma_pages(struct migrate_vma *migrate) continue; } if (!notified) { - mmu_start = addr; notified = true; - mmu_notifier_invalidate_range_start(mm, - mmu_start, - migrate->end); + + mmu_notifier_range_init(&range, + migrate->vma->vm_mm, + addr, migrate->end); + mmu_notifier_invalidate_range_start(&range); } migrate_vma_insert_page(migrate, addr, newpage, &migrate->src[i], @@ -2763,8 +2762,7 @@ static void migrate_vma_pages(struct migrate_vma *migrate) * did already call it. */ if (notified) - mmu_notifier_invalidate_range_only_end(mm, mmu_start, - migrate->end); + mmu_notifier_invalidate_range_only_end(&range); } /* diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 74a7dc3d11c8..9c884abc7850 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -167,28 +167,20 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address, srcu_read_unlock(&srcu, id); } -int __mmu_notifier_invalidate_range_start(struct mm_struct *mm, - unsigned long start, unsigned long end, - bool blockable) +int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) { - struct mmu_notifier_range _range, *range = &_range; struct mmu_notifier *mn; int ret = 0; int id; - range->blockable = blockable; - range->start = start; - range->end = end; - range->mm = mm; - id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) { + hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) { if (mn->ops->invalidate_range_start) { int _ret = mn->ops->invalidate_range_start(mn, range); if (_ret) { pr_info("%pS callback failed with %d in %sblockable context.\n", - mn->ops->invalidate_range_start, _ret, - !blockable ? "non-" : ""); + mn->ops->invalidate_range_start, _ret, + !range->blockable ? "non-" : ""); ret = _ret; } } @@ -199,27 +191,14 @@ int __mmu_notifier_invalidate_range_start(struct mm_struct *mm, } EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start); -void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, - unsigned long start, - unsigned long end, +void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range, bool only_end) { - struct mmu_notifier_range _range, *range = &_range; struct mmu_notifier *mn; int id; - /* - * The end call back will never be call if the start refused to go - * through because of blockable was false so here assume that we - * can block. - */ - range->blockable = true; - range->start = start; - range->end = end; - range->mm = mm; - id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) { + hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) { /* * Call invalidate_range here too to avoid the need for the * subsystem of having to register an invalidate_range_end @@ -234,7 +213,9 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, * already happen under page table lock. */ if (!only_end && mn->ops->invalidate_range) - mn->ops->invalidate_range(mn, mm, start, end); + mn->ops->invalidate_range(mn, range->mm, + range->start, + range->end); if (mn->ops->invalidate_range_end) mn->ops->invalidate_range_end(mn, range); } diff --git a/mm/mprotect.c b/mm/mprotect.c index 6d331620b9e5..36cb358db170 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -167,11 +167,12 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pgprot_t newprot, int dirty_accountable, int prot_numa) { pmd_t *pmd; - struct mm_struct *mm = vma->vm_mm; unsigned long next; unsigned long pages = 0; unsigned long nr_huge_updates = 0; - unsigned long mni_start = 0; + struct mmu_notifier_range range; + + range.start = 0; pmd = pmd_offset(pud, addr); do { @@ -183,9 +184,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, goto next; /* invoke the mmu notifier if the pmd is populated */ - if (!mni_start) { - mni_start = addr; - mmu_notifier_invalidate_range_start(mm, mni_start, end); + if (!range.start) { + mmu_notifier_range_init(&range, vma->vm_mm, addr, end); + mmu_notifier_invalidate_range_start(&range); } if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { @@ -214,8 +215,8 @@ next: cond_resched(); } while (pmd++, addr = next, addr != end); - if (mni_start) - mmu_notifier_invalidate_range_end(mm, mni_start, end); + if (range.start) + mmu_notifier_invalidate_range_end(&range); if (nr_huge_updates) count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates); diff --git a/mm/mremap.c b/mm/mremap.c index 7f9f9180e401..def01d86e36f 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -197,16 +197,14 @@ unsigned long move_page_tables(struct vm_area_struct *vma, bool need_rmap_locks) { unsigned long extent, next, old_end; + struct mmu_notifier_range range; pmd_t *old_pmd, *new_pmd; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ old_end = old_addr + len; flush_cache_range(vma, old_addr, old_end); - mmun_start = old_addr; - mmun_end = old_end; - mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end); + mmu_notifier_range_init(&range, vma->vm_mm, old_addr, old_end); + mmu_notifier_invalidate_range_start(&range); for (; old_addr < old_end; old_addr += extent, new_addr += extent) { cond_resched(); @@ -247,7 +245,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma, new_pmd, new_addr, need_rmap_locks); } - mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end); + mmu_notifier_invalidate_range_end(&range); return len + old_addr - old_end; /* how much done */ } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 5442cb12e4ed..f0e8cd9edb1a 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -528,19 +528,20 @@ bool __oom_reap_task_mm(struct mm_struct *mm) * count elevated without a good reason. */ if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) { - const unsigned long start = vma->vm_start; - const unsigned long end = vma->vm_end; + struct mmu_notifier_range range; struct mmu_gather tlb; - tlb_gather_mmu(&tlb, mm, start, end); - if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) { - tlb_finish_mmu(&tlb, start, end); + mmu_notifier_range_init(&range, mm, vma->vm_start, + vma->vm_end); + tlb_gather_mmu(&tlb, mm, range.start, range.end); + if (mmu_notifier_invalidate_range_start_nonblock(&range)) { + tlb_finish_mmu(&tlb, range.start, range.end); ret = false; continue; } - unmap_page_range(&tlb, vma, start, end, NULL); - mmu_notifier_invalidate_range_end(mm, start, end); - tlb_finish_mmu(&tlb, start, end); + unmap_page_range(&tlb, vma, range.start, range.end, NULL); + mmu_notifier_invalidate_range_end(&range); + tlb_finish_mmu(&tlb, range.start, range.end); } } diff --git a/mm/rmap.c b/mm/rmap.c index 85b7f9423352..c75f72f6fe0e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -889,15 +889,17 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, .address = address, .flags = PVMW_SYNC, }; - unsigned long start = address, end; + struct mmu_notifier_range range; int *cleaned = arg; /* * We have to assume the worse case ie pmd for invalidation. Note that * the page can not be free from this function. */ - end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page))); - mmu_notifier_invalidate_range_start(vma->vm_mm, start, end); + mmu_notifier_range_init(&range, vma->vm_mm, address, + min(vma->vm_end, address + + (PAGE_SIZE << compound_order(page)))); + mmu_notifier_invalidate_range_start(&range); while (page_vma_mapped_walk(&pvmw)) { unsigned long cstart; @@ -949,7 +951,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, (*cleaned)++; } - mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); + mmu_notifier_invalidate_range_end(&range); return true; } @@ -1345,7 +1347,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, pte_t pteval; struct page *subpage; bool ret = true; - unsigned long start = address, end; + struct mmu_notifier_range range; enum ttu_flags flags = (enum ttu_flags)arg; /* munlock has nothing to gain from examining un-locked vmas */ @@ -1369,15 +1371,18 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * Note that the page can not be free in this function as call of * try_to_unmap() must hold a reference on the page. */ - end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page))); + mmu_notifier_range_init(&range, vma->vm_mm, vma->vm_start, + min(vma->vm_end, vma->vm_start + + (PAGE_SIZE << compound_order(page)))); if (PageHuge(page)) { /* * If sharing is possible, start and end will be adjusted * accordingly. */ - adjust_range_if_pmd_sharing_possible(vma, &start, &end); + adjust_range_if_pmd_sharing_possible(vma, &range.start, + &range.end); } - mmu_notifier_invalidate_range_start(vma->vm_mm, start, end); + mmu_notifier_invalidate_range_start(&range); while (page_vma_mapped_walk(&pvmw)) { #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION @@ -1428,9 +1433,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * we must flush them all. start/end were * already adjusted above to cover this range. */ - flush_cache_range(vma, start, end); - flush_tlb_range(vma, start, end); - mmu_notifier_invalidate_range(mm, start, end); + flush_cache_range(vma, range.start, range.end); + flush_tlb_range(vma, range.start, range.end); + mmu_notifier_invalidate_range(mm, range.start, + range.end); /* * The ref count of the PMD page was dropped @@ -1650,7 +1656,7 @@ discard: put_page(page); } - mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); + mmu_notifier_invalidate_range_end(&range); return ret; } -- cgit v1.2.3-59-g8ed1b From 0614ce9776b037b6a08a9adcbfcc382c0053b178 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 28 Dec 2018 00:38:13 -0800 Subject: include/linux/memory_hotplug.h: remove duplicate declaration of offline_pages() offline_pages() is already declared in this file. Just remove the duplicated one. Link: http://lkml.kernel.org/r/20181205031357.24769-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Reviewed-by: David Hildenbrand Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memory_hotplug.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include') diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 8ed6e09a5c0c..07da5c6c5ba0 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -331,7 +331,6 @@ extern int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, bool want_memblock); extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages, struct vmem_altmap *altmap); -extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages); extern bool is_memblock_offlined(struct memory_block *mem); extern int sparse_add_one_section(int nid, unsigned long start_pfn, struct vmem_altmap *altmap); -- cgit v1.2.3-59-g8ed1b From 7635d9cbe8327e131a1d3d8517dc186c2796ce2e Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 28 Dec 2018 00:38:21 -0800 Subject: mm, thp, proc: report THP eligibility for each vma Userspace falls short when trying to find out whether a specific memory range is eligible for THP. There are usecases that would like to know that http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com : This is used to identify heap mappings that should be able to fault thp : but do not, and they normally point to a low-on-memory or fragmentation : issue. The only way to deduce this now is to query for hg resp. nh flags and confronting the state with the global setting. Except that there is also PR_SET_THP_DISABLE that might change the picture. So the final logic is not trivial. Moreover the eligibility of the vma depends on the type of VMA as well. In the past we have supported only anononymous memory VMAs but things have changed and shmem based vmas are supported as well these days and the query logic gets even more complicated because the eligibility depends on the mount option and another global configuration knob. Simplify the current state and report the THP eligibility in /proc//smaps for each existing vma. Reuse transparent_hugepage_enabled for this purpose. The original implementation of this function assumes that the caller knows that the vma itself is supported for THP so make the core checks into __transparent_hugepage_enabled and use it for existing callers. __show_smap just use the new transparent_hugepage_enabled which also checks the vma support status (please note that this one has to be out of line due to include dependency issues). [mhocko@kernel.org: fix oops with NULL ->f_mapping] Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka Cc: Dan Williams Cc: David Rientjes Cc: Jan Kara Cc: Mike Rapoport Cc: Paul Oppenheimer Cc: William Kucharski Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 3 +++ fs/proc/task_mmu.c | 2 ++ include/linux/huge_mm.h | 13 ++++++++++++- mm/huge_memory.c | 12 +++++++++++- mm/memory.c | 4 ++-- 5 files changed, 30 insertions(+), 4 deletions(-) (limited to 'include') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 2a4e63f5122c..cd465304bec4 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -425,6 +425,7 @@ SwapPss: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Locked: 0 kB +THPeligible: 0 VmFlags: rd ex mr mw me dw the first of these lines shows the same information as is displayed for the @@ -462,6 +463,8 @@ replaced by copy-on-write) part of the underlying shmem object out on swap. "SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this does not take into account swapped out page of underlying shmem objects. "Locked" indicates whether the mapping is locked in memory or not. +"THPeligible" indicates whether the mapping is eligible for THP pages - 1 if +true, 0 otherwise. "VmFlags" field deserves a separate description. This member represents the kernel flags associated with the particular virtual memory area in two letter encoded diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index b3ddceb003bc..f0ec9edab2f3 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -790,6 +790,8 @@ static int show_smap(struct seq_file *m, void *v) __show_smap(m, &mss); + seq_printf(m, "THPeligible: %d\n", transparent_hugepage_enabled(vma)); + if (arch_pkeys_enabled()) seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); show_smap_vma_flags(m, vma); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 4663ee96cf59..381e872bfde0 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -93,7 +93,11 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma); extern unsigned long transparent_hugepage_flags; -static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma) +/* + * to be used on vmas which are known to support THP. + * Use transparent_hugepage_enabled otherwise + */ +static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma) { if (vma->vm_flags & VM_NOHUGEPAGE) return false; @@ -117,6 +121,8 @@ static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma) return false; } +bool transparent_hugepage_enabled(struct vm_area_struct *vma); + #define transparent_hugepage_use_zero_page() \ (transparent_hugepage_flags & \ (1<ptl); alloc: - if (transparent_hugepage_enabled(vma) && + if (__transparent_hugepage_enabled(vma) && !transparent_hugepage_debug_cow()) { huge_gfp = alloc_hugepage_direct_gfpmask(vma); new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER); diff --git a/mm/memory.c b/mm/memory.c index b7a8bfe5f5ec..2dd2f9ab57f4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3831,7 +3831,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, vmf.pud = pud_alloc(mm, p4d, address); if (!vmf.pud) return VM_FAULT_OOM; - if (pud_none(*vmf.pud) && transparent_hugepage_enabled(vma)) { + if (pud_none(*vmf.pud) && __transparent_hugepage_enabled(vma)) { ret = create_huge_pud(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; @@ -3857,7 +3857,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, vmf.pmd = pmd_alloc(mm, vmf.pud, address); if (!vmf.pmd) return VM_FAULT_OOM; - if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) { + if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) { ret = create_huge_pmd(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; -- cgit v1.2.3-59-g8ed1b From 125b860b251ad226b1384b6db06be37485127f69 Mon Sep 17 00:00:00 2001 From: Pingfan Liu Date: Fri, 28 Dec 2018 00:38:43 -0800 Subject: mm/pageblock: throw compile error if pageblock_bits cannot hold MIGRATE_TYPES Currently, NR_PAGEBLOCK_BITS and MIGRATE_TYPES are not associated by code. If someone adds extra migrate type, then he may forget to enlarge the NR_PAGEBLOCK_BITS. Hence it requires some way to fix. NR_PAGEBLOCK_BITS depends on MIGRATE_TYPES, while these macro spread on two different .h file with reverse dependency, it is a little hard to refer to MIGRATE_TYPES in pageblock-flag.h. This patch tries to remind such relation in compiling-time. Link: http://lkml.kernel.org/r/1544508709-11358-1-git-send-email-kernelfans@gmail.com Signed-off-by: Pingfan Liu Reviewed-by: Andrew Morton Acked-by: Vlastimil Babka Cc: Michal Hocko Cc: Pavel Tatashin Cc: Oscar Salvador Cc: Mike Rapoport Cc: Joonsoo Kim Cc: Alexander Duyck Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/pageblock-flags.h | 3 ++- mm/page_alloc.c | 1 + 2 files changed, 3 insertions(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h index 9132c5cb41f1..06a66327333d 100644 --- a/include/linux/pageblock-flags.h +++ b/include/linux/pageblock-flags.h @@ -25,10 +25,11 @@ #include +#define PB_migratetype_bits 3 /* Bit indices that affect a whole block of pages */ enum pageblock_bits { PB_migrate, - PB_migrate_end = PB_migrate + 3 - 1, + PB_migrate_end = PB_migrate + PB_migratetype_bits - 1, /* 3 bits required for migrate types */ PB_migrate_skip,/* If set the block is skipped by compaction */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b4fcf211ca69..cd1c9d32ef9a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -431,6 +431,7 @@ void set_pfnblock_flags_mask(struct page *page, unsigned long flags, unsigned long old_word, word; BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4); + BUILD_BUG_ON(MIGRATE_TYPES > (1 << PB_migratetype_bits)); bitmap = get_pageblock_bitmap(page, pfn); bitidx = pfn_to_bitidx(page, pfn); -- cgit v1.2.3-59-g8ed1b From 89cb0888ca1483ad72648844ddd1b801863a8949 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Fri, 28 Dec 2018 00:39:12 -0800 Subject: mm: migrate: provide buffer_migrate_page_norefs() Provide a variant of buffer_migrate_page() that also checks whether there are no unexpected references to buffer heads. This function will then be safe to use for block device pages. [akpm@linux-foundation.org: remove EXPORT_SYMBOL(buffer_migrate_page_norefs)] Link: http://lkml.kernel.org/r/20181211172143.7358-5-jack@suse.cz Signed-off-by: Jan Kara Acked-by: Mel Gorman Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/fs.h | 4 ++++ mm/migrate.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++------- 2 files changed, 57 insertions(+), 7 deletions(-) (limited to 'include') diff --git a/include/linux/fs.h b/include/linux/fs.h index 26a8607b3c3c..1cda6648a41f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3269,8 +3269,12 @@ extern int generic_check_addressable(unsigned, u64); extern int buffer_migrate_page(struct address_space *, struct page *, struct page *, enum migrate_mode); +extern int buffer_migrate_page_norefs(struct address_space *, + struct page *, struct page *, + enum migrate_mode); #else #define buffer_migrate_page NULL +#define buffer_migrate_page_norefs NULL #endif extern int setattr_prepare(struct dentry *, struct iattr *); diff --git a/mm/migrate.c b/mm/migrate.c index 8392140fb298..8dd57601714f 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -743,13 +743,9 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head, return true; } -/* - * Migration function for pages with buffers. This function can only be used - * if the underlying filesystem guarantees that no other references to "page" - * exist. - */ -int buffer_migrate_page(struct address_space *mapping, - struct page *newpage, struct page *page, enum migrate_mode mode) +static int __buffer_migrate_page(struct address_space *mapping, + struct page *newpage, struct page *page, enum migrate_mode mode, + bool check_refs) { struct buffer_head *bh, *head; int rc; @@ -767,6 +763,33 @@ int buffer_migrate_page(struct address_space *mapping, if (!buffer_migrate_lock_buffers(head, mode)) return -EAGAIN; + if (check_refs) { + bool busy; + bool invalidated = false; + +recheck_buffers: + busy = false; + spin_lock(&mapping->private_lock); + bh = head; + do { + if (atomic_read(&bh->b_count)) { + busy = true; + break; + } + bh = bh->b_this_page; + } while (bh != head); + spin_unlock(&mapping->private_lock); + if (busy) { + if (invalidated) { + rc = -EAGAIN; + goto unlock_buffers; + } + invalidate_bh_lrus(); + invalidated = true; + goto recheck_buffers; + } + } + rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode, 0); if (rc != MIGRATEPAGE_SUCCESS) goto unlock_buffers; @@ -803,7 +826,30 @@ unlock_buffers: return rc; } + +/* + * Migration function for pages with buffers. This function can only be used + * if the underlying filesystem guarantees that no other references to "page" + * exist. For example attached buffer heads are accessed only under page lock. + */ +int buffer_migrate_page(struct address_space *mapping, + struct page *newpage, struct page *page, enum migrate_mode mode) +{ + return __buffer_migrate_page(mapping, newpage, page, mode, false); +} EXPORT_SYMBOL(buffer_migrate_page); + +/* + * Same as above except that this variant is more careful and checks that there + * are also no buffer head references. This function is the right one for + * mappings where buffer heads are directly looked up and referenced (such as + * block device mappings). + */ +int buffer_migrate_page_norefs(struct address_space *mapping, + struct page *newpage, struct page *page, enum migrate_mode mode) +{ + return __buffer_migrate_page(mapping, newpage, page, mode, true); +} #endif /* -- cgit v1.2.3-59-g8ed1b From ab41ee6879981b3d3a16a1079a33fa6fd043eb3c Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Fri, 28 Dec 2018 00:39:20 -0800 Subject: mm: migrate: drop unused argument of migrate_page_move_mapping() All callers of migrate_page_move_mapping() now pass NULL for 'head' argument. Drop it. Link: http://lkml.kernel.org/r/20181211172143.7358-7-jack@suse.cz Signed-off-by: Jan Kara Acked-by: Mel Gorman Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/aio.c | 2 +- fs/f2fs/data.c | 2 +- fs/iomap.c | 2 +- fs/ubifs/file.c | 2 +- include/linux/migrate.h | 3 +-- mm/migrate.c | 7 +++---- 6 files changed, 8 insertions(+), 10 deletions(-) (limited to 'include') diff --git a/fs/aio.c b/fs/aio.c index aac9659381d2..bc401d5bcdc2 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -409,7 +409,7 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, BUG_ON(PageWriteback(old)); get_page(new); - rc = migrate_page_move_mapping(mapping, new, old, NULL, mode, 1); + rc = migrate_page_move_mapping(mapping, new, old, mode, 1); if (rc != MIGRATEPAGE_SUCCESS) { put_page(new); goto out_unlock; diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index b293cb3e27a2..008b74eff00d 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -2738,7 +2738,7 @@ int f2fs_migrate_page(struct address_space *mapping, */ extra_count = (atomic_written ? 1 : 0) - page_has_private(page); rc = migrate_page_move_mapping(mapping, newpage, - page, NULL, mode, extra_count); + page, mode, extra_count); if (rc != MIGRATEPAGE_SUCCESS) { if (atomic_written) mutex_unlock(&fi->inmem_lock); diff --git a/fs/iomap.c b/fs/iomap.c index ce837d962d47..2d9b93cc4930 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -563,7 +563,7 @@ iomap_migrate_page(struct address_space *mapping, struct page *newpage, { int ret; - ret = migrate_page_move_mapping(mapping, newpage, page, NULL, mode, 0); + ret = migrate_page_move_mapping(mapping, newpage, page, mode, 0); if (ret != MIGRATEPAGE_SUCCESS) return ret; diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c index 1b78f2e09218..5d2ffb1a45fc 100644 --- a/fs/ubifs/file.c +++ b/fs/ubifs/file.c @@ -1481,7 +1481,7 @@ static int ubifs_migrate_page(struct address_space *mapping, { int rc; - rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode, 0); + rc = migrate_page_move_mapping(mapping, newpage, page, mode, 0); if (rc != MIGRATEPAGE_SUCCESS) return rc; diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 617615fa11ce..e13d9bf2f9a5 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -77,8 +77,7 @@ extern void migrate_page_copy(struct page *newpage, struct page *page); extern int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page); extern int migrate_page_move_mapping(struct address_space *mapping, - struct page *newpage, struct page *page, - struct buffer_head *head, enum migrate_mode mode, + struct page *newpage, struct page *page, enum migrate_mode mode, int extra_count); #else diff --git a/mm/migrate.c b/mm/migrate.c index 8dd57601714f..4389696fba0e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -399,8 +399,7 @@ static int expected_page_refs(struct page *page) * 3 for pages with a mapping and PagePrivate/PagePrivate2 set. */ int migrate_page_move_mapping(struct address_space *mapping, - struct page *newpage, struct page *page, - struct buffer_head *head, enum migrate_mode mode, + struct page *newpage, struct page *page, enum migrate_mode mode, int extra_count) { XA_STATE(xas, &mapping->i_pages, page_index(page)); @@ -687,7 +686,7 @@ int migrate_page(struct address_space *mapping, BUG_ON(PageWriteback(page)); /* Writeback must be complete */ - rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode, 0); + rc = migrate_page_move_mapping(mapping, newpage, page, mode, 0); if (rc != MIGRATEPAGE_SUCCESS) return rc; @@ -790,7 +789,7 @@ recheck_buffers: } } - rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode, 0); + rc = migrate_page_move_mapping(mapping, newpage, page, mode, 0); if (rc != MIGRATEPAGE_SUCCESS) goto unlock_buffers; -- cgit v1.2.3-59-g8ed1b From af3b854492f351d1ff3b4744a83bf5ff7eed4920 Mon Sep 17 00:00:00 2001 From: Benjamin Poirier Date: Fri, 28 Dec 2018 00:39:23 -0800 Subject: mm/page_alloc.c: allow error injection Model call chain after should_failslab(). Likewise, we can now use a kprobe to override the return value of should_fail_alloc_page() and inject allocation failures into alloc_page*(). This will allow injecting allocation failures using the BCC tools even without building kernel with CONFIG_FAIL_PAGE_ALLOC and booting it with a fail_page_alloc= parameter, which incurs some overhead even when failures are not being injected. On the other hand, this patch adds an unconditional call to should_fail_alloc_page() from page allocation hotpath. That overhead should be rather negligible with CONFIG_FAIL_PAGE_ALLOC=n when there's no kprobe attached, though. [vbabka@suse.cz: changelog addition] Link: http://lkml.kernel.org/r/20181214074330.18917-1-bpoirier@suse.com Signed-off-by: Benjamin Poirier Acked-by: Vlastimil Babka Cc: Arnd Bergmann Cc: Michal Hocko Cc: Pavel Tatashin Cc: Oscar Salvador Cc: Mike Rapoport Cc: Joonsoo Kim Cc: Alexander Duyck Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/asm-generic/error-injection.h | 1 + mm/page_alloc.c | 10 ++++++++-- 2 files changed, 9 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/asm-generic/error-injection.h b/include/asm-generic/error-injection.h index 296c65442f00..95a159a4137f 100644 --- a/include/asm-generic/error-injection.h +++ b/include/asm-generic/error-injection.h @@ -8,6 +8,7 @@ enum { EI_ETYPE_NULL, /* Return NULL if failure */ EI_ETYPE_ERRNO, /* Return -ERRNO if failure */ EI_ETYPE_ERRNO_NULL, /* Return -ERRNO or NULL if failure */ + EI_ETYPE_TRUE, /* Return true if failure */ }; struct error_injection_entry { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 75865e1325b5..cde5dac6229a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3131,7 +3131,7 @@ static int __init setup_fail_page_alloc(char *str) } __setup("fail_page_alloc=", setup_fail_page_alloc); -static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) +static bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) { if (order < fail_page_alloc.min_order) return false; @@ -3181,13 +3181,19 @@ late_initcall(fail_page_alloc_debugfs); #else /* CONFIG_FAIL_PAGE_ALLOC */ -static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) +static inline bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) { return false; } #endif /* CONFIG_FAIL_PAGE_ALLOC */ +static noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) +{ + return __should_fail_alloc_page(gfp_mask, order); +} +ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE); + /* * Return true if free base pages are above 'mark'. For high-order checks it * will return true of the order-0 watermark is reached and there is at least -- cgit v1.2.3-59-g8ed1b From 4918e7625ffa82f388ea70538f0e1df20ea35a54 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 28 Dec 2018 00:39:27 -0800 Subject: include/linux/vmstat.h: remove unused page state adjustment macro These four macro are not used anymore. Just remove them. Link: http://lkml.kernel.org/r/20181214063211.2290-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Acked-by: Michal Hocko Reviewed-by: David Hildenbrand Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/vmstat.h | 5 ----- 1 file changed, 5 deletions(-) (limited to 'include') diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index f25cef84b41d..2db8d60981fe 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -239,11 +239,6 @@ extern unsigned long node_page_state(struct pglist_data *pgdat, #define node_page_state(node, item) global_node_page_state(item) #endif /* CONFIG_NUMA */ -#define add_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, __d) -#define sub_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, -(__d)) -#define add_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, __d) -#define sub_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, -(__d)) - #ifdef CONFIG_SMP void __mod_zone_page_state(struct zone *, enum zone_stat_item item, long); void __inc_zone_page_state(struct page *, enum zone_stat_item); -- cgit v1.2.3-59-g8ed1b From 063a7d1d3623db31ca5d2309cab6030ebf93b72f Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 28 Dec 2018 00:39:46 -0800 Subject: mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The kbuild robot reported the following on a development branch that used memremap.h in a new path: In file included from arch/m68k/include/asm/pgtable_mm.h:148:0, from arch/m68k/include/asm/pgtable.h:5, from include/linux/memremap.h:7, from drivers//dax/bus.c:3: arch/m68k/include/asm/motorola_pgtable.h: In function 'pgd_offset': >> arch/m68k/include/asm/motorola_pgtable.h:199:11: error: dereferencing pointer to incomplete type 'const struct mm_struct' return mm->pgd + pgd_index(address); ^~ The ->page_fault() callback is specific to HMM. Move it to 'struct hmm_devmem' where the unusual asm/pgtable.h dependency can be contained in include/linux/hmm.h. Longer term refactoring this dependency out of HMM is recommended, but in the meantime memremap.h remains generic. Link: http://lkml.kernel.org/r/154534090899.3120190.6652620807617715272.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: 5042db43cc26 ("mm/ZONE_DEVICE: new type of ZONE_DEVICE memory...") Signed-off-by: Dan Williams Reviewed-by: "Jérôme Glisse" Cc: Logan Gunthorpe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/hmm.h | 24 ++++++++++++++++++++++++ include/linux/memremap.h | 32 -------------------------------- kernel/memremap.c | 6 +++++- mm/hmm.c | 4 ++-- 4 files changed, 31 insertions(+), 35 deletions(-) (limited to 'include') diff --git a/include/linux/hmm.h b/include/linux/hmm.h index ed89fbc525d2..66f9ebbb1df3 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -69,6 +69,7 @@ #define LINUX_HMM_H #include +#include #if IS_ENABLED(CONFIG_HMM) @@ -486,6 +487,7 @@ struct hmm_devmem_ops { * @device: device to bind resource to * @ops: memory operations callback * @ref: per CPU refcount + * @page_fault: callback when CPU fault on an unaddressable device page * * This an helper structure for device drivers that do not wish to implement * the gory details related to hotplugging new memoy and allocating struct @@ -493,7 +495,28 @@ struct hmm_devmem_ops { * * Device drivers can directly use ZONE_DEVICE memory on their own if they * wish to do so. + * + * The page_fault() callback must migrate page back, from device memory to + * system memory, so that the CPU can access it. This might fail for various + * reasons (device issues, device have been unplugged, ...). When such error + * conditions happen, the page_fault() callback must return VM_FAULT_SIGBUS and + * set the CPU page table entry to "poisoned". + * + * Note that because memory cgroup charges are transferred to the device memory, + * this should never fail due to memory restrictions. However, allocation + * of a regular system page might still fail because we are out of memory. If + * that happens, the page_fault() callback must return VM_FAULT_OOM. + * + * The page_fault() callback can also try to migrate back multiple pages in one + * chunk, as an optimization. It must, however, prioritize the faulting address + * over all the others. */ +typedef int (*dev_page_fault_t)(struct vm_area_struct *vma, + unsigned long addr, + const struct page *page, + unsigned int flags, + pmd_t *pmdp); + struct hmm_devmem { struct completion completion; unsigned long pfn_first; @@ -503,6 +526,7 @@ struct hmm_devmem { struct dev_pagemap pagemap; const struct hmm_devmem_ops *ops; struct percpu_ref ref; + dev_page_fault_t page_fault; }; /* diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 55db66b3716f..f0628660d541 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -4,8 +4,6 @@ #include #include -#include - struct resource; struct device; @@ -66,47 +64,18 @@ enum memory_type { }; /* - * For MEMORY_DEVICE_PRIVATE we use ZONE_DEVICE and extend it with two - * callbacks: - * page_fault() - * page_free() - * * Additional notes about MEMORY_DEVICE_PRIVATE may be found in * include/linux/hmm.h and Documentation/vm/hmm.rst. There is also a brief * explanation in include/linux/memory_hotplug.h. * - * The page_fault() callback must migrate page back, from device memory to - * system memory, so that the CPU can access it. This might fail for various - * reasons (device issues, device have been unplugged, ...). When such error - * conditions happen, the page_fault() callback must return VM_FAULT_SIGBUS and - * set the CPU page table entry to "poisoned". - * - * Note that because memory cgroup charges are transferred to the device memory, - * this should never fail due to memory restrictions. However, allocation - * of a regular system page might still fail because we are out of memory. If - * that happens, the page_fault() callback must return VM_FAULT_OOM. - * - * The page_fault() callback can also try to migrate back multiple pages in one - * chunk, as an optimization. It must, however, prioritize the faulting address - * over all the others. - * - * * The page_free() callback is called once the page refcount reaches 1 * (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug. * This allows the device driver to implement its own memory management.) - * - * For MEMORY_DEVICE_PUBLIC only the page_free() callback matter. */ -typedef int (*dev_page_fault_t)(struct vm_area_struct *vma, - unsigned long addr, - const struct page *page, - unsigned int flags, - pmd_t *pmdp); typedef void (*dev_page_free_t)(struct page *page, void *data); /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings - * @page_fault: callback when CPU fault on an unaddressable device page * @page_free: free page callback when page refcount reaches 1 * @altmap: pre-allocated/reserved memory for vmemmap allocations * @res: physical address range covered by @ref @@ -117,7 +86,6 @@ typedef void (*dev_page_free_t)(struct page *page, void *data); * @type: memory type: see MEMORY_* in memory_hotplug.h */ struct dev_pagemap { - dev_page_fault_t page_fault; dev_page_free_t page_free; struct vmem_altmap altmap; bool altmap_valid; diff --git a/kernel/memremap.c b/kernel/memremap.c index 0d5603d76c37..a856cb5ff192 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -11,6 +11,7 @@ #include #include #include +#include static DEFINE_XARRAY(pgmap_array); #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1) @@ -24,6 +25,9 @@ vm_fault_t device_private_entry_fault(struct vm_area_struct *vma, pmd_t *pmdp) { struct page *page = device_private_entry_to_page(entry); + struct hmm_devmem *devmem; + + devmem = container_of(page->pgmap, typeof(*devmem), pagemap); /* * The page_fault() callback must migrate page back to system memory @@ -39,7 +43,7 @@ vm_fault_t device_private_entry_fault(struct vm_area_struct *vma, * There is a more in-depth description of what that callback can and * cannot do, in include/linux/memremap.h */ - return page->pgmap->page_fault(vma, addr, page, flags, pmdp); + return devmem->page_fault(vma, addr, page, flags, pmdp); } EXPORT_SYMBOL(device_private_entry_fault); #endif /* CONFIG_DEVICE_PRIVATE */ diff --git a/mm/hmm.c b/mm/hmm.c index 789587731217..a04e4b810610 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -1087,10 +1087,10 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT; devmem->pfn_last = devmem->pfn_first + (resource_size(devmem->resource) >> PAGE_SHIFT); + devmem->page_fault = hmm_devmem_fault; devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; devmem->pagemap.res = *devmem->resource; - devmem->pagemap.page_fault = hmm_devmem_fault; devmem->pagemap.page_free = hmm_devmem_free; devmem->pagemap.altmap_valid = false; devmem->pagemap.ref = &devmem->ref; @@ -1141,10 +1141,10 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops, devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT; devmem->pfn_last = devmem->pfn_first + (resource_size(devmem->resource) >> PAGE_SHIFT); + devmem->page_fault = hmm_devmem_fault; devmem->pagemap.type = MEMORY_DEVICE_PUBLIC; devmem->pagemap.res = *devmem->resource; - devmem->pagemap.page_fault = hmm_devmem_fault; devmem->pagemap.page_free = hmm_devmem_free; devmem->pagemap.altmap_valid = false; devmem->pagemap.ref = &devmem->ref; -- cgit v1.2.3-59-g8ed1b From 70c6066e19c15749b579dde7d5722c7d7fb05d57 Mon Sep 17 00:00:00 2001 From: Kyle Spiers Date: Fri, 28 Dec 2018 00:39:49 -0800 Subject: include/linux/gfp.h: fix typo Fix misspelled "satisfied" Link: http://lkml.kernel.org/r/20181227232354.64562-1-ksspiers@google.com Signed-off-by: Kyle Spiers Reviewed-by: Andrew Morton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/gfp.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 0705164f928c..5f5e25fd6149 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -81,7 +81,7 @@ struct vm_area_struct; * * %__GFP_HARDWALL enforces the cpuset memory allocation policy. * - * %__GFP_THISNODE forces the allocation to be satisified from the requested + * %__GFP_THISNODE forces the allocation to be satisfied from the requested * node with no fallbacks or placement policy enforcements. * * %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg. -- cgit v1.2.3-59-g8ed1b