aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2018-05-29kconfig: support user-defined function and recursively expanded variableMasahiro Yamada4-4/+120
Now, we got a basic ability to test compiler capability in Kconfig. config CC_HAS_STACKPROTECTOR def_bool $(shell,($(CC) -Werror -fstack-protector -E -x c /dev/null -o /dev/null 2>/dev/null) && echo y || echo n) This works, but it is ugly to repeat this long boilerplate. We want to describe like this: config CC_HAS_STACKPROTECTOR bool default $(cc-option,-fstack-protector) It is straight-forward to add a new function, but I do not like to hard-code specialized functions like that. Hence, here is another feature, user-defined function. This works as a textual shorthand with parameterization. A user-defined function is defined by using the = operator, and can be referenced in the same way as built-in functions. A user-defined function in Make is referenced like $(call my-func,arg1,arg2), but I omitted the 'call' to make the syntax shorter. The definition of a user-defined function contains $(1), $(2), etc. in its body to reference the parameters. It is grammatically valid to pass more or fewer arguments when calling it. We already exploit this feature in our makefiles; scripts/Kbuild.include defines cc-option which takes two arguments at most, but most of the callers pass only one argument. By the way, a variable is supported as a subset of this feature since a variable is "a user-defined function with zero argument". In this context, I mean "variable" as recursively expanded variable. I will add a different flavored variable in the next commit. The code above can be written as follows: [Example Code] success = $(shell,($(1)) >/dev/null 2>&1 && echo y || echo n) cc-option = $(success,$(CC) -Werror $(1) -E -x c /dev/null -o /dev/null) config CC_HAS_STACKPROTECTOR def_bool $(cc-option,-fstack-protector) [Result] $ make -s alldefconfig && tail -n 1 .config CONFIG_CC_HAS_STACKPROTECTOR=y Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-05-29kconfig: begin PARAM state only when seeing a command keywordMasahiro Yamada1-1/+1
Currently, any statement line starts with a keyword with TF_COMMAND flag. So, the following three lines are dead code. alloc_string(yytext, yyleng); zconflval.string = text; return T_WORD; If a T_WORD token is returned in this context, it will cause syntax error in the parser anyway. The next commit will support the assignment statement where a line starts with an arbitrary identifier. So, I want the lexer to switch to the PARAM state only when it sees a command keyword. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-05-29kconfig: replace $(UNAME_RELEASE) with function callMasahiro Yamada2-4/+3
Now that 'shell' function is supported, this can be self-contained in Kconfig. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Ulf Magnusson <ulfalizer@gmail.com>
2018-05-29kconfig: add 'shell' built-in functionMasahiro Yamada1-0/+41
This accepts a single command to execute. It returns the standard output from it. [Example code] config HELLO string default "$(shell,echo hello world)" config Y def_bool $(shell,echo y) [Result] $ make -s alldefconfig && tail -n 2 .config CONFIG_HELLO="hello world" CONFIG_Y=y Caveat: Like environments, functions are expanded in the lexer. You cannot pass symbols to function arguments. This is a limitation to simplify the implementation. I want to avoid the dynamic function evaluation, which would introduce much more complexity. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-05-29kconfig: add built-in function supportMasahiro Yamada1-12/+130
This commit adds a new concept 'function' to do more text processing in Kconfig. A function call looks like this: $(function,arg1,arg2,arg3,...) This commit adds the basic infrastructure to expand functions. Change the text expansion helpers to take arguments. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-05-29kconfig: make default prompt of mainmenu less specificMasahiro Yamada2-2/+2
If "mainmenu" is not specified, "Linux Kernel Configuration" is used as a default prompt. Given that Kconfig is used in other projects than Linux, let's use a more generic prompt, "Main menu". Suggested-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-05-29kconfig: remove sym_expand_string_value()Masahiro Yamada2-54/+0
There is no more caller of sym_expand_string_value(). Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: Kees Cook <keescook@chromium.org>
2018-05-29kconfig: remove string expansion for mainmenu after yyparse()Masahiro Yamada1-19/+5
Now that environments are expanded in the lexer, conf_parse() does not need to expand them explicitly. The hack introduced by commit 0724a7c32a54 ("kconfig: Don't leak main menus during parsing") can go away. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Ulf Magnusson <ulfalizer@gmail.com>
2018-05-29kconfig: remove string expansion in file_lookup()Masahiro Yamada1-3/+1
There are two callers of file_lookup(), but there is no more reason to expand the given path. [1] zconf_initscan() This is used to open the first Kconfig. sym_expand_string_value() has never been used in a useful way here; before opening the first Kconfig file, obviously there is no symbol to expand. If you use expand_string_value() instead, environments in KBUILD_KCONFIG would be expanded, but I do not see practical benefits for that. [2] zconf_nextfile() This is used to open the next file from 'source' statement. Symbols in the path like "arch/$SRCARCH/Kconfig" needed expanding, but it was replaced with the direct environment expansion. The environment has already been expanded before the token is passed to the parser. By the way, file_lookup() was already buggy; it expanded a given path, but it used the path before expansion for look-up: if (!strcmp(name, file->name)) { Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Ulf Magnusson <ulfalizer@gmail.com>
2018-05-29kconfig: reference environment variables directly and remove 'option env='Masahiro Yamada19-154/+343
To get access to environment variables, Kconfig needs to define a symbol using "option env=" syntax. It is tedious to add a symbol entry for each environment variable given that we need to define much more such as 'CC', 'AS', 'srctree' etc. to evaluate the compiler capability in Kconfig. Adding '$' for symbol references is grammatically inconsistent. Looking at the code, the symbols prefixed with 'S' are expanded by: - conf_expand_value() This is used to expand 'arch/$ARCH/defconfig' and 'defconfig_list' - sym_expand_string_value() This is used to expand strings in 'source' and 'mainmenu' All of them are fixed values independent of user configuration. So, they can be changed into the direct expansion instead of symbols. This change makes the code much cleaner. The bounce symbols 'SRCARCH', 'ARCH', 'SUBARCH', 'KERNELVERSION' are gone. sym_init() hard-coding 'UNAME_RELEASE' is also gone. 'UNAME_RELEASE' should be replaced with an environment variable. ARCH_DEFCONFIG is a normal symbol, so it should be simply referenced without '$' prefix. The new syntax is addicted by Make. The variable reference needs parentheses, like $(FOO), but you can omit them for single-letter variables, like $F. Yet, in Makefiles, people tend to use the parenthetical form for consistency / clarification. At this moment, only the environment variable is supported, but I will extend the concept of 'variable' later on. The variables are expanded in the lexer so we can simplify the token handling on the parser side. For example, the following code works. [Example code] config MY_TOOLCHAIN_LIST string default "My tools: CC=$(CC), AS=$(AS), CPP=$(CPP)" [Result] $ make -s alldefconfig && tail -n 1 .config CONFIG_MY_TOOLCHAIN_LIST="My tools: CC=gcc, AS=as, CPP=gcc -E" Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: Kees Cook <keescook@chromium.org>
2018-05-29kbuild: remove CONFIG_CROSS_COMPILE supportMasahiro Yamada2-12/+0
Kbuild provides a couple of ways to specify CROSS_COMPILE: [1] Command line [2] Environment [3] arch/*/Makefile (only some architectures) [4] CONFIG_CROSS_COMPILE [4] is problematic for the compiler capability tests in Kconfig. CONFIG_CROSS_COMPILE allows users to change the compiler prefix from 'make menuconfig', etc. It means, the compiler options would have to be all re-calculated everytime CONFIG_CROSS_COMPILE is changed. To avoid complexity and performance issues, I'd like to evaluate the shell commands statically, i.e. only parsing Kconfig files. I guess the majority is [1] or [2]. Currently, there are only 5 defconfig files that specify CONFIG_CROSS_COMPILE. arch/arm/configs/lpc18xx_defconfig arch/hexagon/configs/comet_defconfig arch/nds32/configs/defconfig arch/openrisc/configs/or1ksim_defconfig arch/openrisc/configs/simple_smp_defconfig Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: Kees Cook <keescook@chromium.org>
2018-05-29kbuild: remove kbuild cacheMasahiro Yamada2-90/+16
The kbuild cache was introduced to remember the result of shell commands, some of which are expensive to compute, such as $(call cc-option,...). However, this turned out not so clever as I had first expected. Actually, it is problematic. For example, "$(CC) -print-file-name" is cached. If the compiler is updated, the stale search path causes build error, which is difficult to figure out. Another problem scenario is cache files could be touched while install targets are running under the root permission. We can patch them if desired, but the build infrastructure is getting uglier and uglier. Now, we are going to move compiler flag tests to the configuration phase. If this is completed, the result of compiler tests will be naturally cached in the .config file. We will not have performance issues of incremental building since this testing only happens at Kconfig time. To start this work with a cleaner code base, remove the kbuild cache first. Revert the following commits: Commit 9a234a2e3843 ("kbuild: create directory for make cache only when necessary") Commit e17c400ae194 ("kbuild: shrink .cache.mk when it exceeds 1000 lines") Commit 4e56207130ed ("kbuild: Cache a few more calls to the compiler") Commit 3298b690b21c ("kbuild: Add a cache for generated variables") Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: Kees Cook <keescook@chromium.org>
2018-05-28kconfig: drop localization supportSam Ravnborg21-610/+258
The localization support is broken and appears unused. There is no google hits on the update-po-config target. And there is no recent (5 years) activity related to the localization. So lets just drop this as it is no longer used. Suggested-by: Ulf Magnusson <ulfalizer@gmail.com> Suggested-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-05-28kconfig: refactor ncurses package checks for building mconf and nconfMasahiro Yamada5-128/+113
The mconf (or its infrastructure, lxdiaglog) depends on the ncurses. Move and rename check-lxdialog.sh to mconf-cfg.sh to make it work in the same way as for qconf and gconf. This commit fixes some more weirdnesses. The nconf also needs ncurses packages. HOSTLOADLIBES_nconf is set to the libraries needed for nconf, but the cflags is not explicitly set. Actually, nconf relies on the check-lxdialog.sh for the proper cflags: HOST_EXTRACFLAGS += $(shell $(CONFIG_SHELL) $(check-lxdialog) -ccflags) \ -DLOCALE The code above passes the ncurses flags to all objects, even for conf, qconf, gconf. Let's pass the ncurses flags only to mconf and nconf. Currently, the presence of ncurses is not checked for nconf. Let's show a prompt like the mconf case. According to Randy's report, the shell scripts still need to carry the fallback code in case the pkg-config fails to find the ncurses packages. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
2018-05-28kconfig: refactor GTK+ package checks for building gconfMasahiro Yamada2-34/+32
Refactor the package checks for gconf in the same way as for qconf. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
2018-05-28kconfig: refactor Qt package checks for building qconfMasahiro Yamada2-45/+53
Currently, the necessary package checks for building qconf is surrounded by ifeq ($(MAKECMDGOALS),xconfig) ... endif. Then, Make will restart when .tmp_qtcheck is generated. To simplify the Makefile, move the scripting to a separate file, and use filechk. The shell script is executed everytime xconfig is run, but it is not a costly script. In the old code, 'pkg-config --exists' only checked Qt5Core / QtCore, but the set of necessary packages should be checked. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
2018-05-28kbuild: do not display CHK for filechkMasahiro Yamada1-1/+0
filechk displays two short logs; CHK for creating a temporary file, and UPD for really updating the target. IMHO, the build system can be quiet when the target file has not been updated. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
2018-05-27Linux 4.17-rc7Linus Torvalds1-1/+1
2018-05-26ARM: Fix i2c-gpio GPIO descriptor tablesLinus Walleij10-11/+11
I used bad names in my clumsiness when rewriting many board files to use GPIO descriptors instead of platform data. A few had the platform_device ID set to -1 which would indeed give the device name "i2c-gpio". But several had it set to >=0 which gives the names "i2c-gpio.0", "i2c-gpio.1" ... Fix the offending instances in the ARM tree. Sorry for the mess. Fixes: b2e63555592f ("i2c: gpio: Convert to use descriptors") Cc: Wolfram Sang <wsa@the-dreams.de> Cc: Simon Guinot <simon.guinot@sequanux.org> Reported-by: Simon Guinot <simon.guinot@sequanux.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Olof Johansson <olof@lixom.net>
2018-05-26arm64: dts: hikey: Fix eMMC corruption regressionJohn Stultz1-1/+0
This patch is a partial revert of commit abd7d0972a19 ("arm64: dts: hikey: Enable HS200 mode on eMMC") which has been causing eMMC corruption on my HiKey board. Symptoms usually looked like: mmc_host mmc0: Bus speed (slot 0) = 24800000Hz (slot req 400000Hz, actual 400000HZ div = 31) ... mmc_host mmc0: Bus speed (slot 0) = 148800000Hz (slot req 150000000Hz, actual 148800000HZ div = 0) mmc0: new HS200 MMC card at address 0001 ... dwmmc_k3 f723d000.dwmmc0: Unexpected command timeout, state 3 mmc_host mmc0: Bus speed (slot 0) = 24800000Hz (slot req 400000Hz, actual 400000HZ div = 31) mmc_host mmc0: Bus speed (slot 0) = 148800000Hz (slot req 150000000Hz, actual 148800000HZ div = 0) mmc_host mmc0: Bus speed (slot 0) = 24800000Hz (slot req 400000Hz, actual 400000HZ div = 31) mmc_host mmc0: Bus speed (slot 0) = 148800000Hz (slot req 150000000Hz, actual 148800000HZ div = 0) mmc_host mmc0: Bus speed (slot 0) = 24800000Hz (slot req 400000Hz, actual 400000HZ div = 31) mmc_host mmc0: Bus speed (slot 0) = 148800000Hz (slot req 150000000Hz, actual 148800000HZ div = 0) print_req_error: I/O error, dev mmcblk0, sector 8810504 Aborting journal on device mmcblk0p10-8. mmc_host mmc0: Bus speed (slot 0) = 24800000Hz (slot req 400000Hz, actual 400000HZ div = 31) mmc_host mmc0: Bus speed (slot 0) = 148800000Hz (slot req 150000000Hz, actual 148800000HZ div = 0) mmc_host mmc0: Bus speed (slot 0) = 24800000Hz (slot req 400000Hz, actual 400000HZ div = 31) mmc_host mmc0: Bus speed (slot 0) = 148800000Hz (slot req 150000000Hz, actual 148800000HZ div = 0) mmc_host mmc0: Bus speed (slot 0) = 24800000Hz (slot req 400000Hz, actual 400000HZ div = 31) mmc_host mmc0: Bus speed (slot 0) = 148800000Hz (slot req 150000000Hz, actual 148800000HZ div = 0) mmc_host mmc0: Bus speed (slot 0) = 24800000Hz (slot req 400000Hz, actual 400000HZ div = 31) mmc_host mmc0: Bus speed (slot 0) = 148800000Hz (slot req 150000000Hz, actual 148800000HZ div = 0) EXT4-fs error (device mmcblk0p10): ext4_journal_check_start:61: Detected aborted journal EXT4-fs (mmcblk0p10): Remounting filesystem read-only And quite often this would result in a disk that wouldn't properly boot even with older kernels. It seems the max-frequency property added by the above patch is causing the problem, so remove it. Cc: Ryan Grachek <ryan@edited.us> Cc: Wei Xu <xuwei5@hisilicon.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: YongQin Liu <yongqin.liu@linaro.org> Cc: Leo Yan <leo.yan@linaro.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Tested-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Wei Xu <xuwei04@gmail.com>
2018-05-25kasan: fix memory hotplug during bootDavid Hildenbrand1-1/+1
Using module_init() is wrong. E.g. ACPI adds and onlines memory before our memory notifier gets registered. This makes sure that ACPI memory detected during boot up will not result in a kernel crash. Easily reproducible with QEMU, just specify a DIMM when starting up. Link: http://lkml.kernel.org/r/20180522100756.18478-3-david@redhat.com Fixes: 786a8959912e ("kasan: disable memory hotplug") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25kasan: free allocated shadow memory on MEM_CANCEL_ONLINEDavid Hildenbrand1-0/+1
We have to free memory again when we cancel onlining, otherwise a later onlining attempt will fail. Link: http://lkml.kernel.org/r/20180522100756.18478-2-david@redhat.com Fixes: fa69b5989bb0 ("mm/kasan: add support for memory hotplug") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25checkpatch: fix macro argument precedence testJoe Perches1-1/+1
checkpatch's macro argument precedence test is broken so fix it. Link: http://lkml.kernel.org/r/5dd900e9197febc1995604bb33c23c136d8b33ce.camel@perches.com Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25init/main.c: include <linux/mem_encrypt.h>Mathieu Malaterre1-0/+1
In commit c7753208a94c ("x86, swiotlb: Add memory encryption support") a call to function `mem_encrypt_init' was added. Include prototype defined in header <linux/mem_encrypt.h> to prevent a warning reported during compilation with W=1: init/main.c:494:20: warning: no previous prototype for `mem_encrypt_init' [-Wmissing-prototypes] Link: http://lkml.kernel.org/r/20180522195533.31415-1-malat@debian.org Signed-off-by: Mathieu Malaterre <malat@debian.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Kees Cook <keescook@chromium.org> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Gargi Sharma <gs051095@gmail.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25kernel/sys.c: fix potential Spectre v1 issueGustavo A. R. Silva1-0/+5
`resource' can be controlled by user-space, hence leading to a potential exploitation of the Spectre variant 1 vulnerability. This issue was detected with the help of Smatch: kernel/sys.c:1474 __do_compat_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap) kernel/sys.c:1455 __do_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap) Fix this by sanitizing *resource* before using it to index current->signal->rlim Notice that given that speculation windows are large, the policy is to kill the speculation on the first load and not worry if it can be completed with a dependent load/store [1]. [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2 Link: http://lkml.kernel.org/r/20180515030038.GA11822@embeddedor.com Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25mm/memory_hotplug: fix leftover use of struct page during hotplugJonathan Cameron3-6/+9
The case of a new numa node got missed in avoiding using the node info from page_struct during hotplug. In this path we have a call to register_mem_sect_under_node (which allows us to specify it is hotplug so don't change the node), via link_mem_sections which unfortunately does not. Fix is to pass check_nid through link_mem_sections as well and disable it in the new numa node path. Note the bug only 'sometimes' manifests depending on what happens to be in the struct page structures - there are lots of them and it only needs to match one of them. The result of the bug is that (with a new memory only node) we never successfully call register_mem_sect_under_node so don't get the memory associated with the node in sysfs and meminfo for the node doesn't report it. It came up whilst testing some arm64 hotplug patches, but appears to be universal. Whilst I'm triggering it by removing then reinserting memory to a node with no other elements (thus making the node disappear then appear again), it appears it would happen on hotplugging memory where there was none before and it doesn't seem to be related the arm64 patches. These patches call __add_pages (where most of the issue was fixed by Pavel's patch). If there is a node at the time of the __add_pages call then all is well as it calls register_mem_sect_under_node from there with check_nid set to false. Without a node that function returns having not done the sysfs related stuff as there is no node to use. This is expected but it is the resulting path that fails... Exact path to the problem is as follows: mm/memory_hotplug.c: add_memory_resource() The node is not online so we enter the 'if (new_node)' twice, on the second such block there is a call to link_mem_sections which calls into drivers/node.c: link_mem_sections() which calls drivers/node.c: register_mem_sect_under_node() which calls get_nid_for_pfn and keeps trying until the output of that matches the expected node (passed all the way down from add_memory_resource) It is effectively the same fix as the one referred to in the fixes tag just in the code path for a new node where the comments point out we have to rerun the link creation because it will have failed in register_new_memory (as there was no node at the time). (actually that comment is wrong now as we don't have register_new_memory any more it got renamed to hotplug_memory_register in Pavel's patch). Link: http://lkml.kernel.org/r/20180504085311.1240-1-Jonathan.Cameron@huawei.com Fixes: fc44f7f9231a ("mm/memory_hotplug: don't read nid from struct page during hotplug") Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25proc: fix smaps and meminfo alignmentHugh Dickins1-5/+0
The 4.17-rc /proc/meminfo and /proc/<pid>/smaps look ugly: single-digit numbers (commonly 0) are misaligned. Remove seq_put_decimal_ull_width()'s leftover optimization for single digits: it's wrong now that num_to_str() takes care of the width. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805241554210.1326@eggly.anvils Fixes: d1be35cb6f96 ("proc: add seq_put_decimal_ull_width to speed up /proc/pid/smaps") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Andrei Vagin <avagin@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25mm: do not warn on offline nodes unless the specific node is explicitly requestedMichal Hocko1-1/+1
Oscar has noticed that we splat WARNING: CPU: 0 PID: 64 at ./include/linux/gfp.h:467 vmemmap_alloc_block+0x4e/0xc9 [...] CPU: 0 PID: 64 Comm: kworker/u4:1 Tainted: G W E 4.17.0-rc5-next-20180517-1-default+ #66 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 Workqueue: kacpi_hotplug acpi_hotplug_work_fn Call Trace: vmemmap_populate+0xf2/0x2ae sparse_mem_map_populate+0x28/0x35 sparse_add_one_section+0x4c/0x187 __add_pages+0xe7/0x1a0 add_pages+0x16/0x70 add_memory_resource+0xa3/0x1d0 add_memory+0xe4/0x110 acpi_memory_device_add+0x134/0x2e0 acpi_bus_attach+0xd9/0x190 acpi_bus_scan+0x37/0x70 acpi_device_hotplug+0x389/0x4e0 acpi_hotplug_work_fn+0x1a/0x30 process_one_work+0x146/0x340 worker_thread+0x47/0x3e0 kthread+0xf5/0x130 ret_from_fork+0x35/0x40 when adding memory to a node that is currently offline. The VM_WARN_ON is just too loud without a good reason. In this particular case we are doing alloc_pages_node(node, GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN, order) so we do not insist on allocating from the given node (it is more a hint) so we can fall back to any other populated node and moreover we explicitly ask to not warn for the allocation failure. Soften the warning only to cases when somebody asks for the given node explicitly by __GFP_THISNODE. Link: http://lkml.kernel.org/r/20180523125555.30039-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Oscar Salvador <osalvador@techadventures.net> Tested-by: Oscar Salvador <osalvador@techadventures.net> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25mm, memory_hotplug: make has_unmovable_pages more robustMichal Hocko1-6/+10
Oscar has reported: : Due to an unfortunate setting with movablecore, memblocks containing bootmem : memory (pages marked by get_page_bootmem()) ended up marked in zone_movable. : So while trying to remove that memory, the system failed in do_migrate_range : and __offline_pages never returned. : : This can be reproduced by running : qemu-system-x86_64 -m 6G,slots=8,maxmem=8G -numa node,mem=4096M -numa node,mem=2048M : and movablecore=4G kernel command line : : linux kernel: BIOS-provided physical RAM map: : linux kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable : linux kernel: BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved : linux kernel: BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved : linux kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable : linux kernel: BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved : linux kernel: BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved : linux kernel: BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved : linux kernel: BIOS-e820: [mem 0x0000000100000000-0x00000001bfffffff] usable : linux kernel: NX (Execute Disable) protection: active : linux kernel: SMBIOS 2.8 present. : linux kernel: DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org : linux kernel: Hypervisor detected: KVM : linux kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved : linux kernel: e820: remove [mem 0x000a0000-0x000fffff] usable : linux kernel: last_pfn = 0x1c0000 max_arch_pfn = 0x400000000 : : linux kernel: SRAT: PXM 0 -> APIC 0x00 -> Node 0 : linux kernel: SRAT: PXM 1 -> APIC 0x01 -> Node 1 : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff] : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff] : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x13fffffff] : linux kernel: ACPI: SRAT: Node 1 PXM 1 [mem 0x140000000-0x1bfffffff] : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x1c0000000-0x43fffffff] hotplug : linux kernel: NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x0 : linux kernel: NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x13fffffff] -> [mem 0 : linux kernel: NODE_DATA(0) allocated [mem 0x13ffd6000-0x13fffffff] : linux kernel: NODE_DATA(1) allocated [mem 0x1bffd3000-0x1bfffcfff] : : zoneinfo shows that the zone movable is placed into both numa nodes: : Node 0, zone Movable : pages free 160140 : min 1823 : low 2278 : high 2733 : spanned 262144 : present 262144 : managed 245670 : Node 1, zone Movable : pages free 448427 : min 3827 : low 4783 : high 5739 : spanned 524288 : present 524288 : managed 515766 Note how only Node 0 has a hutplugable memory region which would rule it out from the early memblock allocations (most likely memmap). Node1 will surely contain memmaps on the same node and those would prevent offlining to succeed. So this is arguably a configuration issue. Although one could argue that we should be more clever and rule early allocations from the zone movable. This would be correct but probably not worth the effort considering what a hack movablecore is. Anyway, We could do better for those cases though. We rely on start_isolate_page_range resp. has_unmovable_pages to do their job. The first one isolates the whole range to be offlined so that we do not allocate from it anymore and the later makes sure we are not stumbling over non-migrateable pages. has_unmovable_pages is overly optimistic, however. It doesn't check all the pages if we are withing zone_movable because we rely that those pages will be always migrateable. As it turns out we are still not perfect there. While bootmem pages in zonemovable sound like a clear bug which should be fixed let's remove the optimization for now and warn if we encounter unmovable pages in zone_movable in the meantime. That should help for now at least. Btw. this wasn't a real problem until commit 72b39cfc4d75 ("mm, memory_hotplug: do not fail offlining too early") because we used to have a small number of retries and then failed. This turned out to be too fragile though. Link: http://lkml.kernel.org/r/20180523125555.30039-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Oscar Salvador <osalvador@techadventures.net> Tested-by: Oscar Salvador <osalvador@techadventures.net> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25mm/kasan: don't vfree() nonexistent vm_areaAndrey Ryabinin1-2/+61
KASAN uses different routines to map shadow for hot added memory and memory obtained in boot process. Attempt to offline memory onlined by normal boot process leads to this: Trying to vfree() nonexistent vm area (000000005d3b34b9) WARNING: CPU: 2 PID: 13215 at mm/vmalloc.c:1525 __vunmap+0x147/0x190 Call Trace: kasan_mem_notifier+0xad/0xb9 notifier_call_chain+0x166/0x260 __blocking_notifier_call_chain+0xdb/0x140 __offline_pages+0x96a/0xb10 memory_subsys_offline+0x76/0xc0 device_offline+0xb8/0x120 store_mem_state+0xfa/0x120 kernfs_fop_write+0x1d5/0x320 __vfs_write+0xd4/0x530 vfs_write+0x105/0x340 SyS_write+0xb0/0x140 Obviously we can't call vfree() to free memory that wasn't allocated via vmalloc(). Use find_vm_area() to see if we can call vfree(). Unfortunately it's a bit tricky to properly unmap and free shadow allocated during boot, so we'll have to keep it. If memory will come online again that shadow will be reused. Matthew asked: how can you call vfree() on something that isn't a vmalloc address? vfree() is able to free any address returned by __vmalloc_node_range(). And __vmalloc_node_range() gives you any address you ask. It doesn't have to be an address in [VMALLOC_START, VMALLOC_END] range. That's also how the module_alloc()/module_memfree() works on architectures that have designated area for modules. [aryabinin@virtuozzo.com: improve comments] Link: http://lkml.kernel.org/r/dabee6ab-3a7a-51cd-3b86-5468718e0390@virtuozzo.com [akpm@linux-foundation.org: fix typos, reflow comment] Link: http://lkml.kernel.org/r/20180201163349.8700-1-aryabinin@virtuozzo.com Fixes: fa69b5989bb0 ("mm/kasan: add support for memory hotplug") Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reported-by: Paul Menzel <pmenzel+linux-kasan-dev@molgen.mpg.de> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25MAINTAINERS: change hugetlbfs maintainer and update filesMike Kravetz1-1/+7
The current hugetlbfs maintainer has not been active for more than a few years. I have been been active in this area for more than two years and plan to remain active in the foreseeable future. Also, update the hugetlbfs entry to include linux-mm mail list and additional hugetlbfs related files. hugetlb.c and hugetlb.h are not 100% hugetlbfs, but a majority of their content is hugetlbfs related. Link: http://lkml.kernel.org/r/20180518225236.19079-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Nadia Yvette Chambers <nyc@holomorphy.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25ipc/shm: fix shmat() nil address after round-down when remappingDavidlohr Bueso1-2/+10
shmat()'s SHM_REMAP option forbids passing a nil address for; this is in fact the very first thing we check for. Andrea reported that for SHM_RND|SHM_REMAP cases we can end up bypassing the initial addr check, but we need to check again if the address was rounded down to nil. As of this patch, such cases will return -EINVAL. Link: http://lkml.kernel.org/r/20180503204934.kk63josdu6u53fbd@linux-n805 Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25Revert "ipc/shm: Fix shmat mmap nil-page protection"Davidlohr Bueso1-7/+2
Patch series "ipc/shm: shmat() fixes around nil-page". These patches fix two issues reported[1] a while back by Joe and Andrea around how shmat(2) behaves with nil-page. The first reverts a commit that it was incorrectly thought that mapping nil-page (address=0) was a no no with MAP_FIXED. This is not the case, with the exception of SHM_REMAP; which is address in the second patch. I chose two patches because it is easier to backport and it explicitly reverts bogus behaviour. Both patches ought to be in -stable and ltp testcases need updated (the added testcase around the cve can be modified to just test for SHM_RND|SHM_REMAP). [1] lkml.kernel.org/r/20180430172152.nfa564pvgpk3ut7p@linux-n805 This patch (of 2): Commit 95e91b831f87 ("ipc/shm: Fix shmat mmap nil-page protection") worked on the idea that we should not be mapping as root addr=0 and MAP_FIXED. However, it was reported that this scenario is in fact valid, thus making the patch both bogus and breaks userspace as well. For example X11's libint10.so relies on shmat(1, SHM_RND) for lowmem initialization[1]. [1] https://cgit.freedesktop.org/xorg/xserver/tree/hw/xfree86/os-support/linux/int10/linux.c#n347 Link: http://lkml.kernel.org/r/20180503203243.15045-2-dave@stgolabs.net Fixes: 95e91b831f87 ("ipc/shm: Fix shmat mmap nil-page protection") Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reported-by: Joe Lawrence <joe.lawrence@redhat.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25idr: fix invalid ptr dereference on item deleteMatthew Wilcox2-1/+10
If the radix tree underlying the IDR happens to be full and we attempt to remove an id which is larger than any id in the IDR, we will call __radix_tree_delete() with an uninitialised 'slot' pointer, at which point anything could happen. This was easiest to hit with a single entry at id 0 and attempting to remove a non-0 id, but it could have happened with 64 entries and attempting to remove an id >= 64. Roman said: The syzcaller test boils down to opening /dev/kvm, creating an eventfd, and calling a couple of KVM ioctls. None of this requires superuser. And the result is dereferencing an uninitialized pointer which is likely a crash. The specific path caught by syzbot is via KVM_HYPERV_EVENTD ioctl which is new in 4.17. But I guess there are other user-triggerable paths, so cc:stable is probably justified. Matthew added: We have around 250 calls to idr_remove() in the kernel today. Many of them pass an ID which is embedded in the object they're removing, so they're safe. Picking a few likely candidates: drivers/firewire/core-cdev.c looks unsafe; the ID comes from an ioctl. drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c is similar drivers/atm/nicstar.c could be taken down by a handcrafted packet Link: http://lkml.kernel.org/r/20180518175025.GD6361@bombadil.infradead.org Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree") Reported-by: <syzbot+35666cba7f0a337e2e79@syzkaller.appspotmail.com> Debugged-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25ocfs2: revert "ocfs2/o2hb: check len for bio_add_page() to avoid getting incorrect bio"Changwei Ge1-10/+1
This reverts commit ba16ddfbeb9d ("ocfs2/o2hb: check len for bio_add_page() to avoid getting incorrect bio"). In my testing, this patch introduces a problem that mkfs can't have slots more than 16 with 4k block size. And the original logic is safe actually with the situation it mentions so revert this commit. Attach test log: (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 0, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 1, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 2, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 3, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 4, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 5, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 6, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 7, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 8, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 9, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 10, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 11, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 12, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 13, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 14, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 15, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 16, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:471 ERROR: Adding page[16] to bio failed, page ffffea0002d7ed40, len 0, vec_len 4096, vec_start 0,bi_sector 8192 (mkfs.ocfs2,27479,2):o2hb_read_slots:500 ERROR: status = -5 (mkfs.ocfs2,27479,2):o2hb_populate_slot_data:1911 ERROR: status = -5 (mkfs.ocfs2,27479,2):o2hb_region_dev_write:2012 ERROR: status = -5 Link: http://lkml.kernel.org/r/SIXPR06MB0461721F398A5A92FC68C39ED5920@SIXPR06MB0461.apcprd06.prod.outlook.com Signed-off-by: Changwei Ge <ge.changwei@h3c.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25mm: fix nr_rotate_swap leak in swapon() error caseOmar Sandoval1-1/+6
If swapon() fails after incrementing nr_rotate_swap, we don't decrement it and thus effectively leak it. Make sure we decrement it if we incremented it. Link: http://lkml.kernel.org/r/b6fe6b879f17fa68eee6cbd876f459f6e5e33495.1526491581.git.osandov@fb.com Fixes: 81a0298bdfab ("mm, swap: don't use VMA based swap readahead if HDD is used as swap") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25ibmvnic: Fix partial success login retriesThomas Falcon1-1/+6
In its current state, the driver will handle backing device login in a loop for a certain number of retries while the device returns a partial success, indicating that the driver may need to try again using a smaller number of resources. The variable it checks to continue retrying may change over the course of operations, resulting in reallocation of resources but exits without sending the login attempt. Guard against this by introducing a boolean variable that will retain the state indicating that the driver needs to reattempt login with backing device firmware. Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-25KVM: x86: fix #UD address of failed Hyper-V hypercallsRadim Krčmář2-16/+15
If the hypercall was called from userspace or real mode, KVM injects #UD and then advances RIP, so it looks like #UD was caused by the following instruction. This probably won't cause more than confusion, but could give an unexpected access to guest OS' instruction emulator. Also, refactor the code to count hv hypercalls that were handled by the virt userspace. Fixes: 6356ee0c9602 ("x86: Delay skip of emulated hypercall instruction") Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-05-25selftests/net: Add missing config options for PMTU testsStefano Brivio1-0/+5
PMTU tests in pmtu.sh need support for VTI, VTI6 and dummy interfaces: add them to config file. Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Fixes: d1f1b9cbf34c ("selftests: net: Introduce first PMTU test") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-25mlx4_core: allocate ICM memory in page size chunksQing Huang1-7/+9
When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use kvzalloc to replace kcalloc which will fall back to vmalloc automatically if kmalloc fails. Signed-off-by: Qing Huang <qing.huang@oracle.com> Acked-by: Daniel Jurgens <danielj@mellanox.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-25sched, tracing: Fix trace_sched_pi_setprio() for deboostingSebastian Andrzej Siewior1-1/+3
Since the following commit: b91473ff6e97 ("sched,tracing: Update trace_sched_pi_setprio()") the sched_pi_setprio trace point shows the "newprio" during a deboost: |futex sched_pi_setprio: comm=futex_requeue_p pid"34 oldprio˜ newprio=3D98 |futex sched_switch: prev_comm=futex_requeue_p prev_pid"34 prev_prio=120 This patch open codes __rt_effective_prio() in the tracepoint as the 'newprio' to get the old behaviour back / the correct priority: |futex sched_pi_setprio: comm=futex_requeue_p pid"20 oldprio˜ newprio=3D120 |futex sched_switch: prev_comm=futex_requeue_p prev_pid"20 prev_prio=120 Peter suggested to open code the new priority so people using tracehook could get the deadline data out. Reported-by: Mansky Christian <man@keba.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: b91473ff6e97 ("sched,tracing: Update trace_sched_pi_setprio()") Link: http://lkml.kernel.org/r/20180524132647.gg6ziuogczdmjjzu@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-25kthread: Allow kthread_park() on a parked kthreadPeter Zijlstra1-4/+2
The following commit: 85f1abe0019f ("kthread, sched/wait: Fix kthread_parkme() completion issue") added a WARN() in the case where we call kthread_park() on an already parked thread, because the old code wasn't doing the right thing there and it wasn't at all clear that would happen. It turns out, this does in fact happen, so we have to deal with it. Instead of potentially returning early, also wait for the completion. This does however mean we have to use complete_all() and re-initialize the completion on re-use. Reported-by: LKP <lkp@01.org> Tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kernel test robot <lkp@intel.com> Cc: wfg@linux.intel.com Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 85f1abe0019f ("kthread, sched/wait: Fix kthread_parkme() completion issue") Link: http://lkml.kernel.org/r/20180504091142.GI12235@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-25sched/topology: Clarify root domain(s) debug stringJuri Lelli1-1/+1
When scheduler debug is enabled, building scheduling domains outputs information about how the domains are laid out and to which root domain each CPU (or sets of CPUs) belongs, e.g.: CPU0 attaching sched-domain(s): domain-0: span=0-5 level=MC groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 } CPU1 attaching sched-domain(s): domain-0: span=0-5 level=MC groups: 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 } [...] span: 0-5 (max cpu_capacity = 1024) The fact that latest line refers to CPUs 0-5 root domain doesn't however look immediately obvious to me: one might wonder why span 0-5 is reported "again". Make it more clear by adding "root domain" to it, as to end with the following: CPU0 attaching sched-domain(s): domain-0: span=0-5 level=MC groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 } CPU1 attaching sched-domain(s): domain-0: span=0-5 level=MC groups: 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 } [...] root domain span: 0-5 (max cpu_capacity = 1024) Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180524152936.17611-1-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-24firmware: qcom: scm: Fix crash in qcom_scm_call_atomic1()Niklas Cassel1-4/+4
qcom_scm_call_atomic1() can crash with a NULL pointer dereference at qcom_scm_call_atomic1+0x30/0x48. disassembly of qcom_scm_call_atomic1(): ... <0xc08d73b0 <+12>: ldr r3, [r12] ... (no instruction explicitly modifies r12) 0xc08d73cc <+40>: smc 0 ... (no instruction explicitly modifies r12) 0xc08d73d4 <+48>: ldr r3, [r12] <- crashing instruction ... Since the first ldr is successful, and since r12 isn't explicitly modified by any instruction between the first and the second ldr, it must have been modified by the smc call, which is ok, since r12 is caller save according to the AAPCS. Add r12 to the clobber list so that the compiler knows that the callee potentially overwrites the value in r12. Clobber descriptions may not in any way overlap with an input or output operand. Signed-off-by: Niklas Cassel <niklas.cassel@linaro.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Reviewed-by: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Andy Gross <andy.gross@linaro.org>
2018-05-24enic: set DMA mask to 47 bitGovindarajulu Varadarajan1-4/+4
In commit 624dbf55a359b ("driver/net: enic: Try DMA 64 first, then failover to DMA") DMA mask was changed from 40 bits to 64 bits. Hardware actually supports only 47 bits. Fixes: 624dbf55a359b ("driver/net: enic: Try DMA 64 first, then failover to DMA") Signed-off-by: Govindarajulu Varadarajan <gvaradar@cisco.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24ppp: remove the PPPIOCDETACH ioctlEric Biggers3-29/+6
The PPPIOCDETACH ioctl effectively tries to "close" the given ppp file before f_count has reached 0, which is fundamentally a bad idea. It does check 'f_count < 2', which excludes concurrent operations on the file since they would only be possible with a shared fd table, in which case each fdget() would take a file reference. However, it fails to account for the fact that even with 'f_count == 1' the file can still be linked into epoll instances. As reported by syzbot, this can trivially be used to cause a use-after-free. Yet, the only known user of PPPIOCDETACH is pppd versions older than ppp-2.4.2, which was released almost 15 years ago (November 2003). Also, PPPIOCDETACH apparently stopped working reliably at around the same time, when the f_count check was added to the kernel, e.g. see https://lkml.org/lkml/2002/12/31/83. Also, the current 'f_count < 2' check makes PPPIOCDETACH only work in single-threaded applications; it always fails if called from a multithreaded application. All pppd versions released in the last 15 years just close() the file descriptor instead. Therefore, instead of hacking around this bug by exporting epoll internals to modules, and probably missing other related bugs, just remove the PPPIOCDETACH ioctl and see if anyone actually notices. Leave a stub in place that prints a one-time warning and returns EINVAL. Reported-by: syzbot+16363c99d4134717c05b@syzkaller.appspotmail.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Guillaume Nault <g.nault@alphalink.fr> Tested-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24ipv4: remove warning in ip_recv_errorWillem de Bruijn1-2/+0
A precondition check in ip_recv_error triggered on an otherwise benign race. Remove the warning. The warning triggers when passing an ipv6 socket to this ipv4 error handling function. RaceFuzzer was able to trigger it due to a race in setsockopt IPV6_ADDRFORM. --- CPU0 do_ipv6_setsockopt sk->sk_socket->ops = &inet_dgram_ops; --- CPU1 sk->sk_prot->recvmsg udp_recvmsg ip_recv_error WARN_ON_ONCE(sk->sk_family == AF_INET6); --- CPU0 do_ipv6_setsockopt sk->sk_family = PF_INET; This socket option converts a v6 socket that is connected to a v4 peer to an v4 socket. It updates the socket on the fly, changing fields in sk as well as other structs. This is inherently non-atomic. It races with the lockless udp_recvmsg path. No other code makes an assumption that these fields are updated atomically. It is benign here, too, as ip_recv_error cares only about the protocol of the skbs enqueued on the error queue, for which sk_family is not a precise predictor (thanks to another isue with IPV6_ADDRFORM). Link: http://lkml.kernel.org/r/20180518120826.GA19515@dragonet.kaist.ac.kr Fixes: 7ce875e5ecb8 ("ipv4: warn once on passing AF_INET6 socket to ip_recv_error") Reported-by: DaeRyong Jeong <threeearcat@gmail.com> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24net : sched: cls_api: deal with egdev path only if neededOr Gerlitz1-1/+1
When dealing with ingress rule on a netdev, if we did fine through the conventional path, there's no need to continue into the egdev route, and we can stop right there. Not doing so may cause a 2nd rule to be added by the cls api layer with the ingress being the egdev. For example, under sriov switchdev scheme, a user rule of VFR A --> VFR B will end up with two HW rules (1) VF A --> VF B and (2) uplink --> VF B Fixes: 208c0f4b5237 ('net: sched: use tc_setup_cb_call to call per-block callbacks') Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24vhost: synchronize IOTLB message with dev cleanupJason Wang1-0/+3
DaeRyong Jeong reports a race between vhost_dev_cleanup() and vhost_process_iotlb_msg(): Thread interleaving: CPU0 (vhost_process_iotlb_msg) CPU1 (vhost_dev_cleanup) (In the case of both VHOST_IOTLB_UPDATE and VHOST_IOTLB_INVALIDATE) ===== ===== vhost_umem_clean(dev->iotlb); if (!dev->iotlb) { ret = -EFAULT; break; } dev->iotlb = NULL; The reason is we don't synchronize between them, fixing by protecting vhost_process_iotlb_msg() with dev mutex. Reported-by: DaeRyong Jeong <threeearcat@gmail.com> Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API") Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24packet: fix reserve calculationWillem de Bruijn1-1/+1
Commit b84bbaf7a6c8 ("packet: in packet_snd start writing at link layer allocation") ensures that packet_snd always starts writing the link layer header in reserved headroom allocated for this purpose. This is needed because packets may be shorter than hard_header_len, in which case the space up to hard_header_len may be zeroed. But that necessary padding is not accounted for in skb->len. The fix, however, is buggy. It calls skb_push, which grows skb->len when moving skb->data back. But in this case packet length should not change. Instead, call skb_reserve, which moves both skb->data and skb->tail back, without changing length. Fixes: b84bbaf7a6c8 ("packet: in packet_snd start writing at link layer allocation") Reported-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>