wireguard-linux - WireGuard for the Linux kernel

Age	Commit message (Collapse)	Author	Files	Lines
2018-04-09	afs: Dump bad status record	David Howells	1	-0/+35
	Dump an AFS FileStatus record that is detected as invalid. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09	afs: Implement @cell substitution handling	David Howells	3	-1/+89
	Implement @cell substitution handling such that if @cell is seen as a name in a dynamic root mount, then the name of the root cell for that network namespace will be substituted for @cell during lookup. The substitution of @cell for the current net namespace is set by writing the cell name to /proc/fs/afs/rootcell. The value can be obtained by reading the file. For example: # mount -t afs none /kafs -o dyn # echo grand.central.org >/proc/fs/afs/rootcell # ls /kafs/@cell archive/ cvs/ doc/ local/ project/ service/ software/ user/ www/ # cat /proc/fs/afs/rootcell grand.central.org Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09	afs: Implement @sys substitution handling	David Howells	5	-2/+380
	Implement the AFS feature by which @sys at the end of a pathname component may be substituted for one of a list of values, typically naming the operating system. Up to 16 alternatives may be specified and these are tried in turn until one works. Each network namespace has[] a separate independent list. Upon creation of a new network namespace, the list of values is initialised[] to a single OpenAFS-compatible string representing arch type plus "_linux26". For example, on x86_64, the sysname is "amd64_linux26". [*] Or will, once network namespace support is finalised in kAFS. The list may be set by: # for i in foo bar linux-x86_64; do echo $i; done >/proc/fs/afs/sysname for which separate writes to the same fd are amalgamated and applied on close. The LF character may be used as a separator to specify multiple items in the same write() call. The list may be cleared by: # echo >/proc/fs/afs/sysname and read by: # cat /proc/fs/afs/sysname foo bar linux-x86_64 Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09	afs: Prospectively look up extra files when doing a single lookup	David Howells	8	-63/+552
	When afs_lookup() is called, prospectively look up the next 50 uncached fids also from that same directory and cache the results, rather than just looking up the one file requested. This allows us to use the FS.InlineBulkStatus RPC op to increase efficiency by fetching up to 50 file statuses at a time. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09	afs: Don't over-increment the cell usage count when pinning it	David Howells	2	-3/+4
	AFS cells that are added or set as the workstation cell through /proc are pinned against removal by setting the AFS_CELL_FL_NO_GC flag on them and taking a ref. The ref should be only taken if the flag wasn't already set. Fix this by making it conditional. Without this an assertion failure will occur during module removal indicating that the refcount is too elevated. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09	afs: Fix checker warnings	David Howells	11	-50/+35
	Fix warnings raised by checker, including: () Warnings raised by unequal comparison for the purposes of sorting, where the endianness doesn't matter: fs/afs/addr_list.c:246:21: warning: restricted __be16 degrades to integer fs/afs/addr_list.c:246:30: warning: restricted __be16 degrades to integer fs/afs/addr_list.c:248:21: warning: restricted __be32 degrades to integer fs/afs/addr_list.c:248:49: warning: restricted __be32 degrades to integer fs/afs/addr_list.c:283:21: warning: restricted __be16 degrades to integer fs/afs/addr_list.c:283:30: warning: restricted __be16 degrades to integer () afs_set_cb_interest() is not actually used and can be removed. () afs_cell_gc_delay() should be provided with a sysctl. () afs_cell_destroy() needs to use rcu_access_pointer() to read cell->vl_addrs. () afs_init_fs_cursor() should be static. () struct afs_vnode::permit_cache needs to be marked __rcu. () afs_server_rcu() needs to use rcu_access_pointer(). () afs_destroy_server() should use rcu_access_pointer() on server->addresses as the server object is no longer accessible. () afs_find_server() casts __be16/__be32 values to int in order to directly compare them for the purpose of finding a match in a list, but is should also annotate the cast with __force to avoid checker warnings. () afs_check_permit() accesses vnode->permit_cache outside of the RCU readlock, though it doesn't then access the value; the extraneous access is deleted. False positives: () Conditional locking around the code in xdr_decode_AFSFetchStatus. This can be dealt with in a separate patch. fs/afs/fsclient.c:148:9: warning: context imbalance in 'xdr_decode_AFSFetchStatus' - different lock contexts for basic block () Incorrect handling of seq-retry lock context balance: fs/afs/inode.c:455:38: warning: context imbalance in 'afs_getattr' - different lock contexts for basic block fs/afs/server.c:52:17: warning: context imbalance in 'afs_find_server' - different lock contexts for basic block fs/afs/server.c:128:17: warning: context imbalance in 'afs_find_server_by_uuid' - different lock contexts for basic block Errors: () afs_lookup_cell_rcu() needs to break out of the seq-retry loop, not go round again if it successfully found the workstation cell. () Fix UUID decode in afs_deliver_cb_probe_uuid(). () afs_cache_permit() has a missing rcu_read_unlock() before one of the jumps to the someone_else_changed_it label. Move the unlock to after the label. () afs_vl_get_addrs_u() is using ntohl() rather than htonl() when encoding to XDR. (*) afs_deliver_yfsvl_get_endpoints() is using htonl() rather than ntohl() when decoding from XDR. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09	vfs: Remove the const from dir_context::actor	David Howells	1	-1/+1
	Remove the const marking from the actor function pointer in the dir_context struct. The const prevents the structure from being used as part of a kmalloc'd object as it makes the compiler require that the actor member be set at object initialisation time (or not at all), incuring something like the following error if you try and set it later: fs/afs/dir.c:556:20: error: assignment of read-only member 'actor' Marking the member const like this adds very little in the way of sanity checking as the type checking system is likely to provide sufficient - and if not, the kernel is very likely to oops repeatably in this case. Fixes: ac6614b76478 ("[readdir] constify ->actor") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-09	selinux: fix missing dput() before selinuxfs unmount	Stephen Smalley	1	-0/+1
	Commit 0619f0f5e36f ("selinux: wrap selinuxfs state") triggers a BUG when SELinux is runtime-disabled (i.e. systemd or equivalent disables SELinux before initial policy load via /sys/fs/selinux/disable based on /etc/selinux/config SELINUX=disabled). This does not manifest if SELinux is disabled via kernel command line argument or if SELinux is enabled (permissive or enforcing). Before: SELinux: Disabled at runtime. BUG: Dentry 000000006d77e5c7{i=17,n=null} still in use (1) [unmount of selinuxfs selinuxfs] After: SELinux: Disabled at runtime. Fixes: 0619f0f5e36f ("selinux: wrap selinuxfs state") Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-09	Fix subtle macro variable shadowing in min_not_zero()	Linus Torvalds	1	-8/+9
	Commit 3c8ba0d61d04 ("kernel.h: Retain constant expression output for max()/min()") rewrote our min/max macros to be very clever, but in the meantime resurrected a variable name shadow issue that we had had previously fixed in commit 589a9785ee3a ("min/max: remove sparse warnings when they're nested"). That commit talks about the sparse warnings that this shadowing causes, which we ignored as just a minor annoyance. But it turns out that the sparse warning is the least of our problems. We actually have a real bug due to the shadowing through the interaction with "min_not_zero()", which ends up doing min(__x, __y) internally, and then the new declaration of "__x" and "__y" as new variables in __cmp_once() results in a complete mess of an expression, and "min_not_zero()" doesn't work at all. For some odd reason, this only ever caused (reported) problems on s390, even though it is a generic issue and most of the (obviously successful) testing of the problematic commit had happened on other architectures. Quoting Sebastian Ott: "What happened is that the bio build by the partition detection code was attempted to be split by the block layer because the block queue had a max_sector setting of 0. blk_queue_max_hw_sectors uses min_not_zero." So re-introduce the use of __UNIQUE_ID() to make sure that the min/max macros do not have these kinds of clashes. [ That said, __UNIQUE_ID() itself has several issues that make it less than wonderful. In particular, the "uniqueness" has a fallback on the line number, which means that it's not actually unique in more complex cases if you don't build with gcc or clang (which have working unique counters that aren't tied to line numbers). That historical broken fallback also means that we have that pointless "prefix" argument that doesn't actually make much sense _except_ for the known-broken case. Oh well. ] Fixes: 3c8ba0d61d04 ("kernel.h: Retain constant expression output for max()/min()") Reported-and-tested-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-08	getname_kernel() needs to make sure that ->name != ->iname in long case	Al Viro	1	-1/+2
	missed it in "kill struct filename.separate" several years ago. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-07	alpha: io: reorder barriers to guarantee writeX() and iowriteX() ordering	Sinan Kaya	1	-7/+7
	memory-barriers.txt has been updated with the following requirement. "When using writel(), a prior wmb() is not needed to guarantee that the cache coherent memory writes have completed before writing to the MMIO region." Current writeX() and iowriteX() implementations on alpha are not satisfying this requirement as the barrier is after the register write. Move mb() in writeX() and iowriteX() functions to guarantee that HW observes memory changes before performing register operations. Signed-off-by: Sinan Kaya <okaya@codeaurora.org> Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Matt Turner <mattst88@gmail.com>
2018-04-07	alpha: Implement CPU vulnerabilities sysfs functions.	Michael Cree	3	-1/+47
	Implement the CPU vulnerabilty show functions for meltdown, spectre_v1 and spectre_v2 on Alpha. Tests on XP1000 (EV67/667MHz) and ES45 (EV68CB/1.25GHz) show them to be vulnerable to Meltdown and Spectre V1. In the case of Meltdown I saw a 1 to 2% success rate in reading bytes on the XP1000 and 50 to 60% success rate on the ES45. (This compares to 99.97% success reported for Intel CPUs.) Report EV6 and later CPUs as vulnerable. Tests on PWS600au (EV56/600MHz) for Spectre V1 attack were unsuccessful (though I did not try particularly hard) so mark EV4 through to EV56 as not vulnerable. Signed-off-by: Michael Cree <mcree@orcon.net.nz> Signed-off-by: Matt Turner <mattst88@gmail.com>
2018-04-07	alpha: rtc: stop validating rtc_time in .read_time	Alexandre Belloni	1	-1/+1
	The RTC core is always calling rtc_valid_tm after the read_time callback. It is not necessary to call it just before returning from the callback. Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Matt Turner <mattst88@gmail.com>
2018-04-07	alpha: rtc: remove unused set_mmss ops	Alexandre Belloni	1	-99/+0
	The .set_mmss and .setmmss64 ops are only called when the RTC is not providing an implementation for the .set_time callback. On alpha, .set_time is provided so .set_mmss64 is never called. Remove the unused code. Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Matt Turner <mattst88@gmail.com>
2018-04-07	treewide: fix up files incorrectly marked executable	Linus Torvalds	2	-0/+0
	Joe Perches noted that we have a few source files that for some inexplicable reason (read: I'm too lazy to even go look at the history) are marked executable: drivers/gpu/drm/amd/amdgpu/vce_v4_0.c drivers/net/ethernet/cadence/macb_ptp.c A simple git command line to show executable C/asm/header files is this: git ls-files -s '*.[chsS]' \| grep '^100755' and then you can fix them up with scripting by just feeding that output into: \| cut -f2 \| xargs chmod -x and commit it. Which is exactly what this commit does. Reported-by: Joe Perches <joe@perches.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-07	MAINTAINERS: Update LEAKING_ADDRESSES	Tobin C. Harding	1	-0/+3
	MAINTAINERS is out of date for leaking_addresses.pl. There is now a tree on kernel.org for development of this script. We have a second maintainer now, thanks Tycho. Development of this scripts was started on kernel-hardening mailing list so let's keep it there. Update maintainer details; Add mailing list, kernel.org hosted tree, and second maintainer. Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-07	leaking_addresses: check if file name contains address	Tobin C. Harding	1	-0/+12
	Sometimes files may be created by using output from printk. As the scan traverses the directory tree we should parse each path name and check if it is leaking an address. Add check for leaking address on each path name. Suggested-by: Tycho Andersen <tycho@tycho.ws> Acked-by: Tycho Andersen <tycho@tycho.ws> Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-07	leaking_addresses: explicitly name variable used in regex	Tobin C. Harding	1	-1/+1
	Currently sub routine may_leak_address() is checking regex against Perl special variable $_ which is _fortunately_ being set correctly in a loop before this sub routine is called. We already have declared a variable to hold this value '$line' we should use it. Use $line in regex match instead of implicit $_ Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-07	leaking_addresses: remove version number	Tobin C. Harding	1	-2/+0
	We have git now, we don't need a version number. This was originally added because leaking_addresses.pl shamelessly (and mindlessly) copied checkpatch.pl Remove version number from script. Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-07	leaking_addresses: skip '/proc/1/syscall'	Tobin C. Harding	1	-0/+1
	The pointers listed in /proc/1/syscall are user pointers, and negative syscall args will show up like kernel addresses. For example /proc/31808/syscall: 0 0x3 0x55b107a38180 0x2000 0xffffffffffffffb0 \ 0x55b107a302d0 0x55b107a38180 0x7fffa313b8e8 0x7ff098560d11 Skip parsing /proc/1/syscall Suggested-by: Tycho Andersen <tycho@tycho.ws> Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-07	leaking_addresses: skip all /proc/PID except /proc/1	Tobin C. Harding	1	-0/+12
	When the system is idle it is likely that most files under /proc/PID will be identical for various processes. Scanning _all_ the PIDs under /proc is unnecessary and implies that we are thoroughly scanning /proc. This is _not_ the case because there may be ways userspace can trigger creation of /proc files that leak addresses but were not present during a scan. For these two reasons we should exclude all PID directories under /proc except '1/' Exclude all /proc/PID except /proc/1. Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-07	leaking_addresses: cache architecture name	Tobin C. Harding	1	-3/+5
	Currently we are repeatedly calling `uname -m`. This is causing the script to take a long time to run (more than 10 seconds to parse /proc/kallsyms). We can use Perl state variables to cache the result of the first call to `uname -m`. With this change in place the script scans the whole kernel in under a minute. Cache machine architecture in state variable. Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-07	leaking_addresses: simplify path skipping	Tobin C. Harding	1	-61/+29
	Currently script has multiple configuration arrays. This is confusing, evident by the fact that a bunch of the entries are in the wrong place. We can simplify the code by just having a single array for absolute paths to skip and a single array for file names to skip wherever they appear in the scanned directory tree. There are also currently multiple subroutines to handle the different arrays, we can reduce these to a single subroutine also. Simplify the path skipping code. Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-07	leaking_addresses: do not parse binary files	Tobin C. Harding	1	-0/+4
	Currently script parses binary files. Since we are scanning for readable kernel addresses there is no need to parse binary files. We can use Perl to check if file is binary and skip parsing it if so. Do not parse binary files. Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-07	leaking_addresses: add 32-bit support	Tobin C. Harding	1	-11/+82
	Currently script only supports x86_64 and ppc64. It would be nice to be able to scan 32-bit machines also. We can add support for 32-bit architectures by modifying how we check for false positives, taking advantage of the page offset used by the kernel, and using the correct regular expression. Support for 32-bit machines is enabled by the observation that the kernel addresses on 32-bit machines are larger [in value] than the page offset. We can use this to filter false positives when scanning the kernel for leaking addresses. Programmatic determination of the running architecture is not immediately obvious (current 32-bit machines return various strings from `uname -m`). We therefore provide a flag to enable scanning of 32-bit kernels. Also we can check the kernel config file for the offset and if not found default to 0xc0000000. A command line option to parse in the page offset is also provided. We do automatically detect architecture if running on ix86. Add support for 32-bit kernels. Add a command line option for page offset. Suggested-by: Kaiwan N Billimoria <kaiwan.billimoria@gmail.com> Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-07	leaking_addresses: add is_arch() wrapper subroutine	Tobin C. Harding	1	-12/+14
	Currently there is duplicate code when checking the architecture type. We can remove the duplication by implementing a wrapper function is_arch(). Implement and use wrapper function is_arch(). Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-07	leaking_addresses: use system command to get arch	Tobin C. Harding	1	-6/+6
	Currently script uses Perl to get the machine architecture. This can be erroneous since Perl uses the architecture of the machine that Perl was compiled on not the architecture of the running machine. We should use the systems `uname` command instead. Use `uname -m` instead of Perl to get the machine architecture. Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-07	leaking_addresses: add support for 5 page table levels	Tobin C. Harding	1	-7/+25
	Currently script only supports 4 page table levels because of the way the kernel address regular expression is crafted. We can do better than this. Using previously added support for kernel configuration options we can get the number of page table levels defined by CONFIG_PGTABLE_LEVELS. Using this value a correct regular expression can be crafted. This only supports 5 page tables on x86_64. Add support for 5 page table levels on x86_64. Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-07	leaking_addresses: add support for kernel config file	Tobin C. Harding	1	-1/+65
	Features that rely on the ability to get kernel configuration options are ready to be implemented in script. In preparation for this we can add support for kernel config options as a separate patch to ease review. Add support for locating and parsing kernel configuration file. Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-07	leaking_addresses: add range check for vsyscall memory	Tobin C. Harding	1	-6/+14
	Currently script checks only first and last address in the vsyscall memory range. We can do better than this. When checking for false positives against $match, we can convert $match to a hexadecimal value then check if it lies within the range of vsyscall addresses. Check whole range of vsyscall addresses when checking for false positive. Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-07	leaking_addresses: indent dependant options	Tobin C. Harding	1	-8/+8
	A number of the command line options to script are dependant on the option --input-raw being set. If we indent these options it makes explicit this dependency. Indent options dependant on --input-raw. Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-07	leaking_addresses: remove command examples	Tobin C. Harding	1	-11/+0
	Currently help output includes command examples. These were cute when we first started development of this script but are unnecessary. Remove command examples. Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-07	leaking_addresses: remove mention of kptr_restrict	Tobin C. Harding	1	-3/+0
	leaking_addresses.pl can be run with kptr_restrict==0 now, we don't need the comment about setting kptr_restrict any more. Remove comment suggesting setting kptr_restrict. Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-07	leaking_addresses: fix typo function not called	Tobin C. Harding	1	-1/+1
	Currently code uses a check against an undefined variable because the variable is a sub routine name and is not evaluated. Evaluate subroutine; add parenthesis to sub routine name. Signed-off-by: Tobin C. Harding <me@tobin.cc>
2018-04-06	pstore: fix crypto dependencies without compression	Tobias Regnery	1	-2/+2
	Commit 58eb5b670747 ("pstore: fix crypto dependencies") fixed up the crypto dependencies but missed the case when no compression is selected. With CONFIG_PSTORE=y, CONFIG_PSTORE_COMPRESS=n and CONFIG_CRYPTO=m we see the following link error: fs/pstore/platform.o: In function `pstore_register': (.text+0x1b1): undefined reference to `crypto_has_alg' (.text+0x205): undefined reference to `crypto_alloc_base' fs/pstore/platform.o: In function `pstore_unregister': (.text+0x3b0): undefined reference to `crypto_destroy_tfm' Fix this by checking at compile-time if CONFIG_PSTORE_COMPRESS is enabled. Fixes: 58eb5b670747 ("pstore: fix crypto dependencies") Signed-off-by: Tobias Regnery <tobias.regnery@gmail.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Kees Cook <keescook@chromium.org>
2018-04-06	make lookup_one_len() safe to use with directory locked shared	Al Viro	1	-1/+3
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-06	new helper: __lookup_slow()	Al Viro	1	-9/+18
	lookup_slow() sans locking/unlocking the directory Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-06	merge common parts of lookup_one_len{,_unlocked} into common helper	Al Viro	1	-56/+34
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-06	kvm: x86: fix a prototype warning	Peng Hao	1	-1/+1
	Make the function static to avoid a warning: no previous prototype for ‘vmx_enable_tdp’ Signed-off-by: Peng Hao <peng.hao2@zte.com.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-06	ARM: sa1100/simpad: switch simpad CF to use gpiod APIs	Russell King	2	-9/+14
	Switch simpad's CF implementation to use the gpiod APIs. The inverted detection is handled using gpiolib's native inversion abilities. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
2018-04-06	ARM: sa1100/shannon: convert to generic CF sockets	Russell King	6	-109/+40
	Convert shannon to use the generic CF socket support. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
2018-04-06	ARM: sa1100/nanoengine: convert to generic CF sockets	Russell King	5	-138/+23
	Convert nanoengine to use the generic CF socket support. Makefile fix from Arnd Bergmann <arnd@arndb.de>. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
2018-04-06	fscache: Maintain a catalogue of allocated cookies	David Howells	5	-119/+279
	Maintain a catalogue of allocated cookies so that cookie collisions can be handled properly. For the moment, this just involves printing a warning and returning a NULL cookie to the caller of fscache_acquire_cookie(), but in future it might make sense to wait for the old cookie to finish being cleaned up. This requires the cookie key to be stored attached to the cookie so that we still have the key available if the netfs relinquishes the cookie. This is done by an earlier patch. The catalogue also renders redundant fscache_netfs_list (used for checking for duplicates), so that can be removed. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Anna Schumaker <anna.schumaker@netapp.com> Tested-by: Steve Dickson <steved@redhat.com>
2018-04-06	fscache: Pass object size in rather than calling back for it	David Howells	21	-166/+127
	Pass the object size in to fscache_acquire_cookie() and fscache_write_page() rather than the netfs providing a callback by which it can be received. This makes it easier to update the size of the object when a new page is written that extends the object. The current object size is also passed by fscache to the check_aux function, obviating the need to store it in the aux data. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Anna Schumaker <anna.schumaker@netapp.com> Tested-by: Steve Dickson <steved@redhat.com>
2018-04-05	mm,oom_reaper: check for MMF_OOM_SKIP before complaining	Tetsuo Handa	1	-1/+2
	I got "oom_reaper: unable to reap pid:" messages when the victim thread was blocked inside free_pgtables() (which occurred after returning from unmap_vmas() and setting MMF_OOM_SKIP). We don't need to complain when exit_mmap() already set MMF_OOM_SKIP. Killed process 7558 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB oom_reaper: unable to reap pid:7558 (a.out) a.out D13272 7558 6931 0x00100084 Call Trace: schedule+0x2d/0x80 rwsem_down_write_failed+0x2bb/0x440 call_rwsem_down_write_failed+0x13/0x20 down_write+0x49/0x60 unlink_file_vma+0x28/0x50 free_pgtables+0x36/0x100 exit_mmap+0xbb/0x180 mmput+0x50/0x110 copy_process.part.41+0xb61/0x1fe0 _do_fork+0xe6/0x560 do_syscall_64+0x74/0x230 entry_SYSCALL_64_after_hwframe+0x42/0xb7 Link: http://lkml.kernel.org/r/201803221946.DHG65638.VFJHFtOSQLOMOF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05	mm/ksm: fix interaction with THP	Claudio Imbrenda	1	-0/+28
	This patch fixes a corner case for KSM. When two pages belong or belonged to the same transparent hugepage, and they should be merged, KSM fails to split the page, and therefore no merging happens. This bug can be reproduced by: * making sure ksm is running (in case disabling ksmtuned) * enabling transparent hugepages * allocating a THP-aligned 1-THP-sized buffer e.g. on amd64: posix_memalign(&p, 1<<21, 1<<21) * filling it with the same values e.g. memset(p, 42, 1<<21) * performing madvise to make it mergeable e.g. madvise(p, 1<<21, MADV_MERGEABLE) * waiting for KSM to perform a few scans The expected outcome is that the all the pages get merged (1 shared and the rest sharing); the actual outcome is that no pages get merged (1 unshared and the rest volatile) The reason of this behaviour is that we increase the reference count once for both pages we want to merge, but if they belong to the same hugepage (or compound page), the reference counter used in both cases is the one of the head of the compound page. This means that split_huge_page will find a value of the reference counter too high and will fail. This patch solves this problem by testing if the two pages to merge belong to the same hugepage when attempting to merge them. If so, the hugepage is split safely. This means that the hugepage is not split if not necessary. Link: http://lkml.kernel.org/r/1521548069-24758-1-git-send-email-imbrenda@linux.vnet.ibm.com Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Co-authored-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05	mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t	Stefan Agner	1	-2/+2
	This fixes a warning shown when phys_addr_t is 32-bit int when compiling with clang: mm/memblock.c:927:15: warning: implicit conversion from 'unsigned long long' to 'phys_addr_t' (aka 'unsigned int') changes value from 18446744073709551615 to 4294967295 [-Wconstant-conversion] r->base : ULLONG_MAX; ^~~~~~~~~~ ./include/linux/kernel.h:30:21: note: expanded from macro 'ULLONG_MAX' #define ULLONG_MAX (~0ULL) ^~~~~ Link: http://lkml.kernel.org/r/20180319005645.29051-1-stefan@agner.ch Signed-off-by: Stefan Agner <stefan@agner.ch> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05	headers: untangle kmemleak.h from mm.h	Randy Dunlap	23	-14/+11
	Currently <linux/slab.h> #includes <linux/kmemleak.h> for no obvious reason. It looks like it's only a convenience, so remove kmemleak.h from slab.h and add <linux/kmemleak.h> to any users of kmemleak_* that don't already #include it. Also remove <linux/kmemleak.h> from source files that do not use it. This is tested on i386 allmodconfig and x86_64 allmodconfig. It would be good to run it through the 0day bot for other $ARCHes. I have neither the horsepower nor the storage space for the other $ARCHes. Update: This patch has been extensively build-tested by both the 0day bot & kisskb/ozlabs build farms. Both of them reported 2 build failures for which patches are included here (in v2). [ slab.h is the second most used header file after module.h; kernel.h is right there with slab.h. There could be some minor error in the counting due to some #includes having comments after them and I didn't combine all of those. ] [akpm@linux-foundation.org: security/keys/big_key.c needs vmalloc.h, per sfr] Link: http://lkml.kernel.org/r/e4309f98-3749-93e1-4bb7-d9501a39d015@infradead.org Link: http://kisskb.ellerman.id.au/kisskb/head/13396/ Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Reported-by: Michael Ellerman <mpe@ellerman.id.au> [2 build failures] Reported-by: Fengguang Wu <fengguang.wu@intel.com> [2 build failures] Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Wei Yongjun <weiyongjun1@huawei.com> Cc: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: John Johansen <john.johansen@canonical.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05	include/linux/mmdebug.h: make VM_WARN* non-rvals	Michal Hocko	1	-4/+4
	At present the construct if (VM_WARN(...)) will compile OK with CONFIG_DEBUG_VM=y and will fail with CONFIG_DEBUG_VM=n. The reason is that VM_{WARN,BUG}* have always been special wrt. {WARN/BUG}* and never generate any code when DEBUG_VM is disabled. So we cannot really use it in conditionals. We considered changing things so that this construct works in both cases but that might cause unwanted code generation with CONFIG_DEBUG_VM=n. It is safer and simpler to make the build fail in both cases. [akpm@linux-foundation.org: changelog] Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05	mm/page_isolation.c: make start_isolate_page_range() fail if already isolated	Mike Kravetz	2	-5/+21
	start_isolate_page_range() is used to set the migrate type of a set of pageblocks to MIGRATE_ISOLATE while attempting to start a migration operation. It assumes that only one thread is calling it for the specified range. This routine is used by CMA, memory hotplug and gigantic huge pages. Each of these users synchronize access to the range within their subsystem. However, two subsystems (CMA and gigantic huge pages for example) could attempt operations on the same range. If this happens, one thread may 'undo' the work another thread is doing. This can result in pageblocks being incorrectly left marked as MIGRATE_ISOLATE and therefore not available for page allocation. What is ideally needed is a way to synchronize access to a set of pageblocks that are undergoing isolation and migration. The only thing we know about these pageblocks is that they are all in the same zone. A per-node mutex is too coarse as we want to allow multiple operations on different ranges within the same zone concurrently. Instead, we will use the migration type of the pageblocks themselves as a form of synchronization. start_isolate_page_range sets the migration type on a set of page- blocks going in order from the one associated with the smallest pfn to the largest pfn. The zone lock is acquired to check and set the migration type. When going through the list of pageblocks check if MIGRATE_ISOLATE is already set. If so, this indicates another thread is working on this pageblock. We know exactly which pageblocks we set, so clean up by undo those and return -EBUSY. This allows start_isolate_page_range to serve as a synchronization mechanism and will allow for more general use of callers making use of these interfaces. Update comments in alloc_contig_range to reflect this new functionality. Each CPU holds the associated zone lock to modify or examine the migration type of a pageblock. And, it will only examine/update a single pageblock per lock acquire/release cycle. Link: http://lkml.kernel.org/r/20180309224731.16978-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>