Age | Commit message (Collapse) | Author | Files | Lines |
|
The function SMCCC_ARCH_WORKAROUND_1 was introduced as part of SMC
V1.1 Calling Convention to mitigate CVE-2017-5715. This patch uses
the standard call SMCCC_ARCH_WORKAROUND_1 for Falkor chips instead
of Silicon provider service ID 0xC2001700.
Cc: <stable@vger.kernel.org> # 4.14+
Signed-off-by: Shanker Donthineni <shankerd@codeaurora.org>
[maz: reworked errata framework integration]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
Commit 3c8ba0d61d04 ("kernel.h: Retain constant expression output for
max()/min()") rewrote our min/max macros to be very clever, but in the
meantime resurrected a variable name shadow issue that we had had
previously fixed in commit 589a9785ee3a ("min/max: remove sparse
warnings when they're nested").
That commit talks about the sparse warnings that this shadowing causes,
which we ignored as just a minor annoyance. But it turns out that the
sparse warning is the least of our problems. We actually have a real
bug due to the shadowing through the interaction with "min_not_zero()",
which ends up doing
min(__x, __y)
internally, and then the new declaration of "__x" and "__y" as new
variables in __cmp_once() results in a complete mess of an expression,
and "min_not_zero()" doesn't work at all.
For some odd reason, this only ever caused (reported) problems on s390,
even though it is a generic issue and most of the (obviously successful)
testing of the problematic commit had happened on other architectures.
Quoting Sebastian Ott:
"What happened is that the bio build by the partition detection code
was attempted to be split by the block layer because the block queue
had a max_sector setting of 0. blk_queue_max_hw_sectors uses
min_not_zero."
So re-introduce the use of __UNIQUE_ID() to make sure that the min/max
macros do not have these kinds of clashes.
[ That said, __UNIQUE_ID() itself has several issues that make it less
than wonderful.
In particular, the "uniqueness" has a fallback on the line number,
which means that it's not actually unique in more complex cases if you
don't build with gcc or clang (which have working unique counters that
aren't tied to line numbers).
That historical broken fallback also means that we have that pointless
"prefix" argument that doesn't actually make much sense _except_ for
the known-broken case. Oh well. ]
Fixes: 3c8ba0d61d04 ("kernel.h: Retain constant expression output for max()/min()")
Reported-and-tested-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
memory-barriers.txt has been updated with the following requirement.
"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."
Current writeX() and iowriteX() implementations on alpha are not
satisfying this requirement as the barrier is after the register write.
Move mb() in writeX() and iowriteX() functions to guarantee that HW
observes memory changes before performing register operations.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Matt Turner <mattst88@gmail.com>
|
|
Implement the CPU vulnerabilty show functions for meltdown, spectre_v1
and spectre_v2 on Alpha.
Tests on XP1000 (EV67/667MHz) and ES45 (EV68CB/1.25GHz) show them
to be vulnerable to Meltdown and Spectre V1. In the case of
Meltdown I saw a 1 to 2% success rate in reading bytes on the
XP1000 and 50 to 60% success rate on the ES45. (This compares to
99.97% success reported for Intel CPUs.) Report EV6 and later
CPUs as vulnerable.
Tests on PWS600au (EV56/600MHz) for Spectre V1 attack were
unsuccessful (though I did not try particularly hard) so mark EV4
through to EV56 as not vulnerable.
Signed-off-by: Michael Cree <mcree@orcon.net.nz>
Signed-off-by: Matt Turner <mattst88@gmail.com>
|
|
The RTC core is always calling rtc_valid_tm after the read_time callback.
It is not necessary to call it just before returning from the callback.
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Matt Turner <mattst88@gmail.com>
|
|
The .set_mmss and .setmmss64 ops are only called when the RTC is not
providing an implementation for the .set_time callback.
On alpha, .set_time is provided so .set_mmss64 is never called. Remove the
unused code.
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Matt Turner <mattst88@gmail.com>
|
|
Joe Perches noted that we have a few source files that for some
inexplicable reason (read: I'm too lazy to even go look at the history)
are marked executable:
drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
drivers/net/ethernet/cadence/macb_ptp.c
A simple git command line to show executable C/asm/header files is this:
git ls-files -s '*.[chsS]' | grep '^100755'
and then you can fix them up with scripting by just feeding that output
into:
| cut -f2 | xargs chmod -x
and commit it.
Which is exactly what this commit does.
Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
MAINTAINERS is out of date for leaking_addresses.pl. There is now a tree on
kernel.org for development of this script. We have a second maintainer now,
thanks Tycho. Development of this scripts was started on kernel-hardening
mailing list so let's keep it there.
Update maintainer details; Add mailing list, kernel.org hosted tree, and second
maintainer.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
Sometimes files may be created by using output from printk. As the scan
traverses the directory tree we should parse each path name and check if
it is leaking an address.
Add check for leaking address on each path name.
Suggested-by: Tycho Andersen <tycho@tycho.ws>
Acked-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
Currently sub routine may_leak_address() is checking regex against Perl
special variable $_ which is _fortunately_ being set correctly in a loop
before this sub routine is called. We already have declared a variable
to hold this value '$line' we should use it.
Use $line in regex match instead of implicit $_
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
We have git now, we don't need a version number. This was originally
added because leaking_addresses.pl shamelessly (and mindlessly) copied
checkpatch.pl
Remove version number from script.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
The pointers listed in /proc/1/syscall are user pointers, and negative
syscall args will show up like kernel addresses.
For example
/proc/31808/syscall: 0 0x3 0x55b107a38180 0x2000 0xffffffffffffffb0 \
0x55b107a302d0 0x55b107a38180 0x7fffa313b8e8 0x7ff098560d11
Skip parsing /proc/1/syscall
Suggested-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
When the system is idle it is likely that most files under /proc/PID
will be identical for various processes. Scanning _all_ the PIDs under
/proc is unnecessary and implies that we are thoroughly scanning /proc.
This is _not_ the case because there may be ways userspace can trigger
creation of /proc files that leak addresses but were not present during
a scan. For these two reasons we should exclude all PID directories
under /proc except '1/'
Exclude all /proc/PID except /proc/1.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
Currently we are repeatedly calling `uname -m`. This is causing the
script to take a long time to run (more than 10 seconds to parse
/proc/kallsyms). We can use Perl state variables to cache the result of
the first call to `uname -m`. With this change in place the script
scans the whole kernel in under a minute.
Cache machine architecture in state variable.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
Currently script has multiple configuration arrays. This is confusing,
evident by the fact that a bunch of the entries are in the wrong place.
We can simplify the code by just having a single array for absolute
paths to skip and a single array for file names to skip wherever they
appear in the scanned directory tree. There are also currently multiple
subroutines to handle the different arrays, we can reduce these to a
single subroutine also.
Simplify the path skipping code.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
Currently script parses binary files. Since we are scanning for
readable kernel addresses there is no need to parse binary files. We
can use Perl to check if file is binary and skip parsing it if so.
Do not parse binary files.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
Currently script only supports x86_64 and ppc64. It would be nice to be
able to scan 32-bit machines also. We can add support for 32-bit
architectures by modifying how we check for false positives, taking
advantage of the page offset used by the kernel, and using the correct
regular expression.
Support for 32-bit machines is enabled by the observation that the kernel
addresses on 32-bit machines are larger [in value] than the page offset.
We can use this to filter false positives when scanning the kernel for
leaking addresses.
Programmatic determination of the running architecture is not
immediately obvious (current 32-bit machines return various strings from
`uname -m`). We therefore provide a flag to enable scanning of 32-bit
kernels. Also we can check the kernel config file for the offset and if
not found default to 0xc0000000. A command line option to parse in the
page offset is also provided. We do automatically detect architecture
if running on ix86.
Add support for 32-bit kernels. Add a command line option for page
offset.
Suggested-by: Kaiwan N Billimoria <kaiwan.billimoria@gmail.com>
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
Currently there is duplicate code when checking the architecture type.
We can remove the duplication by implementing a wrapper function
is_arch().
Implement and use wrapper function is_arch().
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
Currently script uses Perl to get the machine architecture. This can be
erroneous since Perl uses the architecture of the machine that Perl was
compiled on not the architecture of the running machine. We should use
the systems `uname` command instead.
Use `uname -m` instead of Perl to get the machine architecture.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
Currently script only supports 4 page table levels because of the way
the kernel address regular expression is crafted. We can do better than
this. Using previously added support for kernel configuration options we
can get the number of page table levels defined by
CONFIG_PGTABLE_LEVELS. Using this value a correct regular expression can
be crafted. This only supports 5 page tables on x86_64.
Add support for 5 page table levels on x86_64.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
Features that rely on the ability to get kernel configuration options
are ready to be implemented in script. In preparation for this we can
add support for kernel config options as a separate patch to ease
review.
Add support for locating and parsing kernel configuration file.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
Currently script checks only first and last address in the vsyscall
memory range. We can do better than this. When checking for false
positives against $match, we can convert $match to a hexadecimal value
then check if it lies within the range of vsyscall addresses.
Check whole range of vsyscall addresses when checking for false
positive.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
A number of the command line options to script are dependant on the
option --input-raw being set. If we indent these options it makes
explicit this dependency.
Indent options dependant on --input-raw.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
Currently help output includes command examples. These were cute when we
first started development of this script but are unnecessary.
Remove command examples.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
leaking_addresses.pl can be run with kptr_restrict==0 now, we don't need
the comment about setting kptr_restrict any more.
Remove comment suggesting setting kptr_restrict.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
Currently code uses a check against an undefined variable because the
variable is a sub routine name and is not evaluated.
Evaluate subroutine; add parenthesis to sub routine name.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
|
|
Make the function static to avoid a
warning: no previous prototype for ‘vmx_enable_tdp’
Signed-off-by: Peng Hao <peng.hao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Switch simpad's CF implementation to use the gpiod APIs. The inverted
detection is handled using gpiolib's native inversion abilities.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
|
|
Convert shannon to use the generic CF socket support.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
|
|
Convert nanoengine to use the generic CF socket support.
Makefile fix from Arnd Bergmann <arnd@arndb.de>.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
|
|
Maintain a catalogue of allocated cookies so that cookie collisions can be
handled properly. For the moment, this just involves printing a warning
and returning a NULL cookie to the caller of fscache_acquire_cookie(), but
in future it might make sense to wait for the old cookie to finish being
cleaned up.
This requires the cookie key to be stored attached to the cookie so that we
still have the key available if the netfs relinquishes the cookie. This is
done by an earlier patch.
The catalogue also renders redundant fscache_netfs_list (used for checking
for duplicates), so that can be removed.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Anna Schumaker <anna.schumaker@netapp.com>
Tested-by: Steve Dickson <steved@redhat.com>
|
|
Pass the object size in to fscache_acquire_cookie() and
fscache_write_page() rather than the netfs providing a callback by which it
can be received. This makes it easier to update the size of the object
when a new page is written that extends the object.
The current object size is also passed by fscache to the check_aux
function, obviating the need to store it in the aux data.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Anna Schumaker <anna.schumaker@netapp.com>
Tested-by: Steve Dickson <steved@redhat.com>
|
|
I got "oom_reaper: unable to reap pid:" messages when the victim thread
was blocked inside free_pgtables() (which occurred after returning from
unmap_vmas() and setting MMF_OOM_SKIP). We don't need to complain when
exit_mmap() already set MMF_OOM_SKIP.
Killed process 7558 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: unable to reap pid:7558 (a.out)
a.out D13272 7558 6931 0x00100084
Call Trace:
schedule+0x2d/0x80
rwsem_down_write_failed+0x2bb/0x440
call_rwsem_down_write_failed+0x13/0x20
down_write+0x49/0x60
unlink_file_vma+0x28/0x50
free_pgtables+0x36/0x100
exit_mmap+0xbb/0x180
mmput+0x50/0x110
copy_process.part.41+0xb61/0x1fe0
_do_fork+0xe6/0x560
do_syscall_64+0x74/0x230
entry_SYSCALL_64_after_hwframe+0x42/0xb7
Link: http://lkml.kernel.org/r/201803221946.DHG65638.VFJHFtOSQLOMOF@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch fixes a corner case for KSM. When two pages belong or
belonged to the same transparent hugepage, and they should be merged,
KSM fails to split the page, and therefore no merging happens.
This bug can be reproduced by:
* making sure ksm is running (in case disabling ksmtuned)
* enabling transparent hugepages
* allocating a THP-aligned 1-THP-sized buffer
e.g. on amd64: posix_memalign(&p, 1<<21, 1<<21)
* filling it with the same values
e.g. memset(p, 42, 1<<21)
* performing madvise to make it mergeable
e.g. madvise(p, 1<<21, MADV_MERGEABLE)
* waiting for KSM to perform a few scans
The expected outcome is that the all the pages get merged (1 shared and
the rest sharing); the actual outcome is that no pages get merged (1
unshared and the rest volatile)
The reason of this behaviour is that we increase the reference count
once for both pages we want to merge, but if they belong to the same
hugepage (or compound page), the reference counter used in both cases is
the one of the head of the compound page. This means that
split_huge_page will find a value of the reference counter too high and
will fail.
This patch solves this problem by testing if the two pages to merge
belong to the same hugepage when attempting to merge them. If so, the
hugepage is split safely. This means that the hugepage is not split if
not necessary.
Link: http://lkml.kernel.org/r/1521548069-24758-1-git-send-email-imbrenda@linux.vnet.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Co-authored-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This fixes a warning shown when phys_addr_t is 32-bit int when compiling
with clang:
mm/memblock.c:927:15: warning: implicit conversion from 'unsigned long long'
to 'phys_addr_t' (aka 'unsigned int') changes value from
18446744073709551615 to 4294967295 [-Wconstant-conversion]
r->base : ULLONG_MAX;
^~~~~~~~~~
./include/linux/kernel.h:30:21: note: expanded from macro 'ULLONG_MAX'
#define ULLONG_MAX (~0ULL)
^~~~~
Link: http://lkml.kernel.org/r/20180319005645.29051-1-stefan@agner.ch
Signed-off-by: Stefan Agner <stefan@agner.ch>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently <linux/slab.h> #includes <linux/kmemleak.h> for no obvious
reason. It looks like it's only a convenience, so remove kmemleak.h
from slab.h and add <linux/kmemleak.h> to any users of kmemleak_* that
don't already #include it. Also remove <linux/kmemleak.h> from source
files that do not use it.
This is tested on i386 allmodconfig and x86_64 allmodconfig. It would
be good to run it through the 0day bot for other $ARCHes. I have
neither the horsepower nor the storage space for the other $ARCHes.
Update: This patch has been extensively build-tested by both the 0day
bot & kisskb/ozlabs build farms. Both of them reported 2 build failures
for which patches are included here (in v2).
[ slab.h is the second most used header file after module.h; kernel.h is
right there with slab.h. There could be some minor error in the
counting due to some #includes having comments after them and I didn't
combine all of those. ]
[akpm@linux-foundation.org: security/keys/big_key.c needs vmalloc.h, per sfr]
Link: http://lkml.kernel.org/r/e4309f98-3749-93e1-4bb7-d9501a39d015@infradead.org
Link: http://kisskb.ellerman.id.au/kisskb/head/13396/
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au> [2 build failures]
Reported-by: Fengguang Wu <fengguang.wu@intel.com> [2 build failures]
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Cc: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
At present the construct
if (VM_WARN(...))
will compile OK with CONFIG_DEBUG_VM=y and will fail with
CONFIG_DEBUG_VM=n. The reason is that VM_{WARN,BUG}* have always been
special wrt. {WARN/BUG}* and never generate any code when DEBUG_VM is
disabled. So we cannot really use it in conditionals.
We considered changing things so that this construct works in both cases
but that might cause unwanted code generation with CONFIG_DEBUG_VM=n.
It is safer and simpler to make the build fail in both cases.
[akpm@linux-foundation.org: changelog]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
start_isolate_page_range() is used to set the migrate type of a set of
pageblocks to MIGRATE_ISOLATE while attempting to start a migration
operation. It assumes that only one thread is calling it for the
specified range. This routine is used by CMA, memory hotplug and
gigantic huge pages. Each of these users synchronize access to the
range within their subsystem. However, two subsystems (CMA and gigantic
huge pages for example) could attempt operations on the same range. If
this happens, one thread may 'undo' the work another thread is doing.
This can result in pageblocks being incorrectly left marked as
MIGRATE_ISOLATE and therefore not available for page allocation.
What is ideally needed is a way to synchronize access to a set of
pageblocks that are undergoing isolation and migration. The only thing
we know about these pageblocks is that they are all in the same zone. A
per-node mutex is too coarse as we want to allow multiple operations on
different ranges within the same zone concurrently. Instead, we will
use the migration type of the pageblocks themselves as a form of
synchronization.
start_isolate_page_range sets the migration type on a set of page-
blocks going in order from the one associated with the smallest pfn to
the largest pfn. The zone lock is acquired to check and set the
migration type. When going through the list of pageblocks check if
MIGRATE_ISOLATE is already set. If so, this indicates another thread is
working on this pageblock. We know exactly which pageblocks we set, so
clean up by undo those and return -EBUSY.
This allows start_isolate_page_range to serve as a synchronization
mechanism and will allow for more general use of callers making use of
these interfaces. Update comments in alloc_contig_range to reflect this
new functionality.
Each CPU holds the associated zone lock to modify or examine the
migration type of a pageblock. And, it will only examine/update a
single pageblock per lock acquire/release cycle.
Link: http://lkml.kernel.org/r/20180309224731.16978-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The plan for these patches is to introduce the typedef, initially just
as documentation ("These functions should return a VM_FAULT_ status").
We'll trickle the patches to individual drivers/filesystems in through
the maintainers, as far as possible. Then we'll change the typedef to
an unsigned int and break the compilation of any unconverted
drivers/filesystems.
vmf_insert_page(), vmf_insert_mixed() and vmf_insert_pfn() are three
newly added functions. The various drivers/filesystems where return
value of fault(), huge_fault(), page_mkwrite() and pfn_mkwrite() get
converted, will need them. These functions will return correct
VM_FAULT_ code based on err value.
We've had bugs before where drivers returned -EFOO. And we have this
silly inefficiency where vm_insert_xxx() return an errno which (afaict)
every driver then converts into a VM_FAULT code. In many cases drivers
failed to return correct VM_FAULT code value despite of vm_insert_xxx()
fails. We have indentified and clean up all those existing bugs and
silly inefficiencies in driver/filesystems by adding these three new
inline wrappers. As mentioned above, we will trickle those patches to
individual drivers/filesystems in through maintainers after these three
wrapper functions are merged.
Eventually we can convert vm_insert_xxx() into vmf_insert_xxx() and
remove these inline wrappers, but these are a good intermediate step.
Link: http://lkml.kernel.org/r/20180310162351.GA7422@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since the 2.6 kernel, the oom killer has slightly biased away from
CAP_SYS_ADMIN processes by discounting some of its memory usage in
comparison to other processes.
This has always been implicit and nothing exactly relies on the
behavior.
Gaurav notices that __task_cred() can dereference a potentially freed
pointer if the task under consideration is exiting because a reference
to the task_struct is not held.
Remove the CAP_SYS_ADMIN bias so that all processes are treated equally.
If any CAP_SYS_ADMIN process would like to be biased against, it is
always allowed to adjust /proc/pid/oom_score_adj.
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803071548510.6996@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Kswapd will not wakeup if per-zone watermarks are not failing or if too
many previous attempts at background reclaim have failed.
This can be true if there is a lot of free memory available. For high-
order allocations, kswapd is responsible for waking up kcompactd for
background compaction. If the zone is not below its watermarks or
reclaim has recently failed (lots of free memory, nothing left to
reclaim), kcompactd does not get woken up.
When __GFP_DIRECT_RECLAIM is not allowed, allow kcompactd to still be
woken up even if kswapd will not reclaim. This allows high-order
allocations, such as thp, to still trigger background compaction even
when the zone has an abundance of free memory.
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803111659420.209721@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
KASAN splats indicate that in some cases we free a live mm, then
continue to access it, with potentially disastrous results. This is
likely due to a mismatched mmdrop() somewhere in the kernel, but so far
the culprit remains elusive.
Let's have __mmdrop() verify that the mm isn't live for the current
task, similar to the existing check for init_mm. This way, we can catch
this class of issue earlier, and without requiring KASAN.
Currently, idle_task_exit() leaves active_mm stale after it switches to
init_mm. This isn't harmful, but will trigger the new assertions, so we
must adjust idle_task_exit() to update active_mm.
Link: http://lkml.kernel.org/r/20180312140103.19235-1-mark.rutland@arm.com
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
During the reclaiming slab of a memcg, shrink_slab iterates over all
registered shrinkers in the system, and tries to count and consume
objects related to the cgroup. In case of memory pressure, this behaves
bad: I observe high system time and time spent in list_lru_count_one()
for many processes on RHEL7 kernel.
This patch makes list_lru_node::memcg_lrus rcu protected, that allows to
skip taking spinlock in list_lru_count_one().
Shakeel Butt with the patch observes significant perf graph change. He
says:
========================================================================
Setup: running a fork-bomb in a memcg of 200MiB on a 8GiB and 4 vcpu
VM and recording the trace with 'perf record -g -a'.
The trace without the patch:
+ 34.19% fb.sh [kernel.kallsyms] [k] queued_spin_lock_slowpath
+ 30.77% fb.sh [kernel.kallsyms] [k] _raw_spin_lock
+ 3.53% fb.sh [kernel.kallsyms] [k] list_lru_count_one
+ 2.26% fb.sh [kernel.kallsyms] [k] super_cache_count
+ 1.68% fb.sh [kernel.kallsyms] [k] shrink_slab
+ 0.59% fb.sh [kernel.kallsyms] [k] down_read_trylock
+ 0.48% fb.sh [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
+ 0.38% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
+ 0.32% fb.sh [kernel.kallsyms] [k] queue_work_on
+ 0.26% fb.sh [kernel.kallsyms] [k] count_shadow_nodes
With the patch:
+ 0.16% swapper [kernel.kallsyms] [k] default_idle
+ 0.13% oom_reaper [kernel.kallsyms] [k] mutex_spin_on_owner
+ 0.05% perf [kernel.kallsyms] [k] copy_user_generic_string
+ 0.05% init.real [kernel.kallsyms] [k] wait_consider_task
+ 0.05% kworker/0:0 [kernel.kallsyms] [k] finish_task_switch
+ 0.04% kworker/2:1 [kernel.kallsyms] [k] finish_task_switch
+ 0.04% kworker/3:1 [kernel.kallsyms] [k] finish_task_switch
+ 0.04% kworker/1:0 [kernel.kallsyms] [k] finish_task_switch
+ 0.03% binary [kernel.kallsyms] [k] copy_page
========================================================================
Thanks Shakeel for the testing.
[ktkhai@virtuozzo.com: v2]
Link: http://lkml.kernel.org/r/151203869520.3915.2587549826865799173.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/150583358557.26700.8490036563698102569.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The bool enable_vma_readahead and swap_vma_readahead() are local to the
source and do not need to be in global scope, so make them static.
Cleans up sparse warnings:
mm/swap_state.c:41:6: warning: symbol 'enable_vma_readahead' was not declared. Should it be static?
mm/swap_state.c:742:13: warning: symbol 'swap_vma_readahead' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20180223164852.5159-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Prior to commit d47992f86b30 ("mm: change invalidatepage prototype to
accept length"), an offset of 0 meant that the full page was being
invalidated. After that commit, we need to instead check the length.
Jan said:
:
: The only possible issue is that try_to_release_page() was called more
: often than necessary. Otherwise the issue is harmless but still it's good
: to have this fixed.
Link: http://lkml.kernel.org/r/x49fu5rtnzs.fsf@segfault.boston.devel.redhat.com
Fixes: d47992f86b307 ("mm: change invalidatepage prototype to accept length")
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Link: http://lkml.kernel.org/r/1519585191-10180-4-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The 'cold' parameter was removed from release_pages function by commit
c6f92f9fbe7d ("mm: remove cold parameter for release_pages").
Update the description to match the code.
Link: http://lkml.kernel.org/r/1519585191-10180-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The alloc_mm_area in nommu is a stub, but its description states it
allocates kernel address space. Remove the description to make the code
and the documentation agree.
Link: http://lkml.kernel.org/r/1519585191-10180-2-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Remove ZRAM's enforced "huge object" value and use zsmalloc huge-class
watermark instead, which makes more sense.
TEST
- I used a 1G zram device, LZO compression back-end, original
data set size was 444MB. Looking at zsmalloc classes stats the
test ended up to be pretty fair.
BASE ZRAM/ZSMALLOC
=====================
zram mm_stat
498978816 191482495 199831552 0 199831552 15634 0
zsmalloc classes
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
...
151 2448 0 0 1240 1240 744 3 0
168 2720 0 0 4200 4200 2800 2 0
190 3072 0 0 10100 10100 7575 3 0
202 3264 0 0 380 380 304 4 0
254 4096 0 0 10620 10620 10620 1 0
Total 7 46 106982 106187 48787 0
PATCHED ZRAM/ZSMALLOC
=====================
zram mm_stat
498978816 182579184 194248704 0 194248704 15628 0
zsmalloc classes
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
...
151 2448 0 0 1240 1240 744 3 0
168 2720 0 0 4200 4200 2800 2 0
190 3072 0 0 10100 10100 7575 3 0
202 3264 0 0 7180 7180 5744 4 0
254 4096 0 0 3820 3820 3820 1 0
Total 8 45 106959 106193 47424 0
As we can see, we reduced the number of objects stored in class-4096,
because a huge number of objects which we previously forcibly stored in
class-4096 now stored in non-huge class-3264. This results in lower
memory consumption:
- zsmalloc now uses 47424 physical pages, which is less than 48787 pages
zsmalloc used before.
- objects that we store in class-3264 share zspages. That's why overall
the number of pages that both class-4096 and class-3264 consumed went
down from 10924 to 9564.
[sergey.senozhatsky.work@gmail.com: add pool param to zs_huge_class_size()]
Link: http://lkml.kernel.org/r/20180314081833.1096-3-sergey.senozhatsky@gmail.com
Link: http://lkml.kernel.org/r/20180306070639.7389-3-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "zsmalloc/zram: drop zram's max_zpage_size", v3.
ZRAM's max_zpage_size is a bad thing. It forces zsmalloc to store
normal objects as huge ones, which results in bigger zsmalloc memory
usage. Drop it and use actual zsmalloc huge-class value when decide if
the object is huge or not.
This patch (of 2):
Not every object can be share its zspage with other objects, e.g. when
the object is as big as zspage or nearly as big a zspage. For such
objects zsmalloc has a so called huge class - every object which belongs
to huge class consumes the entire zspage (which consists of a physical
page). On x86_64, PAGE_SHIFT 12 box, the first non-huge class size is
3264, so starting down from size 3264, objects can share page(-s) and
thus minimize memory wastage.
ZRAM, however, has its own statically defined watermark for huge
objects, namely "3 * PAGE_SIZE / 4 = 3072", and forcibly stores every
object larger than this watermark (3072) as a PAGE_SIZE object, in other
words, to a huge class, while zsmalloc can keep some of those objects in
non-huge classes. This results in increased memory consumption.
zsmalloc knows better if the object is huge or not. Introduce
zs_huge_class_size() function which tells if the given object can be
stored in one of non-huge classes or not. This will let us to drop
ZRAM's huge object watermark and fully rely on zsmalloc when we decide
if the object is huge.
[sergey.senozhatsky.work@gmail.com: add pool param to zs_huge_class_size()]
Link: http://lkml.kernel.org/r/20180314081833.1096-2-sergey.senozhatsky@gmail.com
Link: http://lkml.kernel.org/r/20180306070639.7389-2-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|