Age | Commit message (Collapse) | Author | Files | Lines |
|
No need to pass the key field to lookup_iocb to compare it with KIOCB_KEY,
as we can do that right after retrieving it from userspace. Also move the
KIOCB_KEY definition to aio.c as it is an internal value not used by any
other place in the kernel.
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
->get_poll_head returns the waitqueue that the poll operation is going
to sleep on. Note that this means we can only use a single waitqueue
for the poll, unlike some current drivers that use two waitqueues for
different events. But now that we have keyed wakeups and heavily use
those for poll there aren't that many good reason left to keep the
multiple waitqueues, and if there are any ->poll is still around, the
driver just won't support aio poll.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
These abstract out calls to the poll method in preparation for changes
in how we poll.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Use straightline code with failure handling gotos instead of a lot
of nested conditionals.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
No users outside of select.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
If io_destroy() gets to cancelling everything that can be cancelled and
gets to kiocb_cancel() calling the function driver has left in ->ki_cancel,
it becomes vulnerable to a race with IO completion. At that point req
is already taken off the list and aio_complete() does *NOT* spin until
we (in free_ioctx_users()) releases ->ctx_lock. As the result, it proceeds
to kiocb_free(), freing req just it gets passed to ->ki_cancel().
Fix is simple - remove from the list after the call of kiocb_cancel(). All
instances of ->ki_cancel() already have to cope with the being called with
iocb still on list - that's what happens in io_cancel(2).
Cc: stable@kernel.org
Fixes: 0460fef2a921 "aio: use cancellation list lazily"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
kill_ioctx() used to have an explicit RCU delay between removing the
reference from ->ioctx_table and percpu_ref_kill() dropping the refcount.
At some point that delay had been removed, on the theory that
percpu_ref_kill() itself contained an RCU delay. Unfortunately, that was
the wrong kind of RCU delay and it didn't care about rcu_read_lock() used
by lookup_ioctx(). As the result, we could get ctx freed right under
lookup_ioctx(). Tejun has fixed that in a6d7cff472e ("fs/aio: Add explicit
RCU grace period when freeing kioctx"); however, that fix is not enough.
Suppose io_destroy() from one thread races with e.g. io_setup() from another;
CPU1 removes the reference from current->mm->ioctx_table[...] just as CPU2
has picked it (under rcu_read_lock()). Then CPU1 proceeds to drop the
refcount, getting it to 0 and triggering a call of free_ioctx_users(),
which proceeds to drop the secondary refcount and once that reaches zero
calls free_ioctx_reqs(). That does
INIT_RCU_WORK(&ctx->free_rwork, free_ioctx);
queue_rcu_work(system_wq, &ctx->free_rwork);
and schedules freeing the whole thing after RCU delay.
In the meanwhile CPU2 has gotten around to percpu_ref_get(), bumping the
refcount from 0 to 1 and returned the reference to io_setup().
Tejun's fix (that queue_rcu_work() in there) guarantees that ctx won't get
freed until after percpu_ref_get(). Sure, we'd increment the counter before
ctx can be freed. Now we are out of rcu_read_lock() and there's nothing to
stop freeing of the whole thing. Unfortunately, CPU2 assumes that since it
has grabbed the reference, ctx is *NOT* going away until it gets around to
dropping that reference.
The fix is obvious - use percpu_ref_tryget_live() and treat failure as miss.
It's not costlier than what we currently do in normal case, it's safe to
call since freeing *is* delayed and it closes the race window - either
lookup_ioctx() comes before percpu_ref_kill() (in which case ctx->users
won't reach 0 until the caller of lookup_ioctx() drops it) or lookup_ioctx()
fails, ctx->users is unaffected and caller of lookup_ioctx() doesn't see
the object in question at all.
Cc: stable@kernel.org
Fixes: a6d7cff472e "fs/aio: Add explicit RCU grace period when freeing kioctx"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
open file, unlink it, then use ioctl(2) to make it immutable or
append only. Now close it and watch the blocks *not* freed...
Immutable/append-only checks belong in ->setattr().
Note: the bug is old and backport to anything prior to 737f2e93b972
("ext2: convert to use the new truncate convention") will need
these checks lifted into ext2_setattr().
Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
That can (and does, on some filesystems) happen - ->mkdir() (and thus
vfs_mkdir()) can legitimately leave its argument negative and just
unhash it, counting upon the lookup to pick the object we'd created
next time we try to look at that name.
Some vfs_mkdir() callers forget about that possibility...
Acked-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
That can (and does, on some filesystems) happen - ->mkdir() (and thus
vfs_mkdir()) can legitimately leave its argument negative and just
unhash it, counting upon the lookup to pick the object we'd created
next time we try to look at that name.
Some vfs_mkdir() callers forget about that possibility...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
new_sb is left uninitialized in case of early failures in kernfs_mount_ns(),
and while IS_ERR(root) is true in all such cases, using IS_ERR(root) || !new_sb
is not a solution - IS_ERR(root) is true in some cases when new_sb is true.
Make sure new_sb is initialized (and matches the reality) in all cases and
fix the condition for dropping kobj reference - we want it done precisely
in those situations where the reference has not been transferred into a new
super_block instance.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
make sure that info->node is initialized early, so that kernfs_kill_sb()
can list_del() it safely.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
There's an extra C here...
Fixes: 99c18ce580c6 ("cramfs: direct memory access support")
Acked-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
RTFS(Documentation/filesystems/nfs/Exporting) if you try to make
something exportable.
Fixes: ac632f5b6301 "befs: add NFS export support"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Making something exportable takes more than providing ->s_export_ops.
In particular, ->lookup() *MUST* use d_splice_alias() instead of
d_add().
Reading Documentation/filesystems/nfs/Exporting would've been a good idea;
as it is, exporting AFFS is badly (and exploitably) broken.
Partially-Fixes: ed4433d72394 "fs/affs: make affs exportable"
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
we unlock the directory hash too early - if we are looking at secondary
link and primary (in another directory) gets removed just as we unlock,
we could have the old primary moved in place of the secondary, leaving
us to look into freed entry (and leaving our dentry with ->d_fsdata
pointing to a freed entry).
Cc: stable@vger.kernel.org # 2.4.4+
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
"VFS: don't keep disconnected dentries on d_anon" had a non-trivial
side-effect - d_unhashed() now returns true for those dentries,
making d_find_alias() skip them altogether. For most of its callers
that's fine - we really want a connected alias there. However,
there is a codepath where we relied upon picking such aliases
if nothing else could be found - selinux delayed initialization
of contexts for inodes on already mounted filesystems used to
rely upon that.
Cc: stable@kernel.org # f1ee616214cb "VFS: don't keep disconnected dentries on d_anon"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
We recently had an oops reported on a 4.14 kernel in
xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
and so the m_perag_tree lookup walked into lala land. It produces
an oops down this path during the failed mount:
radix_tree_gang_lookup_tag+0xc4/0x130
xfs_perag_get_tag+0x37/0xf0
xfs_reclaim_inodes_count+0x32/0x40
xfs_fs_nr_cached_objects+0x11/0x20
super_cache_count+0x35/0xc0
shrink_slab.part.66+0xb1/0x370
shrink_node+0x7e/0x1a0
try_to_free_pages+0x199/0x470
__alloc_pages_slowpath+0x3a1/0xd20
__alloc_pages_nodemask+0x1c3/0x200
cache_grow_begin+0x20b/0x2e0
fallback_alloc+0x160/0x200
kmem_cache_alloc+0x111/0x4e0
The problem is that the superblock shrinker is running before the
filesystem structures it depends on have been fully set up. i.e.
the shrinker is registered in sget(), before ->fill_super() has been
called, and the shrinker can call into the filesystem before
fill_super() does it's setup work. Essentially we are exposed to
both use-after-free and use-before-initialisation bugs here.
To fix this, add a check for the SB_BORN flag in super_cache_count.
In general, this flag is not set until ->fs_mount() completes
successfully, so we know that it is set after the filesystem
setup has completed. This matches the trylock_super() behaviour
which will not let super_cache_scan() run if SB_BORN is not set, and
hence will not allow the superblock shrinker from entering the
filesystem while it is being set up or after it has failed setup
and is being torn down.
Cc: stable@kernel.org
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
For anything NFS-exported we do _not_ want to unlock new inode
before it has grown an alias; original set of fixes got the
ordering right, but missed the nasty complication in case of
lockdep being enabled - unlock_new_inode() does
lockdep_annotate_inode_mutex_key(inode)
which can only be done before anyone gets a chance to touch
->i_mutex. Unfortunately, flipping the order and doing
unlock_new_inode() before d_instantiate() opens a window when
mkdir can race with open-by-fhandle on a guessed fhandle, leading
to multiple aliases for a directory inode and all the breakage
that follows from that.
Correct solution: a new primitive (d_instantiate_new())
combining these two in the right order - lockdep annotate, then
d_instantiate(), then the rest of unlock_new_inode(). All
combinations of d_instantiate() with unlock_new_inode() should
be converted to that.
Cc: stable@kernel.org # 2.6.29 and later
Tested-by: Mike Marshall <hubcap@omnibond.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Make n signed to avoid leaking the pages array if __pipe_get_pages()
fails to allocate any pages.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
It returns -EFAULT and happens to be a helper for pipe_get_pages()
whose return type is ssize_t.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
This is the io_getevents equivalent of ppoll/pselect and allows to
properly mix signals and aio completions (especially with IOCB_CMD_POLL)
and atomically executes the following sequence:
sigset_t origmask;
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ret = io_getevents(ctx, min_nr, nr, events, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
Note that unlike many other signal related calls we do not pass a sigmask
size, as that would get us to 7 arguments, which aren't easily supported
by the syscall infrastructure. It seems a lot less painful to just add a
new syscall variant in the unlikely case we're going to increase the
sigset size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Simple workqueue offload for now, but prepared for adding a real aio_fsync
method if the need arises. Based on an earlier patch from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Don't reference the kiocb structure from the common aio code, and move
any use of it into helper specific to the read/write path. This is in
preparation for aio_poll support that wants to use the space for different
fields.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
If we release the lockdep write protection token before calling into
->write_iter and thus never access the file pointer after an -EIOCBQUEUED
return from ->write_iter or ->read_iter we don't need this extra
reference.
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Instead of handcoded non-null checks always initialize ki_list to an
empty list and use list_empty / list_empty_careful on it. While we're
at it also error out on a double call to kiocb_set_cancel_fn instead
of ignoring it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
These days we don't treat sync iocbs special in the aio completion code as
they never use it. Remove the old comment and BUG_ON given that the
current definition of is_sync_kiocb makes it impossible to hit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
The page size is in no way related to the aio code, and printing it in
the (debug) dmesg at every boot serves no purpose.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
|
|
The comment claims that this helper will try not to loose bits, but for
64bit long it looses the high bits before hashing 64bit long into 32bit
int. Use the helper hash_long() to do the right thing for 64bit long.
For 32bit long, there is no change.
All the callers of end_name_hash() either assign the result to
qstr->hash, which is u32 or return the result as an int value (e.g.
full_name_hash()). Change the helper return type to int to conform to
its users.
[ It took me a while to apply this, because my initial reaction to it
was - incorrectly - that it could make for slower code.
After having looked more at it, I take back all my complaints about
the patch, Amir was right and I was mis-reading things or just being
stupid.
I also don't worry too much about the possible performance impact of
this on 64-bit, since most architectures that actually care about
performance end up not using this very much (the dcache code is the
most performance-critical, but the word-at-a-time case uses its own
hashing anyway).
So this ends up being mostly used for filesystems that do their own
degraded hashing (usually because they want a case-insensitive
comparison function).
A _tiny_ worry remains, in that not everybody uses DCACHE_WORD_ACCESS,
and then this potentially makes things more expensive on 64-bit
architectures with slow or lacking multipliers even for the normal
case.
That said, realistically the only such architecture I can think of is
PA-RISC. Nobody really cares about performance on that, it's more of a
"look ma, I've got warts^W an odd machine" platform.
So the patch is fine, and all my initial worries were just misplaced
from not looking at this properly. - Linus ]
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The AFFS filesystem is still in use by m68k community (Link #2), but as
there was no code activity and no maintainer, the filesystem appeared on
the list of candidates for staging/removal (Link #1).
I volunteer to act as a maintainer of AFFS to collect any fixes that
might show up and to guard fs/affs/ against another spring cleaning.
Link: https://lkml.kernel.org/r/20180425154602.GA8546@bombadil.infradead.org
Link: https://lkml.kernel.org/r/1613268.lKBQxPXt8J@merkaba
CC: Martin Steigerwald <martin@lichtvoll.de>
CC: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Move DISABLE_EXITS KVM capability bits to the UAPI just like the rest of
capabilities.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Currently, KVM flushes the TLB after a change to the APIC access page
address or the APIC mode when EPT mode is enabled. However, even in
shadow paging mode, a TLB flush is needed if VPIDs are being used, as
specified in the Intel SDM Section 29.4.5.
So replace vmx_flush_tlb_ept_only() with vmx_flush_tlb(), which will
flush if either EPT or VPIDs are in use.
Signed-off-by: Junaid Shahid <junaids@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
32-bit user code that uses int $80 doesn't care about r8-r11. There is,
however, some 64-bit user code that intentionally uses int $0x80 to invoke
32-bit system calls. From what I've seen, basically all such code assumes
that r8-r15 are all preserved, but the kernel clobbers r8-r11. Since I
doubt that there's any code that depends on int $0x80 zeroing r8-r11,
change the kernel to preserve them.
I suspect that very little user code is broken by the old clobber, since
r8-r11 are only rarely allocated by gcc, and they're clobbered by function
calls, so they only way we'd see a problem is if the same function that
invokes int $0x80 also spills something important to one of these
registers.
The current behavior seems to date back to the historical commit
"[PATCH] x86-64 merge for 2.6.4". Before that, all regs were
preserved. I can't find any explanation of why this change was made.
Update the test_syscall_vdso_32 testcase as well to verify the new
behavior, and it strengthens the test to make sure that the kernel doesn't
accidentally permute r8..r15.
Suggested-by: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Link: https://lkml.kernel.org/r/d4c4d9985fbe64f8c9e19291886453914b48caee.1523975710.git.luto@kernel.org
|
|
A bugfix broke the x32 shmid64_ds and msqid64_ds data structure layout
(as seen from user space) a few years ago: Originally, __BITS_PER_LONG
was defined as 64 on x32, so we did not have padding after the 64-bit
__kernel_time_t fields, After __BITS_PER_LONG got changed to 32,
applications would observe extra padding.
In other parts of the uapi headers we seem to have a mix of those
expecting either 32 or 64 on x32 applications, so we can't easily revert
the path that broke these two structures.
Instead, this patch decouples x32 from the other architectures and moves
it back into arch specific headers, partially reverting the even older
commit 73a2d096fdf2 ("x86: remove all now-duplicate header files").
It's not clear whether this ever made any difference, since at least
glibc carries its own (correct) copy of both of these header files,
so possibly no application has ever observed the definitions here.
Based on a suggestion from H.J. Lu, I tried out the tool from
https://github.com/hjl-tools/linux-header to find other such
bugs, which pointed out the same bug in statfs(), which also has
a separate (correct) copy in glibc.
Fixes: f4b4aae18288 ("x86/headers/uapi: Fix __BITS_PER_LONG value for x32 builds")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H . J . Lu" <hjl.tools@gmail.com>
Cc: Jeffrey Walton <noloader@gmail.com>
Cc: stable@vger.kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20180424212013.3967461-1-arnd@arndb.de
|
|
Xen PV domains cannot shut down and start a crash kernel. Instead,
the crashing kernel makes a SCHEDOP_shutdown hypercall with the
reason code SHUTDOWN_crash, cf. xen_crash_shutdown() machine op in
arch/x86/xen/enlighten_pv.c.
A crash kernel reservation is merely a waste of RAM in this case. It
may also confuse users of kexec_load(2) and/or kexec_file_load(2).
When flags include KEXEC_ON_CRASH or KEXEC_FILE_ON_CRASH,
respectively, these syscalls return success, which is technically
correct, but the crash kexec image will never be actually used.
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: xen-devel@lists.xenproject.org
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jean Delvare <jdelvare@suse.de>
Link: https://lkml.kernel.org/r/20180425120835.23cef60c@ezekiel.suse.cz
|
|
We found the I2C controller count register is unreliable sometimes,
that will cause I2C to lose data. Thus we can read the data count
from 'i2c_dev->count' instead of the I2C controller count register.
Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
|
|
Add one flag to indicate if the i2c controller has been in suspend state,
which can prevent i2c accesses after i2c controller is suspended following
system suspend.
Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
|
|
i2cdev_ioctl_rdwr() allocates i2c_msg.buf using memdup_user(), which
returns ZERO_SIZE_PTR if i2c_msg.len is zero.
Currently i2cdev_ioctl_rdwr() always dereferences the buf pointer in case
of I2C_M_RD | I2C_M_RECV_LEN transfer. That causes a kernel oops in
case of zero len.
Let's check the len against zero before dereferencing buf pointer.
This issue was triggered by syzkaller.
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Reviewed-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
[wsa: use '< 1' instead of '!' for easier readability]
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
|
|
Our out-of-line atomics are built with a special calling convention,
preventing pointless stack spilling, and allowing us to patch call sites
with ARMv8.1 atomic instructions.
Instrumentation inserted by the compiler may result in calls to
functions not following this special calling convention, resulting in
registers being unexpectedly clobbered, and various problems resulting
from this.
For example, if a kernel is built with KCOV and ARM64_LSE_ATOMICS, the
compiler inserts calls to __sanitizer_cov_trace_pc in the prologues of
the atomic functions. This has been observed to result in spurious
cmpxchg failures, leading to a hang early on in the boot process.
This patch avoids such issues by preventing instrumentation of our
out-of-line atomics.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
Add missing "altivec unavailable" interrupt injection helper
thus fixing the linker error below:
arch/powerpc/kvm/emulate_loadstore.o: In function `kvmppc_check_altivec_disabled':
arch/powerpc/kvm/emulate_loadstore.c: undefined reference to `.kvmppc_core_queue_vec_unavail'
Fixes: 09f984961c137c4b ("KVM: PPC: Book3S: Add MMIO emulation for VMX instructions")
Signed-off-by: Laurentiu Tudor <laurentiu.tudor@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
smp_send_stop can lock up the IPI path for any subsequent calls,
because the receiving CPUs spin in their handler function. This
started becoming a problem with the addition of an smp_send_stop
call in the reboot path, because panics can reboot after doing
their own smp_send_stop.
The NMI IPI variant was fixed with ac61c11566 ("powerpc: Fix
smp_send_stop NMI IPI handling"), which leaves the smp_call_function
variant.
This is fixed by having smp_send_stop only ever do the
smp_call_function once. This is a bit less robust than the NMI IPI
fix, because any other call to smp_call_function after smp_send_stop
could deadlock, but that has always been the case, and it was not
been a problem before.
Fixes: f2748bdfe1573 ("powerpc/powernv: Always stop secondaries before reboot/shutdown")
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
gpstate_timer_handler() uses synchronous smp_call to set the pstate
on the requested core. This causes the below hard lockup:
smp_call_function_single+0x110/0x180 (unreliable)
smp_call_function_any+0x180/0x250
gpstate_timer_handler+0x1e8/0x580
call_timer_fn+0x50/0x1c0
expire_timers+0x138/0x1f0
run_timer_softirq+0x1e8/0x270
__do_softirq+0x158/0x3e4
irq_exit+0xe8/0x120
timer_interrupt+0x9c/0xe0
decrementer_common+0x114/0x120
-- interrupt: 901 at doorbell_global_ipi+0x34/0x50
LR = arch_send_call_function_ipi_mask+0x120/0x130
arch_send_call_function_ipi_mask+0x4c/0x130
smp_call_function_many+0x340/0x450
pmdp_invalidate+0x98/0xe0
change_huge_pmd+0xe0/0x270
change_protection_range+0xb88/0xe40
mprotect_fixup+0x140/0x340
SyS_mprotect+0x1b4/0x350
system_call+0x58/0x6c
One way to avoid this is removing the smp-call. We can ensure that the
timer always runs on one of the policy-cpus. If the timer gets
migrated to a cpu outside the policy then re-queue it back on the
policy->cpus. This way we can get rid of the smp-call which was being
used to set the pstate on the policy->cpus.
Fixes: 7bc54b652f13 ("timers, cpufreq/powernv: Initialize the gpstate timer as pinned")
Cc: stable@vger.kernel.org # v4.8+
Reported-by: Nicholas Piggin <npiggin@gmail.com>
Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Make kernel print the correct number of TLB entries on Intel Xeon Phi 7210
(and others)
Before:
[ 0.320005] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
After:
[ 0.320005] Last level dTLB entries: 4KB 256, 2MB 128, 4MB 128, 1GB 16
The entries do exist in the official Intel SMD but the type column there is
incorrect (states "Cache" where it should read "TLB"), but the entries for
the values 0x6B, 0x6C and 0x6D are correctly described as 'Data TLB'.
Signed-off-by: Jacek Tomaka <jacek.tomaka@poczta.fm>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20180423161425.24366-1-jacekt@dugeo.com
|
|
The block responsible of parsing the DT for the number of chip-select
lines uses an 'if/else if/else if' block. The content of the second and
third 'else if' conditions are:
1/ the actual condition to enter the sub-block and
2/ the operation to do in this sub-block.
[...]
else if (condition1_to_enter && action1() == failed)
raise_error();
else if (condition2_to_enter && action2() == failed)
raise_error();
[...]
In case of failure, the sub-block is entered and an error raised.
Otherwise, in case of success, the code would continue erroneously in
the next 'else if' statement because it did not failed (and did not
enter the first 'else if' sub-block).
The first 'else if' refers to legacy bindings while the second 'else if'
refers to new bindings. The second 'else if', which is entered
erroneously, checks for the 'reg' property, which, for old bindings,
does not mean anything because it would not be the number of CS
available, but the regular register map of almost any DT node. This
being said, the content of the 'reg' property being the register map
offset and length, it has '2' values, so the number of CS in this
situation is assumed to be '2'.
When running nand_scan_ident() with 2 CS, the core will check for an
array of chips. It will first issue a RESET and then a READ_ID. Of
course this will trigger two timeouts because there is no chip in front
of the second CS:
[ 1.367460] marvell-nfc f2720000.nand: Timeout on CMDD (NDSR: 0x00000080)
[ 1.474292] marvell-nfc f2720000.nand: Timeout on CMDD (NDSR: 0x00000280)
Indeed, this is harmless and the core will then assume there is only one
valid CS.
Fix the logic in the whole block by entering each sub-block just on the
'is legacy' condition, doing the action inside the sub-block. This way,
when the action succeeds, the whole block is left.
Furthermore, for both the old bindings and the new bindings the same
logic was applied to retrieve the number of CS lines:
using of_get_property() to get a size in bytes, converted in the actual
number of lines by dividing it per sizeof(u32) (4 bytes).
This is fine for the 'reg' property which is a list of the CS IDs but
not for the 'num-cs' property which is directly the value of the number
of CS.
Anyway, no existing DT uses another value than 'num-cs = <1>' and no
other value has ever been supported by the old driver (pxa3xx_nand.c).
Remove this condition and apply a number of 1 CS anyway, as already
described in the bindings.
Finally, the 'reg' property of a 'nand' node (with the new bindings)
gives the IDs of each CS line in use. marvell_nand.c driver first look
at the number of CS lines that are present in this property.
Better use of_property_count_elems_of_size() than dividing by 4 the size
of the number of bytes returned by of_get_property().
Fixes: 02f26ecf8c772 ("mtd: nand: add reworked Marvell NAND controller driver")
Cc: stable@vger.kernel.org
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Tested-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
|
|
It's possible for userspace to control n. Sanitize n when using it as an
array index.
Note that while it appears that n must be bound to the interval [0,3]
due to the way it is extracted from addr, we cannot guarantee that
compiler transformations (and/or future refactoring) will ensure this is
the case, and given this is a slow path it's better to always perform
the masking.
Found by smatch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: kvmarm@lists.cs.columbia.edu
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
It's possible for userspace to control intid. Sanitize intid when using
it as an array index.
At the same time, sort the includes when adding <linux/nospec.h>.
Found by smatch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: kvmarm@lists.cs.columbia.edu
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
It's possible for userspace to control idx. Sanitize idx when using it
as an array index.
Found by smatch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
|