Age | Commit message (Collapse) | Author | Files | Lines |
|
Let's cleanup empty page tables. Consider only page tables that fully
fall into the idendity mapping and the vmemmap range.
As there are no valid accesses to vmem/vmemmap within non-populated ranges,
the single tlb flush at the end should be sufficient.
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20200722094558.9828-7-david@redhat.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Let's synchronize all accesses to the 1:1 and vmemmap mappings. This will
be especially relevant when wanting to cleanup empty page tables that could
be shared by both. Avoid races when removing tables that might be just
about to get reused.
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20200722094558.9828-6-david@redhat.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Cleanup what we partially added in case vmemmap_populate() fails. For
vmem, this is already handled by vmem_add_mapping().
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20200722094558.9828-5-david@redhat.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Extend our shiny new modify_pagetable() to handle !direct (vmemmap)
mappings. Convert vmemmap_populate() and implement vmemmap_free().
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20200722094558.9828-4-david@redhat.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
We want to have only a single pagetable walker and reuse the same
functionality for vmemmap handling. Let's start by consolidating
vmem_add_range() and vmem_remove_range(), converting it into a
recursive implementation.
A recursive implementation makes it easier to expand individual cases
without harming readability. In addition, we minimize traversing the
whole hierarchy over and over again.
One change is that we don't unmap large PMDs/PUDs when not completely
covered by the request, something that should never happen with direct
mappings, unless one would be removing in other granularity than added,
which would be broken already.
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20200722094558.9828-3-david@redhat.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Let's match the name to vmem_remove_range().
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20200722094558.9828-2-david@redhat.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
This kernel feature is required for enabling BPF_KPROBE_OVERRIDE.
Define override_function_with_return() and regs_set_return_value()
functions, and fix compile errors in syscall_wrapper.h.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
The existing comment was talking about reading in the write part
and vice versa. While we are here make it more clear why restricting
the syscalls to MIO capable devices is okay.
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
It doesn't make sense to add zero shifted by 15. It's still zero.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
The value returned by read_tod_clock() will overflow on September 17th 2042.
To avoid that system time jumps back select CLOCKSOURCE_VALIDATE_LAST_CYCLE
which enables a sanity check in order to prevent negative "delta" values.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Make use of CLOCKSOURCE_MASK instead of open-coding it.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
This is a s390 port of x86 commit 3dec541b2e63 ("bpf: Add support for BTF
pointers to x86 JIT").
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
This is a s390 port of commit 548acf19234d ("x86/mm: Expand the
exception table logic to allow new handling options"), which is needed
for implementing BPF_PROBE_MEM on s390.
The new handler field is made 64-bit in order to allow pointing from
dynamically allocated entries to handlers in kernel text. Unlike on x86,
NULL is used instead of ex_handler_default. This is because exception
tables are used by boot/text_dma.S, and it would be a pain to preserve
ex_handler_default.
The new infrastructure is ignored in early_pgm_check_handler, since
there is no pt_regs.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Replace three implementations with one using using __stringify_in_c
macro conveniently "borrowed" from powerpc and microblaze.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Get rid of FORCE_MAX_ZONEORDER which limited allocations to order 8 (= 1MB)
and use the default, which allows for order 10 (= 4MB) allocations.
Given that s390 allows less than the default this caused some memory
allocation problems more or less unique to s390 from time to time.
Note: this was originally introduced with commit 684de39bd795 ("[S390]
Fix IPL from NSS.") in order to support Named Saved Segments, which
could start/end at an arbitrary 1 megabyte boundary and also before
support for sparsemem vmemmmap was enabled.
Since NSS support is gone, but sparsemem vmemmap support is available
this limitation can go away.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Trimming to MAX_ORDER was originally done in order to avoid to set
HOLES_IN_ZONE, which in turn would enable a quite expensive
pfn_valid() check. pfn_valid() however only checks if a struct page
exists for a given pfn.
With sparsemen vmemmap there are always struct pages, since memmaps
are allocated for whole sections. Therefore remove the HOLES_IN_ZONE
comment and the trimming.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
For non-thinint devices in LPAR, qdio polls an idle Input Queue for a
little while to catch more work. But platform support for thinints has
been around practically _forever_ by now, so this micro-optimization is
seeing 0 actual use. Remove it to reduce the overall complexity of the
hot path.
In the meantime we also grew support for driver-level polling
(eg. NAPI in qeth), so it's quite questionable how useful this would
actually be on current kernels.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
The comment is inaccurate, qdio_inbound_q_moved() and/or its callers no
longer get confused by a count of 128 completed SBALs.
Scanning all 128 SBALs at once can improve IRQ reduction (as we now
place the ACK at the right spot), and reduce the amount of processing
needed to handle all completed SBALs.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Old code would only scan up to 127 SBALs at once. So the last statistics
bucket was set aside to count "discovered 127 SBALs with new work"
events.
But nowadays we allow to scan all 128 SBALs for Output Queues, and a
subsequent patch will introduce the same for Input Queues.
So fix up the accounting to use the last bucket only when all 128 SBALs
have been discovered with new work.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Helpful for debugging.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
With the removal of the critical section cleanup, we now enter the svc
interrupt handler with interrupts disabled.
Fixes: 0b0ed657fe00 ("s390: remove critical section cleanup from entry.S")
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Rework of the QCI crypto info and how it is used.
This is only a internal rework but does not affect the way
how the ap bus acts with ap card and queue devices and
domain handling.
Tested on z15, z14, z12 (QCI support) and z196 (no QCI support).
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Commit 50be63450728 ("s390/mm: Convert bootmem to memblock") mentions
"The original bootmem allocator is getting replaced by memblock. To
cover the needs of the s390 kdump implementation the physical
memory list is used."
As we can now reference "physmem" managed in the memblock allocator after
init even without ARCH_KEEP_MEMBLOCK, and s390x does no longer need
other memblock metadata after boot (esp., the zcore memmap device that used
it got removed), we can stop setting ARCH_KEEP_MEMBLOCK.
With this change, we no longer create memblocks for standby/hotplugged
memory (added via add_memory()) and free up memblock metadata (except
physmem) after boot.
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Philipp Rudo <prudo@linux.ibm.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20200701141830.18749-3-david@redhat.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
"physmem" in the memblock allocator is somewhat weird: it's not actually
used for allocation, it's simply information collected during boot, which
describes the unmodified physical memory map at boot time, without any
standby/hotplugged memory. It's only used on s390 and is currently the
only reason s390 keeps using CONFIG_ARCH_KEEP_MEMBLOCK.
Physmem isn't numa aware and current users don't specify any flags. Let's
hide it from the user, exposing only for_each_physmem(), and simplify. The
interface for physmem is now really minimalistic:
- memblock_physmem_add() to add ranges
- for_each_physmem() / __next_physmem_range() to walk physmem ranges
Don't place it into an __init section and don't discard it without
CONFIG_ARCH_KEEP_MEMBLOCK. As we're reusing __next_mem_range(), remove
the __meminit notifier to avoid section mismatch warnings once
CONFIG_ARCH_KEEP_MEMBLOCK is no longer used with
CONFIG_HAVE_MEMBLOCK_PHYS_MAP.
While fixing up the documentation, sneak in some related cleanups. We can
stop setting CONFIG_ARCH_KEEP_MEMBLOCK for s390 next.
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Message-Id: <20200701141830.18749-2-david@redhat.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
This patch introduces the sysfs attributes serialnr and
mkvps for cex2c and cex3c cards. These sysfs attributes
are available for cex4c and higher since
commit 7c4e91c0959b ("s390/zcrypt: new sysfs attributes serialnr and mkvps")'
and this patch now provides the same for the older cex2
and cex3 cards.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
There is a state machine held for each ap queue device.
The states and functions related to this where somethimes
noted with _sm_ somethimes without. This patch clarifies
and renames all the ap queue state machine related functions,
enums and defines to have a _sm_ in the name.
There is no functional change coming with this patch - it's
only beautifying code.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
The zcrpyt_unlocked_ioctl() function has become large. So split away
into new static functions the 4 ioctl ICARSAMODEXPO, ICARSACRT,
ZSECSENDCPRB and ZSENDEP11CPRB. This makes the code more readable and
is a preparation step for further improvements needed on these ioctls.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
Some beautifications related to the internal only used
struct ap_message and related code. Instead of one int carrying
only the special flag now a u32 flags field is used.
At struct CPRBX the pointers to additional data are now marked
with __user. This caused some changes needed on code, where
these structs are also used within the zcrypt misc functions.
The ica_rsa_* structs now use the generic types __u8, __u32, ...
instead of char, unsigned int.
zcrypt_msg6 and zcrypt_msg50 use min_t() instead of min().
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
Fix these smatch warnings:
zcrypt_api.c:986 _zcrypt_send_ep11_cprb() error: uninitialized symbol 'pref_weight'.
zcrypt_api.c:1008 _zcrypt_send_ep11_cprb() error: uninitialized symbol 'weight'.
zcrypt_api.c:676 zcrypt_rsa_modexpo() error: uninitialized symbol 'pref_weight'.
zcrypt_api.c:694 zcrypt_rsa_modexpo() error: uninitialized symbol 'weight'.
zcrypt_api.c:760 zcrypt_rsa_crt() error: uninitialized symbol 'pref_weight'.
zcrypt_api.c:778 zcrypt_rsa_crt() error: uninitialized symbol 'weight'.
zcrypt_api.c:824 _zcrypt_send_cprb() warn: always true condition '(tdom >= 0) => (0-u16max >= 0)'
zcrypt_api.c:846 _zcrypt_send_cprb() error: uninitialized symbol 'pref_weight'.
zcrypt_api.c:867 _zcrypt_send_cprb() error: uninitialized symbol 'weight'.
zcrypt_api.c:1065 zcrypt_rng() error: uninitialized symbol 'pref_weight'.
zcrypt_api.c:1079 zcrypt_rng() error: uninitialized symbol 'weight'.
zcrypt_cex4.c:251 ep11_card_op_modes_show() warn: should '(1 << ep11_op_modes[i]->mode_bit)' be a 64 bit type?
zcrypt_cex4.c:346 ep11_queue_op_modes_show() warn: should '(1 << ep11_op_modes[i]->mode_bit)' be a 64 bit type?
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
Fix smatch warnings:
pkey_api.c:1606 pkey_ccacipher_aes_attr_read() warn: inconsistent indenting
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
segment_load() will no longer return -ENOSPC. If a segment overlaps with
storage, we now also return -EBUSY. Remove the stale comment from
__segment_load() and the stale handling from segment_warning().
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20200630084240.8283-1-david@redhat.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
Saves us a couple of bytes.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
In an effort to enable -Wcast-function-type in the top-level Makefile to
support Control Flow Integrity builds, remove all the function callback
casts.
To do this modify the function prototypes accordingly.
Signed-off-by: Oscar Carter <oscar.carter@gmx.com>
Message-Id: <20200627125417.18887-1-oscar.carter@gmx.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
[heiko.carstens@de.ibm.com: coding style changes]
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
I can't come up with a satisfying reason why we still need the memory
segment list. We used to represent in the list:
- boot memory
- standby memory added via add_memory()
- loaded dcss segments
When loading/unloading dcss segments, we already track them in a
separate list and check for overlaps
(arch/s390/mm/extmem.c:segment_overlaps_others()) when loading segments.
The overlap check was introduced for some segments in
commit b2300b9efe1b ("[S390] dcssblk: add >2G DCSSs support and stacked
contiguous DCSSs support.")
and was extended to cover all dcss segments in
commit ca57114609d1 ("s390/extmem: remove code for 31 bit addressing
mode").
Although I doubt that overlaps with boot memory and standby memory
are relevant, let's reshuffle the checks in load_segment() to request
the resource first. This will bail out in case we have overlaps with
other resources (esp. boot memory and standby memory). The order
is now different compared to segment_unload() and segment_unload(), but
that should not matter.
This smells like a leftover from ancient times, let's get rid of it. We
can now convert vmem_remove_mapping() into a void function - everybody
ignored the return value already.
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20200625150029.45019-1-david@redhat.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [DCSS]
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
There are no secrets in these files, so allow all users
to read it.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.
This code was detected with the help of Coccinelle and, audited and
fixed manually.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Message-Id: <20200617212930.GA11728@embeddedor>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
There is no interface to userspace which exposes anything that would
require the struct __debug_entry definition. Therefore remove it from
uapi. This allows to change the definition, since it is only kernel
internally used.
The only exception is the crash utility, however that tool must handle
changes all the time anyway.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
There is not a single user of the debug raw view. Therefore remove it
before anybody uses it. If anybody would make use of the view it would
expose the struct __debug_entry definition to userspace and really
would make it uapi. This wouldn't be good, since the definition is
suboptimal and needs to be changed.
Right now the structure definition is only defined to be uapi, however
there is no user.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
Remove unused /sys/kernel/debug/zcore/memmap device.
Since at least version 1.24.0 of s390-tools zfcpdump no longer
needs it and reads /proc/vmcore instead.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reviewed-by: Philipp Rudo <prudo@linux.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
Instead of using the old 'jiffies + HZ {/,*} something' calculation
use msecs_to_jiffies() as that makes the code more readable.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
|
|
Some performance regression on reaim benchmark have been raised with
commit 070f5e860ee2 ("sched/fair: Take into account runnable_avg to classify group")
The problem comes from the init value of runnable_avg which is initialized
with max value. This can be a problem if the newly forked task is finally
a short task because the group of CPUs is wrongly set to overloaded and
tasks are pulled less agressively.
Set initial value of runnable_avg equals to util_avg to reflect that there
is no waiting time so far.
Fixes: 070f5e860ee2 ("sched/fair: Take into account runnable_avg to classify group")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200624154422.29166-1-vincent.guittot@linaro.org
|
|
Instead of relying on BUG_ON() to ensure the various data structures
line up, use a bunch of horrible unions to make it all automatic.
Much of the union magic is to ensure irq_work and smp_call_function do
not (yet) see the members of their respective data structures change
name.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20200622100825.844455025@infradead.org
|
|
Use a better name for this poorly named flag, to avoid confusion...
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20200622100825.785115830@infradead.org
|
|
Paul reported rcutorture occasionally hitting a NULL deref:
sched_ttwu_pending()
ttwu_do_wakeup()
check_preempt_curr() := check_preempt_wakeup()
find_matching_se()
is_same_group()
if (se->cfs_rq == pse->cfs_rq) <-- *BOOM*
Debugging showed that this only appears to happen when we take the new
code-path from commit:
2ebb17717550 ("sched/core: Offload wakee task activation if it the wakee is descheduling")
and only when @cpu == smp_processor_id(). Something which should not
be possible, because p->on_cpu can only be true for remote tasks.
Similarly, without the new code-path from commit:
c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
this would've unconditionally hit:
smp_cond_load_acquire(&p->on_cpu, !VAL);
and if: 'cpu == smp_processor_id() && p->on_cpu' is possible, this
would result in an instant live-lock (with IRQs disabled), something
that hasn't been reported.
The NULL deref can be explained however if the task_cpu(p) load at the
beginning of try_to_wake_up() returns an old value, and this old value
happens to be smp_processor_id(). Further assume that the p->on_cpu
load accurately returns 1, it really is still running, just not here.
Then, when we enqueue the task locally, we can crash in exactly the
observed manner because p->se.cfs_rq != rq->cfs_rq, because p's cfs_rq
is from the wrong CPU, therefore we'll iterate into the non-existant
parents and NULL deref.
The closest semi-plausible scenario I've managed to contrive is
somewhat elaborate (then again, actual reproduction takes many CPU
hours of rcutorture, so it can't be anything obvious):
X->cpu = 1
rq(1)->curr = X
CPU0 CPU1 CPU2
// switch away from X
LOCK rq(1)->lock
smp_mb__after_spinlock
dequeue_task(X)
X->on_rq = 9
switch_to(Z)
X->on_cpu = 0
UNLOCK rq(1)->lock
// migrate X to cpu 0
LOCK rq(1)->lock
dequeue_task(X)
set_task_cpu(X, 0)
X->cpu = 0
UNLOCK rq(1)->lock
LOCK rq(0)->lock
enqueue_task(X)
X->on_rq = 1
UNLOCK rq(0)->lock
// switch to X
LOCK rq(0)->lock
smp_mb__after_spinlock
switch_to(X)
X->on_cpu = 1
UNLOCK rq(0)->lock
// X goes sleep
X->state = TASK_UNINTERRUPTIBLE
smp_mb(); // wake X
ttwu()
LOCK X->pi_lock
smp_mb__after_spinlock
if (p->state)
cpu = X->cpu; // =? 1
smp_rmb()
// X calls schedule()
LOCK rq(0)->lock
smp_mb__after_spinlock
dequeue_task(X)
X->on_rq = 0
if (p->on_rq)
smp_rmb();
if (p->on_cpu && ttwu_queue_wakelist(..)) [*]
smp_cond_load_acquire(&p->on_cpu, !VAL)
cpu = select_task_rq(X, X->wake_cpu, ...)
if (X->cpu != cpu)
switch_to(Y)
X->on_cpu = 0
UNLOCK rq(0)->lock
However I'm having trouble convincing myself that's actually possible
on x86_64 -- after all, every LOCK implies an smp_mb() there, so if ttwu
observes ->state != RUNNING, it must also observe ->cpu != 1.
(Most of the previous ttwu() races were found on very large PowerPC)
Nevertheless, this fully explains the observed failure case.
Fix it by ordering the task_cpu(p) load after the p->on_cpu load,
which is easy since nothing actually uses @cpu before this.
Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200622125649.GC576871@hirez.programming.kicks-ass.net
|
|
syzbot reported the following warning:
WARNING: CPU: 1 PID: 6351 at kernel/sched/deadline.c:628
enqueue_task_dl+0x22da/0x38a0 kernel/sched/deadline.c:1504
At deadline.c:628 we have:
623 static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
624 {
625 struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
626 struct rq *rq = rq_of_dl_rq(dl_rq);
627
628 WARN_ON(dl_se->dl_boosted);
629 WARN_ON(dl_time_before(rq_clock(rq), dl_se->deadline));
[...]
}
Which means that setup_new_dl_entity() has been called on a task
currently boosted. This shouldn't happen though, as setup_new_dl_entity()
is only called when the 'dynamic' deadline of the new entity
is in the past w.r.t. rq_clock and boosted tasks shouldn't verify this
condition.
Digging through the PI code I noticed that what above might in fact happen
if an RT tasks blocks on an rt_mutex hold by a DEADLINE task. In the
first branch of boosting conditions we check only if a pi_task 'dynamic'
deadline is earlier than mutex holder's and in this case we set mutex
holder to be dl_boosted. However, since RT 'dynamic' deadlines are only
initialized if such tasks get boosted at some point (or if they become
DEADLINE of course), in general RT 'dynamic' deadlines are usually equal
to 0 and this verifies the aforementioned condition.
Fix it by checking that the potential donor task is actually (even if
temporary because in turn boosted) running at DEADLINE priority before
using its 'dynamic' deadline value.
Fixes: 2d3d891d3344 ("sched/deadline: Add SCHED_DEADLINE inheritance logic")
Reported-by: syzbot+119ba87189432ead09b4@syzkaller.appspotmail.com
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Tested-by: Daniel Wagner <dwagner@suse.de>
Link: https://lkml.kernel.org/r/20181119153201.GB2119@localhost.localdomain
|