Age | Commit message (Collapse) | Author | Files | Lines |
|
This renames cpu_read_ops() to init_cpu_ops() as the function is only
called in initialization phase. Also, we will introduce get_cpu_ops() in
the subsequent patches, to retireve the CPU operation by the given CPU
index. The usage of cpu_read_ops() and get_cpu_ops() are difficult to be
distinguished from their names.
Signed-off-by: Gavin Shan <gshan@redhat.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
It's obvious we needn't declare the corresponding CPU operation when
CONFIG_ARM64_ACPI_PARKING_PROTOCOL is disabled, even it doesn't cause
any compiling warnings.
Signed-off-by: Gavin Shan <gshan@redhat.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
At present ARMv8 event counters are limited to 32-bits, though by
using the CHAIN event it's possible to combine adjacent counters to
achieve 64-bits. The perf config1:0 bit can be set to use such a
configuration.
With the introduction of ARMv8.5-PMU support, all event counters can
now be used as 64-bit counters.
Let's enable 64-bit event counters where support exists. Unless the
user sets config1:0 we will adjust the counter value such that it
overflows upon 32-bit overflow. This follows the same behaviour as
the cycle counter which has always been (and remains) 64-bits.
Signed-off-by: Andrew Murray <andrew.murray@arm.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
[Mark: fix ID field names, compare with 8.5 value]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
We currently expose the PMU version of the host to the guest via
emulation of the DFR0_EL1 and AA64DFR0_EL1 debug feature registers.
However many of the features offered beyond PMUv3 for 8.1 are not
supported in KVM. Examples of this include support for the PMMIR
registers (added in PMUv3 for ARMv8.4) and 64-bit event counters
added in (PMUv3 for ARMv8.5).
Let's trap the Debug Feature Registers in order to limit
PMUVer/PerfMon in the Debug Feature Registers to PMUv3 for ARMv8.1
to avoid unexpected behaviour.
Both ID_AA64DFR0.PMUVer and ID_DFR0.PerfMon follow the "Alternative ID
scheme used for the Performance Monitors Extension version" where 0xF
means an IMPLEMENTATION DEFINED PMU is implemented, and values 0x0-0xE
are treated as with an unsigned field (with 0x0 meaning no PMU is
present). As we don't expect to expose an IMPLEMENTATION DEFINED PMU,
and our cap is below 0xF, we can treat these fields as unsigned when
applying the cap.
Signed-off-by: Andrew Murray <andrew.murray@arm.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
[Mark: make field names consistent, use perfmon cap]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When emulating ID registers there is often a need to cap the version
bits of a feature such that the guest will not use features that the
host is not aware of. For example, when KVM mediates access to the PMU
by emulating register accesses.
Let's add a helper that extracts a performance monitors ID field and
caps the version to a given value.
Fields that identify the version of the Performance Monitors Extension
do not follow the standard ID scheme, and instead follow the scheme
described in ARM DDI 0487E.a page D13-2825 "Alternative ID scheme used
for the Performance Monitors Extension version". The value 0xF means an
IMPLEMENTATION DEFINED PMU is present, and values 0x0-OxE can be treated
the same as an unsigned field with 0x0 meaning no PMU is present.
Signed-off-by: Andrew Murray <andrew.murray@arm.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
[Mark: rework to handle perfmon fields]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Reading this code bordered on painful, what with all the repetition and
pointless return values. More fundamentally, dribbling the hardware
enables and disables in one bit at a time incurs needless system
register overhead for chained events and on reset. We already use
bitmask values for the KVM hooks, so consolidate all the register
accesses to match, and make a reasonable saving in both source and
object code.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
snprintf() is a hard-to-use function, it's especially difficult to use
it for concatenating substrings in a buffer with a limited size.
Since snprintf() returns the would-be-output size, not the actual
size, the subsequent use of snprintf() may point to the incorrect
position easily. Although the current code doesn't actually overflow
the buffer, it's an incorrect usage.
This patch replaces such snprintf() calls with a safer version,
scnprintf().
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When running the kernel with init_on_alloc=1, calling the default
implementation of __alloc_zeroed_user_highpage() from include/linux/highmem.h
leads to double-initialization of the allocated page (first by the page
allocator, then by clear_user_page().
Calling alloc_page_vma() with __GFP_ZERO, similarly to e.g. x86, seems
to be enough to ensure the user page is zeroed only once.
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The function __cpu_up() is invoked to bring up the target CPU through
the backend, PSCI for example. The nested if statements won't be needed
if we bail out early on the following two conditions where the status
won't be checked. The code looks simplified in that case.
* Error returned from the backend (e.g. PSCI)
* The target CPU has been marked as onlined
Signed-off-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
|
|
remove redundant blank for '=' operator, it may be more elegant.
Signed-off-by: hankecai <hankecai@vivo.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Remove unnecessary blank.
Signed-off-by: Li Tao <tao.li@vivo.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
add blank after 'if' for armv8_deprecated_init()
to make it comply with kernel coding style.
Signed-off-by: Zheng Wei <wei.zheng@vivo.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
There is a spelling mistake in the comment, Fix it.
Signed-off-by: hankecai <hankecai@bbktel.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The KVM hyp code is only run after system capabilities have been
finalized, and thus all const cap checks have been patched. This is
noted in in __cpu_init_hyp_mode(), where we BUG() if called too early:
| /*
| * Call initialization code, and switch to the full blown HYP code.
| * If the cpucaps haven't been finalized yet, something has gone very
| * wrong, and hyp will crash and burn when it uses any
| * cpus_have_const_cap() wrapper.
| */
Given this, the hyp code can use cpus_have_final_cap() and avoid
generating code to check the cpu_hwcaps array, which would be unsafe to
run in hyp context.
This patch migrate the KVM hyp code to cpus_have_final_cap(), avoiding
this redundant code generation, and making it possible to detect if we
accidentally invoke this code too early. In the latter case, the BUG()
in cpus_have_final_cap() will cause a hyp panic.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
Cc: Suzuki Poulouse <suzuki.poulose@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
When cpus_have_const_cap() was originally introduced it was intended to
be safe in hyp context, where it is not safe to access the cpu_hwcaps
array as cpus_have_cap() did. For more details see commit:
a4023f682739439b ("arm64: Add hypervisor safe helper for checking constant capabilities")
We then made use of cpus_have_const_cap() throughout the kernel.
Subsequently, we had to defer updating the static_key associated with
each capability in order to avoid lockdep complaints. To avoid breaking
kernel-wide usage of cpus_have_const_cap(), this was updated to fall
back to the cpu_hwcaps array if called before the static_keys were
updated. As the kvm hyp code was only called later than this, the
fallback is redundant but not functionally harmful. For more details,
see commit:
63a1e1c95e60e798 ("arm64/cpufeature: don't use mutex in bringup path")
Today we have more users of cpus_have_const_cap() which are only called
once the relevant static keys are initialized, and it would be
beneficial to avoid the redundant code.
To that end, this patch adds a new cpus_have_final_cap(), helper which
is intend to be used in code which is only run once capabilities have
been finalized, and will never check the cpus_hwcap array. This helps
the compiler to generate better code as it no longer needs to generate
code to address and test the cpus_hwcap array. To help catch misuse,
cpus_have_final_cap() will BUG() if called before capabilities are
finalized.
In hyp context, BUG() will result in a hyp panic, but the specific BUG()
instance will not be identified in the usual way.
Comments are added to the various cpus_have_*_cap() helpers to describe
the constraints on when they can be used. For clarity cpus_have_cap() is
moved above the other helpers. Similarly the helpers are updated to use
system_capabilities_finalized() consistently, and this is made
__always_inline as required by its new callers.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Currently, the EL0 SP alignment handler masks IRQs unnecessarily. It
does so due to historic code sharing of the EL0 SP and PC alignment
handlers, and branch predictor hardening applicable to the EL0 SP
handler.
We began masking IRQs in the EL0 SP alignment handler in commit:
5dfc6ed27710c42c ("arm64: entry: Apply BP hardening for high-priority synchronous exception")
... as this shared code with the EL0 PC alignment handler, and branch
predictor hardening made it necessary to disable IRQs for early parts of
the EL0 PC alignment handler. It was not necessary to mask IRQs during
EL0 SP alignment exceptions, but it was not considered harmful to do so.
This masking was carried forward into C code in commit:
582f95835a8fc812 ("arm64: entry: convert el0_sync to C")
... where the SP/PC cases were split into separate handlers, and the
masking duplicated.
Subsequently the EL0 PC alignment handler was refactored to perform
branch predictor hardening before unmasking IRQs, in commit:
bfe298745afc9548 ("arm64: entry-common: don't touch daif before bp-hardening")
... but the redundant masking of IRQs was not removed from the EL0 SP
alignment handler.
Let's do so now, and make it interruptible as with most other
synchronous exception handlers.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
|
|
efi-entry.o is built on demand for efi-entry.stub.o, so you do not have
to repeat $(CONFIG_EFI) here. Adding it to 'targets' is enough.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
|
|
Throwing our __uint128_t idioms at csum_ipv6_magic() makes it
about 1.3x-2x faster across a range of microarchitecture/compiler
combinations. Not much in absolute terms, but every little helps.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Fix typo in a comment in arch/arm64/include/asm/esr.h
"Unallocted" -> "Unallocated"
Signed-off-by: Chenggang Wang <wangchenggang@vivo.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Using an arch timer with a frequency of less than 1MHz can potentially
result in incorrect functionality in systems that assume a reasonable
rate of the arch timer of 1 to 50MHz, described as typical in the
architecture specification.
Therefore, warn if the arch timer rate is below 1MHz, which is
considered atypical and worth emphasizing.
Suggested-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The Frequency Invariance Engine (FIE) is providing a frequency
scaling correction factor that helps achieve more accurate
load-tracking.
So far, for arm and arm64 platforms, this scale factor has been
obtained based on the ratio between the current frequency and the
maximum supported frequency recorded by the cpufreq policy. The
setting of this scale factor is triggered from cpufreq drivers by
calling arch_set_freq_scale. The current frequency used in computation
is the frequency requested by a governor, but it may not be the
frequency that was implemented by the platform.
This correction factor can also be obtained using a core counter and a
constant counter to get information on the performance (frequency based
only) obtained in a period of time. This will more accurately reflect
the actual current frequency of the CPU, compared with the alternative
implementation that reflects the request of a performance level from
the OS.
Therefore, implement arch_scale_freq_tick to use activity monitors, if
present, for the computation of the frequency scale factor.
The use of AMU counters depends on:
- CONFIG_ARM64_AMU_EXTN - depents on the AMU extension being present
- CONFIG_CPU_FREQ - the current frequency obtained using counter
information is divided by the maximum frequency obtained from the
cpufreq policy.
While it is possible to have a combination of CPUs in the system with
and without support for activity monitors, the use of counters for
frequency invariance is only enabled for a CPU if all related CPUs
(CPUs in the same frequency domain) support and have enabled the core
and constant activity monitor counters. In this way, there is a clear
separation between the policies for which arch_set_freq_scale (cpufreq
based FIE) is used, and the policies for which arch_scale_freq_tick
(counter based FIE) is used to set the frequency scale factor. For
this purpose, a late_initcall_sync is registered to trigger validation
work for policies that will enable or disable the use of AMU counters
for frequency invariance. If CONFIG_CPU_FREQ is not defined, the use
of counters is enabled on all CPUs only if all possible CPUs correctly
support the necessary counters.
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Add weak function to return the hardware maximum frequency of a CPU,
with the default implementation returning cpuinfo.max_freq, which is
the best information we can generically get from the cpufreq framework.
The default can be overwritten by a strong function in platforms
that want to provide an alternative implementation, with more accurate
information, obtained either from hardware or firmware.
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The activity monitors extension is an optional extension introduced
by the ARMv8.4 CPU architecture.
Add initial documentation for the AMUv1 extension:
- arm64/amu.txt: AMUv1 documentation
- arm64/booting.txt: system registers initialisation
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Access to the AMU counters should be disabled by default in kvm guests,
as information from the counters might reveal activity in other guests
or activity on the host.
Therefore, disable access to AMU registers from EL0 and EL1 in kvm
guests by:
- Hiding the presence of the extension in the feature register
(SYS_ID_AA64PFR0_EL1) on the VCPU.
- Disabling access to the AMU registers before switching to the guest.
- Trapping accesses and injecting an undefined instruction into the
guest.
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
Cc: James Morse <james.morse@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The activity monitors extension is an optional extension introduced
by the ARMv8.4 CPU architecture. In order to access the activity
monitors counters safely, if desired, the kernel should detect the
presence of the extension through the feature register, and mediate
the access.
Therefore, disable direct accesses to activity monitors counters
from EL0 (userspace) and trap them to EL1 (kernel).
To be noted that the ARM64_AMU_EXTN kernel config does not have an
effect on this code. Given that the amuserenr_el0 resets to an
UNKNOWN value, setting the trap of EL0 accesses to EL1 is always
attempted for safety and security considerations. Therefore firmware
should still ensure accesses to AMU registers are not trapped in
EL2/EL3 as this code cannot be bypassed if the CPU implements the
Activity Monitors Unit.
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The activity monitors extension is an optional extension introduced
by the ARMv8.4 CPU architecture. This implements basic support for
version 1 of the activity monitors architecture, AMUv1.
This support includes:
- Extension detection on each CPU (boot, secondary, hotplugged)
- Register interface for AMU aarch64 registers
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
There are no applicable literals above them.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Remi Denis-Courmont <remi.denis.courmont@huawei.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The arch code for hot-remove must tear down portions of the linear map and
vmemmap corresponding to memory being removed. In both cases the page
tables mapping these regions must be freed, and when sparse vmemmap is in
use the memory backing the vmemmap must also be freed.
This patch adds unmap_hotplug_range() and free_empty_tables() helpers which
can be used to tear down either region and calls it from vmemmap_free() and
___remove_pgd_mapping(). The free_mapped argument determines whether the
backing memory will be freed.
It makes two distinct passes over the kernel page table. In the first pass
with unmap_hotplug_range() it unmaps, invalidates applicable TLB cache and
frees backing memory if required (vmemmap) for each mapped leaf entry. In
the second pass with free_empty_tables() it looks for empty page table
sections whose page table page can be unmapped, TLB invalidated and freed.
While freeing intermediate level page table pages bail out if any of its
entries are still valid. This can happen for partially filled kernel page
table either from a previously attempted failed memory hot add or while
removing an address range which does not span the entire page table page
range.
The vmemmap region may share levels of table with the vmalloc region.
There can be conflicts between hot remove freeing page table pages with
a concurrent vmalloc() walking the kernel page table. This conflict can
not just be solved by taking the init_mm ptl because of existing locking
scheme in vmalloc(). So free_empty_tables() implements a floor and ceiling
method which is borrowed from user page table tear with free_pgd_range()
which skips freeing page table pages if intermediate address range is not
aligned or maximum floor-ceiling might not own the entire page table page.
Boot memory on arm64 cannot be removed. Hence this registers a new memory
hotplug notifier which prevents boot memory offlining and it's removal.
While here update arch_add_memory() to handle __add_pages() failures by
just unmapping recently added kernel linear mapping. Now enable memory hot
remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.
This implementation is overall inspired from kernel page table tear down
procedure on X86 architecture and user page table tear down method.
[Mike and Catalin added P4D page table level support]
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The arm64 page table dump code can race with concurrent modification of the
kernel page tables. When a leaf entries are modified concurrently, the dump
code may log stale or inconsistent information for a VA range, but this is
otherwise not harmful.
When intermediate levels of table are freed, the dump code will continue to
use memory which has been freed and potentially reallocated for another
purpose. In such cases, the dump code may dereference bogus addresses,
leading to a number of potential problems.
Intermediate levels of table may by freed during memory hot-remove,
which will be enabled by a subsequent patch. To avoid racing with
this, take the memory hotplug lock when walking the kernel page table.
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Add support for matching the new PMUs. For now, this just wires them up
as generic PMUv3 such that people writing DTs for new SoCs can do the
right thing, and at least have architectural and raw events be usable.
We can come back and fill in event maps for sysfs and/or perf tools at
a later date.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The PMU init callbacks are already drowning in boilerplate, so before
doubling the number of supported PMU models, give it a sensible refactor
to significantly reduce the bloat, both in source and object code.
Although nobody uses non-default sysfs attributes today, there's minimal
impact to preserving the notion that maybe, some day, somebody might, so
we may as well keep up appearances.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
We already check that the 'nr_pages' is > 2, so there's no need to check
that it's != 0 later on.
Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Commit 25b92693a1b6 ("arm64: mm: convert cpu_do_switch_mm() to C") added
a new use of the ASID() macro, so update the comment in asm/mmu.h which
reasons about why an atomic reload of 'mm->context.id.counter' is not
required.
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Function sdei_event_find() is always called in sdei_event_create(), but
it is already called in sdei_event_register(). This code is trying to
avoid a double-create of the same event, which can't happen as we still
hold the sdei_events_lock. We can remove this needless sdei_event_find()
call.
Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
[expanded commit message]
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
SDEI has private events that need registering and enabling on each CPU.
CPUs can come and go while we are trying to do this. SDEI tries to avoid
these problems by setting the reregister flag before the register call,
so any CPUs that come online register the event too. Sticking plaster
like this doesn't work, as if the register call fails, a CPU that
subsequently comes online will register the event before reregister
is cleared.
Take cpus_read_lock() around the register and enable calls. We don't
want surprise CPUs to do the wrong thing if they race with these calls
failing.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
We call sdei_reregister_event() with sdei_list_lock held, if the register
fails we call sdei_event_destroy() which also acquires sdei_list_lock
thus creating A-A deadlock.
Add '_llocked' to sdei_reregister_event(), to indicate the list lock
is held, and add a _llocked variant of sdei_event_destroy().
Fixes: da351827240e ("firmware: arm_sdei: Add support for CPU and system power states")
Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
[expanded subject, added wrappers instead of duplicating contents]
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
SDEI has private events that must be registered on each CPU. When
CPUs come and go they must re-register and re-enable their private
events. Each event has flags to indicate whether this should happen
to protect against an event being registered on a CPU coming online,
while all the others are unregistering the event.
These flags are protected by the sdei_list_lock spinlock, because
the cpuhp callbacks can't take the mutex.
Hibernate needs to unregister all events, but keep the in-memory
re-register and re-enable as they are. sdei_unregister_shared()
takes the spinlock to walk the list, then calls _sdei_event_unregister()
on each shared event. _sdei_event_unregister() tries to take the
same spinlock to update re-register and re-enable. This doesn't go
so well.
Push the re-register and re-enable updates out to their callers.
sdei_unregister_shared() doesn't want these values updated, so
doesn't need to do anything.
This also fixes shared events getting lost over hibernate as this
path made them look unregistered.
Fixes: da351827240e ("firmware: arm_sdei: Add support for CPU and system power states")
Reported-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
There's no reason that cpu_do_switch_mm() needs to be written as an
assembly function, and having it as a C function would make it easier to
maintain.
This patch converts cpu_do_switch_mm() to C, removing code that this
change makes redundant (e.g. the mmid macro). Since the header comment
was stale and the prototype now implies all the necessary information,
this comment is removed. The 'pgd_phys' argument is made a phys_addr_t
to match the return type of virt_to_phys().
At the same time, post_ttbr_update_workaround() is updated to use
IS_ENABLED(), which allows the compiler to figure out it can elide calls
for !CONFIG_CAVIUM_ERRATUM_27456 builds.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
[catalin.marinas@arm.com: change comments from asm-style to C-style]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Fix typos in arch/arm64/Kconfig:
- spell Numa as NUMA
- add hyphenation to Non-Uniform
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
|
|
The C-Sky platform code is not a clock provider, and just needs to call
of_clk_init().
Hence it can include <linux/of_clk.h> instead of <linux/clk-provider.h>.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
|
|
Since commit a3a0e43fd770 ("io_uring: don't enter poll loop if we have
CQEs pending"), if we already events pending, we won't enter poll loop.
In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has
been terminated and don't reap pending events which are already in cq
ring, and there are some reqs in poll_list, io_sq_thread will enter
__io_iopoll_check(), and find pending events, then return, this loop
will never have a chance to exit.
I have seen this issue in fio stress tests, to fix this issue, let
io_sq_thread call io_iopoll_getevents() with argument 'min' being zero,
and remove __io_iopoll_check().
Fixes: a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs pending")
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When CONFIG_QFMT_V2 is configured as a module, the test in
ext4_feature_set_ok() fails and so mount of filesystems with quota or
project features fails. Fix the test to use IS_ENABLED macro which
works properly even for modules.
Link: https://lore.kernel.org/r/20200221100835.9332-1-jack@suse.cz
Fixes: d65d87a07476 ("ext4: improve explanation of a mount failure caused by a misconfigured kernel")
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
I found a NULL pointer dereference in ocfs2_block_group_clear_bits().
The running environment:
kernel version: 4.19
A cluster with two nodes, 5 luns mounted on two nodes, and do some
file operations like dd/fallocate/truncate/rm on every lun with storage
network disconnection.
The fallocate operation on dm-23-45 caused an null pointer dereference.
The information of NULL pointer dereference as follows:
[577992.878282] JBD2: Error -5 detected when updating journal superblock for dm-23-45.
[577992.878290] Aborting journal on device dm-23-45.
...
[577992.890778] JBD2: Error -5 detected when updating journal superblock for dm-24-46.
[577992.890908] __journal_remove_journal_head: freeing b_committed_data
[577992.890916] (fallocate,88392,52):ocfs2_extend_trans:474 ERROR: status = -30
[577992.890918] __journal_remove_journal_head: freeing b_committed_data
[577992.890920] (fallocate,88392,52):ocfs2_rotate_tree_right:2500 ERROR: status = -30
[577992.890922] __journal_remove_journal_head: freeing b_committed_data
[577992.890924] (fallocate,88392,52):ocfs2_do_insert_extent:4382 ERROR: status = -30
[577992.890928] (fallocate,88392,52):ocfs2_insert_extent:4842 ERROR: status = -30
[577992.890928] __journal_remove_journal_head: freeing b_committed_data
[577992.890930] (fallocate,88392,52):ocfs2_add_clusters_in_btree:4947 ERROR: status = -30
[577992.890933] __journal_remove_journal_head: freeing b_committed_data
[577992.890939] __journal_remove_journal_head: freeing b_committed_data
[577992.890949] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
[577992.890950] Mem abort info:
[577992.890951] ESR = 0x96000004
[577992.890952] Exception class = DABT (current EL), IL = 32 bits
[577992.890952] SET = 0, FnV = 0
[577992.890953] EA = 0, S1PTW = 0
[577992.890954] Data abort info:
[577992.890955] ISV = 0, ISS = 0x00000004
[577992.890956] CM = 0, WnR = 0
[577992.890958] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000f8da07a9
[577992.890960] [0000000000000020] pgd=0000000000000000
[577992.890964] Internal error: Oops: 96000004 [#1] SMP
[577992.890965] Process fallocate (pid: 88392, stack limit = 0x00000000013db2fd)
[577992.890968] CPU: 52 PID: 88392 Comm: fallocate Kdump: loaded Tainted: G W OE 4.19.36 #1
[577992.890969] Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 0.98 08/25/2019
[577992.890971] pstate: 60400009 (nZCv daif +PAN -UAO)
[577992.891054] pc : _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2]
[577992.891082] lr : _ocfs2_free_suballoc_bits+0x618/0x968 [ocfs2]
[577992.891084] sp : ffff0000c8e2b810
[577992.891085] x29: ffff0000c8e2b820 x28: 0000000000000000
[577992.891087] x27: 00000000000006f3 x26: ffffa07957b02e70
[577992.891089] x25: ffff807c59d50000 x24: 00000000000006f2
[577992.891091] x23: 0000000000000001 x22: ffff807bd39abc30
[577992.891093] x21: ffff0000811d9000 x20: ffffa07535d6a000
[577992.891097] x19: ffff000001681638 x18: ffffffffffffffff
[577992.891098] x17: 0000000000000000 x16: ffff000080a03df0
[577992.891100] x15: ffff0000811d9708 x14: 203d207375746174
[577992.891101] x13: 73203a524f525245 x12: 20373439343a6565
[577992.891103] x11: 0000000000000038 x10: 0101010101010101
[577992.891106] x9 : ffffa07c68a85d70 x8 : 7f7f7f7f7f7f7f7f
[577992.891109] x7 : 0000000000000000 x6 : 0000000000000080
[577992.891110] x5 : 0000000000000000 x4 : 0000000000000002
[577992.891112] x3 : ffff000001713390 x2 : 2ff90f88b1c22f00
[577992.891114] x1 : ffff807bd39abc30 x0 : 0000000000000000
[577992.891116] Call trace:
[577992.891139] _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2]
[577992.891162] _ocfs2_free_clusters+0x100/0x290 [ocfs2]
[577992.891185] ocfs2_free_clusters+0x50/0x68 [ocfs2]
[577992.891206] ocfs2_add_clusters_in_btree+0x198/0x5e0 [ocfs2]
[577992.891227] ocfs2_add_inode_data+0x94/0xc8 [ocfs2]
[577992.891248] ocfs2_extend_allocation+0x1bc/0x7a8 [ocfs2]
[577992.891269] ocfs2_allocate_extents+0x14c/0x338 [ocfs2]
[577992.891290] __ocfs2_change_file_space+0x3f8/0x610 [ocfs2]
[577992.891309] ocfs2_fallocate+0xe4/0x128 [ocfs2]
[577992.891316] vfs_fallocate+0x11c/0x250
[577992.891317] ksys_fallocate+0x54/0x88
[577992.891319] __arm64_sys_fallocate+0x28/0x38
[577992.891323] el0_svc_common+0x78/0x130
[577992.891325] el0_svc_handler+0x38/0x78
[577992.891327] el0_svc+0x8/0xc
My analysis process as follows:
ocfs2_fallocate
__ocfs2_change_file_space
ocfs2_allocate_extents
ocfs2_extend_allocation
ocfs2_add_inode_data
ocfs2_add_clusters_in_btree
ocfs2_insert_extent
ocfs2_do_insert_extent
ocfs2_rotate_tree_right
ocfs2_extend_rotate_transaction
ocfs2_extend_trans
jbd2_journal_restart
jbd2__journal_restart
/* handle->h_transaction is NULL,
* is_handle_aborted(handle) is true
*/
handle->h_transaction = NULL;
start_this_handle
return -EROFS;
ocfs2_free_clusters
_ocfs2_free_clusters
_ocfs2_free_suballoc_bits
ocfs2_block_group_clear_bits
ocfs2_journal_access_gd
__ocfs2_journal_access
jbd2_journal_get_undo_access
/* I think jbd2_write_access_granted() will
* return true, because do_get_write_access()
* will return -EROFS.
*/
if (jbd2_write_access_granted(...)) return 0;
do_get_write_access
/* handle->h_transaction is NULL, it will
* return -EROFS here, so do_get_write_access()
* was not called.
*/
if (is_handle_aborted(handle)) return -EROFS;
/* bh2jh(group_bh) is NULL, caused NULL
pointer dereference */
undo_bg = (struct ocfs2_group_desc *)
bh2jh(group_bh)->b_committed_data;
If handle->h_transaction == NULL, then jbd2_write_access_granted()
does not really guarantee that journal_head will stay around,
not even speaking of its b_committed_data. The bh2jh(group_bh)
can be removed after ocfs2_journal_access_gd() and before call
"bh2jh(group_bh)->b_committed_data". So, we should move
is_handle_aborted() check from do_get_write_access() into
jbd2_journal_get_undo_access() and jbd2_journal_get_write_access()
before the call to jbd2_write_access_granted().
Link: https://lore.kernel.org/r/f72a623f-b3f1-381a-d91d-d22a1c83a336@huawei.com
Signed-off-by: Yan Wang <wangyan122@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
|
|
If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running
on it, the following warning in ext4_add_complete_io() can be hit:
WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120
Here's a minimal reproducer (not 100% reliable) (root isn't required):
while true; do
sync
done &
while true; do
rm -f file
touch file
chattr -e file
echo X >> file
chattr +e file
done
The problem is that in ext4_writepages(), ext4_should_dioread_nolock()
(which only returns true on extent-based files) is checked once to set
the number of reserved journal credits, and also again later to select
the flags for ext4_map_blocks() and copy the reserved journal handle to
ext4_io_end::handle. But if EXT4_EXTENTS_FL is being concurrently set,
the first check can see dioread_nolock disabled while the later one can
see it enabled, causing the reserved handle to unexpectedly be NULL.
Since changing EXT4_EXTENTS_FL is uncommon, and there may be other races
related to doing so as well, fix this by synchronizing changing
EXT4_EXTENTS_FL with ext4_writepages() via the existing
s_writepages_rwsem (previously called s_journal_flag_rwsem).
This was originally reported by syzbot without a reproducer at
https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf,
but now that dioread_nolock is the default I also started seeing this
when running syzkaller locally.
Link: https://lore.kernel.org/r/20200219183047.47417-3-ebiggers@kernel.org
Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com
Fixes: 6b523df4fb5a ("ext4: use transaction reservation for extent conversion in ext4_end_io")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
|
|
In preparation for making s_journal_flag_rwsem synchronize
ext4_writepages() with changes to both the EXTENTS and JOURNAL_DATA
flags (rather than just JOURNAL_DATA as it does currently), rename it to
s_writepages_rwsem.
Link: https://lore.kernel.org/r/20200219183047.47417-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
|
|
During an online resize an array of s_flex_groups structures gets replaced
so it can get enlarged. If there is a concurrent access to the array and
this memory has been reused then this can lead to an invalid memory access.
The s_flex_group array has been converted into an array of pointers rather
than an array of structures. This is to ensure that the information
contained in the structures cannot get out of sync during a resize due to
an accessor updating the value in the old structure after it has been
copied but before the array pointer is updated. Since the structures them-
selves are no longer copied but only the pointers to them this case is
mitigated.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.edu
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
Use tabs for indentation instead of spaces for SAFESETID. All (!) other
entries in MAINTAINERS use tabs (according to my simple grepping).
Link: http://lkml.kernel.org/r/2bb2e52a-2694-816d-57b4-6cabfadd6c1a@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Micah Morton <mortonm@chromium.org>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Walter Wu has reported a potential case in which init_stack_slab() is
called after stack_slabs[STACK_ALLOC_MAX_SLABS - 1] has already been
initialized. In that case init_stack_slab() will overwrite
stack_slabs[STACK_ALLOC_MAX_SLABS], which may result in a memory
corruption.
Link: http://lkml.kernel.org/r/20200218102950.260263-1-glider@google.com
Fixes: cd11016e5f521 ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
Signed-off-by: Alexander Potapenko <glider@google.com>
Reported-by: Walter Wu <walter-zh.wu@mediatek.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When we use SPARSEMEM instead of SPARSEMEM_VMEMMAP, pfn_to_page()
doesn't work before sparse_init_one_section() is called.
This leads to a crash when hotplug memory:
BUG: unable to handle page fault for address: 0000000006400000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 3 PID: 221 Comm: kworker/u16:1 Tainted: G W 5.5.0-next-20200205+ #343
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
RIP: 0010:__memset+0x24/0x30
Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
RSP: 0018:ffffb43ac0373c80 EFLAGS: 00010a87
RAX: ffffffffffffffff RBX: ffff8a1518800000 RCX: 0000000000050000
RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000006400000
RBP: 0000000000140000 R08: 0000000000100000 R09: 0000000006400000
R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
R13: 0000000000000028 R14: 0000000000000000 R15: ffff8a153ffd9280
FS: 0000000000000000(0000) GS:ffff8a153ab00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000006400000 CR3: 0000000136fca000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
sparse_add_section+0x1c9/0x26a
__add_pages+0xbf/0x150
add_pages+0x12/0x60
add_memory_resource+0xc8/0x210
__add_memory+0x62/0xb0
acpi_memory_device_add+0x13f/0x300
acpi_bus_attach+0xf6/0x200
acpi_bus_scan+0x43/0x90
acpi_device_hotplug+0x275/0x3d0
acpi_hotplug_work_fn+0x1a/0x30
process_one_work+0x1a7/0x370
worker_thread+0x30/0x380
kthread+0x112/0x130
ret_from_fork+0x35/0x40
We should use memmap as it did.
On x86 the impact is limited to x86_32 builds, or x86_64 configurations
that override the default setting for SPARSEMEM_VMEMMAP.
Other memory hotplug archs (arm64, ia64, and ppc) also default to
SPARSEMEM_VMEMMAP=y.
[dan.j.williams@intel.com: changelog update]
{rppt@linux.ibm.com: changelog update]
Link: http://lkml.kernel.org/r/20200219030454.4844-1-bhe@redhat.com
Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|