Age | Commit message (Collapse) | Author | Files | Lines |
|
Add MIDR encoding for HiSilicon HIP12 which is used on HiSilicon
HIP12 SoCs.
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Link: https://lore.kernel.org/r/20250425033845.57671-2-yangyicong@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Commit 5b39db6037e7 ("arm64: el2_setup.h: Rename some labels to be more
diff-friendly") reworked the labels in __init_el2_fgt to say what's
skipped rather than what the target location is. The exception was
"set_fgt_" which is where registers are written. In reviewing the BRBE
additions, Will suggested "set_debug_fgt_" where HDFGxTR_EL2 are
written. Doing that would partially revert commit 5b39db6037e7 undoing
the goal of minimizing additions here, but it would follow the
convention for labels where registers are written.
So let's do both. Branches that skip something go to a "skip" label and
places that set registers have a "set" label. This results in some
double labels, but it makes things entirely consistent.
While we're here, the SME skip label was incorrectly named, so fix it.
Reported-by: Will Deacon <will@kernel.org>
Cc: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://lore.kernel.org/r/20250520-arm-brbe-v19-v22-2-c1ddde38e7f8@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
An ACPI binding for CMN S3 was not yet finalised when the driver support
was originally written, but v1.2 of DEN0093 "ACPI for Arm Components"
has at last been published; support ACPI systems using the proper HID.
Cc: stable@vger.kernel.org
Fixes: 0dc2f4963f7e ("perf/arm-cmn: Support CMN S3")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/7dafe147f186423020af49d7037552ee59c60e97.1747652164.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
BSS might be uninitialized when entering the startup code, so forbid the
use by the startup code of any variables that live after __bss_start in
the linker map.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Yeoreum Yun <yeoreum.yun@arm.com>
Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com>
Link: https://lore.kernel.org/r/20250508114328.2460610-8-ardb+git@google.com
[will: Drop export of 'memstart_offset_seed', as this has been removed]
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Accessing BSS will no longer be permitted from the startup code in
arch/arm64/kernel/pi, as some of it executes before BSS is cleared.
Clearing BSS earlier would involve managing cache coherency explicitly
in software, which is a hassle we prefer to avoid.
So move some variables that are assigned by the startup code out of BSS
and into .data.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Yeoreum Yun <yeoreum.yun@arm.com>
Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com>
Link: https://lore.kernel.org/r/20250508114328.2460610-7-ardb+git@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
init_pgdir[] is only referenced from the startup code, but lives after
BSS in the linker map. Before tightening the rules about accessing BSS
from startup code, move init_pgdir[] into the __pi_ namespace, so it
does not need to be exported explicitly.
For symmetry, do the same with init_idmap_pgdir[], although it lives
before BSS.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Yeoreum Yun <yeoreum.yun@arm.com>
Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com>
Link: https://lore.kernel.org/r/20250508114328.2460610-6-ardb+git@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
For all the complexity of handling affinity for CPU hotplug, what we've
apparently managed to overlook is that arm_cmn_init_irqs() has in fact
always been setting the *initial* affinity of all IRQs to CPU 0, not the
CPU we subsequently choose for event scheduling. Oh dear.
Cc: stable@vger.kernel.org
Fixes: 0ba64770a2f2 ("perf: Add Arm CMN-600 PMU driver")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Link: https://lore.kernel.org/r/b12fccba6b5b4d2674944f59e4daad91cd63420b.1747069914.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When running 'make' in tools/testing/selftests/arm64/ without explicitly
setting the OUTPUT variable, the build system will creates test
directories (e.g., /bti) in the root filesystem due to OUTPUT defaulting
to an empty string. This causes unintended pollution of the root directory.
This patch adds proper handling for the OUTPUT variable: Sets OUTPUT
to the current directory (.) if not specified
Signed-off-by: tanze <tanze@kylinos.cn>
Link: https://lore.kernel.org/r/20250515051839.3409658-1-tanze@kylinos.cn
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The values stored in __boot_cpu_mode were changed without updating the
comment. Rectify that.
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Link: https://lore.kernel.org/r/20250513124525.677736-1-ben.horgan@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
pmd_val(pmd) is redundant because a positive pmd_present(pmd) ensures
a positive pmd_val(pmd) according to their definitions like below.
#define pmd_val(x) ((x).pmd)
#define pmd_present(pmd) pte_present(pmd_pte(pmd))
#define pte_present(pte) (pte_valid(pte) || pte_present_invalid(pte))
#define pte_valid(pte) (!!(pte_val(pte) & PTE_VALID))
#define pte_present_invalid(pte) \
((pte_val(pte) & (PTE_VALID | PTE_PRESENT_INVALID)) == PTE_PRESENT_INVALID)
pte_present() can't be positive unless either of the flag PTE_VALID or
PTE_PRESENT_INVALID is set. In this case, pmd_val(pmd) should be positive
either.
So lets drop the redundant check pmd_val(pmd) and no functional changes
intended.
Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/r/20250508085251.204282-1-gshan@redhat.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
mov_q cannot really move PIE_E[0|1] macros into a general purpose register
as expected if those macro constants contain some 128 bit layout elements,
that are required for D128 page tables. The primary issue is that for D128,
PIE_E[0|1] are defined in terms of 128-bit types with shifting and masking,
which the assembler can't accommodate.
Instead pre-calculate these PIRE0_EL1/PIR_EL1 constants into asm-offsets.h
based PIE_E0_ASM/PIE_E1_ASM which can then be used in arch/arm64/mm/proc.S.
While here also drop PTE_MAYBE_NG/PTE_MAYBE_SHARED assembly overrides which
are not required any longer, as the compiler toolchains are smart enough to
compute both the PIE_[E0|E1]_ASM constants in all scenarios.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/r/20250429050511.1663235-1-anshuman.khandual@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
lazy_mmu_mode is not supposed to permit nesting. But in practice this
does happen with CONFIG_DEBUG_PAGEALLOC, where a page allocation inside
a lazy_mmu_mode section (such as zap_pte_range()) will change
permissions on the linear map with apply_to_page_range(), which
re-enters lazy_mmu_mode (see stack trace below).
The warning checking that nesting was not happening was previously being
triggered due to this. So let's relax by removing the warning and
tolerate nesting in the arm64 implementation. The first (inner) call to
arch_leave_lazy_mmu_mode() will flush and clear the flag such that the
remainder of the work in the outer nest behaves as if outside of lazy
mmu mode. This is safe and keeps tracking simple.
Code review suggests powerpc deals with this issue in the same way.
------------[ cut here ]------------
WARNING: CPU: 6 PID: 1 at arch/arm64/include/asm/pgtable.h:89 __apply_to_page_range+0x85c/0x9f8
Modules linked in: ip_tables x_tables ipv6
CPU: 6 UID: 0 PID: 1 Comm: systemd Not tainted 6.15.0-rc5-00075-g676795fe9cf6 #1 PREEMPT
Hardware name: QEMU KVM Virtual Machine, BIOS 2024.08-4 10/25/2024
pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __apply_to_page_range+0x85c/0x9f8
lr : __apply_to_page_range+0x2b4/0x9f8
sp : ffff80008009b3c0
x29: ffff80008009b460 x28: ffff0000c43a3000 x27: ffff0001ff62b108
x26: ffff0000c43a4000 x25: 0000000000000001 x24: 0010000000000001
x23: ffffbf24c9c209c0 x22: ffff80008009b4d0 x21: ffffbf24c74a3b20
x20: ffff0000c43a3000 x19: ffff0001ff609d18 x18: 0000000000000001
x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000003
x14: 0000000000000028 x13: ffffbf24c97c1000 x12: ffff0000c43a3fff
x11: ffffbf24cacc9a70 x10: ffff0000c43a3fff x9 : ffff0001fffff018
x8 : 0000000000000012 x7 : ffff0000c43a4000 x6 : ffff0000c43a4000
x5 : ffffbf24c9c209c0 x4 : ffff0000c43a3fff x3 : ffff0001ff609000
x2 : 0000000000000d18 x1 : ffff0000c03e8000 x0 : 0000000080000000
Call trace:
__apply_to_page_range+0x85c/0x9f8 (P)
apply_to_page_range+0x14/0x20
set_memory_valid+0x5c/0xd8
__kernel_map_pages+0x84/0xc0
get_page_from_freelist+0x1110/0x1340
__alloc_frozen_pages_noprof+0x114/0x1178
alloc_pages_mpol+0xb8/0x1d0
alloc_frozen_pages_noprof+0x48/0xc0
alloc_pages_noprof+0x10/0x60
get_free_pages_noprof+0x14/0x90
__tlb_remove_folio_pages_size.isra.0+0xe4/0x140
__tlb_remove_folio_pages+0x10/0x20
unmap_page_range+0xa1c/0x14c0
unmap_single_vma.isra.0+0x48/0x90
unmap_vmas+0xe0/0x200
vms_clear_ptes+0xf4/0x140
vms_complete_munmap_vmas+0x7c/0x208
do_vmi_align_munmap+0x180/0x1a8
do_vmi_munmap+0xac/0x188
__vm_munmap+0xe0/0x1e0
__arm64_sys_munmap+0x20/0x38
invoke_syscall+0x48/0x104
el0_svc_common.constprop.0+0x40/0xe0
do_el0_svc+0x1c/0x28
el0_svc+0x4c/0x16c
el0t_64_sync_handler+0x10c/0x140
el0t_64_sync+0x198/0x19c
irq event stamp: 281312
hardirqs last enabled at (281311): [<ffffbf24c780fd04>] bad_range+0x164/0x1c0
hardirqs last disabled at (281312): [<ffffbf24c89c4550>] el1_dbg+0x24/0x98
softirqs last enabled at (281054): [<ffffbf24c752d99c>] handle_softirqs+0x4cc/0x518
softirqs last disabled at (281019): [<ffffbf24c7450694>] __do_softirq+0x14/0x20
---[ end trace 0000000000000000 ]---
Fixes: 5fdd05efa1cd ("arm64/mm: Batch barriers when updating kernel mappings")
Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Closes: https://lore.kernel.org/linux-arm-kernel/aCH0TLRQslXHin5Q@arm.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20250512150333.5589-1-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Commit 5fdd05efa1cd ("arm64/mm: Batch barriers when updating kernel
mappings") enabled arm64 kernels to track "lazy mmu mode" using TIF
flags in order to defer barriers until exiting the mode. At the same
time, it added warnings to check that pte manipulations were never
performed in interrupt context, because the tracking implementation
could not deal with nesting.
But it turns out that some debug features (e.g. KFENCE, DEBUG_PAGEALLOC)
do manipulate ptes in softirq context, which triggered the warnings.
So let's take the simplest and safest route and disable the batching
optimization in interrupt contexts. This makes these users no worse off
than prior to the optimization. Additionally the known offenders are
debug features that only manipulate a single PTE, so there is no
performance gain anyway.
There may be some obscure case of encrypted/decrypted DMA with the
dma_free_coherent called from an interrupt context, but again, this is
no worse off than prior to the commit.
Some options for supporting nesting were considered, but there is a
difficult to solve problem if any code manipulates ptes within interrupt
context but *outside of* a lazy mmu region. If this case exists, the
code would expect the updates to be immediate, but because the task
context may have already been in lazy mmu mode, the updates would be
deferred, which could cause incorrect behaviour. This problem is avoided
by always ensuring updates within interrupt context are immediate.
Fixes: 5fdd05efa1cd ("arm64/mm: Batch barriers when updating kernel mappings")
Reported-by: syzbot+5c0d9392e042f41d45c5@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-arm-kernel/681f2a09.050a0220.f2294.0006.GAE@google.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20250512102242.4156463-1-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Currently, when ARM64 displays CPU information, every call to c_show()
assembles all CPU information. However, as the number of CPUs increases,
this can lead to insufficient buffer space due to excessive assembly in
a single call, causing repeated expansion and multiple calls to c_show().
To prevent this invalid c_show() call, only one CPU's information is
assembled each time c_show() is called.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20250421062947.4072855-1-yebin@huaweicloud.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Because the kernel can't tolerate page faults for kernel mappings, when
setting a valid, kernel space pte (or pmd/pud/p4d/pgd), it emits a
dsb(ishst) to ensure that the store to the pgtable is observed by the
table walker immediately. Additionally it emits an isb() to ensure that
any already speculatively determined invalid mapping fault gets
canceled.
We can improve the performance of vmalloc operations by batching these
barriers until the end of a set of entry updates.
arch_enter_lazy_mmu_mode() and arch_leave_lazy_mmu_mode() provide the
required hooks.
vmalloc improves by up to 30% as a result.
Two new TIF_ flags are created; TIF_LAZY_MMU tells us if the task is in
the lazy mode and can therefore defer any barriers until exit from the
lazy mode. TIF_LAZY_MMU_PENDING is used to remember if any pte operation
was performed while in the lazy mode that required barriers. Then when
leaving lazy mode, if that flag is set, we emit the barriers.
Since arch_enter_lazy_mmu_mode() and arch_leave_lazy_mmu_mode() are used
for both user and kernel mappings, we need the second flag to avoid
emitting barriers unnecessarily if only user mappings were updated.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-12-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Wrap vmalloc's pte table manipulation loops with
arch_enter_lazy_mmu_mode() / arch_leave_lazy_mmu_mode(). This provides
the arch code with the opportunity to optimize the pte manipulations.
Note that vmap_pfn() already uses lazy mmu mode since it delegates to
apply_to_page_range() which enters lazy mmu mode for both user and
kernel mappings.
These hooks will shortly be used by arm64 to improve vmalloc
performance.
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-11-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Implement the required arch functions to enable use of contpte in the
vmap when VM_ALLOW_HUGE_VMAP is specified. This speeds up vmap
operations due to only having to issue a DSB and ISB per contpte block
instead of per pte. But it also means that the TLB pressure reduces due
to only needing a single TLB entry for the whole contpte block.
Since vmap uses set_huge_pte_at() to set the contpte, that API is now
used for kernel mappings for the first time. Although in the vmap case
we never expect it to be called to modify a valid mapping so
clear_flush() should never be called, it's still wise to make it robust
for the kernel case, so amend the tlb flush function if the mm is for
kernel space.
Tested with vmalloc performance selftests:
# kself/mm/test_vmalloc.sh \
run_test_mask=1
test_repeat_count=5
nr_pages=256
test_loop_count=100000
use_huge=1
Duration reduced from 1274243 usec to 1083553 usec on Apple M2 for 15%
reduction in time taken.
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-10-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Commit f7ee1f13d606 ("mm/vmalloc: enable mapping of huge pages at pte
level in vmap") added its support by reusing the set_huge_pte_at() API,
which is otherwise only used for user mappings. But when unmapping those
huge ptes, it continued to call ptep_get_and_clear(), which is a
layering violation. To date, the only arch to implement this support is
powerpc and it all happens to work ok for it.
But arm64's implementation of ptep_get_and_clear() can not be safely
used to clear a previous set_huge_pte_at(). So let's introduce a new
arch opt-in function, arch_vmap_pte_range_unmap_size(), which can
provide the size of a (present) pte. Then we can call
huge_ptep_get_and_clear() to tear it down properly.
Note that if vunmap_range() is called with a range that starts in the
middle of a huge pte-mapped page, we must unmap the entire huge page so
the behaviour is consistent with pmd and pud block mappings. In this
case emit a warning just like we do for pmd/pud mappings.
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-9-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
A call to vmalloc_huge() may cause memory blocks to be mapped at pmd or
pud level. But it is possible to subsequently call vunmap_range() on a
sub-range of the mapped memory, which partially overlaps a pmd or pud.
In this case, vmalloc unmaps the entire pmd or pud so that the
no-overlapping portion is also unmapped. Clearly that would have a bad
outcome, but it's not something that any callers do today as far as I
can tell. So I guess it's just expected that callers will not do this.
However, it would be useful to know if this happened in future; let's
add a warning to cover the eventuality.
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-8-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
set_ptes_anysz() previously called __set_pte() for each PTE in the
range, which would conditionally issue a DSB and ISB to make the new PTE
value immediately visible to the table walker if the new PTE was valid
and for kernel space.
We can do better than this; let's hoist those barriers out of the loop
so that they are only issued once at the end of the loop. We then reduce
the cost by the number of PTEs in the range.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-7-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Refactor the huge_pte helpers to use the new common __set_ptes_anysz()
and __ptep_get_and_clear_anysz() APIs.
This provides 2 benefits; First, when page_table_check=on, hugetlb is
now properly/fully checked. Previously only the first page of a hugetlb
folio was checked. Second, instead of having to call __set_ptes(nr=1)
for each pte in a loop, the whole contiguous batch can now be set in one
go, which enables some efficiencies and cleans up the code.
One detail to note is that huge_ptep_clear_flush() was previously
calling ptep_clear_flush() for a non-contiguous pte (i.e. a pud or pmd
block mapping). This has a couple of disadvantages; first
ptep_clear_flush() calls ptep_get_and_clear() which transparently
handles contpte. Given we only call for non-contiguous ptes, it would be
safe, but a waste of effort. It's preferable to go straight to the layer
below. However, more problematic is that ptep_get_and_clear() is for
PAGE_SIZE entries so it calls page_table_check_pte_clear() and would not
clear the whole hugetlb folio. So let's stop special-casing the non-cont
case and just rely on get_clear_contig_flush() to do the right thing for
non-cont entries.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-6-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Refactor __set_ptes(), set_pmd_at() and set_pud_at() so that they are
all a thin wrapper around a new common __set_ptes_anysz(), which takes
pgsize parameter. Additionally, refactor __ptep_get_and_clear() and
pmdp_huge_get_and_clear() to use a new common
__ptep_get_and_clear_anysz() which also takes a pgsize parameter.
These changes will permit the huge_pte API to efficiently batch-set
pgtable entries and take advantage of the future barrier optimizations.
Additionally since the new *_anysz() helpers call the correct
page_table_check_*_set() API based on pgsize, this means that huge_ptes
will be able to get proper coverage. Currently the huge_pte API always
uses the pte API which assumes an entry only covers a single page.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-5-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Convert page_table_check_p[mu]d_set(...) to
page_table_check_p[mu]ds_set(..., nr) to allow checking a contiguous set
of pmds/puds in single batch. We retain page_table_check_p[mu]d_set(...)
as macros that call new batch functions with nr=1 for compatibility.
arm64 is about to reorganise its pte/pmd/pud helpers to reuse more code
and to allow the implementation for huge_pte to more efficiently set
ptes/pmds/puds in batches. We need these batch-helpers to make the
refactoring possible.
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-4-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When operating on contiguous blocks of ptes (or pmds) for some hugetlb
sizes, we must honour break-before-make requirements and clear down the
block to invalid state in the pgtable then invalidate the relevant tlb
entries before making the pgtable entries valid again.
However, the tlb maintenance is currently always done assuming the worst
case stride (PAGE_SIZE), last_level (false) and tlb_level
(TLBI_TTL_UNKNOWN). We can do much better with the hinting; In reality,
we know the stride from the huge_pte pgsize, we are always operating
only on the last level, and we always know the tlb_level, again based on
pgsize. So let's start providing these hints.
Additionally, avoid tlb maintenace in set_huge_pte_at().
Break-before-make is only required if we are transitioning the
contiguous pte block from valid -> valid. So let's elide the
clear-and-flush ("break") if the pte range was previously invalid.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-3-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Not all huge_pte helper APIs explicitly provide the size of the
huge_pte. So the helpers have to depend on various methods to determine
the size of the huge_pte. Some of these methods are dubious.
Let's clean up the code to use preferred methods and retire the dubious
ones. The options in order of preference:
- If size is provided as parameter, use it together with
num_contig_ptes(). This is explicit and works for both present and
non-present ptes.
- If vma is provided as a parameter, retrieve size via
huge_page_size(hstate_vma(vma)) and use it together with
num_contig_ptes(). This is explicit and works for both present and
non-present ptes.
- If the pte is present and contiguous, use find_num_contig() to walk
the pgtable to find the level and infer the number of ptes from
level. Only works for *present* ptes.
- If the pte is present and not contiguous and you can infer from this
that only 1 pte needs to be operated on. This is ok if you don't care
about the absolute size, and just want to know the number of ptes.
- NEVER rely on resolving the PFN of a present pte to a folio and
getting the folio's size. This is fragile at best, because there is
nothing to stop the core-mm from allocating a folio twice as big as
the huge_pte then mapping it across 2 consecutive huge_ptes. Or just
partially mapping it.
Where we require that the pte is present, add warnings if not-present.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-2-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The Amlogic DDR PMU driver meson_ddr_pmu_create() function incorrectly uses
smp_processor_id(), which assumes disabled preemption. This leads to kernel
warnings during module loading because meson_ddr_pmu_create() can be called
in a preemptible context.
Following kernel warning and stack trace:
[ 31.745138] [ T2289] BUG: using smp_processor_id() in preemptible [00000000] code: (udev-worker)/2289
[ 31.745154] [ T2289] caller is debug_smp_processor_id+0x28/0x38
[ 31.745172] [ T2289] CPU: 4 UID: 0 PID: 2289 Comm: (udev-worker) Tainted: GW 6.14.0-0-MANJARO-ARM #1 59519addcbca6ba8de735e151fd7b9e97aac7ff0
[ 31.745181] [ T2289] Tainted: [W]=WARN
[ 31.745183] [ T2289] Hardware name: Hardkernel ODROID-N2Plus (DT)
[ 31.745188] [ T2289] Call trace:
[ 31.745191] [ T2289] show_stack+0x28/0x40 (C)
[ 31.745199] [ T2289] dump_stack_lvl+0x4c/0x198
[ 31.745205] [ T2289] dump_stack+0x20/0x50
[ 31.745209] [ T2289] check_preemption_disabled+0xec/0xf0
[ 31.745213] [ T2289] debug_smp_processor_id+0x28/0x38
[ 31.745216] [ T2289] meson_ddr_pmu_create+0x200/0x560 [meson_ddr_pmu_g12 8095101c49676ad138d9961e3eddaee10acca7bd]
[ 31.745237] [ T2289] g12_ddr_pmu_probe+0x20/0x38 [meson_ddr_pmu_g12 8095101c49676ad138d9961e3eddaee10acca7bd]
[ 31.745246] [ T2289] platform_probe+0x98/0xe0
[ 31.745254] [ T2289] really_probe+0x144/0x3f8
[ 31.745258] [ T2289] __driver_probe_device+0xb8/0x180
[ 31.745261] [ T2289] driver_probe_device+0x54/0x268
[ 31.745264] [ T2289] __driver_attach+0x11c/0x288
[ 31.745267] [ T2289] bus_for_each_dev+0xfc/0x160
[ 31.745274] [ T2289] driver_attach+0x34/0x50
[ 31.745277] [ T2289] bus_add_driver+0x160/0x2b0
[ 31.745281] [ T2289] driver_register+0x78/0x120
[ 31.745285] [ T2289] __platform_driver_register+0x30/0x48
[ 31.745288] [ T2289] init_module+0x30/0xfe0 [meson_ddr_pmu_g12 8095101c49676ad138d9961e3eddaee10acca7bd]
[ 31.745298] [ T2289] do_one_initcall+0x11c/0x438
[ 31.745303] [ T2289] do_init_module+0x68/0x228
[ 31.745311] [ T2289] load_module+0x118c/0x13a8
[ 31.745315] [ T2289] __arm64_sys_finit_module+0x274/0x390
[ 31.745320] [ T2289] invoke_syscall+0x74/0x108
[ 31.745326] [ T2289] el0_svc_common+0x90/0xf8
[ 31.745330] [ T2289] do_el0_svc+0x2c/0x48
[ 31.745333] [ T2289] el0_svc+0x60/0x150
[ 31.745337] [ T2289] el0t_64_sync_handler+0x80/0x118
[ 31.745341] [ T2289] el0t_64_sync+0x1b8/0x1c0
Changes replaces smp_processor_id() with raw_smp_processor_id() to
ensure safe CPU ID retrieval in preemptible contexts.
Cc: Jiucheng Xu <jiucheng.xu@amlogic.com>
Fixes: 2016e2113d35 ("perf/amlogic: Add support for Amlogic meson G12 SoC DDR PMU driver")
Signed-off-by: Anand Moon <linux.amoon@gmail.com>
Link: https://lore.kernel.org/r/20250407063206.5211-1-linux.amoon@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Somehow the encodings for REQ2/SNP2 channels in XP events
got mixed up... Unmix them.
CC: stable@vger.kernel.org
Fixes: 23760a014417 ("perf/arm-cmn: Add CMN-700 support")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/087023e9737ac93d7ec7a841da904758c254cb01.1746717400.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
In order to fix an ABI problem, we recently changed the way that reads
of the NT_ARM_SVE and NT_ARM_SSVE regsets behave when their
corresponding vector state is inactive.
Update the fp-ptrace test for the new behaviour.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Spickett <david.spickett@arm.com>
Cc: Luis Machado <luis.machado@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-25-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
In order to fix an ABI problem, we recently changed the way that
changing the SVE/SME vector length affects PSTATE.SM. Historically,
changing the SME vector length would clear PSTATE.SM. Now, changing the
SME vector length preserves PSTATE.SM.
Update the fp-ptrace test for the new behaviour.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Spickett <david.spickett@arm.com>
Cc: Luis Machado <luis.machado@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-24-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
In order to fix an ABI problem, we recently changed the way that a
clone() syscall manipulates TPIDR2 and PSTATE.ZA. Historically the child
would inherit the parent's TPIDR2 value unless CLONE_SETTLS was set, and
now the child will inherit the parent's TPIDR2 value unless CLONE_VM is
set.
Update the tpidr2 test for the new behaviour.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Kiss <daniel.kiss@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Richard Sandiford <richard.sandiford@arm.com>
Cc: Sander De Smalen <sander.desmalen@arm.com>
Cc: Tamas Petz <tamas.petz@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yury Khrustalev <yury.khrustalev@arm.com>
Link: https://lore.kernel.org/r/20250508132644.1395904-23-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The fp-ptrace test suite expects that FPMR is set to zero when PSTATE.SM
is changed via ptrace, but ptrace has never altered FPMR in this way,
and the test logic erroneously relies upon (and has concealed) a bug
where task_fpsimd_load() would unexpectedly and non-deterministically
clobber FPMR.
Using ptrace, FPMR can only be altered by writing to the NT_ARM_FPMR
regset. The value of PSTATE.SM can be altered by writing to the
NT_ARM_SVE or NT_ARM_SSVE regsets, and/or by changing the SME vector
length (when writing to the NT_ARM_SVE, NT_ARM_SSVE, or NT_ARM_ZA
regsets), but none of these writes will change the value of FPMR.
The task_fpsimd_load() bug was introduced with the initial FPMR support
in commit:
203f2b95a882 ("arm64/fpsimd: Support FEAT_FPMR")
The incorrect FPMR test code was introduced in commit:
7dbd26d0b22d ("kselftest/arm64: Add FPMR coverage to fp-ptrace")
Subsequently, the task_fpsimd_load() bug was fixed in commit:
e5fa85fce08b ("arm64/fpsimd: Don't corrupt FPMR when streaming mode changes")
... whereupon the fp-ptrace FPMR tests started failing reliably, e.g.
| # # Mismatch in saved FPMR: 915058000 != 0
| # not ok 25 SVE write, SVE 64->64, SME 64/0->64/1
Fix this by changing the test to expect that FPMR is *NOT* changed when
PSTATE.SM is changed via ptrace, matching the extant behaviour.
I've chosen to update the test code rather than modifying ptrace to zero
FPMR when PSTATE.SM changes. Not zeroing FPMR is simpler overall, and
allows the NT_ARM_FPMR regset to be handled independently from other
regsets, leaving less scope for error.
Fixes: 7dbd26d0b22d ("kselftest/arm64: Add FPMR coverage to fp-ptrace")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Spickett <david.spickett@arm.com>
Cc: Luis Machado <luis.machado@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-22-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Now that the known issues with SME have been addressed, allow SME to be
selected.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Kiss <daniel.kiss@arm.com>
Cc: David Spickett <david.spickett@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Luis Machado <luis.machado@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Richard Sandiford <richard.sandiford@arm.com>
Cc: Sander De Smalen <sander.desmalen@arm.com>
Cc: Tamas Petz <tamas.petz@arm.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yury Khrustalev <yury.khrustalev@arm.com>
Tested-By: Luis Machado <luis.machado@arm.com>
Link: https://lore.kernel.org/r/20250508132644.1395904-21-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Within sve_set_common() we do not handle error conditions correctly:
* When writing to NT_ARM_SSVE, if sme_alloc() fails, the task will be
left with task->thread.sme_state==NULL, but TIF_SME will be set and
task->thread.fp_type==FP_STATE_SVE. This will result in a subsequent
null pointer dereference when the task's state is loaded or otherwise
manipulated.
* When writing to NT_ARM_SSVE, if sve_alloc() fails, the task will be
left with task->thread.sve_state==NULL, but TIF_SME will be set,
PSTATE.SM will be set, and task->thread.fp_type==FP_STATE_FPSIMD.
This is not a legitimate state, and can result in various problems,
including a subsequent null pointer dereference and/or the task
inheriting stale streaming mode register state the next time its state
is loaded into hardware.
* When writing to NT_ARM_SSVE, if the VL is changed but the resulting VL
differs from that in the header, the task will be left with TIF_SME
set, PSTATE.SM set, but task->thread.fp_type==FP_STATE_FPSIMD. This is
not a legitimate state, and can result in various problems as
described above.
Avoid these problems by allocating memory earlier, and by changing the
task's saved fp_type to FP_STATE_SVE before skipping register writes due
to a change of VL.
To make early returns simpler, I've moved the call to
fpsimd_flush_task_state() earlier. As the tracee's state has already
been saved, and the tracee is known to be blocked for the duration of
sve_set_common(), it doesn't matter whether this is called at the start
or the end.
For consistency I've moved the setting of TIF_SVE earlier. This will be
cleared when loading FPSIMD-only state, and so moving this has no
resulting functional change.
Note that we only allocate the memory for SVE state when SVE register
contents are provided, avoiding unnecessary memory allocations for tasks
which only use FPSIMD.
Fixes: e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers")
Fixes: baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE")
Fixes: 5d0a8d2fba50 ("arm64/ptrace: Ensure that SME is set up for target when writing SSVE state")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Spickett <david.spickett@arm.com>
Cc: Luis Machado <luis.machado@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-20-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When a task has PSTATE.SM==1, reads of NT_ARM_SSVE are required to
always present a header with SVE_PT_REGS_SVE, and register data in SVE
format. Reads of NT_ARM_SSVE must never present register data in FPSIMD
format. Within the kernel, we always expect streaming SVE data to be
stored in SVE format.
Currently a user can write to NT_ARM_SSVE with a header presenting
SVE_PT_REGS_FPSIMD rather than SVE_PT_REGS_SVE, placing the task's
FPSIMD/SVE data into an invalid state.
To fix this we can either:
(a) Forbid such writes.
(b) Accept such writes, and immediately convert data into SVE format.
Take the simple option and forbid such writes.
Fixes: e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Spickett <david.spickett@arm.com>
Cc: Luis Machado <luis.machado@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-19-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The SME ptrace ABI is written around the incorrect assumption that
SVE_PT_REGS_FPSIMD and SVE_PT_REGS_SVE are independent bit flags, where
it is possible for both to be clear. In reality they are different
values for bit 0 of the header flags, where SVE_PT_REGS_FPSIMD is 0 and
SVE_PT_REGS_SVE is 1. In cases where code was written expecting that
neither bit flag would be set, the value is equivalent to
SVE_PT_REGS_FPSIMD.
One consequence of this is that reads of the NT_ARM_SVE or NT_ARM_SSVE
will erroneously present data from the other mode:
* When PSTATE.SM==1, reads of NT_ARM_SVE will present a header with
SVE_PT_REGS_FPSIMD, and FPSIMD-formatted data from streaming mode.
* When PSTATE.SM==0, reads of NT_ARM_SSVE will present a header with
SVE_PT_REGS_FPSIMD, and FPSIMD-formatted data from non-streaming mode.
The original intent was that no register data would be provided in these
cases, as described in commit:
e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers")
Luckily, debuggers do not consume the bogus register data. Both GDB and
LLDB read the NT_ARM_SSVE regset before the NT_ARM_SVE regset, and
assume that when the NT_ARM_SSVE header presents SVE_PT_REGS_FPSIMD, it
is necessary to read register contents from the NT_ARM_SVE regset,
regardless of whether the NT_ARM_SSVE regset provided bogus register
data.
Fix the code to stop presenting register data from the inactive mode.
At the same time, make the manipulation of the flag clearer, and remove
the bogus comment from sve_set_common(). I've given this a quick spin
with GDB and LLDB, and both seem happy.
Fixes: e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Spickett <david.spickett@arm.com>
Cc: Luis Machado <luis.machado@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-18-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
As sve_init_header_from_task() consumes the saved value of PSTATE.SM and
the saved fp_type, both must be saved before the header is generated.
When generating a coredump for the current task, sve_get_common() calls
sve_init_header_from_task() before saving the task's state. Consequently
the header may be bogus, and the contents of the regset may be
misleading.
Fix this by saving the task's state before generting the header.
Fixes: e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers")
Fixes: b017a0cea627 ("arm64/ptrace: Use saved floating point state type to determine SVE layout")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Spickett <david.spickett@arm.com>
Cc: Luis Machado <luis.machado@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-17-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Currently, vec_set_vector_length() can manipulate a task into an invalid
state as a result of a prctl/ptrace syscall which changes the SVE/SME
vector length, resulting in several problems:
(1) When changing the SVE vector length, if the task initially has
PSTATE.ZA==1, and sve_alloc() fails to allocate memory, the task
will be left with PSTATE.ZA==1 and sve_state==NULL. This is not a
legitimate state, and could result in a subsequent null pointer
dereference.
(2) When changing the SVE vector length, if the task initially has
PSTATE.SM==1, the task will be left with PSTATE.SM==1 and
fp_type==FP_STATE_FPSIMD. Streaming mode state always needs to be
saved in SVE format, so this is not a legitimate state.
Attempting to restore this state may cause a task to erroneously
inherit stale streaming mode predicate registers and FFR contents,
behaving non-deterministically and potentially leaving information
from another task.
While in this state, reads of the NT_ARM_SSVE regset will indicate
that the registers are not stored in SVE format. For the NT_ARM_SSVE
regset specifically, debuggers interpret this as meaning that
PSTATE.SM==0.
(3) When changing the SME vector length, if the task initially has
PSTATE.SM==1, the lower 128 bits of task's streaming mode vector
state will be migrated to non-streaming mode, rather than these bits
being zeroed as is usually the case for changes to PSTATE.SM.
To fix the first issue, we can eagerly allocate the new sve_state and
sme_state before modifying the task. This makes it possible to handle
memory allocation failure without modifying the task state at all, and
removes the need to clear TIF_SVE and TIF_SME.
To fix the second issue, we either need to clear PSTATE.SM or not change
the saved fp_type. Given we're going to eagerly allocate sve_state and
sme_state, the simplest option is to preserve PSTATE.SM and the saves
fp_type, and consistently truncate the SVE state. This ensures that the
task always stays in a valid state, and by virtue of not exiting
streaming mode, this also sidesteps the third issue.
I believe these changes should not be problematic for realistic usage:
* When the SVE/SME vector length is changed via prctl(), syscall entry
will have cleared PSTATE.SM. Unless the task's state has been
manipulated via ptrace after entry, the task will have PSTATE.SM==0.
* When the SVE/SME vector length is changed via a write to the
NT_ARM_SVE or NT_ARM_SSVE regsets, PSTATE.SM will be forced
immediately after the length change, and new vector state will be
copied from userspace.
* When the SME vector length is changed via a write to the NT_ARM_ZA
regset, the (S)SVE state is clobbered today, so anyone who cares about
the specific state would need to install this after writing to the
NT_ARM_ZA regset.
As we need to free the old SVE state while TIF_SVE may still be set, we
cannot use sve_free(), and using kfree() directly makes it clear that
the free pairs with the subsequent assignment. As this leaves sve_free()
unused, I've removed the existing sve_free() and renamed __sve_free() to
mirror sme_free().
Fixes: 8bd7f91c03d8 ("arm64/sme: Implement traps and syscall handling for SME")
Fixes: baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Spickett <david.spickett@arm.com>
Cc: Luis Machado <luis.machado@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-16-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The SVE/SME vector lengths can be changed via prctl/ptrace syscalls.
Changes to the SVE/SME vector lengths are documented as preserving the
lower 128 bits of the Z registers (i.e. the bits shared with the FPSIMD
V registers). To ensure this, vec_set_vector_length() explicitly copies
register values from a task's saved SVE state to its saved FPSIMD state
when dropping the task to FPSIMD-only.
The logic for this was not updated when when FPSIMD/SVE state tracking
was changed across commits:
baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE")
a0136be443d5 (arm64/fpsimd: Load FP state based on recorded data type")
bbc6172eefdb ("arm64/fpsimd: SME no longer requires SVE register state")
8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")
Since the last commit above, a task's FPSIMD/SVE state may be stored in
FPSIMD format while TIF_SVE is set, and the stored SVE state is stale.
When vec_set_vector_length() encounters this case, it will erroneously
clobber the live FPSIMD state with stale SVE state by using
sve_to_fpsimd().
Fix this by using fpsimd_sync_from_effective_state() instead.
Related issues with streaming mode state will be addressed in subsequent
patches.
Fixes: 8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Spickett <david.spickett@arm.com>
Cc: Luis Machado <luis.machado@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-15-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Linux is intended to be compatible with userspace written to Arm's
AAPCS64 procedure call standard [1,2]. For the Scalable Matrix Extension
(SME), AAPCS64 was extended with a "ZA lazy saving scheme", where SME's
ZA tile is lazily callee-saved and caller-restored. In this scheme,
TPIDR2_EL0 indicates whether the ZA tile is live or has been saved by
pointing to a "TPIDR2 block" in memory, which has a "za_save_buffer"
pointer. This scheme has been implemented in GCC and LLVM, with
necessary runtime support implemented in glibc and bionic.
AAPCS64 does not specify how the ZA lazy saving scheme is expected to
interact with thread creation mechanisms such as fork() and
pthread_create(), which would be implemented in terms of the Linux clone
syscall. The behaviour implemented by Linux and glibc/bionic doesn't
always compose safely, as explained below.
Currently the clone syscall is implemented such that PSTATE.ZA and the
ZA tile are always inherited by the new task, and TPIDR2_EL0 is
inherited unless the 'flags' argument includes CLONE_SETTLS,
in which case TPIDR2_EL0 is set to 0/NULL. This doesn't make much sense:
(a) TPIDR2_EL0 is part of the calling convention, and changes as control
is passed between functions. It is *NOT* used for thread local
storage, despite superficial similarity to TPIDR_EL0, which is is
used as the TLS register.
(b) TPIDR2_EL0 and PSTATE.ZA are tightly coupled in the procedure call
standard, and some combinations of states are illegal. In general,
manipulating the two independently is not guaranteed to be safe.
In practice, code which is compliant with the procedure call standard
may issue a clone syscall while in the "ZA dormant" state, where
PSTATE.ZA==1 and TPIDR2_EL0 is non-null and indicates that ZA needs to
be saved. This can cause a variety of problems, including:
* If the implementation of pthread_create() passes CLONE_SETTLS, the
new thread will start with PSTATE.ZA==1 and TPIDR2==NULL. Per the
procedure call standard this is not a legitimate state for most
functions. This can cause data corruption (e.g. as code may rely on
PSTATE.ZA being 0 to guarantee that an SMSTART ZA instruction will
zero the ZA tile contents), and may result in other undefined
behaviour.
* If the implementation of pthread_create() does not pass CLONE_SETTLS, the
new thread will start with PSTATE.ZA==1 and TPIDR2 pointing to a
TPIDR2 block on the parent thread's stack. This can result in a
variety of problems, e.g.
- The child may write back to the parent's za_save_buffer, corrupting
its contents.
- The child may read from the TPIDR2 block after the parent has reused
this memory for something else, and consequently the child may abort
or clobber arbitrary memory.
Ideally we'd require that userspace ensures that a task is in the "ZA
off" state (with PSTATE.ZA==0 and TPIDR2_EL0==NULL) prior to issuing a
clone syscall, and have the kernel force this state for new threads.
Unfortunately, contemporary C libraries do not do this, and simply
forcing this state within the implementation of clone would break
fork().
Instead, we can bodge around this by considering the CLONE_VM flag, and
manipulate PSTATE.ZA and TPIDR2_EL0 as a pair. CLONE_VM indicates that
the new task will run in the same address space as its parent, and in
that case it doesn't make sense to inherit a stale pointer to the
parent's TPIDR2 block:
* For fork(), CLONE_VM will not be set, and it is safe to inherit both
PSTATE.ZA and TPIDR2_EL0 as the new task will have its own copy of the
address space, and cannot clobber its parent's stack.
* For pthread_create() and vfork(), CLONE_VM will be set, and discarding
PSTATE.ZA and TPIDR2_EL0 for the new task doesn't break any existing
assumptions in userspace.
Implement this behaviour for clone(). We currently inherit PSTATE.ZA in
arch_dup_task_struct(), but this does not have access to the clone
flags, so move this logic under copy_thread(). Documentation is updated
to describe the new behaviour.
[1] https://github.com/ARM-software/abi-aa/releases/download/2025Q1/aapcs64.pdf
[2] https://github.com/ARM-software/abi-aa/blob/c51addc3dc03e73a016a1e4edf25440bcac76431/aapcs64/aapcs64.rst
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Kiss <daniel.kiss@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Richard Sandiford <richard.sandiford@arm.com>
Cc: Sander De Smalen <sander.desmalen@arm.com>
Cc: Tamas Petz <tamas.petz@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yury Khrustalev <yury.khrustalev@arm.com>
Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
Link: https://lore.kernel.org/r/20250508132644.1395904-14-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Currently arch_dup_task_struct() doesn't handle cases where the parent
task has PSTATE.SM==1. Since syscall entry exits streaming mode, the
parent will usually have PSTATE.SM==0, but this can be change by ptrace
after syscall entry. When this happens, arch_dup_task_struct() will
initialise the new task into an invalid state. The new task inherits the
parent's configuration of PSTATE.SM, but fp_type is set to
FP_STATE_FPSIMD, TIF_SVE and SME may be cleared, and both sve_state and
sme_state may be set to NULL.
This can result in a variety of problems whenever the new task's state
is manipulated, including kernel NULL pointer dereferences and leaking
of streaming mode state between tasks.
When ptrace is not involved, the parent will have PSTATE.SM==0 as a
result of syscall entry, and the documentation in
Documentation/arch/arm64/sme.rst says:
| On process creation (eg, clone()) the newly created process will have
| PSTATE.SM cleared.
... so make this true by using task_smstop_sm() to exit streaming mode
in the child task, avoiding the problems above.
Fixes: 8bd7f91c03d8 ("arm64/sme: Implement traps and syscall handling for SME")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-13-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
In arch_dup_task_struct() we try to ensure that the child task inherits
the FPSIMD state of its parent, but this depends on the parent task's
saved state being in FPSIMD format, which is not always the case.
Consequently the child task may inherit stale FPSIMD state in some
cases.
This can happen when the parent's state has been modified by ptrace
since syscall entry, as writes to the NT_ARM_SVE regset may save state
in SVE format. This has been possible since commit:
bc0ee4760364 ("arm64/sve: Core task context handling")
More recently it has been possible for a task's FPSIMD/SVE state to be
saved before lazy discarding was guaranteed to occur, in which case
preemption could cause the effective FPSIMD state to be saved in SVE
format non-deterministically. This has been possible since commit:
f130ac0ae441 ("arm64: syscall: unmask DAIF earlier for SVCs")
Fix this by saving the parent task's effective FPSIMD state into FPSIMD
format before copying the task_struct. As this requires modifying the
parent's fpsimd_state, we must save+flush the state to avoid racing with
concurrent manipulation.
Similar issues exist when the parent has streaming mode state, and will
be addressed by subsequent patches.
Fixes: bc0ee4760364 ("arm64/sve: Core task context handling")
Fixes: f130ac0ae441 ("arm64: syscall: unmask DAIF earlier for SVCs")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-12-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
For historical reasons, arch_dup_task_struct() only calls
fpsimd_preserve_current_state() when current->mm is non-NULL, but this
is no longer necessary.
Historically TIF_FOREIGN_FPSTATE was only managed for user threads, and
was never set for kernel threads. At the time, various functions
attempted to avoid saving/restoring state for kernel threads by checking
task_struct::mm to try to distinguish user threads from kernel threads.
We added the current->mm check to arch_dup_task_struct() in commit:
6eb6c80187c5 ("arm64: kernel thread don't need to save fpsimd context.")
... where the intent was to avoid pointlessly saving state for kernel
threads, which never had live state (and the saved state would never be
consumed).
Subsequently we began setting TIF_FOREIGN_FPSTATE for kernel threads,
and removed most of the task_struct::mm checks in commit:
df3fb9682045 ("arm64: fpsimd: Eliminate task->mm checks")
... but we missed the check in arch_dup_task_struct(), which is similarly
redundant.
Remove the redundant check from arch_dup_task_struct().
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-11-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Historically the behaviour of setup_return() was nondeterministic,
depending on whether the task's FSIMD/SVE/SME state happened to be live.
We fixed most of that in commit:
929fa99b1215 ("arm64/fpsimd: signal: Always save+flush state early")
... but we didn't decide on how clearing PSTATE.SM should behave, and left a
TODO comment to that effect.
Use the new task_smstop_sm() helper to make this behave as if an SMSTOP
instruction was used to exit streaming mode. This would have been the
most common behaviour prior to the commit above.
Fixes: 40a8e87bb328 ("arm64/sme: Disable ZA and streaming mode when handling signals")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-10-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
In a few places we want to transition a task from streaming mode to
non-streaming mode, e.g. signal delivery where we historically tried to
use an SMSTOP SM instruction.
Add a new helper to manipulate a task's state in the same way as an
SMSTOP SM instruction. I have not added a corresponding helper to
simulate the effects of SMSTART SM. Only ptrace transitions a task into
streaming mode, and ptrace has distinct semantics for such transitions.
Per ARM DDI 0487 L.a, section B1.4.6:
| RRSWFQ
| When the Effective value of PSTATE.SM is changed by any method from 0
| to 1, an entry to Streaming SVE mode is performed, and all implemented
| bits of Streaming SVE register state are set to zero.
| RKFRQZ
| When the Effective value of PSTATE.SM is changed by any method from 1
| to 0, an exit from Streaming SVE mode is performed, and in the
| newly-entered mode, all implemented bits of the SVE scalable vector
| registers, SVE predicate registers, and FFR, are set to zero.
Per ARM DDI 0487 L.a, section C5.2.9:
| On entry to or exit from Streaming SVE mode, FPMR is set to 0
Per ARM DDI 0487 L.a, section C5.2.10:
| On entry to or exit from Streaming SVE mode, FPSR.{IOC, DZC, OFC, UFC,
| IXC, IDC, QC} are set to 1 and the remaining bits are set to 0.
This means bits 0, 1, 2, 3, 4, 7, and 27 respectively, i.e. 0x0800009f
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-9-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
In subsequent patches we'll need to determine the SVE/SME state size for
a given SVE VL and SME VL regardless of whether a task is currently
configured with those VLs. Split the sizing logic out of
sve_state_size() and sme_state_size() so that we don't need to open-code
this logic elsewhere.
At the same time, apply minor cleanups:
* Move sve_state_size() into fpsimd.h, matching the placement of
sme_state_size().
* Remove the feature checks from sve_state_size(). We only call
sve_state_size() when at least one of SVE and SME are supported, and
when either of the two is not supported, the task's corresponding
SVE/SME vector length will be zero.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-8-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The sve_sync_{to,from}_fpsimd*() functions are intended to
extract/insert the currently effective FPSIMD state of a task regardless
of whether the task's state is saved in FPSIMD format or SVE format.
Historically they were only used by ptrace, but sve_sync_to_fpsimd() is
now used more widely, and sve_sync_from_fpsimd_zeropad() may be used
more widely in future.
When FPSIMD/SVE state tracking was changed across commits:
baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE")
a0136be443d5 (arm64/fpsimd: Load FP state based on recorded data type")
bbc6172eefdb ("arm64/fpsimd: SME no longer requires SVE register state")
8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")
... sve_sync_to_fpsimd() was updated to consider task->thread.fp_type
rather than the task's TIF_SVE and PSTATE.SM, but (apparently due to an
oversight) sve_sync_from_fpsimd_zeropad() was left as-is, leaving the
two inconsistent.
Due to this, sve_sync_from_fpsimd_zeropad() may copy state from
task->thread.uw.fpsimd_state into task->thread.sve_state when
task->thread.fp_type == FP_STATE_FPSIMD. This is redundant (but benign)
as task->thread.uw.fpsimd_state is the effective state that will be
restored, and task->thread.sve_state will not be consumed. For
consistency, and to avoid the redundant work, it better for
sve_sync_from_fpsimd_zeropad() to consider task->thread.fp_type alone,
matching sve_sync_to_fpsimd().
The naming of both functions is somehat unfortunate, as it is unclear
when and why they copy state. It would be better to describe them in
terms of the effective state.
Considering all of the above, clean this up:
* Adjust sve_sync_from_fpsimd_zeropad() to consider
task->thread.fp_type.
* Update comments to clarify the intended semantics/usage. I've removed
the description that task->thread.sve_state must have been allocated,
as this is only necessary when task->thread.fp_type == FP_STATE_SVE,
which itself implies that task->thread.sve_state must have been
allocated.
* Rename the functions to more clearly indicate when/why they copy
state:
- sve_sync_to_fpsimd() => fpsimd_sync_from_effective_state()
- sve_sync_from_fpsimd_zeropad => fpsimd_sync_to_effective_state_zeropad()
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-7-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Partial writes to the NT_ARM_SVE and NT_ARM_SSVE regsets using an
payload are handled inconsistently and non-deterministically. A comment
within sve_set_common() indicates that we intended that a partial write
would preserve any effective FPSIMD/SVE state which was not overwritten,
but this has never worked consistently, and during syscalls the FPSIMD
vector state may be non-deterministically preserved and may be
erroneously migrated between streaming and non-streaming SVE modes.
The simplest fix is to handle a partial write by consistently zeroing
the remaining state. As detailed below I do not believe this will
adversely affect any real usage.
Neither GDB nor LLDB attempt partial writes to these regsets, and the
documentation (in Documentation/arch/arm64/sve.rst) has always indicated
that state preservation was not guaranteed, as is says:
| The effect of writing a partial, incomplete payload is unspecified.
When the logic was originally introduced in commit:
43d4da2c45b2 ("arm64/sve: ptrace and ELF coredump support")
... there were two potential behaviours, depending on TIF_SVE:
* When TIF_SVE was clear, all SVE state would be zeroed, excluding the
low 128 bits of vectors shared with FPSIMD, FPSR, and FPCR.
* When TIF_SVE was set, all SVE state would be zeroed, including the
low 128 bits of vectors shared with FPSIMD, but excluding FPSR and
FPCR.
Note that as writing to NT_ARM_SVE would set TIF_SVE, partial writes to
NT_ARM_SVE would not be idempotent, and if a first write preserved the
low 128 bits, a subsequent (potentially identical) partial write would
discard the low 128 bits.
When support for the NT_ARM_SSVE regset was added in commit:
e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers")
... the above behaviour was retained for writes to the NT_ARM_SVE
regset, though writes to the NT_ARM_SSVE would always zero the SVE
registers and would not inherit FPSIMD register state. This happened as
fpsimd_sync_to_sve() only copied the FPSIMD regs when TIF_SVE was clear
and PSTATE.SM==0.
Subsequently, when FPSIMD/SVE state tracking was changed across commits:
baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE")
a0136be443d5 (arm64/fpsimd: Load FP state based on recorded data type")
bbc6172eefdb ("arm64/fpsimd: SME no longer requires SVE register state")
8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")
... there was no corresponding update to the ptrace code, nor to
fpsimd_sync_to_sve(), which stil considers TIF_SVE and PSTATE.SM rather
than the saved fp_type. The saved state can be in the FPSIMD format
regardless of whether TIF_SVE is set or clear, and the saved type can
change non-deterministically during syscalls. Consequently a subsequent
partial write to the NT_ARM_SVE or NT_ARM_SSVE regsets may
non-deterministically preserve the FPSIMD state, and may migrate this
state between streaming and non-streaming modes.
Clean this up by never attempting to preserve ANY state when writing an
SVE payload to the NT_ARM_SVE/NT_ARM_SSVE regsets, zeroing all relevant
state including FPSR and FPCR. This simplifies the code, makes the
behaviour deterministic, and avoids migrating state between streaming
and non-streaming modes. As above, I do not believe this should
adversely affect existing userspace applications.
At the same time, remove fpsimd_sync_to_sve(). It is no longer used,
doesn't do what its documentation implies, and gets in the way of other
cleanups and fixes.
Fixes: 43d4da2c45b2 ("arm64/sve: ptrace and ELF coredump support")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Spickett <david.spickett@arm.com>
Cc: Luis Machado <luis.machado@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-6-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
For historical reasons, restore_sve_fpsimd_context() has an open-coded
copy of the logic from read_fpsimd_context(), which is used to either
restore an FPSIMD-only context, or to merge FPSIMD state into an
SVE state when restoring an SVE+FPSIMD context. The logic is *almost*
identical.
Refactor the logic to avoid duplication and make this clearer.
This comes with two functional changes that I do not believe will be
problematic in practice:
* The user_fpsimd_state::size field will be checked in all restore paths
that consume it user_fpsimd_state. The kernel always populates this
field when delivering a signal, and so this should contain the
expected value unless it has been corrupted.
* If a read of user_fpsimd_state fails, we will return early without
modifying TIF_SVE, the saved SVCR, or the save fp_type. This will
leave the task in a consistent state, without potentially resurrecting
stale FPSIMD state. A read of user_fpsimd_state should never fail
unless the structure has been corrupted or the stack has been
unmapped.
Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-5-mark.rutland@arm.com
[will: Ensure read_fpsimd_context() returns negative error code or zero]
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Non-streaming SVE state may be preserved without an SVE payload, in
which case the SVE context only has a header with VL==0, and all state
can be restored from the FPSIMD context. Streaming SVE state is always
preserved with an SVE payload, where the SVE context header has VL!=0,
and the SVE_SIG_FLAG_SM flag is set.
The kernel never preserves an SVE context where SVE_SIG_FLAG_SM is set
without an SVE payload. However, restore_sve_fpsimd_context() doesn't
forbid restoring such a context, and will handle this case by clearing
PSTATE.SM and restoring the FPSIMD context into non-streaming mode,
which isn't consistent with the SVE_SIG_FLAG_SM flag.
Forbid this case, and mandate an SVE payload when the SVE_SIG_FLAG_SM
flag is set. This avoids an awkward ABI quirk and reduces the risk that
later rework to this code permits configuring a task with PSTATE.SM==1
and fp_type==FP_STATE_FPSIMD.
I've marked this as a fix given that we never intended to support this
case, and we don't want anyone to start relying upon the old behaviour
once we re-enable SME.
Fixes: 85ed24dad290 ("arm64/sme: Implement streaming SVE signal handling")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-4-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
On systems with SVE and/or SME, the kernel will always create SVE and
FPSIMD signal frames when delivering a signal, but a user can manipulate
signal frames such that a signal return only observes an FPSIMD signal
frame. When this happens, restore_fpsimd_context() will restore state
such that fp_type==FP_STATE_FPSIMD, but will leave PSTATE.SM as-is.
It is possible for a user to set PSTATE.SM between syscall entry and
execution of the sigreturn logic (e.g. via ptrace), and consequently the
sigreturn may result in the task having PSTATE.SM==1 and
fp_type==FP_STATE_FPSIMD.
For various reasons it is not legitimate for a task to be in a state
where PSTATE.SM==1 and fp_type==FP_STATE_FPSIMD. Portions of the user
ABI are written with the requirement that streaming SVE state is always
presented in SVE format rather than FPSIMD format, and as there is no
mechanism to permit access to only the FPSIMD subset of streaming SVE
state, streaming SVE state must always be saved and restored in SVE
format.
Fix restore_fpsimd_context() to clear PSTATE.SM when restoring an FPSIMD
signal frame without an SVE signal frame. This matches the current
behaviour when an SVE signal frame is present, but the SVE signal frame
has no register payload (e.g. as is the case on SME-only systems which
lack SVE).
This change should have no effect for applications which do not alter
signal frames (i.e. almost all applications). I do not expect
non-{malicious,buggy} applications to hide the SVE signal frame, but
I've chosen to clear PSTATE.SM rather than mandating the presence of an
SVE signal frame in case there is some legacy (non-SME) usage that I am
not currently aware of.
For context, the SME handling was originally introduced in commit:
85ed24dad290 ("arm64/sme: Implement streaming SVE signal handling")
... and subsequently updated/fixed to handle SME-only systems in commits:
7dde62f0687c ("arm64/signal: Always accept SVE signal frames on SME only systems")
f26cd7372160 ("arm64/signal: Always allocate SVE signal frames on SME only systems")
Fixes: 85ed24dad290 ("arm64/sme: Implement streaming SVE signal handling")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-3-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|