| Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull kvm fixes from Paolo Bonzini:
"arm64:
- Fix ITS EventID sanitisation when restoring an interrupt
translation table.
- Fix PPI memory leak when failing to initialise a vcpu.
- Correctly return an error when the validation of a hypervisor trace
descriptor fails, and limit this validation to protected mode only.
RISC-V:
- Fix invalid HVA warning in steal-time recording
- Return SBI_ERR_FAILURE to guest upon OOM in pmu_event_info() and
pmu_snapshot_set_shmem()
- Fix NULL pointer dereference in SBI v0.1 SEND_IPI handler
- Fix sign extension of value for MMIO loads
s390:
- Fix bugs in vSIE (nested virtualization) and UCONTROL, caused by
the page table rewrite.
x86:
- Apply erratum #1235 workaround (disable AVIC IPI virtualization) on
Hygon Family 18h, just like on AMD Family 17h.
- When KVM_CAP_X86_APIC_BUS_CYCLES_NS is queried on a specific VM,
return the VM's configured APIC bus frequency instead of the
default. This is less confusing (read: not wrong) and makes it
easier to fill in CPUID information that communicates the APIC bus
frequency to the guest.
Selftests:
- Do not include glibc-internal <bits/endian.h>; it worked by chance
and broke building KVM selftests with musl"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: SVM: Disable AVIC IPI virtualization on Hygon Family 18h (erratum #1235)
KVM: selftests: Verify that KVM returns the configured APIC cycle length
KVM: x86: Return the VM's configured APIC bus frequency when queried
KVM: selftests: elf: Include <endian.h> instead of <bits/endian.h>
KVM: s390: Properly reset zero bit in PGSTE
KVM: s390: vsie: Fix redundant rmap entries
KVM: s390: vsie: Fix unshadowing logic
KVM: s390: Fix leaking kvm_s390_mmu_cache in case of errors
KVM: s390: vsie: Fix memory leak when unshadowing
KVM: arm64: Fix nVHE/pKVM hyp tracing error on invalid desc
KVM: arm64: vgic: Free private_irqs when init fails after allocation
KVM: arm64: vgic-its: Reject restored DTE with out-of-range num_eventid_bits
RISC-V: KVM: Fix sign extension for MMIO loads
RISC-V: KVM: Fix NULL pointer dereference in SBI v0.1 SEND_IPI handler
riscv: kvm: return SBI_ERR_FAILURE for pmu_event_info() when OOM
riscv: kvm: return SBI_ERR_FAILURE for pmu_snapshot_set_shmem() when OOM
RISC-V: KVM: Fix invalid HVA warning in steal-time recording
|
|
Pull x86 fixes from Ingo Molnar:
- On SEV guests, handle set_memory_{encrypted,decrypted}() failures
more conservatively by assuming that all affected pages are
unencrypted (Carlos López)
- Disable broadcast TLB flush when PCID is disabled (Tom Lendacky)
- Fix VMX vs. hrtimer_rearm_deferred() regression (Peter Zijlstra)
- Move IRQ/NMI dispatch code from KVM into x86 core, to prepare for a
KVM x2apic fix (Peter Zijlstra)
- Fix incorrect munmap() size on map_vdso() failure (Guilherme Giacomo
Simoes)
* tag 'x86-urgent-2026-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
virt: sev-guest: Explicitly leak pages in unknown state
x86/mm: Disable broadcast TLB flush when PCID is disabled
x86/kvm/vmx: Fix VMX vs hrtimer_rearm_deferred()
x86/kvm/vmx: Move IRQ/NMI dispatch from KVM into x86 core
x86/vdso: Fix incorrect size in munmap() on map_vdso() failure
|
|
Hygon Family 18h CPUs are derived from AMD Family 17h (Zen1) silicon and
share the same erratum #1235: hardware may read a stale IsRunning=1 bit
during ICR write emulation and silently fail to generate an
AVIC_IPI_FAILURE_TARGET_NOT_RUNNING VM-Exit on the sending vCPU.
The absence of the VM-Exit causes KVM to miss the required wakeup of
blocking target vCPUs, leading to hung vCPUs and unbounded delays in
guest execution.
Extend the existing AMD Family 17h erratum #1235 workaround to also cover
Hygon Family 18h. With IPI virtualization disabled, KVM never sets
IsRunning=1 in the Physical ID table, so every non-self IPI generates a
VM-Exit and is correctly emulated.
Fixes: 8de4a1c8164e ("KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235")
Cc: <stable@vger.kernel.org>
Signed-off-by: Tina Zhang <zhang_wei@open-hieco.net>
Message-ID: <20260522040014.3380201-1-zhang_wei@open-hieco.net>
|
|
When KVM_CAP_X86_APIC_BUS_CYCLES_NS is queried on a specific VM, return the
VM's configured APIC bus frequency, not KVM's default. Aside from the fact
that returning the default frequency is blatantly wrong if userspace has
changed the frequency, returning the configured frequency means userspace
can blindly trust the result, e.g. when filling PV CPUID information that
communicates the APIC bus frequency to the guest.
Fixes: 6fef518594bc ("KVM: x86: Add a capability to configure bus frequency for APIC timer")
Reported-by: David Woodhouse <dwmw2@infradead.org>
Closes: https://lore.kernel.org/all/ab84153e33fbe7c25667f595c56b310d4d5a93ef.camel@infradead.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260522173526.3539407-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
On real hardware, panic and machine reboot may not flush hardware cache
to memory. This means the persistent ring buffer, which relies on a
coherent state of memory, may not have its events written to the buffer
and they may be lost. Moreover, there may be inconsistency with the
counters which are used for validation of the integrity of the
persistent ring buffer which may cause all data to be discarded.
To avoid this issue, stop recording of the ring buffer on panic and
flush the cache of the ring buffer's memory.
Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance")
Cc: stable@vger.kernel.org
Cc: Will Deacon <will@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ian Rogers <irogers@google.com>
Link: https://patch.msgid.link/177751969602.2136606.12031934362587643488.stgit@mhiramat.tok.corp.google.com
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
Booting with "nopcid" clears X86_FEATURE_PCID and keeps CR4.PCIDE from being
set to one. On AMD CPUs that support INVLPGB, broadcast TLB flushing remains
enabled.
There are two checks that decide whether the global ASID code runs,
mm_global_asid() and consider_global_asid(), that key off of the
X86_FEATURE_INVLPGB feature. Once an mm becomes active on more than three
CPUs, consider_global_asid() assigns it a global ASID, after which
flush_tlb_mm_range() takes the broadcast_tlb_flush() path using a non-zero
PCID. Issuing an INVLPGB with a non-zero PCID while CR4.PCIDE is not set
results in a #GP:
Oops: general protection fault, kernel NULL pointer dereference 0x1: 0000 [#1] SMP NOPTI
CPU: 158 UID: 0 PID: 3119 Comm: snap Not tainted 7.1.0-rc3 #1 PREEMPT(full)
Hardware name: ...
RIP: 0010:broadcast_tlb_flush
Code: ... 89 da 48 83 c8 07 <0f> 01 fe eb 08 cc cc cc ...
Call Trace:
<TASK>
flush_tlb_mm_range
ptep_clear_flush
wp_page_copy
? _raw_spin_unlock
__handle_mm_fault
handle_mm_fault
do_user_addr_fault
exc_page_fault
asm_exc_page_fault
All processors that support broadcast TLB invalidation also have PCID support,
so it is only the "nopcid" scenario that is of concern. In this situation just
disable the broadcast TLB support using the CPUID dependency support by making
X86_FEATURE_INVLPGB dependent on X86_FEATURE_PCID.
[ bp: Massage commit message. ]
Fixes: 4afeb0ed1753 ("x86/mm: Enable broadcast TLB invalidation for multi-threaded processes")
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Assisted-by: Claude:claude-opus-4.7
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: <stable@kernel.org>
Link: https://patch.msgid.link/b915acfd63e8b2a094fdeb8dc608738072518764.1779296450.git.thomas.lendacky@amd.com
|
|
Vishal reported that KVM unit test 'x2apic' started failing after commit
0e98eb14814e ("entry: Prepare for deferred hrtimer rearming").
The reason is that KVM/VMX is injecting interrupts while it has interrupts
disabled, for a context that will enable interrupts, this means that
regs->flags.X86_EFLAGS_IF == 0 and irqentry_exit() will not do the right
thing.
Notably, irqentry_exit() must not call hrtimer_rearm_deferred() when the return
context does not have IF set, because this will cause problems vs NMIs.
Therefore, fix up the state after the injection.
Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming")
Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
Suggested-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
Tested-by: David Woodhouse <dwmw@amazon.co.uk>
Tested-by: Zhao Liu <zhao1.liu@intel.com>
Tested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260423155936.957351833@infradead.org
Closes: https://lore.kernel.org/r/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel%40intel.com
|
|
Move the VMX interrupt dispatch magic into the x86 core code. This
isolates KVM from the FRED/IDT decisions and reduces the amount of
EXPORT_SYMBOL_FOR_KVM().
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
Tested-by: Zhao Liu <zhao1.liu@intel.com>
Tested-by: Zhao Liu <zhao1.liu@intel.com>
Tested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linxu.intel.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260508091829.GO3126523@noisy.programming.kicks-ass.net
|
|
In map_vdso(), if a failure occurs during the installation of the VVAR
mappings, the error path attempts to clean up previously allocated mappings
using do_munmap(). However, the cleanup for the VVAR mapping is incorrectly
using image->size (the size of the vDSO text) instead of the actual size
allocated for the VVAR area.
Replace the incorrect do_munmap() image->size parameter with the constant
VDSO_NR_PAGES * PAGE_SIZE. Ensure the unmap size exactly matches the size
used during the vdso_install_vvar_mapping() phase to provide a symmetrical
and complete teardown of the memory region.
Fixes: e93d2521b27f ("x86/vdso: Split virtual clock pages into dedicated mapping")
Signed-off-by: Guilherme Giacomo Simoes <trintaeoitogc@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Link: https://patch.msgid.link/20260503191609.551817-1-trintaeoitogc@gmail.com
|
|
Pull x86 fix from Ingo Molnar:
- Fix x86 boot crash for non-kjump kexecs (David Woodhouse)
* tag 'x86-urgent-2026-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kexec: Push kjump return address even for non-kjump kexec
|
|
Pull MCE fix from Ingo Molnar:
- Fix an MCE polling interval adjustment regression (Borislav Petkov)
* tag 'ras-urgent-2026-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Restore MCA polling interval halving
|
|
Pull xen fixes from Juergen Gross:
- one simple cleanup
- a fix for a corner case when running as Xen PV dom0
- a fix of a regression for Xen PV guests, introduced in 7.0
* tag 'for-linus-7.1b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: Tolerate nested XEN_LAZY_MMU entering/leaving
x86/xen: Fix xen_e820_swap_entry_with_ram()
xen/arm: Replace __ASSEMBLY__ with __ASSEMBLER__ in interface.h
|
|
Pull ACPI support fixes from Rafael Wysocki:
"These fix several platform drivers that use the ACPI companion of the
given platform device without checking its presence, which may lead to
a NULL pointer dereference or other kind of malfunction if the driver
is forced to match a device without an ACPI companion via driver
override, and restore debug log level for some messages in the ACPI
CPPC library:
- Check ACPI_COMPANION() against NULL during probe in several core
ACPI device drivers (Rafael Wysocki)
- Restore log level of messages in amd_set_max_freq_ratio() (Mario
Limonciello)"
* tag 'acpi-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: PAD: xen: Check ACPI_COMPANION() against NULL
ACPI: driver: Check ACPI_COMPANION() against NULL during probe
Revert "ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"
|
|
Merge a revert of an ACPI CPPC commit that increased the log level of
some debug messages which turned out to be a bad idea:
- Restore log level of messages in amd_set_max_freq_ratio() (Mario
Limonciello)
* acpi-cppc:
Revert "ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"
|
|
With the support of nested lazy mmu sections it can happen that
arch_enter_lazy_mmu_mode() is being called twice without a call of
arch_leave_lazy_mmu_mode() in between, as the lazy_mmu_*() helpers
are not disabling preemption when checking for nested lazy mmu
sections.
This is a problem when running as a Xen PV guest, as
xen_enter_lazy_mmu() and xen_leave_lazy_mmu() don't tolerate this
case.
Fix that in xen_enter_lazy_mmu() and xen_leave_lazy_mmu() in order
not to hurt all other lazy mmu mode users.
Fixes: 291b3abed657 ("x86/xen: use lazy_mmu_state when context-switching")
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <20260508143933.493013-1-jgross@suse.com>
|
|
When swapping a not page-aligned E820 map entry with RAM, the start
address of the modified entry is calculated wrong (the offset into the
page is subtracted instead of being added to the page address).
Fixes: be35d91c8880 ("xen: tolerate ACPI NVS memory overlapping with Xen allocated memory")
Reported-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <20260505102417.208138-1-jgross@suse.com>
|
|
Pull kvm fixes from Paolo Bonzini:
"arm64:
- Add the pKVM side of the workaround for ARM's erratum 4193714,
provided that the EL3 firmware does its part of the job. KVM will
refuse to initialise otherwise
- Correctly handle 52bit VAs for guest EL2 stage-1 translations when
running under NV with E2H==0
- Correctly deal with permission faults in guest_memfd memslots
- Fix the steal-time selftest after the infrastructure was reworked
- Make sure the host cannot pass a non-sensical clock update to the
EL2 tracing infrastructure
- Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390
ability to run arm64 guests, which will inevitably lead to arm64
code being directly used on s390
- Make sure that EL2 is configured with both exception entry and exit
being Context Synchronization Events
- Handle the current vcpu being NULL on EL2 panic
- Fix the selftest_vcpu memcache being empty at the point of donation
or sharing
- Check that the memcache has enough capacity before engaging on the
share/donate path
- Fix __deactivate_fgt() to use its parameter rather than a variable
in the macro context
s390:
- Fix array overrun with large amounts of PCI devices
x86:
- Never use L0's PAUSE loop exiting while L2 is running, since it's
unlikely that a nested guest will help solving the hypervisor's
spinlock contention
- Fix emulation of MOVNTDQA
- Fix typo in Xen hypercall tracepoint
- Add back an optimization that was left behind when recently fixing
a bug
- Add module parameter to disable CET, whose implementation seems to
have issues. For now it remains enabled by default
Generic:
- Reject offset causing an unsigned overflow in kvm_reset_dirty_gfn()
Documentation:
- Update stale links
Selftests:
- Fix guest_memfd_test with host page size > guest page size"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
KVM: VMX: introduce module parameter to disable CET
KVM: x86: Swap the dst and src operand for MOVNTDQA
KVM: x86: use again the flush argument of __link_shadow_page()
KVM: selftests: Ensure gmem file sizes are multiple of host page size
Documentation: kvm: update links in the references section of AMD Memory Encryption
KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running
KVM: x86: Fix Xen hypercall tracepoint argument assignment
KVM: Reject wrapped offset in kvm_reset_dirty_gfn()
KVM: arm64: Pre-check vcpu memcache for host->guest donate
KVM: arm64: Pre-check vcpu memcache for host->guest share
KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache
KVM: arm64: Fix __deactivate_fgt macro parameter typo
KVM: arm64: Guard against NULL vcpu on VHE hyp panic path
KVM: arm64: Make EL2 exception entry and exit context-synchronization events
MAINTAINERS: Add Steffen as reviewer for KVM/arm64
KVM: arm64: Remove potential UB on nvhe tracing clock update
KVM: selftests: arm64: Fix steal_time test after UAPI refactoring
KVM: arm64: Handle permission faults with guest_memfd
KVM: arm64: nv: Consider the DS bit when translating TCR_EL2
KVM: arm64: Work around C1-Pro erratum 4193714 for protected guests
...
|
|
RongQing reported that the MCA polling interval doesn't halve when an
error gets logged. It was traced down to the commit in Fixes:, because:
mce_timer_fn()
|-> mce_poll_banks()
|-> machine_check_poll()
|-> mce_log()
which will queue the work and return.
Now, back in mce_timer_fn():
/*
* Alert userspace if needed. If we logged an MCE, reduce the polling
* interval, otherwise increase the polling interval.
*/
if (mce_notify_irq())
<--- here we haven't ran the notifier chain yet so mce_need_notify is
not set yet so this won't hit and we won't halve the interval iv.
Now the notifier chain runs. mce_early_notifier() sets the bit, does
mce_notify_irq(), that clears the bit and then the notifier chain
a little later logs the error.
So this is a silly timing issue.
But, that's all unnecessary.
All it needs to happen here is, the "should we notify of a logged MCE"
mce_notify_irq() asks, should be simply a question to the mce gen pool:
"Are you empty?"
And that then turns into a simple yes or no answer and it all
JustWorks(tm).
So do that and also distribute the functionality where it belongs:
- Print that MCE events have been logged in mce_log()
- Trigger the mcelog tool specific work in the first notifier
As a result, mce_notify_irq() can go now.
Fixes: 011d82611172 ("RAS: Add a Corrected Errors Collector")
Reported-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20260112082747.2842-1-lirongqing@baidu.com
|
|
There have been reports of host hangs caused by CET virtualization.
Until these are analyzed further, introduce a module parameter that
makes it possible to easily disable it.
Link: https://lore.kernel.org/all/85548beb-1486-40f9-beb4-632c78e3360b@proxmox.com/
Cc: David Riley <d.riley@proxmox.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Swap the MOVNTDQA operands, as MOVNTDQA does NOT in fact have "the same
characteristics as 0F E7 (MOVNTDQ)"; MOVNTDQA loads from memory and stores
to registers, while MOVNTDQ loads from registers and stores to memory.
Per the SDM:
MOVNTDQ - Move packed integer values in xmm1 to m128 using non-temporal
hint.
MOVNTDQA - Move double quadword from m128 to xmm1 using non-temporal hint
if WC memory type.
Reported-by: Josh Eads <josheads@google.com>
Fixes: c57d9bafbd0b ("KVM: x86: Add support for emulating MOVNTDQA")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260506213514.2781948-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Except in the case of parentless nested-TDP pages, mmu_page_zap_pte()
clears the SPTE but leaves the invalid_list empty. In this case, using
kvm_flush_remote_tlbs() as kvm_mmu_remote_flush_or_zap() does is overkill.
Avoid flushing the entirety of the remote TLBs unless the invalid_list
was populated: instead, use a more efficient gfn-targeting flush (if
available) and skip it altogether if the caller guarantees that a TLB
flush is not necessary.
Based-on: <20260503201029.106481-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260503210917.121840-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Never use L0's (KVM's) PAUSE loop exiting controls while L2 is running,
and instead always configure vmcb02 according to L1's exact capabilities
and desires.
The purpose of intercepting PAUSE after N attempts is to detect when the
vCPU may be stuck waiting on a lock, so that KVM can schedule in a
different vCPU that may be holding said lock. Barring a very interesting
setup, L1 and L2 do not share locks, and it's extremely unlikely that an
L1 vCPU would hold a spinlock while running L2. I.e. having a vCPU
executing in L1 yield to a vCPU running in L2 will not allow the L1 vCPU
to make forward progress, and vice versa.
While teaching KVM's "on spin" logic to only yield to other vCPUs in L2 is
doable, in all likelihood it would do more harm than good for most setups.
KVM has limited visibility into which L2 "vCPUs" belong to the same VM,
and thus share a locking domain. And even if L2 vCPUs are in the same
VM, KVM has no visilibity into L2 vCPU's that are scheduled out by the
L1 hypervisor.
Furthermore, KVM doesn't actually steal PAUSE exits from L1. If L1 is
intercepting PAUSE, KVM will route PAUSE exits to L1, not L0, as
nested_svm_intercept() gives priority to the vmcb12 intercept. As such,
overriding the count/threshold fields in vmcb02 with vmcb01's values is
nonsensical, as doing so clobbers all the training/learning that has been
done in L1.
Even worse, if L1 is not intercepting PAUSE, i.e. KVM is handling PAUSE
exits, then KVM will adjust the PLE knobs based on L2 behavior, which could
very well be detrimental to L1, e.g. due to essentially poisoning L1 PLE
training with bad data.
And copying the count from vmcb02 to vmcb01 on a nested VM-Exit makes even
less sense, because again, the purpose of PLE is to detect spinning vCPUs.
Whether or not a vCPU is spinning in L2 at the time of a nested VM-Exit
has no relevance as to the behavior of the vCPU when it executes in L1.
The only scenarios where any of this actually works is if at least one
of KVM or L1 is NOT intercepting PAUSE for the guest. Per the original
changelog, those were the only scenarios considered to be supported.
Disabling KVM's use of PLE makes it so the VM is always in a "supported"
mode.
Last, but certainly not least, using KVM's count/threshold instead of the
values provided by L1 is a blatant violation of the SVM architecture.
Fixes: 74fd41ed16fd ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260508213321.373309-1-seanjc@google.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
TRACE_EVENT(kvm_xen_hypercall) stores a5 in __entry->a4 instead of
__entry->a5.
That overwrites the recorded a4 argument and leaves a5 unset in the
trace entry. Fix the typo so both arguments are captured correctly.
Signed-off-by: Qiang Ma <maqianga@uniontech.com>
Link: https://patch.msgid.link/20260512015313.1685784-1-maqianga@uniontech.com/
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Make sure resources are not improperly shared in the op cache and
cause instruction corruption this way.
Signed-off-by: Prathyushi Nangia <prathyushi.nangia@amd.com>
Co-developed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull x86 fixes from Ingo Molnar:
- Fix memory map enumeration bug in the Xen e820 parsing code (Juergen
Gross)
- Re-enable e820 BIOS fallback if e820 table is empty (David Gow)
* tag 'x86-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot/e820: Re-enable BIOS fallback if e820 table is empty
x86/xen: Fix a potential problem in xen_e820_resolve_conflicts()
|
|
Pull perf events fixes from Ingo Molnar:
- Fix deadlock in the perf_mmap() failure path (Peter Zijlstra)
- Intel ACR (Auto Counter Reload) fixes (Dapeng Mi):
- Fix validation and configuration of ACR masks
- Fix ACR rescheduling bug causing stale masks
- Disable the PMI on ACR-enabled hardware
- Enable ACR on Panther Cover uarch too
* tag 'perf-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Enable auto counter reload for DMR
perf/x86/intel: Disable PMI for self-reloaded ACR events
perf/x86/intel: Always reprogram ACR events to prevent stale masks
perf/x86/intel: Improve validation and configuration of ACR masks
perf/core: Fix deadlock in perf_mmap() failure path
|
|
Some older systems don't support CPPC in the firmware and this just makes
noise for them when booting. Drop back to debug.
This reverts commit 21fb59ab4b9767085f4fe1edbdbe3177fbb9ec97.
Fixes: 21fb59ab4b976 ("ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn")
Suggested-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Cc: All applicable <stable@vger.kernel.org>
Link: https://patch.msgid.link/20260504230141.484743-2-mario.limonciello@amd.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The version of purgatory code shipped by kexec-tools attempts to look above
the top of its stack to find a return address for a kjump, even in a non-kjump
kexec.
After the commit in Fixes: the word above the stack might not be there,
leading to a fault (which is at least now caught by my exception-handling code
in kexec).
That commit fixed things for the actual kjump path, but no longer
"gratuitously" pushes the unused return address to the stack in the non-kjump
path. Put that *back* in the non-kjump path, to prevent purgatory from
crashing when trying to access it.
Fixes: 2cacf7f23a02 ("x86/kexec: Fix stack and handling of re-entry point for ::preserve_context")
Reported-by: Rohan Kakulawaram <rohanka@google.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Rohan Kakulawaram <rohanka@google.com>
Cc: <stable@kernel.org>
Link: https://patch.msgid.link/32d627134143ffd957891cb697138e839c623211.camel@infradead.org
|
|
In commit:
157266edcc56 ("x86/boot/e820: Simplify append_e820_table() and remove restriction on single-entry tables")
the check on the number of entries in the e820 table was removed. The intention
was to support single-entry maps, but by removing the check entirely, we also
skip the fallback (to, e.g., the BIOS 88h function).
This means that if no E820 map is passed in from the bootloader (which is the
case on some bootloaders, like linld), we end up with an empty memory map, and
the kernel fails to boot (either by deadlocking on OOM, or by failing to
allocate the real mode trampoline, or similar).
Re-instate the check in append_e820_table(), but only check that nr_entries is
non-zero. This allows e820__memory_setup_default() to fall back to other memory
size sources, and doesn't affect e820__memory_setup_extended(), as the latter
ignores the return value from append_e820_table().
In doing so, we also update the return values to be proper error codes, with
-ENOENT for this case (there are no entries), and -EINVAL for the case where an
entry appears invalid. Given none of the callers check the actual value -- just
whether it's nonzero -- this is largely aesthetic in practice.
Tested against linld, and the kernel boots again fine.
[ mingo: Readability edits to the comment and the changelog. ]
Fixes: 157266edcc56 ("x86/boot/e820: Simplify append_e820_table() and remove restriction on single-entry tables")
Signed-off-by: David Gow <david@davidgow.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: stable@vger.kernel.org
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://patch.msgid.link/20260416065746.1896647-1-david@davidgow.net
|
|
Pull EFI fixes from Ard Biesheuvel:
- Fix issues in EFI graceful recovery on x86 introduced by changes to
the kernel mode FPU APIs
- I-cache coherency fixes for the LoongArch EFI stub
- Locking fix for EFI pstore
- Code tweak for efivarfs
* tag 'efi-fixes-for-v7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
x86/efi: Restore IRQ state in EFI page fault handler
x86/efi: Fix graceful fault handling after FPU softirq changes
efi/libstub: Synchronize instruction cache after kernel relocation
efi/loongarch: Implement efi_cache_sync_image()
efi/libstub: Move efi_relocate_kernel() into its only remaining user
efi: pstore: Drop efivar lock when efi_pstore_open() returns with an error
efivarfs: use QSTR() in efivarfs_alloc_dentry
|
|
Panther cove µarch starts to support auto counter reload (ACR), but the
static_call intel_pmu_enable_acr_event() is not updated for the Panther
Cove µarch used by DMR. It leads to the auto counter reload is not
really enabled on DMR.
Update static_call intel_pmu_enable_acr_event() in intel_pmu_init_pnc().
Fixes: d345b6bb8860 ("perf/x86/intel: Add core PMU support for DMR")
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260430002558.712334-5-dapeng1.mi@linux.intel.com
|
|
On platforms with Auto Counter Reload (ACR) support, such as NVL, a
"NMI received for unknown reason 30" warning is observed when running
multiple events in a group with ACR enabled:
$ perf record -e '{instructions/period=20000,acr_mask=0x2/u,\
cycles/period=40000,acr_mask=0x3/u}' ./test
The warning occurs because the Performance Monitoring Interrupt (PMI)
is enabled for the self-reloaded event (the cycles event in this case).
According to the Intel SDM, the overflow bit
(IA32_PERF_GLOBAL_STATUS.PMCn_OVF) is never set for self-reloaded events.
Since the bit is not set, the perf NMI handler cannot identify the source
of the interrupt, leading to the "unknown reason" message.
Furthermore, enabling PMI for self-reloaded events is unnecessary and
can lead to extraneous records that pollute the user's requested data.
Disable the interrupt bit for all events configured with ACR self-reload.
Fixes: ec980e4facef ("perf/x86/intel: Support auto counter reload")
Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260430002558.712334-4-dapeng1.mi@linux.intel.com
|
|
Members of an ACR group are logically linked via a bitmask of their
hardware counter indices. If some members of the group are assigned new
hardware counters during rescheduling, even events that keep their
original counter index must be updated with a new mask.
Without this, an event will continue to use a stale acr_mask that
references the old indices of its group peers. Ensure all ACR events are
reprogrammed during the scheduling path to maintain consistency across
the group.
Fixes: ec980e4facef ("perf/x86/intel: Support auto counter reload")
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260430002558.712334-3-dapeng1.mi@linux.intel.com
|
|
Currently there are several issues on the user space ACR mask validation
and configuration.
- The validation for user space ACR mask (attr.config2) is incomplete,
e.g., the ACR mask could include the index which belongs to another
ACR events group, but it's not validated.
- An early return on an invalid ACR mask caused all subsequent ACR groups
to be skipped.
- The stale hardware ACR mask (hw.config1) is not cleared before setting
new hardware ACR mask.
The following changes address all of the above issues.
- Figure out the event index group of an ACR group. Any bits in the
user-space mask not present in the index group are now dropped.
- Instead of an early return on invalid bits, drop only the invalid
portions and continue iterating through all ACR events to ensure full
configuration.
- Explicitly clear the stale hardware ACR mask for each event prior to
writing the new configuration.
Besides, a non-leader event member of ACR group could be disabled in
theory. This could cause bit-shifting errors in the acr_mask of remaining
group members. But since ACR sampling requires all events to be active,
this should not be a big concern in real use case. Add a "FIXME" comment
to notice this risk.
Fixes: ec980e4facef ("perf/x86/intel: Support auto counter reload")
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260430002558.712334-2-dapeng1.mi@linux.intel.com
|
|
When fixing a conflict in xen_e820_resolve_conflicts(), the loop over
the E820 map entries needs to be restarted, as the E820 map will have
been modified by the fix. Otherwise entries might be skipped by
accident.
Fixes: be35d91c8880 ("xen: tolerate ACPI NVS memory overlapping with Xen allocated memory")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: xen-devel@lists.xenproject.org
Link: https://patch.msgid.link/20260505080653.197775-1-jgross@suse.com
|
|
The kernel's softirq API does not permit re-enabling softirqs while IRQs
are disabled. The reason for this is that local_bh_enable() will not
only re-enable delivery of softirqs over the back of IRQs, it will also
handle any pending softirqs immediately, regardless of whether IRQs are
enabled at that point.
For this reason, commit
d02198550423 ("x86/fpu: Improve crypto performance by making kernel-mode FPU reliably usable in softirqs")
disables softirqs only when IRQs are enabled, as it is not permitted
otherwise, but also unnecessary, given that asynchronous softirq
delivery never happens to begin with while IRQs are disabled.
However, this does mean that entering a kernel mode FPU section with
IRQs enabled and leaving it with IRQs disabled leads to problems, as
identified by Sashiko [0]: the EFI page fault handler is called from
page_fault_oops() with IRQs disabled, and thus ends the kernel mode FPU
section with IRQs disabled as well, regardless of whether IRQs were
enabled when it was started. This may result in schedule() being called
with a non-zero preempt_count, causing a BUG().
So take care to re-enable IRQs when handling any EFI page faults if they
were taken with IRQs enabled.
[0] https://sashiko.dev/#/patchset/20260430074107.27051-1-ivan.hu%40canonical.com
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Ivan Hu <ivan.hu@canonical.com>
Cc: x86@kernel.org
Cc: <stable@vger.kernel.org>
Fixes: d02198550423 ("x86/fpu: Improve crypto performance by making kernel-mode FPU reliably usable in softirqs")
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
Since commit d02198550423 ("x86/fpu: Improve crypto performance by
making kernel-mode FPU reliably usable in softirqs"), kernel_fpu_begin()
calls fpregs_lock() which uses local_bh_disable() instead of the
previous preempt_disable(). This sets SOFTIRQ_OFFSET in preempt_count
during the entire EFI runtime service call, causing in_interrupt() to
return true in normal task context.
The graceful page fault handler efi_crash_gracefully_on_page_fault()
uses in_interrupt() to bail out for faults in real interrupt context.
With SOFTIRQ_OFFSET now set, the handler always bails out, leaving EFI
firmware page faults unhandled. This escalates to die() which also sees
in_interrupt() as true and calls panic("Fatal exception in interrupt"),
resulting in a hard system freeze. On systems with buggy firmware that
triggers page faults during EFI runtime calls (e.g., accessing unmapped
memory in GetTime()), this causes an unrecoverable hang instead of the
expected graceful EFI_ABORTED recovery.
Fix by replacing in_interrupt() with !in_task(). This preserves the
original intent of bailing for interrupts or NMI faults, while no longer
falsely triggering from the FPU code path's local_bh_disable().
Fixes: d02198550423 ("x86/fpu: Improve crypto performance by making kernel-mode FPU reliably usable in softirqs")
Cc: <stable@vger.kernel.org>
Signed-off-by: Ivan Hu <ivan.hu@canonical.com>
[ardb: Sashiko spotted that using 'in_hardirq() || in_nmi()' leaves a
window where a softirq may be taken before fpregs_lock() is
called, but after efi_rts_work.efi_rts_id has been assigned,
and any page faults occurring in that window will then be
misidentified as having been caused by the firmware. Instead,
use !in_task(), which incorporates in_serving_softirq(). ]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
The shadow MMU computes GFNs for direct shadow pages using sp->gfn plus
the SPTE index. This assumption breaks for shadow paging if the guest
page tables are modified between VM entries (similar to commit
aad885e77496, "KVM: x86/mmu: Drop/zap existing present SPTE even
when creating an MMIO SPTE", 2026-03-27). The flow is as follows:
- a PDE is installed for a 2MB mapping, and a page in that area is
accessed. KVM creates a kvm_mmu_page consisting of 512 4KB pages;
the kvm_mmu_page is marked by FNAME(fetch) as direct-mapped because
the guest's mapping is a huge page (and thus contiguous).
- the PDE mapping is changed from outside the guest.
- the guest accesses another page in the same 2MB area. KVM installs
a new leaf SPTE and rmap entry; the SPTE uses the "correct" GFN
(i.e. based on the new mapping, as changed in the previous step) but
that GFN is outside of the [sp->gfn, sp->gfn + 511] range; therefore
the rmap entry cannot be found and removed when the kvm_mmu_page
is zapped.
- the memslot that covers the first 2MB mapping is deleted, and the
kvm_mmu_page for the now-invalid GPA is zapped. However, rmap_remove()
only looks at the [sp->gfn, sp->gfn + 511] range established in step 1,
and fails to find the rmap entry that was recorded by step 3.
- any operation that causes an rmap walk for the same page accessed
by step 3 then walks a stale rmap and dereferences a freed kvm_mmu_page.
This includes dirty logging or MMU notifier invalidations (e.g., from
MADV_DONTNEED).
The underlying issue is that KVM's walking of shadow PTEs assumes that
if a SPTE is present when KVM wants to install a non-leaf SPTE, then the
existing kvm_mmu_page must be for the correct gfn. Because the only way
for the gfn to be wrong is if KVM messed up and failed to zap a SPTE...
which shouldn't happen, but *actually* only happens in response to a
guest write.
That bug dates back literally forever, as even the first version of KVM
assumes that the GFN matches and walks into the "wrong" shadow page.
However, that was only an imprecision until 2032a93d66fa ("KVM: MMU:
Don't allocate gfns page for direct mmu pages") came along.
Fix it by checking for a target gfn mismatch and zapping the existing
SPTE. That way the old SP and rmap entries are gone, KVM installs
the rmap in the right location, and everyone is happy.
Fixes: 2032a93d66fa ("KVM: MMU: Don't allocate gfns page for direct mmu pages")
Fixes: 6aa8b732ca01 ("kvm: userspace interface")
Reported-by: Alexander Bulekov <bkov@amazon.com>
Reported-by: Fred Griffoul <fgriffo@amazon.co.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260503201029.106481-1-pbonzini@redhat.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Rename kvm_apic_update_irr()'s "irr_updated" and vmx_sync_pir_to_irr()'s
"got_posted_interrupt" to a more accurate "max_irr_is_from_pir", as neither
"irr_updated" nor "got_posted_interrupt" is accurate.
__kvm_apic_update_irr() and thus kvm_apic_update_irr() specifically return
true if and only if the highest priority IRQ, i.e. max_irr, is a "new"
pending IRQ from the PIR. I.e. it's possible for the IRR to be updated,
i.e. for a posted IRQ to be "got", *without* the APIs returning true.
Expand vmx_sync_pir_to_irr()'s comment to explain why it's necessary to
set KVM_REQ_EVENT only if a "new" IRQ was found, and to explain why it's
safe to do so only if a new IRQ is also the highest priority pending IRQ.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260503201703.108231-3-pbonzini@redhat.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Fall back to apic_find_highest_vector() when PID.ON is set but PIR
turns out to be empty, to correctly report the highest pending interrupt
from the existing IRR.
In a nested VM stress test, the following WARNING fires in
vmx_check_nested_events() when kvm_cpu_has_interrupt() reports a pending
interrupt but the subsequent kvm_apic_has_interrupt() (which invokes
vmx_sync_pir_to_irr() again) returns -1:
WARNING: CPU: 99 PID: 57767 at arch/x86/kvm/vmx/nested.c:4449 vmx_check_nested_events+0x6bf/0x6e0 [kvm_intel]
Call Trace:
kvm_check_and_inject_events
vcpu_enter_guest.constprop.0
vcpu_run
kvm_arch_vcpu_ioctl_run
kvm_vcpu_ioctl
__x64_sys_ioctl
do_syscall_64
entry_SYSCALL_64_after_hwframe
The root cause is a race between vmx_sync_pir_to_irr() on the target vCPU
and __vmx_deliver_posted_interrupt() on a sender vCPU. The sender
performs two individually-atomic operations that are not a single
transaction:
1. pi_test_and_set_pir(vector) -- sets the PIR bit
2. pi_test_and_set_on() -- sets PID.ON
The following interleaving triggers the bug:
Sender vCPU (IPI): Target vCPU (1st sync_pir_to_irr):
B1: set PIR[vector]
A1: pi_clear_on()
A2: pi_harvest_pir() -> sees B1 bit
A3: xchg() -> consumes bit, PIR=0
(1st sync returns correct max_irr)
B2: set PID.ON = 1
Target vCPU (2nd sync_pir_to_irr):
C1: pi_test_on() -> TRUE (from B2)
C2: pi_clear_on() -> ON=0
C3: pi_harvest_pir() -> PIR empty
C4: *max_irr = -1, early return
IRR NOT SCANNED
The interrupt is not lost (it resides in the IRR from the first sync and
is recovered on the next vcpu_enter_guest() iteration), but the incorrect
max_irr causes a spurious WARNING and a wasted L2 VM-Enter/VM-Exit cycle.
Fixes: b41f8638b9d3 ("KVM: VMX: Isolate pure loads from atomic XCHG when processing PIR")
Reported-by: Farrah Chen <farrah.chen@intel.com>
Analyzed-by: Chenyi Qiang <chenyi.qiang@intel.com>
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/kvm/20260428070349.1633238-1-chenyi.qiang@intel.com/T/
Link: https://patch.msgid.link/20260503201703.108231-2-pbonzini@redhat.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Checking is_guest_mode(vcpu) is incorrect, because translate_nested_gpa()
is only valid if an L2 guest is running *with nested EPT/NPT enabled*.
Instead use the same condition as translate_nested_gpa() itself.
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <seanjc@google.com>
Fixes: aee738236dca ("KVM: x86: Prepare kvm_hv_flush_tlb() to handle L2's GPAs", 2022-11-18)
Link: https://patch.msgid.link/20260503200905.106077-1-pbonzini@redhat.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Pull x86 fixes from Ingo Molnar:
- Prevent deadlock during shstk sigreturn (Rick Edgecombe)
- Disable FRED when PTI is forced on (Dave Hansen)
- Revert a CPA INVLPGB optimization that did not properly handle
discontiguous virtual addresses (Dave Hansen)
* tag 'x86-urgent-2026-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Revert INVLPGB optimization for set_memory code
x86/cpu: Disable FRED when PTI is forced on
x86/shstk: Prevent deadlock during shstk sigreturn
|
|
tl;dr: Revert an INVLPGB optimization that did not properly handle
discontiguous virtual addresses.
Full story:
I got a report from some graphics (i915) folks that bisected a
regression in their test suite to 86e6815b316e ("x86/mm: Change
cpa_flush() to call flush_kernel_range() directly"). There was a bit
of flip-flopping on the exact bisect, but the code here does seem
wrong to me. The i915 folks were calling set_pages_array_wc(), so
using the CPA_PAGES_ARRAY mode.
Basically, the 'struct cpa_data' can wrap up all kinds of page table
changes. Some of these are virtually contiguous, but some are very
much not which is one reason why there are ->vaddr and ->pages arrays.
86e6815b316e made the mistake of assuming that the virtual addresses
in the cpa_data are always contiguous. It got things right when neither
CPA_ARRAY/CPA_PAGES_ARRAY is used, but theoretically wrong when either
of those is used.
In the i915 case, it probably failed to flush some WB TLB entries and
install WC ones, leaving some data in the caches and not flushing it
out to where the device could see it. That eventually caused graphics
problems.
Revert the INVLPGB optimization. It can be reintroduced later, but it
will need to be a bit careful about the array modes.
Fixes: 86e6815b316ec ("x86/mm: Change cpa_flush() to call flush_kernel_range()")
Reported-by: Cui, Ling <ling.cui@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://patch.msgid.link/20260421151909.6B3281C6@davehans-spike.ostc.intel.com
|
|
Pull PCMCIA updates from Dominik Brodowski:
"A number of minor PCMCIA bugfixes and cleanups, and a patch removing
obsolete host controller drivers"
* tag 'pcmcia-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux:
pcmcia: remove obsolete host controller drivers
pcmcia: Convert to use less arguments in pci_bus_for_each_resource()
PCMCIA: Fix garbled log messages for KERN_CONT
|
|
Pull kgdb update from Daniel Thompson:
"Only a very small update for kgdb this cycle: a single patch from
Kexin Sun that fixes some outdated comments"
* tag 'kgdb-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kgdb: update outdated references to kgdb_wait()
|
|
Pull s390 updates from Vasily Gorbik:
- Add support for CONFIG_PAGE_TABLE_CHECK and enable it in
debug_defconfig. s390 can only tell user from kernel PTEs via the mm,
so mm_struct is now passed into pxx_user_accessible_page() callbacks
- Expose the PCI function UID as an arch-specific slot attribute in
sysfs so a function can be identified by its user-defined id while
still in standby. Introduces a generic ARCH_PCI_SLOT_GROUPS hook in
drivers/pci/slot.c
- Refresh s390 PCI documentation to reflect current behavior and cover
previously undocumented sysfs attributes
- zcrypt device driver cleanup series: consistent field types, clearer
variable naming, a kernel-doc warning fix, and a comment explaining
the intentional synchronize_rcu() in pkey_handler_register()
- Provide an s390 arch_raw_cpu_ptr() that avoids the detour via
get_lowcore() using alternatives, shrinking defconfig by ~27 kB
- Guard identity-base randomization with kaslr_enabled() so nokaslr
keeps the identity mapping at 0 even with RANDOMIZE_IDENTITY_BASE=y
- Build S390_MODULES_SANITY_TEST as a module only by requiring KUNIT &&
m, since built-in would not exercise module loading
- Remove the permanently commented-out HMCDRV_DEV_CLASS create_class()
code in the hmcdrv driver
- Drop stale ident_map_size extern conflicting with asm/page.h
* tag 's390-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/zcrypt: Fix warning about wrong kernel doc comment
PCI: s390: Expose the UID as an arch specific PCI slot attribute
docs: s390/pci: Improve and update PCI documentation
s390/pkey: Add comment about synchronize_rcu() to pkey base
s390/hmcdrv: Remove commented out code
s390/zcrypt: Slight rework on the agent_id field
s390/zcrypt: Explicitly use a card variable in _zcrypt_send_cprb
s390/zcrypt: Rework MKVP fields and handling
s390/zcrypt: Make apfs a real unsigned int field
s390/zcrypt: Rework domain processing within zcrypt device driver
s390/zcrypt: Move inline function rng_type6cprb_msgx from header to code
s390/percpu: Provide arch_raw_cpu_ptr()
s390: Enable page table check for debug_defconfig
s390/pgtable: Add s390 support for page table check
s390/pgtable: Use set_pmd_bit() to invalidate PMD entry
mm/page_table_check: Pass mm_struct to pxx_user_accessible_page()
s390/boot: Respect kaslr_enabled() for identity randomization
s390/Kconfig: Make modules sanity test a module-only option
s390/setup: Drop stale ident_map_size declaration
|
|
Pull Hyper-V updates from Wei Liu:
- Fix cross-compilation for hv tools (Aditya Garg)
- Fix vmemmap_shift exceeding MAX_FOLIO_ORDER in mshv_vtl (Naman Jain)
- Limit channel interrupt scan to relid high water mark (Michael
Kelley)
- Export hv_vmbus_exists() and use it in pci-hyperv (Dexuan Cui)
- Fix cleanup and shutdown issues for MSHV (Jork Loeser)
- Introduce more tracing support for MSHV (Stanislav Kinsburskii)
* tag 'hyperv-next-signed-20260421' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
x86/hyperv: Skip LP/VP creation on kexec
x86/hyperv: move stimer cleanup to hv_machine_shutdown()
Drivers: hv: vmbus: fix hyperv_cpuhp_online variable shadowing
mshv: Add tracepoint for GPA intercept handling
mshv_vtl: Fix vmemmap_shift exceeding MAX_FOLIO_ORDER
tools: hv: Fix cross-compilation
Drivers: hv: vmbus: Export hv_vmbus_exists() and use it in pci-hyperv
mshv: Introduce tracing support
Drivers: hv: vmbus: Limit channel interrupt scan to relid high water mark
|
|
After a kexec the logical processors and virtual processors already
exist in the hypervisor because they were created by the previous
kernel. Attempting to add them again causes either a BUG_ON or
corrupted VP state leading to MCEs in the new kernel.
Add hv_lp_exists() to probe whether an LP is already present by
calling HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME. When it succeeds the
LP exists and we skip the add-LP and create-VP loops entirely.
Also add hv_call_notify_all_processors_started() which informs the
hypervisor that all processors are online. This is required after
adding LPs (fresh boot) and is a no-op on kexec since we skip that
path.
Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Co-developed-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>
Signed-off-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>
Co-developed-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
|
|
Move hv_stimer_global_cleanup() from vmbus's hv_kexec_handler() to
hv_machine_shutdown() in the platform code. This ensures stimer cleanup
happens before the vmbus unload, which is required for root partition
kexec to work correctly.
Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
|
|
FRED and PTI were never intended to work together. No FRED hardware is
vulnerable to Meltdown and all of it should have LASS anyway.
Nevertheless, if you boot a system with pti=on and fred=on, the kernel
tries to do what is asked of it and dies a horrible death on the first
attempt to run userspace (since it never switches to the user page
tables).
Disable FRED when PTI is forced on, and print a warning about it.
A quick brain dump about what a FRED+PTI implementation would look like
is below. I'm not sure it would make any sense to do it, but never say
never. All I know is that it's way too complicated to be worth it today.
<brain dump>
The SWITCH_TO_USER/KERNEL_CR3 bits are simple to fix (or at least we
have the assembly tools to do it already), as is sticking the FRED entry
text in .entry.text (it's not in there today).
The nasty part is the stacks. Today, the CPU pops into the kernel on
MSR_IA32_FRED_RSP0 which is normal old kernel memory and not mapped to
userspace. The hardware pushes gunk on to MSR_IA32_FRED_RSP0, which is
currently the task stacks. MSR_IA32_FRED_RSP0 would need to point
elsewhere, probably cpu_entry_stack(). Then, start playing games with
stacks on entry/exit, including copying gunk to and from the task stack.
While I'd *like* to have PTI everywhere, I'm not sure it's worth mucking
up the FRED code with PTI kludges. If a user wants fast entry/exit, they
use FRED. If you want PTI (and sekuritay), you certainly don't care
about fast entry and FRED isn't going to help you *all* that much, so
you can just stay with the IDT.
Plus, FRED hardware should have LASS which gives you a similar security
profile to PTI without the CR3 munging.
</brain dump>
Reported-by: Gayatri Kammela <Gayatri.Kammela@amd.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Cc:stable@vger.kernel.org
Link: https://patch.msgid.link/20260421163136.E7C6788A@davehans-spike.ostc.intel.com
|