aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2026-05-24Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2-6/+8
Pull kvm fixes from Paolo Bonzini: "arm64: - Fix ITS EventID sanitisation when restoring an interrupt translation table. - Fix PPI memory leak when failing to initialise a vcpu. - Correctly return an error when the validation of a hypervisor trace descriptor fails, and limit this validation to protected mode only. RISC-V: - Fix invalid HVA warning in steal-time recording - Return SBI_ERR_FAILURE to guest upon OOM in pmu_event_info() and pmu_snapshot_set_shmem() - Fix NULL pointer dereference in SBI v0.1 SEND_IPI handler - Fix sign extension of value for MMIO loads s390: - Fix bugs in vSIE (nested virtualization) and UCONTROL, caused by the page table rewrite. x86: - Apply erratum #1235 workaround (disable AVIC IPI virtualization) on Hygon Family 18h, just like on AMD Family 17h. - When KVM_CAP_X86_APIC_BUS_CYCLES_NS is queried on a specific VM, return the VM's configured APIC bus frequency instead of the default. This is less confusing (read: not wrong) and makes it easier to fill in CPUID information that communicates the APIC bus frequency to the guest. Selftests: - Do not include glibc-internal <bits/endian.h>; it worked by chance and broke building KVM selftests with musl" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: SVM: Disable AVIC IPI virtualization on Hygon Family 18h (erratum #1235) KVM: selftests: Verify that KVM returns the configured APIC cycle length KVM: x86: Return the VM's configured APIC bus frequency when queried KVM: selftests: elf: Include <endian.h> instead of <bits/endian.h> KVM: s390: Properly reset zero bit in PGSTE KVM: s390: vsie: Fix redundant rmap entries KVM: s390: vsie: Fix unshadowing logic KVM: s390: Fix leaking kvm_s390_mmu_cache in case of errors KVM: s390: vsie: Fix memory leak when unshadowing KVM: arm64: Fix nVHE/pKVM hyp tracing error on invalid desc KVM: arm64: vgic: Free private_irqs when init fails after allocation KVM: arm64: vgic-its: Reject restored DTE with out-of-range num_eventid_bits RISC-V: KVM: Fix sign extension for MMIO loads RISC-V: KVM: Fix NULL pointer dereference in SBI v0.1 SEND_IPI handler riscv: kvm: return SBI_ERR_FAILURE for pmu_event_info() when OOM riscv: kvm: return SBI_ERR_FAILURE for pmu_snapshot_set_shmem() when OOM RISC-V: KVM: Fix invalid HVA warning in steal-time recording
2026-05-24Merge tag 'x86-urgent-2026-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds14-69/+134
Pull x86 fixes from Ingo Molnar: - On SEV guests, handle set_memory_{encrypted,decrypted}() failures more conservatively by assuming that all affected pages are unencrypted (Carlos López) - Disable broadcast TLB flush when PCID is disabled (Tom Lendacky) - Fix VMX vs. hrtimer_rearm_deferred() regression (Peter Zijlstra) - Move IRQ/NMI dispatch code from KVM into x86 core, to prepare for a KVM x2apic fix (Peter Zijlstra) - Fix incorrect munmap() size on map_vdso() failure (Guilherme Giacomo Simoes) * tag 'x86-urgent-2026-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: virt: sev-guest: Explicitly leak pages in unknown state x86/mm: Disable broadcast TLB flush when PCID is disabled x86/kvm/vmx: Fix VMX vs hrtimer_rearm_deferred() x86/kvm/vmx: Move IRQ/NMI dispatch from KVM into x86 core x86/vdso: Fix incorrect size in munmap() on map_vdso() failure
2026-05-23KVM: SVM: Disable AVIC IPI virtualization on Hygon Family 18h (erratum #1235)Tina Zhang1-5/+7
Hygon Family 18h CPUs are derived from AMD Family 17h (Zen1) silicon and share the same erratum #1235: hardware may read a stale IsRunning=1 bit during ICR write emulation and silently fail to generate an AVIC_IPI_FAILURE_TARGET_NOT_RUNNING VM-Exit on the sending vCPU. The absence of the VM-Exit causes KVM to miss the required wakeup of blocking target vCPUs, leading to hung vCPUs and unbounded delays in guest execution. Extend the existing AMD Family 17h erratum #1235 workaround to also cover Hygon Family 18h. With IPI virtualization disabled, KVM never sets IsRunning=1 in the Physical ID table, so every non-self IPI generates a VM-Exit and is correctly emulated. Fixes: 8de4a1c8164e ("KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235") Cc: <stable@vger.kernel.org> Signed-off-by: Tina Zhang <zhang_wei@open-hieco.net> Message-ID: <20260522040014.3380201-1-zhang_wei@open-hieco.net>
2026-05-23KVM: x86: Return the VM's configured APIC bus frequency when queriedSean Christopherson1-1/+1
When KVM_CAP_X86_APIC_BUS_CYCLES_NS is queried on a specific VM, return the VM's configured APIC bus frequency, not KVM's default. Aside from the fact that returning the default frequency is blatantly wrong if userspace has changed the frequency, returning the configured frequency means userspace can blindly trust the result, e.g. when filling PV CPUID information that communicates the APIC bus frequency to the guest. Fixes: 6fef518594bc ("KVM: x86: Add a capability to configure bus frequency for APIC timer") Reported-by: David Woodhouse <dwmw2@infradead.org> Closes: https://lore.kernel.org/all/ab84153e33fbe7c25667f595c56b310d4d5a93ef.camel@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20260522173526.3539407-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-21ring-buffer: Flush and stop persistent ring buffer on panicMasami Hiramatsu (Google)1-0/+1
On real hardware, panic and machine reboot may not flush hardware cache to memory. This means the persistent ring buffer, which relies on a coherent state of memory, may not have its events written to the buffer and they may be lost. Moreover, there may be inconsistency with the counters which are used for validation of the integrity of the persistent ring buffer which may cause all data to be discarded. To avoid this issue, stop recording of the ring buffer on panic and flush the cache of the ring buffer's memory. Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance") Cc: stable@vger.kernel.org Cc: Will Deacon <will@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ian Rogers <irogers@google.com> Link: https://patch.msgid.link/177751969602.2136606.12031934362587643488.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-05-20x86/mm: Disable broadcast TLB flush when PCID is disabledTom Lendacky1-0/+1
Booting with "nopcid" clears X86_FEATURE_PCID and keeps CR4.PCIDE from being set to one. On AMD CPUs that support INVLPGB, broadcast TLB flushing remains enabled. There are two checks that decide whether the global ASID code runs, mm_global_asid() and consider_global_asid(), that key off of the X86_FEATURE_INVLPGB feature. Once an mm becomes active on more than three CPUs, consider_global_asid() assigns it a global ASID, after which flush_tlb_mm_range() takes the broadcast_tlb_flush() path using a non-zero PCID. Issuing an INVLPGB with a non-zero PCID while CR4.PCIDE is not set results in a #GP: Oops: general protection fault, kernel NULL pointer dereference 0x1: 0000 [#1] SMP NOPTI CPU: 158 UID: 0 PID: 3119 Comm: snap Not tainted 7.1.0-rc3 #1 PREEMPT(full) Hardware name: ... RIP: 0010:broadcast_tlb_flush Code: ... 89 da 48 83 c8 07 <0f> 01 fe eb 08 cc cc cc ... Call Trace: <TASK> flush_tlb_mm_range ptep_clear_flush wp_page_copy ? _raw_spin_unlock __handle_mm_fault handle_mm_fault do_user_addr_fault exc_page_fault asm_exc_page_fault All processors that support broadcast TLB invalidation also have PCID support, so it is only the "nopcid" scenario that is of concern. In this situation just disable the broadcast TLB support using the CPUID dependency support by making X86_FEATURE_INVLPGB dependent on X86_FEATURE_PCID. [ bp: Massage commit message. ] Fixes: 4afeb0ed1753 ("x86/mm: Enable broadcast TLB invalidation for multi-threaded processes") Suggested-by: Dave Hansen <dave.hansen@intel.com> Assisted-by: Claude:claude-opus-4.7 Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Rik van Riel <riel@surriel.com> Cc: <stable@kernel.org> Link: https://patch.msgid.link/b915acfd63e8b2a094fdeb8dc608738072518764.1779296450.git.thomas.lendacky@amd.com
2026-05-19x86/kvm/vmx: Fix VMX vs hrtimer_rearm_deferred()Peter Zijlstra1-0/+13
Vishal reported that KVM unit test 'x2apic' started failing after commit 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming"). The reason is that KVM/VMX is injecting interrupts while it has interrupts disabled, for a context that will enable interrupts, this means that regs->flags.X86_EFLAGS_IF == 0 and irqentry_exit() will not do the right thing. Notably, irqentry_exit() must not call hrtimer_rearm_deferred() when the return context does not have IF set, because this will cause problems vs NMIs. Therefore, fix up the state after the injection. Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming") Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com> Suggested-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: "Verma, Vishal L" <vishal.l.verma@intel.com> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Zhao Liu <zhao1.liu@intel.com> Tested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260423155936.957351833@infradead.org Closes: https://lore.kernel.org/r/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel%40intel.com
2026-05-19x86/kvm/vmx: Move IRQ/NMI dispatch from KVM into x86 corePeter Zijlstra12-68/+119
Move the VMX interrupt dispatch magic into the x86 core code. This isolates KVM from the FRED/IDT decisions and reduces the amount of EXPORT_SYMBOL_FOR_KVM(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: "Verma, Vishal L" <vishal.l.verma@intel.com> Tested-by: Zhao Liu <zhao1.liu@intel.com> Tested-by: Zhao Liu <zhao1.liu@intel.com> Tested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Binbin Wu <binbin.wu@linxu.intel.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260508091829.GO3126523@noisy.programming.kicks-ass.net
2026-05-19x86/vdso: Fix incorrect size in munmap() on map_vdso() failureGuilherme Giacomo Simoes1-1/+1
In map_vdso(), if a failure occurs during the installation of the VVAR mappings, the error path attempts to clean up previously allocated mappings using do_munmap(). However, the cleanup for the VVAR mapping is incorrectly using image->size (the size of the vDSO text) instead of the actual size allocated for the VVAR area. Replace the incorrect do_munmap() image->size parameter with the constant VDSO_NR_PAGES * PAGE_SIZE. Ensure the unmap size exactly matches the size used during the vdso_install_vvar_mapping() phase to provide a symmetrical and complete teardown of the memory region. Fixes: e93d2521b27f ("x86/vdso: Split virtual clock pages into dedicated mapping") Signed-off-by: Guilherme Giacomo Simoes <trintaeoitogc@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Link: https://patch.msgid.link/20260503191609.551817-1-trintaeoitogc@gmail.com
2026-05-17Merge tag 'x86-urgent-2026-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-0/+8
Pull x86 fix from Ingo Molnar: - Fix x86 boot crash for non-kjump kexecs (David Woodhouse) * tag 'x86-urgent-2026-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kexec: Push kjump return address even for non-kjump kexec
2026-05-17Merge tag 'ras-urgent-2026-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-28/+5
Pull MCE fix from Ingo Molnar: - Fix an MCE polling interval adjustment regression (Borislav Petkov) * tag 'ras-urgent-2026-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Restore MCA polling interval halving
2026-05-15Merge tag 'for-linus-7.1b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tipLinus Torvalds2-3/+7
Pull xen fixes from Juergen Gross: - one simple cleanup - a fix for a corner case when running as Xen PV dom0 - a fix of a regression for Xen PV guests, introduced in 7.0 * tag 'for-linus-7.1b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: Tolerate nested XEN_LAZY_MMU entering/leaving x86/xen: Fix xen_e820_swap_entry_with_ram() xen/arm: Replace __ASSEMBLY__ with __ASSEMBLER__ in interface.h
2026-05-14Merge tag 'acpi-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds1-3/+3
Pull ACPI support fixes from Rafael Wysocki: "These fix several platform drivers that use the ACPI companion of the given platform device without checking its presence, which may lead to a NULL pointer dereference or other kind of malfunction if the driver is forced to match a device without an ACPI companion via driver override, and restore debug log level for some messages in the ACPI CPPC library: - Check ACPI_COMPANION() against NULL during probe in several core ACPI device drivers (Rafael Wysocki) - Restore log level of messages in amd_set_max_freq_ratio() (Mario Limonciello)" * tag 'acpi-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: PAD: xen: Check ACPI_COMPANION() against NULL ACPI: driver: Check ACPI_COMPANION() against NULL during probe Revert "ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"
2026-05-14Merge branch 'acpi-cppc'Rafael J. Wysocki1-3/+3
Merge a revert of an ACPI CPPC commit that increased the log level of some debug messages which turned out to be a bad idea: - Restore log level of messages in amd_set_max_freq_ratio() (Mario Limonciello) * acpi-cppc: Revert "ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"
2026-05-14x86/xen: Tolerate nested XEN_LAZY_MMU entering/leavingJuergen Gross1-2/+6
With the support of nested lazy mmu sections it can happen that arch_enter_lazy_mmu_mode() is being called twice without a call of arch_leave_lazy_mmu_mode() in between, as the lazy_mmu_*() helpers are not disabling preemption when checking for nested lazy mmu sections. This is a problem when running as a Xen PV guest, as xen_enter_lazy_mmu() and xen_leave_lazy_mmu() don't tolerate this case. Fix that in xen_enter_lazy_mmu() and xen_leave_lazy_mmu() in order not to hurt all other lazy mmu mode users. Fixes: 291b3abed657 ("x86/xen: use lazy_mmu_state when context-switching") Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20260508143933.493013-1-jgross@suse.com>
2026-05-14x86/xen: Fix xen_e820_swap_entry_with_ram()Juergen Gross1-1/+1
When swapping a not page-aligned E820 map entry with RAM, the start address of the modified entry is calculated wrong (the offset into the page is subtracted instead of being added to the page address). Fixes: be35d91c8880 ("xen: tolerate ACPI NVS memory overlapping with Xen allocated memory") Reported-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20260505102417.208138-1-jgross@suse.com>
2026-05-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds7-37/+66
Pull kvm fixes from Paolo Bonzini: "arm64: - Add the pKVM side of the workaround for ARM's erratum 4193714, provided that the EL3 firmware does its part of the job. KVM will refuse to initialise otherwise - Correctly handle 52bit VAs for guest EL2 stage-1 translations when running under NV with E2H==0 - Correctly deal with permission faults in guest_memfd memslots - Fix the steal-time selftest after the infrastructure was reworked - Make sure the host cannot pass a non-sensical clock update to the EL2 tracing infrastructure - Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390 ability to run arm64 guests, which will inevitably lead to arm64 code being directly used on s390 - Make sure that EL2 is configured with both exception entry and exit being Context Synchronization Events - Handle the current vcpu being NULL on EL2 panic - Fix the selftest_vcpu memcache being empty at the point of donation or sharing - Check that the memcache has enough capacity before engaging on the share/donate path - Fix __deactivate_fgt() to use its parameter rather than a variable in the macro context s390: - Fix array overrun with large amounts of PCI devices x86: - Never use L0's PAUSE loop exiting while L2 is running, since it's unlikely that a nested guest will help solving the hypervisor's spinlock contention - Fix emulation of MOVNTDQA - Fix typo in Xen hypercall tracepoint - Add back an optimization that was left behind when recently fixing a bug - Add module parameter to disable CET, whose implementation seems to have issues. For now it remains enabled by default Generic: - Reject offset causing an unsigned overflow in kvm_reset_dirty_gfn() Documentation: - Update stale links Selftests: - Fix guest_memfd_test with host page size > guest page size" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits) KVM: VMX: introduce module parameter to disable CET KVM: x86: Swap the dst and src operand for MOVNTDQA KVM: x86: use again the flush argument of __link_shadow_page() KVM: selftests: Ensure gmem file sizes are multiple of host page size Documentation: kvm: update links in the references section of AMD Memory Encryption KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running KVM: x86: Fix Xen hypercall tracepoint argument assignment KVM: Reject wrapped offset in kvm_reset_dirty_gfn() KVM: arm64: Pre-check vcpu memcache for host->guest donate KVM: arm64: Pre-check vcpu memcache for host->guest share KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache KVM: arm64: Fix __deactivate_fgt macro parameter typo KVM: arm64: Guard against NULL vcpu on VHE hyp panic path KVM: arm64: Make EL2 exception entry and exit context-synchronization events MAINTAINERS: Add Steffen as reviewer for KVM/arm64 KVM: arm64: Remove potential UB on nvhe tracing clock update KVM: selftests: arm64: Fix steal_time test after UAPI refactoring KVM: arm64: Handle permission faults with guest_memfd KVM: arm64: nv: Consider the DS bit when translating TCR_EL2 KVM: arm64: Work around C1-Pro erratum 4193714 for protected guests ...
2026-05-13x86/mce: Restore MCA polling interval halvingBorislav Petkov (AMD)1-28/+5
RongQing reported that the MCA polling interval doesn't halve when an error gets logged. It was traced down to the commit in Fixes:, because: mce_timer_fn() |-> mce_poll_banks() |-> machine_check_poll() |-> mce_log() which will queue the work and return. Now, back in mce_timer_fn(): /* * Alert userspace if needed. If we logged an MCE, reduce the polling * interval, otherwise increase the polling interval. */ if (mce_notify_irq()) <--- here we haven't ran the notifier chain yet so mce_need_notify is not set yet so this won't hit and we won't halve the interval iv. Now the notifier chain runs. mce_early_notifier() sets the bit, does mce_notify_irq(), that clears the bit and then the notifier chain a little later logs the error. So this is a silly timing issue. But, that's all unnecessary. All it needs to happen here is, the "should we notify of a logged MCE" mce_notify_irq() asks, should be simply a question to the mce gen pool: "Are you empty?" And that then turns into a simple yes or no answer and it all JustWorks(tm). So do that and also distribute the functionality where it belongs: - Print that MCE events have been logged in mce_log() - Trigger the mcelog tool specific work in the first notifier As a result, mce_notify_irq() can go now. Fixes: 011d82611172 ("RAS: Add a Corrected Errors Collector") Reported-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://lore.kernel.org/r/20260112082747.2842-1-lirongqing@baidu.com
2026-05-13KVM: VMX: introduce module parameter to disable CETPaolo Bonzini2-2/+16
There have been reports of host hangs caused by CET virtualization. Until these are analyzed further, introduce a module parameter that makes it possible to easily disable it. Link: https://lore.kernel.org/all/85548beb-1486-40f9-beb4-632c78e3360b@proxmox.com/ Cc: David Riley <d.riley@proxmox.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-12KVM: x86: Swap the dst and src operand for MOVNTDQASean Christopherson1-1/+1
Swap the MOVNTDQA operands, as MOVNTDQA does NOT in fact have "the same characteristics as 0F E7 (MOVNTDQ)"; MOVNTDQA loads from memory and stores to registers, while MOVNTDQ loads from registers and stores to memory. Per the SDM: MOVNTDQ - Move packed integer values in xmm1 to m128 using non-temporal hint. MOVNTDQA - Move double quadword from m128 to xmm1 using non-temporal hint if WC memory type. Reported-by: Josh Eads <josheads@google.com> Fixes: c57d9bafbd0b ("KVM: x86: Add support for emulating MOVNTDQA") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20260506213514.2781948-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-12KVM: x86: use again the flush argument of __link_shadow_page()Paolo Bonzini1-2/+21
Except in the case of parentless nested-TDP pages, mmu_page_zap_pte() clears the SPTE but leaves the invalid_list empty. In this case, using kvm_flush_remote_tlbs() as kvm_mmu_remote_flush_or_zap() does is overkill. Avoid flushing the entirety of the remote TLBs unless the invalid_list was populated: instead, use a more efficient gfn-targeting flush (if available) and skip it altogether if the caller guarantees that a TLB flush is not necessary. Based-on: <20260503201029.106481-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20260503210917.121840-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-12KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is runningSean Christopherson2-31/+27
Never use L0's (KVM's) PAUSE loop exiting controls while L2 is running, and instead always configure vmcb02 according to L1's exact capabilities and desires. The purpose of intercepting PAUSE after N attempts is to detect when the vCPU may be stuck waiting on a lock, so that KVM can schedule in a different vCPU that may be holding said lock. Barring a very interesting setup, L1 and L2 do not share locks, and it's extremely unlikely that an L1 vCPU would hold a spinlock while running L2. I.e. having a vCPU executing in L1 yield to a vCPU running in L2 will not allow the L1 vCPU to make forward progress, and vice versa. While teaching KVM's "on spin" logic to only yield to other vCPUs in L2 is doable, in all likelihood it would do more harm than good for most setups. KVM has limited visibility into which L2 "vCPUs" belong to the same VM, and thus share a locking domain. And even if L2 vCPUs are in the same VM, KVM has no visilibity into L2 vCPU's that are scheduled out by the L1 hypervisor. Furthermore, KVM doesn't actually steal PAUSE exits from L1. If L1 is intercepting PAUSE, KVM will route PAUSE exits to L1, not L0, as nested_svm_intercept() gives priority to the vmcb12 intercept. As such, overriding the count/threshold fields in vmcb02 with vmcb01's values is nonsensical, as doing so clobbers all the training/learning that has been done in L1. Even worse, if L1 is not intercepting PAUSE, i.e. KVM is handling PAUSE exits, then KVM will adjust the PLE knobs based on L2 behavior, which could very well be detrimental to L1, e.g. due to essentially poisoning L1 PLE training with bad data. And copying the count from vmcb02 to vmcb01 on a nested VM-Exit makes even less sense, because again, the purpose of PLE is to detect spinning vCPUs. Whether or not a vCPU is spinning in L2 at the time of a nested VM-Exit has no relevance as to the behavior of the vCPU when it executes in L1. The only scenarios where any of this actually works is if at least one of KVM or L1 is NOT intercepting PAUSE for the guest. Per the original changelog, those were the only scenarios considered to be supported. Disabling KVM's use of PLE makes it so the VM is always in a "supported" mode. Last, but certainly not least, using KVM's count/threshold instead of the values provided by L1 is a blatant violation of the SVM architecture. Fixes: 74fd41ed16fd ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE") Cc: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260508213321.373309-1-seanjc@google.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-12KVM: x86: Fix Xen hypercall tracepoint argument assignmentQiang Ma1-1/+1
TRACE_EVENT(kvm_xen_hypercall) stores a5 in __entry->a4 instead of __entry->a5. That overwrites the recorded a4 argument and leaves a5 unset in the trace entry. Fix the typo so both arguments are captured correctly. Signed-off-by: Qiang Ma <maqianga@uniontech.com> Link: https://patch.msgid.link/20260512015313.1685784-1-maqianga@uniontech.com/ Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-11x86/CPU/AMD: Prevent improper isolation of shared resources in Zen2's op cachePrathyushi Nangia2-1/+5
Make sure resources are not improperly shared in the op cache and cause instruction corruption this way. Signed-off-by: Prathyushi Nangia <prathyushi.nangia@amd.com> Co-developed-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-05-08Merge tag 'x86-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-5/+14
Pull x86 fixes from Ingo Molnar: - Fix memory map enumeration bug in the Xen e820 parsing code (Juergen Gross) - Re-enable e820 BIOS fallback if e820 table is empty (David Gow) * tag 'x86-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot/e820: Re-enable BIOS fallback if e820 table is empty x86/xen: Fix a potential problem in xen_e820_resolve_conflicts()
2026-05-08Merge tag 'perf-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds3-16/+57
Pull perf events fixes from Ingo Molnar: - Fix deadlock in the perf_mmap() failure path (Peter Zijlstra) - Intel ACR (Auto Counter Reload) fixes (Dapeng Mi): - Fix validation and configuration of ACR masks - Fix ACR rescheduling bug causing stale masks - Disable the PMI on ACR-enabled hardware - Enable ACR on Panther Cover uarch too * tag 'perf-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Enable auto counter reload for DMR perf/x86/intel: Disable PMI for self-reloaded ACR events perf/x86/intel: Always reprogram ACR events to prevent stale masks perf/x86/intel: Improve validation and configuration of ACR masks perf/core: Fix deadlock in perf_mmap() failure path
2026-05-08Revert "ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"Mario Limonciello1-3/+3
Some older systems don't support CPPC in the firmware and this just makes noise for them when booting. Drop back to debug. This reverts commit 21fb59ab4b9767085f4fe1edbdbe3177fbb9ec97. Fixes: 21fb59ab4b976 ("ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn") Suggested-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Tested-by: Kim Phillips <kim.phillips@amd.com> Cc: All applicable <stable@vger.kernel.org> Link: https://patch.msgid.link/20260504230141.484743-2-mario.limonciello@amd.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-05-08x86/kexec: Push kjump return address even for non-kjump kexecDavid Woodhouse1-0/+8
The version of purgatory code shipped by kexec-tools attempts to look above the top of its stack to find a return address for a kjump, even in a non-kjump kexec. After the commit in Fixes: the word above the stack might not be there, leading to a fault (which is at least now caught by my exception-handling code in kexec). That commit fixed things for the actual kjump path, but no longer "gratuitously" pushes the unused return address to the stack in the non-kjump path. Put that *back* in the non-kjump path, to prevent purgatory from crashing when trying to access it. Fixes: 2cacf7f23a02 ("x86/kexec: Fix stack and handling of re-entry point for ::preserve_context") Reported-by: Rohan Kakulawaram <rohanka@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Rohan Kakulawaram <rohanka@google.com> Cc: <stable@kernel.org> Link: https://patch.msgid.link/32d627134143ffd957891cb697138e839c623211.camel@infradead.org
2026-05-07x86/boot/e820: Re-enable BIOS fallback if e820 table is emptyDavid Gow1-1/+5
In commit: 157266edcc56 ("x86/boot/e820: Simplify append_e820_table() and remove restriction on single-entry tables") the check on the number of entries in the e820 table was removed. The intention was to support single-entry maps, but by removing the check entirely, we also skip the fallback (to, e.g., the BIOS 88h function). This means that if no E820 map is passed in from the bootloader (which is the case on some bootloaders, like linld), we end up with an empty memory map, and the kernel fails to boot (either by deadlocking on OOM, or by failing to allocate the real mode trampoline, or similar). Re-instate the check in append_e820_table(), but only check that nr_entries is non-zero. This allows e820__memory_setup_default() to fall back to other memory size sources, and doesn't affect e820__memory_setup_extended(), as the latter ignores the return value from append_e820_table(). In doing so, we also update the return values to be proper error codes, with -ENOENT for this case (there are no entries), and -EINVAL for the case where an entry appears invalid. Given none of the callers check the actual value -- just whether it's nonzero -- this is largely aesthetic in practice. Tested against linld, and the kernel boots again fine. [ mingo: Readability edits to the comment and the changelog. ] Fixes: 157266edcc56 ("x86/boot/e820: Simplify append_e820_table() and remove restriction on single-entry tables") Signed-off-by: David Gow <david@davidgow.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Cc: stable@vger.kernel.org Cc: Arnd Bergmann <arnd@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://patch.msgid.link/20260416065746.1896647-1-david@davidgow.net
2026-05-06Merge tag 'efi-fixes-for-v7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efiLinus Torvalds3-4/+14
Pull EFI fixes from Ard Biesheuvel: - Fix issues in EFI graceful recovery on x86 introduced by changes to the kernel mode FPU APIs - I-cache coherency fixes for the LoongArch EFI stub - Locking fix for EFI pstore - Code tweak for efivarfs * tag 'efi-fixes-for-v7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: x86/efi: Restore IRQ state in EFI page fault handler x86/efi: Fix graceful fault handling after FPU softirq changes efi/libstub: Synchronize instruction cache after kernel relocation efi/loongarch: Implement efi_cache_sync_image() efi/libstub: Move efi_relocate_kernel() into its only remaining user efi: pstore: Drop efivar lock when efi_pstore_open() returns with an error efivarfs: use QSTR() in efivarfs_alloc_dentry
2026-05-05perf/x86/intel: Enable auto counter reload for DMRDapeng Mi1-0/+1
Panther cove µarch starts to support auto counter reload (ACR), but the static_call intel_pmu_enable_acr_event() is not updated for the Panther Cove µarch used by DMR. It leads to the auto counter reload is not really enabled on DMR. Update static_call intel_pmu_enable_acr_event() in intel_pmu_init_pnc(). Fixes: d345b6bb8860 ("perf/x86/intel: Add core PMU support for DMR") Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260430002558.712334-5-dapeng1.mi@linux.intel.com
2026-05-05perf/x86/intel: Disable PMI for self-reloaded ACR eventsDapeng Mi2-4/+23
On platforms with Auto Counter Reload (ACR) support, such as NVL, a "NMI received for unknown reason 30" warning is observed when running multiple events in a group with ACR enabled: $ perf record -e '{instructions/period=20000,acr_mask=0x2/u,\ cycles/period=40000,acr_mask=0x3/u}' ./test The warning occurs because the Performance Monitoring Interrupt (PMI) is enabled for the self-reloaded event (the cycles event in this case). According to the Intel SDM, the overflow bit (IA32_PERF_GLOBAL_STATUS.PMCn_OVF) is never set for self-reloaded events. Since the bit is not set, the perf NMI handler cannot identify the source of the interrupt, leading to the "unknown reason" message. Furthermore, enabling PMI for self-reloaded events is unnecessary and can lead to extraneous records that pollute the user's requested data. Disable the interrupt bit for all events configured with ACR self-reload. Fixes: ec980e4facef ("perf/x86/intel: Support auto counter reload") Reported-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260430002558.712334-4-dapeng1.mi@linux.intel.com
2026-05-05perf/x86/intel: Always reprogram ACR events to prevent stale masksDapeng Mi1-5/+8
Members of an ACR group are logically linked via a bitmask of their hardware counter indices. If some members of the group are assigned new hardware counters during rescheduling, even events that keep their original counter index must be updated with a new mask. Without this, an event will continue to use a stale acr_mask that references the old indices of its group peers. Ensure all ACR events are reprogrammed during the scheduling path to maintain consistency across the group. Fixes: ec980e4facef ("perf/x86/intel: Support auto counter reload") Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260430002558.712334-3-dapeng1.mi@linux.intel.com
2026-05-05perf/x86/intel: Improve validation and configuration of ACR masksDapeng Mi1-7/+25
Currently there are several issues on the user space ACR mask validation and configuration. - The validation for user space ACR mask (attr.config2) is incomplete, e.g., the ACR mask could include the index which belongs to another ACR events group, but it's not validated. - An early return on an invalid ACR mask caused all subsequent ACR groups to be skipped. - The stale hardware ACR mask (hw.config1) is not cleared before setting new hardware ACR mask. The following changes address all of the above issues. - Figure out the event index group of an ACR group. Any bits in the user-space mask not present in the index group are now dropped. - Instead of an early return on invalid bits, drop only the invalid portions and continue iterating through all ACR events to ensure full configuration. - Explicitly clear the stale hardware ACR mask for each event prior to writing the new configuration. Besides, a non-leader event member of ACR group could be disabled in theory. This could cause bit-shifting errors in the acr_mask of remaining group members. But since ACR sampling requires all events to be active, this should not be a big concern in real use case. Add a "FIXME" comment to notice this risk. Fixes: ec980e4facef ("perf/x86/intel: Support auto counter reload") Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260430002558.712334-2-dapeng1.mi@linux.intel.com
2026-05-05x86/xen: Fix a potential problem in xen_e820_resolve_conflicts()Juergen Gross1-4/+9
When fixing a conflict in xen_e820_resolve_conflicts(), the loop over the E820 map entries needs to be restarted, as the E820 map will have been modified by the fix. Otherwise entries might be skipped by accident. Fixes: be35d91c8880 ("xen: tolerate ACPI NVS memory overlapping with Xen allocated memory") Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: xen-devel@lists.xenproject.org Link: https://patch.msgid.link/20260505080653.197775-1-jgross@suse.com
2026-05-05x86/efi: Restore IRQ state in EFI page fault handlerArd Biesheuvel3-3/+13
The kernel's softirq API does not permit re-enabling softirqs while IRQs are disabled. The reason for this is that local_bh_enable() will not only re-enable delivery of softirqs over the back of IRQs, it will also handle any pending softirqs immediately, regardless of whether IRQs are enabled at that point. For this reason, commit d02198550423 ("x86/fpu: Improve crypto performance by making kernel-mode FPU reliably usable in softirqs") disables softirqs only when IRQs are enabled, as it is not permitted otherwise, but also unnecessary, given that asynchronous softirq delivery never happens to begin with while IRQs are disabled. However, this does mean that entering a kernel mode FPU section with IRQs enabled and leaving it with IRQs disabled leads to problems, as identified by Sashiko [0]: the EFI page fault handler is called from page_fault_oops() with IRQs disabled, and thus ends the kernel mode FPU section with IRQs disabled as well, regardless of whether IRQs were enabled when it was started. This may result in schedule() being called with a non-zero preempt_count, causing a BUG(). So take care to re-enable IRQs when handling any EFI page faults if they were taken with IRQs enabled. [0] https://sashiko.dev/#/patchset/20260430074107.27051-1-ivan.hu%40canonical.com Cc: Eric Biggers <ebiggers@kernel.org> Cc: Ivan Hu <ivan.hu@canonical.com> Cc: x86@kernel.org Cc: <stable@vger.kernel.org> Fixes: d02198550423 ("x86/fpu: Improve crypto performance by making kernel-mode FPU reliably usable in softirqs") Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2026-05-04x86/efi: Fix graceful fault handling after FPU softirq changesIvan Hu1-1/+1
Since commit d02198550423 ("x86/fpu: Improve crypto performance by making kernel-mode FPU reliably usable in softirqs"), kernel_fpu_begin() calls fpregs_lock() which uses local_bh_disable() instead of the previous preempt_disable(). This sets SOFTIRQ_OFFSET in preempt_count during the entire EFI runtime service call, causing in_interrupt() to return true in normal task context. The graceful page fault handler efi_crash_gracefully_on_page_fault() uses in_interrupt() to bail out for faults in real interrupt context. With SOFTIRQ_OFFSET now set, the handler always bails out, leaving EFI firmware page faults unhandled. This escalates to die() which also sees in_interrupt() as true and calls panic("Fatal exception in interrupt"), resulting in a hard system freeze. On systems with buggy firmware that triggers page faults during EFI runtime calls (e.g., accessing unmapped memory in GetTime()), this causes an unrecoverable hang instead of the expected graceful EFI_ABORTED recovery. Fix by replacing in_interrupt() with !in_task(). This preserves the original intent of bailing for interrupts or NMI faults, while no longer falsely triggering from the FPU code path's local_bh_disable(). Fixes: d02198550423 ("x86/fpu: Improve crypto performance by making kernel-mode FPU reliably usable in softirqs") Cc: <stable@vger.kernel.org> Signed-off-by: Ivan Hu <ivan.hu@canonical.com> [ardb: Sashiko spotted that using 'in_hardirq() || in_nmi()' leaves a window where a softirq may be taken before fpregs_lock() is called, but after efi_rts_work.efi_rts_id has been assigned, and any page faults occurring in that window will then be misidentified as having been caused by the firmware. Instead, use !in_task(), which incorporates in_serving_softirq(). ] Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2026-05-03KVM: x86: Fix shadow paging use-after-free due to unexpected GFNSean Christopherson1-21/+14
The shadow MMU computes GFNs for direct shadow pages using sp->gfn plus the SPTE index. This assumption breaks for shadow paging if the guest page tables are modified between VM entries (similar to commit aad885e77496, "KVM: x86/mmu: Drop/zap existing present SPTE even when creating an MMIO SPTE", 2026-03-27). The flow is as follows: - a PDE is installed for a 2MB mapping, and a page in that area is accessed. KVM creates a kvm_mmu_page consisting of 512 4KB pages; the kvm_mmu_page is marked by FNAME(fetch) as direct-mapped because the guest's mapping is a huge page (and thus contiguous). - the PDE mapping is changed from outside the guest. - the guest accesses another page in the same 2MB area. KVM installs a new leaf SPTE and rmap entry; the SPTE uses the "correct" GFN (i.e. based on the new mapping, as changed in the previous step) but that GFN is outside of the [sp->gfn, sp->gfn + 511] range; therefore the rmap entry cannot be found and removed when the kvm_mmu_page is zapped. - the memslot that covers the first 2MB mapping is deleted, and the kvm_mmu_page for the now-invalid GPA is zapped. However, rmap_remove() only looks at the [sp->gfn, sp->gfn + 511] range established in step 1, and fails to find the rmap entry that was recorded by step 3. - any operation that causes an rmap walk for the same page accessed by step 3 then walks a stale rmap and dereferences a freed kvm_mmu_page. This includes dirty logging or MMU notifier invalidations (e.g., from MADV_DONTNEED). The underlying issue is that KVM's walking of shadow PTEs assumes that if a SPTE is present when KVM wants to install a non-leaf SPTE, then the existing kvm_mmu_page must be for the correct gfn. Because the only way for the gfn to be wrong is if KVM messed up and failed to zap a SPTE... which shouldn't happen, but *actually* only happens in response to a guest write. That bug dates back literally forever, as even the first version of KVM assumes that the GFN matches and walks into the "wrong" shadow page. However, that was only an imprecision until 2032a93d66fa ("KVM: MMU: Don't allocate gfns page for direct mmu pages") came along. Fix it by checking for a target gfn mismatch and zapping the existing SPTE. That way the old SP and rmap entries are gone, KVM installs the rmap in the right location, and everyone is happy. Fixes: 2032a93d66fa ("KVM: MMU: Don't allocate gfns page for direct mmu pages") Fixes: 6aa8b732ca01 ("kvm: userspace interface") Reported-by: Alexander Bulekov <bkov@amazon.com> Reported-by: Fred Griffoul <fgriffo@amazon.co.uk> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260503201029.106481-1-pbonzini@redhat.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-03KVM: x86: Fix misleading variable names and add more comments for PIR=>IRR flowSean Christopherson2-16/+40
Rename kvm_apic_update_irr()'s "irr_updated" and vmx_sync_pir_to_irr()'s "got_posted_interrupt" to a more accurate "max_irr_is_from_pir", as neither "irr_updated" nor "got_posted_interrupt" is accurate. __kvm_apic_update_irr() and thus kvm_apic_update_irr() specifically return true if and only if the highest priority IRQ, i.e. max_irr, is a "new" pending IRQ from the PIR. I.e. it's possible for the IRR to be updated, i.e. for a posted IRQ to be "got", *without* the APIs returning true. Expand vmx_sync_pir_to_irr()'s comment to explain why it's necessary to set KVM_REQ_EVENT only if a "new" IRQ was found, and to explain why it's safe to do so only if a new IRQ is also the highest priority pending IRQ. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260503201703.108231-3-pbonzini@redhat.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-03KVM: x86: Do IRR scan in __kvm_apic_update_irr even if PIR is emptyPaolo Bonzini1-3/+5
Fall back to apic_find_highest_vector() when PID.ON is set but PIR turns out to be empty, to correctly report the highest pending interrupt from the existing IRR. In a nested VM stress test, the following WARNING fires in vmx_check_nested_events() when kvm_cpu_has_interrupt() reports a pending interrupt but the subsequent kvm_apic_has_interrupt() (which invokes vmx_sync_pir_to_irr() again) returns -1: WARNING: CPU: 99 PID: 57767 at arch/x86/kvm/vmx/nested.c:4449 vmx_check_nested_events+0x6bf/0x6e0 [kvm_intel] Call Trace: kvm_check_and_inject_events vcpu_enter_guest.constprop.0 vcpu_run kvm_arch_vcpu_ioctl_run kvm_vcpu_ioctl __x64_sys_ioctl do_syscall_64 entry_SYSCALL_64_after_hwframe The root cause is a race between vmx_sync_pir_to_irr() on the target vCPU and __vmx_deliver_posted_interrupt() on a sender vCPU. The sender performs two individually-atomic operations that are not a single transaction: 1. pi_test_and_set_pir(vector) -- sets the PIR bit 2. pi_test_and_set_on() -- sets PID.ON The following interleaving triggers the bug: Sender vCPU (IPI): Target vCPU (1st sync_pir_to_irr): B1: set PIR[vector] A1: pi_clear_on() A2: pi_harvest_pir() -> sees B1 bit A3: xchg() -> consumes bit, PIR=0 (1st sync returns correct max_irr) B2: set PID.ON = 1 Target vCPU (2nd sync_pir_to_irr): C1: pi_test_on() -> TRUE (from B2) C2: pi_clear_on() -> ON=0 C3: pi_harvest_pir() -> PIR empty C4: *max_irr = -1, early return IRR NOT SCANNED The interrupt is not lost (it resides in the IRR from the first sync and is recovered on the next vcpu_enter_guest() iteration), but the incorrect max_irr causes a spurious WARNING and a wasted L2 VM-Enter/VM-Exit cycle. Fixes: b41f8638b9d3 ("KVM: VMX: Isolate pure loads from atomic XCHG when processing PIR") Reported-by: Farrah Chen <farrah.chen@intel.com> Analyzed-by: Chenyi Qiang <chenyi.qiang@intel.com> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/kvm/20260428070349.1633238-1-chenyi.qiang@intel.com/T/ Link: https://patch.msgid.link/20260503201703.108231-2-pbonzini@redhat.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-03KVM: x86: check for nEPT/nNPT in slow flush hypercallsPaolo Bonzini1-1/+1
Checking is_guest_mode(vcpu) is incorrect, because translate_nested_gpa() is only valid if an L2 guest is running *with nested EPT/NPT enabled*. Instead use the same condition as translate_nested_gpa() itself. Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Fixes: aee738236dca ("KVM: x86: Prepare kvm_hv_flush_tlb() to handle L2's GPAs", 2022-11-18) Link: https://patch.msgid.link/20260503200905.106077-1-pbonzini@redhat.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-04-24Merge tag 'x86-urgent-2026-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds4-26/+42
Pull x86 fixes from Ingo Molnar: - Prevent deadlock during shstk sigreturn (Rick Edgecombe) - Disable FRED when PTI is forced on (Dave Hansen) - Revert a CPA INVLPGB optimization that did not properly handle discontiguous virtual addresses (Dave Hansen) * tag 'x86-urgent-2026-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Revert INVLPGB optimization for set_memory code x86/cpu: Disable FRED when PTI is forced on x86/shstk: Prevent deadlock during shstk sigreturn
2026-04-24x86/mm: Revert INVLPGB optimization for set_memory codeDave Hansen1-7/+13
tl;dr: Revert an INVLPGB optimization that did not properly handle discontiguous virtual addresses. Full story: I got a report from some graphics (i915) folks that bisected a regression in their test suite to 86e6815b316e ("x86/mm: Change cpa_flush() to call flush_kernel_range() directly"). There was a bit of flip-flopping on the exact bisect, but the code here does seem wrong to me. The i915 folks were calling set_pages_array_wc(), so using the CPA_PAGES_ARRAY mode. Basically, the 'struct cpa_data' can wrap up all kinds of page table changes. Some of these are virtually contiguous, but some are very much not which is one reason why there are ->vaddr and ->pages arrays. 86e6815b316e made the mistake of assuming that the virtual addresses in the cpa_data are always contiguous. It got things right when neither CPA_ARRAY/CPA_PAGES_ARRAY is used, but theoretically wrong when either of those is used. In the i915 case, it probably failed to flush some WB TLB entries and install WC ones, leaving some data in the caches and not flushing it out to where the device could see it. That eventually caused graphics problems. Revert the INVLPGB optimization. It can be reintroduced later, but it will need to be a bit careful about the array modes. Fixes: 86e6815b316ec ("x86/mm: Change cpa_flush() to call flush_kernel_range()") Reported-by: Cui, Ling <ling.cui@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patch.msgid.link/20260421151909.6B3281C6@davehans-spike.ostc.intel.com
2026-04-23Merge tag 'pcmcia-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linuxLinus Torvalds1-1/+1
Pull PCMCIA updates from Dominik Brodowski: "A number of minor PCMCIA bugfixes and cleanups, and a patch removing obsolete host controller drivers" * tag 'pcmcia-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: pcmcia: remove obsolete host controller drivers pcmcia: Convert to use less arguments in pci_bus_for_each_resource() PCMCIA: Fix garbled log messages for KERN_CONT
2026-04-22Merge tag 'kgdb-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linuxLinus Torvalds1-4/+5
Pull kgdb update from Daniel Thompson: "Only a very small update for kgdb this cycle: a single patch from Kexin Sun that fixes some outdated comments" * tag 'kgdb-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux: kgdb: update outdated references to kgdb_wait()
2026-04-22Merge tag 's390-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linuxLinus Torvalds1-3/+3
Pull s390 updates from Vasily Gorbik: - Add support for CONFIG_PAGE_TABLE_CHECK and enable it in debug_defconfig. s390 can only tell user from kernel PTEs via the mm, so mm_struct is now passed into pxx_user_accessible_page() callbacks - Expose the PCI function UID as an arch-specific slot attribute in sysfs so a function can be identified by its user-defined id while still in standby. Introduces a generic ARCH_PCI_SLOT_GROUPS hook in drivers/pci/slot.c - Refresh s390 PCI documentation to reflect current behavior and cover previously undocumented sysfs attributes - zcrypt device driver cleanup series: consistent field types, clearer variable naming, a kernel-doc warning fix, and a comment explaining the intentional synchronize_rcu() in pkey_handler_register() - Provide an s390 arch_raw_cpu_ptr() that avoids the detour via get_lowcore() using alternatives, shrinking defconfig by ~27 kB - Guard identity-base randomization with kaslr_enabled() so nokaslr keeps the identity mapping at 0 even with RANDOMIZE_IDENTITY_BASE=y - Build S390_MODULES_SANITY_TEST as a module only by requiring KUNIT && m, since built-in would not exercise module loading - Remove the permanently commented-out HMCDRV_DEV_CLASS create_class() code in the hmcdrv driver - Drop stale ident_map_size extern conflicting with asm/page.h * tag 's390-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/zcrypt: Fix warning about wrong kernel doc comment PCI: s390: Expose the UID as an arch specific PCI slot attribute docs: s390/pci: Improve and update PCI documentation s390/pkey: Add comment about synchronize_rcu() to pkey base s390/hmcdrv: Remove commented out code s390/zcrypt: Slight rework on the agent_id field s390/zcrypt: Explicitly use a card variable in _zcrypt_send_cprb s390/zcrypt: Rework MKVP fields and handling s390/zcrypt: Make apfs a real unsigned int field s390/zcrypt: Rework domain processing within zcrypt device driver s390/zcrypt: Move inline function rng_type6cprb_msgx from header to code s390/percpu: Provide arch_raw_cpu_ptr() s390: Enable page table check for debug_defconfig s390/pgtable: Add s390 support for page table check s390/pgtable: Use set_pmd_bit() to invalidate PMD entry mm/page_table_check: Pass mm_struct to pxx_user_accessible_page() s390/boot: Respect kaslr_enabled() for identity randomization s390/Kconfig: Make modules sanity test a module-only option s390/setup: Drop stale ident_map_size declaration
2026-04-22Merge tag 'hyperv-next-signed-20260421' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linuxLinus Torvalds1-2/+13
Pull Hyper-V updates from Wei Liu: - Fix cross-compilation for hv tools (Aditya Garg) - Fix vmemmap_shift exceeding MAX_FOLIO_ORDER in mshv_vtl (Naman Jain) - Limit channel interrupt scan to relid high water mark (Michael Kelley) - Export hv_vmbus_exists() and use it in pci-hyperv (Dexuan Cui) - Fix cleanup and shutdown issues for MSHV (Jork Loeser) - Introduce more tracing support for MSHV (Stanislav Kinsburskii) * tag 'hyperv-next-signed-20260421' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: x86/hyperv: Skip LP/VP creation on kexec x86/hyperv: move stimer cleanup to hv_machine_shutdown() Drivers: hv: vmbus: fix hyperv_cpuhp_online variable shadowing mshv: Add tracepoint for GPA intercept handling mshv_vtl: Fix vmemmap_shift exceeding MAX_FOLIO_ORDER tools: hv: Fix cross-compilation Drivers: hv: vmbus: Export hv_vmbus_exists() and use it in pci-hyperv mshv: Introduce tracing support Drivers: hv: vmbus: Limit channel interrupt scan to relid high water mark
2026-04-22x86/hyperv: Skip LP/VP creation on kexecJork Loeser1-0/+7
After a kexec the logical processors and virtual processors already exist in the hypervisor because they were created by the previous kernel. Attempting to add them again causes either a BUG_ON or corrupted VP state leading to MCEs in the new kernel. Add hv_lp_exists() to probe whether an LP is already present by calling HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME. When it succeeds the LP exists and we skip the add-LP and create-VP loops entirely. Also add hv_call_notify_all_processors_started() which informs the hypervisor that all processors are online. This is required after adding LPs (fresh boot) and is a no-op on kexec since we skip that path. Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com> Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com> Co-developed-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com> Signed-off-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com> Co-developed-by: Mukesh Rathor <mrathor@linux.microsoft.com> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2026-04-22x86/hyperv: move stimer cleanup to hv_machine_shutdown()Jork Loeser1-2/+6
Move hv_stimer_global_cleanup() from vmbus's hv_kexec_handler() to hv_machine_shutdown() in the platform code. This ensures stimer cleanup happens before the vmbus unload, which is required for root partition kexec to work correctly. Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com> Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2026-04-21x86/cpu: Disable FRED when PTI is forced onDave Hansen1-0/+5
FRED and PTI were never intended to work together. No FRED hardware is vulnerable to Meltdown and all of it should have LASS anyway. Nevertheless, if you boot a system with pti=on and fred=on, the kernel tries to do what is asked of it and dies a horrible death on the first attempt to run userspace (since it never switches to the user page tables). Disable FRED when PTI is forced on, and print a warning about it. A quick brain dump about what a FRED+PTI implementation would look like is below. I'm not sure it would make any sense to do it, but never say never. All I know is that it's way too complicated to be worth it today. <brain dump> The SWITCH_TO_USER/KERNEL_CR3 bits are simple to fix (or at least we have the assembly tools to do it already), as is sticking the FRED entry text in .entry.text (it's not in there today). The nasty part is the stacks. Today, the CPU pops into the kernel on MSR_IA32_FRED_RSP0 which is normal old kernel memory and not mapped to userspace. The hardware pushes gunk on to MSR_IA32_FRED_RSP0, which is currently the task stacks. MSR_IA32_FRED_RSP0 would need to point elsewhere, probably cpu_entry_stack(). Then, start playing games with stacks on entry/exit, including copying gunk to and from the task stack. While I'd *like* to have PTI everywhere, I'm not sure it's worth mucking up the FRED code with PTI kludges. If a user wants fast entry/exit, they use FRED. If you want PTI (and sekuritay), you certainly don't care about fast entry and FRED isn't going to help you *all* that much, so you can just stay with the IDT. Plus, FRED hardware should have LASS which gives you a similar security profile to PTI without the CR3 munging. </brain dump> Reported-by: Gayatri Kammela <Gayatri.Kammela@amd.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> Cc:stable@vger.kernel.org Link: https://patch.msgid.link/20260421163136.E7C6788A@davehans-spike.ostc.intel.com