aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/arch/x86/kvm/x86.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2024-07-16Merge tag 'kvm-x86-pmu-6.11' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-7/+10
KVM x86/pmu changes for 6.11 - Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads '0' and writes from userspace are ignored. - Update to the newfangled Intel CPU FMS infrastructure. - Use macros instead of open-coded literals to clean up KVM's manipulation of FIXED_CTR_CTRL MSRs.
2024-07-16Merge tag 'kvm-x86-mtrrs-6.11' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-12/+12
KVM x86 MTRR virtualization removal Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD hack, and instead always honor guest PAT on CPUs that support self-snoop.
2024-07-16Merge tag 'kvm-x86-misc-6.11' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-41/+71
KVM x86 misc changes for 6.11 - Add a global struct to consolidate tracking of host values, e.g. EFER, and move "shadow_phys_bits" into the structure as "maxphyaddr". - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC bus frequency, because TDX. - Print the name of the APICv/AVIC inhibits in the relevant tracepoint. - Clean up KVM's handling of vendor specific emulation to consistently act on "compatible with Intel/AMD", versus checking for a specific vendor. - Misc cleanups
2024-07-16Merge tag 'kvm-x86-generic-6.11' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-19/+17
KVM generic changes for 6.11 - Enable halt poll shrinking by default, as Intel found it to be a clear win. - Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86. - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out(). - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs. - Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout. - A few minor cleanups
2024-07-12KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()Paolo Bonzini1-0/+3
Wire KVM_PRE_FAULT_MEMORY ioctl to kvm_mmu_do_page_fault() to populate guest memory. It can be called right after KVM_CREATE_VCPU creates a vCPU, since at that point kvm_mmu_create() and kvm_init_mmu() are called and the vCPU is ready to invoke the KVM page fault handler. The helper function kvm_tdp_map_page() takes care of the logic to process RET_PF_* return values and convert them to success or errno. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <9b866a0ae7147f96571c439e75429a03dcb659b6.1712785629.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-28KVM: X86: Remove unnecessary GFP_KERNEL_ACCOUNT for temporary variablesPeng Hao1-5/+4
Some variables allocated in kvm_arch_vcpu_ioctl are released when the function exits, so there is no need to set GFP_KERNEL_ACCOUNT. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Link: https://lore.kernel.org/r/20240624012016.46133-1-flyingpeng@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28KVM: x86/pmu: Introduce distinct macros for GP/fixed counter max numberDapeng Mi1-6/+9
Refine the macros which define maximum General Purpose (GP) and fixed counter numbers. Currently the macro KVM_INTEL_PMC_MAX_GENERIC is used to represent the maximum supported General Purpose (GP) counter number ambiguously across Intel and AMD platforms. This would cause issues if AMD begins to support more GP counters than Intel. Thus a bunch of new macros including vendor specific and vendor independent are introduced to replace the old macros. The vendor independent macros are used in x86 common code to hide vendor difference and eliminate the ambiguity. No logic changes are introduced in this patch. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240627021756.144815-1-dapeng1.mi@linux.intel.com Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28KVM: x86: WARN if a vCPU gets a valid wakeup that KVM can't yet injectSean Christopherson1-1/+4
WARN if a blocking vCPU is awakened by a valid wake event that KVM can't inject, e.g. because KVM needs to complete a nested VM-enter, or needs to re-inject an exception. For the nested VM-Enter case, KVM is supposed to clear "nested_run_pending" if L1 puts L2 into HLT, i.e. entering HLT "completes" the nested VM-Enter. And for already-injected exceptions, it should be impossible for the vCPU to be in a blocking state if a VM-Exit occurred while an exception was being vectored. Link: https://lore.kernel.org/r/20240607172609.3205077-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28KVM: nVMX: Fold requested virtual interrupt check into has_nested_events()Sean Christopherson1-9/+1
Check for a Requested Virtual Interrupt, i.e. a virtual interrupt that is pending delivery, in vmx_has_nested_events() and drop the one-off kvm_x86_ops.guest_apic_has_interrupt() hook. In addition to dropping a superfluous hook, this fixes a bug where KVM would incorrectly treat virtual interrupts _for L2_ as always enabled due to kvm_arch_interrupt_allowed(), by way of vmx_interrupt_blocked(), treating IRQs as enabled if L2 is active and vmcs12 is configured to exit on IRQs, i.e. KVM would treat a virtual interrupt for L2 as a valid wake event based on L1's IRQ blocking status. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28KVM: nVMX: Request immediate exit iff pending nested event needs injectionSean Christopherson1-2/+2
When requesting an immediate exit from L2 in order to inject a pending event, do so only if the pending event actually requires manual injection, i.e. if and only if KVM actually needs to regain control in order to deliver the event. Avoiding the "immediate exit" isn't simply an optimization, it's necessary to make forward progress, as the "already expired" VMX preemption timer trick that KVM uses to force a VM-Exit has higher priority than events that aren't directly injected. At present time, this is a glorified nop as all events processed by vmx_has_nested_events() require injection, but that will not hold true in the future, e.g. if there's a pending virtual interrupt in vmcs02.RVI. I.e. if KVM is trying to deliver a virtual interrupt to L2, the expired VMX preemption timer will trigger VM-Exit before the virtual interrupt is delivered, and KVM will effectively hang the vCPU in an endless loop of forced immediate VM-Exits (because the pending virtual interrupt never goes away). Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-20Merge branch 'kvm-6.10-fixes' into HEADPaolo Bonzini1-5/+4
2024-06-20KVM: x86: Always sync PIR to IRR prior to scanning I/O APIC routesSean Christopherson1-5/+4
Sync pending posted interrupts to the IRR prior to re-scanning I/O APIC routes, irrespective of whether the I/O APIC is emulated by userspace or by KVM. If a level-triggered interrupt routed through the I/O APIC is pending or in-service for a vCPU, KVM needs to intercept EOIs on said vCPU even if the vCPU isn't the destination for the new routing, e.g. if servicing an interrupt using the old routing races with I/O APIC reconfiguration. Commit fceb3a36c29a ("KVM: x86: ioapic: Fix level-triggered EOI and userspace I/OAPIC reconfigure race") fixed the common cases, but kvm_apic_pending_eoi() only checks if an interrupt is in the local APIC's IRR or ISR, i.e. misses the uncommon case where an interrupt is pending in the PIR. Failure to intercept EOI can manifest as guest hangs with Windows 11 if the guest uses the RTC as its timekeeping source, e.g. if the VMM doesn't expose a more modern form of time to the guest. Cc: stable@vger.kernel.org Cc: Adamos Ttofari <attofari@amazon.de> Cc: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240611014845.82795-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-18KVM: Introduce vcpu->wants_to_runDavid Matlack1-2/+2
Introduce vcpu->wants_to_run to indicate when a vCPU is in its core run loop, i.e. when the vCPU is running the KVM_RUN ioctl and immediate_exit was not set. Replace all references to vcpu->run->immediate_exit with !vcpu->wants_to_run to avoid TOCTOU races with userspace. For example, a malicious userspace could invoked KVM_RUN with immediate_exit=true and then after KVM reads it to set wants_to_run=false, flip it to false. This would result in the vCPU running in KVM_RUN with wants_to_run=false. This wouldn't cause any real bugs today but is a dangerous landmine. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240503181734.1467938-2-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-18KVM: x86: Prevent excluding the BSP on setting max_vcpu_idsSean Christopherson1-1/+3
If the BSP vCPU ID was already set, ensure it doesn't get excluded when limiting vCPU IDs via KVM_CAP_MAX_VCPU_ID. [mks: provide commit message, code by Sean] Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240614202859.3597745-4-minipli@grsecurity.net Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-18KVM: x86: Limit check IDs for KVM_SET_BOOT_CPU_IDMathias Krause1-0/+3
Do not accept IDs which are definitely invalid by limit checking the passed value against KVM_MAX_VCPU_IDS and 'max_vcpu_ids' if it was already set. This ensures invalid values, especially on 64-bit systems, don't go unnoticed and lead to a valid id by chance when truncated by the final assignment. Fixes: 73880c80aa9c ("KVM: Break dependency between vcpu index in vcpus array and vcpu_id.") Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240614202859.3597745-3-minipli@grsecurity.net Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11KVM: x86: Drop now-superflous setting of l1tf_flush_l1d in vcpu_run()Sean Christopherson1-1/+0
Now that KVM unconditionally sets l1tf_flush_l1d in kvm_arch_vcpu_load(), drop the redundant store from vcpu_run(). The flag is cleared only when VM-Enter is imminent, deep below vcpu_run(), i.e. barring a KVM bug, it's impossible for l1tf_flush_l1d to be cleared between loading the vCPU and calling vcpu_run(). Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11KVM: x86: Unconditionally set l1tf_flush_l1d during vCPU loadSean Christopherson1-6/+5
Always set l1tf_flush_l1d during kvm_arch_vcpu_load() instead of setting it only when the vCPU is being scheduled back in. The flag is processed only when VM-Enter is imminent, and KVM obviously needs to load the vCPU before VM-Enter, so attempting to precisely set l1tf_flush_l1d provides no meaningful value. I.e. the flag _will_ be set either way, it's simply a matter of when. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11KVM: Delete the now unused kvm_arch_sched_in()Sean Christopherson1-5/+0
Delete kvm_arch_sched_in() now that all implementations are nops. Reviewed-by: Bibo Mao <maobibo@loongson.cn> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11KVM: x86: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()Sean Christopherson1-7/+10
Fold the guts of kvm_arch_sched_in() into kvm_arch_vcpu_load(), keying off the recently added kvm_vcpu.scheduled_out as appropriate. Note, there is a very slight functional change, as PLE shrink updates will now happen after blasting WBINVD, but that is quite uninteresting as the two operations do not interact in any way. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11KVM: x86: Don't re-setup empty IRQ routing when KVM_CAP_SPLIT_IRQCHIPYi Wang1-3/+0
Now that KVM sets up empty IRQ routing during VM creation, don't recreate empty routing during KVM_CAP_SPLIT_IRQCHIP. Setting IRQ routes during KVM_CAP_SPLIT_IRQCHIP can result in 20+ milliseconds of delay due to the synchronize_srcu_expedited() call in kvm_set_irq_routing(). Note, the empty routing is guaranteed to be intact as KVM x86 only allows changing the IRQ routing after an in-kernel IRQCHIP has been created, and KVM_CAP_SPLIT_IRQCHIP is disallowed after creating an IRQCHIP. Signed-off-by: Yi Wang <foxywang@tencent.com> Link: https://lore.kernel.org/r/20240506101751.3145407-3-foxywang@tencent.com [sean: massage changelog, remove unused empty_routing array] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11KVM: x86: Add KVM_RUN_X86_GUEST_MODE kvm_run flagThomas Prescher1-0/+3
When a vCPU is interrupted by a signal while running a nested guest, KVM will exit to userspace with L2 state. However, userspace has no way to know whether it sees L1 or L2 state (besides calling KVM_GET_STATS_FD, which does not have a stable ABI). This causes multiple problems: The simplest one is L2 state corruption when userspace marks the sregs as dirty. See this mailing list thread [1] for a complete discussion. Another problem is that if userspace decides to continue by emulating instructions, it will unknowingly emulate with L2 state as if L1 doesn't exist, which can be considered a weird guest escape. Introduce a new flag, KVM_RUN_X86_GUEST_MODE, in the kvm_run data structure, which is set when the vCPU exited while running a nested guest. Also introduce a new capability, KVM_CAP_X86_GUEST_MODE, to advertise the functionality to userspace. [1] https://lore.kernel.org/kvm/20240416123558.212040-1-julian.stecklina@cyberus-technology.de/T/#m280aadcb2e10ae02c191a7dc4ed4b711a74b1f55 Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de> Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> Link: https://lore.kernel.org/r/20240508132502.184428-1-julian.stecklina@cyberus-technology.de Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10KVM: x86: Use "is Intel compatible" helper to emulate SYSCALL in !64-bitSean Christopherson1-0/+6
Use guest_cpuid_is_intel_compatible() to determine whether SYSCALL in 32-bit Protected Mode (including Compatibility Mode) should #UD or succeed. The existing code already does the exact equivalent of guest_cpuid_is_intel_compatible(), just in a rather roundabout way. No functional change intended. Link: https://lore.kernel.org/r/20240405235603.1173076-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10KVM: x86: Inhibit code #DBs in MOV-SS shadow for all Intel compat vCPUsSean Christopherson1-8/+6
Treat code #DBs as inhibited in MOV/POP-SS shadows for vCPU models that are Intel compatible, not just strictly vCPUs with vendor==Intel. The behavior is explicitly called out in the SDM, and thus architectural, i.e. applies to all CPUs that implement Intel's architecture, and isn't a quirk that is unique to CPUs manufactured by Intel: However, if an instruction breakpoint is placed on an instruction located immediately after a POP SS/MOV SS instruction, the breakpoint will be suppressed as if EFLAGS.RF were 1. Applying the behavior strictly to Intel wasn't intentional, KVM simply didn't have a concept of "Intel compatible" as of commit baf67ca8e545 ("KVM: x86: Suppress code #DBs on Intel if MOV/POP SS blocking is active"). Link: https://lore.kernel.org/r/20240405235603.1173076-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10KVM: x86: Apply Intel's TSC_AUX reserved-bit behavior to Intel compat vCPUsSean Christopherson1-4/+4
Extend Intel's check on MSR_TSC_AUX[63:32] to all vCPU models that are Intel compatible, i.e. aren't AMD or Hygon in KVM's world, as the behavior is architectural, i.e. applies to any CPU that is compatible with Intel's architecture. Applying the behavior strictly to Intel wasn't intentional, KVM simply didn't have a concept of "Intel compatible" as of commit 61a05d444d2c ("KVM: x86: Tie Intel and AMD behavior for MSR_TSC_AUX to guest CPU model"). Link: https://lore.kernel.org/r/20240405235603.1173076-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-07KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit pathYan Zhao1-0/+6
Ensure a full memory barrier is emitted in the VM-Exit path, as a full barrier is required on Intel CPUs to evict WC buffers. This will allow unconditionally honoring guest PAT on Intel CPUs that support self-snoop. As srcu_read_lock() is always called in the VM-Exit path and it internally has a smp_mb(), call smp_mb__after_srcu_read_lock() to avoid adding a second fence and make sure smp_mb() is called without dependency on implementation details of srcu_read_lock(). Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> [sean: massage changelog] Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20240309010929.1403984-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05KVM: VMX: Drop support for forcing UC memory when guest CR0.CD=1Sean Christopherson1-6/+0
Drop KVM's emulation of CR0.CD=1 on Intel CPUs now that KVM no longer honors guest MTRR memtypes, as forcing UC memory for VMs with non-coherent DMA only makes sense if the guest is using something other than PAT to configure the memtype for the DMA region. Furthermore, KVM has forced WB memory for CR0.CD=1 since commit fb279950ba02 ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED"), and no known VMM in existence disables KVM_X86_QUIRK_CD_NW_CLEARED, let alone does so with non-coherent DMA. Lastly, commit fb279950ba02 ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED") was from the same author as commit b18d5431acc7 ("KVM: x86: fix CR0.CD virtualization"), and followed by a mere month. I.e. forcing UC memory was likely the result of code inspection or perhaps misdiagnosed failures, and not the necessitate by a concrete use case. Update KVM's documentation to note that KVM_X86_QUIRK_CD_NW_CLEARED is now AMD-only, and to take an erratum for lack of CR0.CD virtualization on Intel. Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20240309010929.1403984-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05KVM: x86: Remove VMX support for virtualizing guest MTRR memtypesSean Christopherson1-8/+8
Remove KVM's support for virtualizing guest MTRR memtypes, as full MTRR adds no value, negatively impacts guest performance, and is a maintenance burden due to it's complexity and oddities. KVM's approach to virtualizating MTRRs make no sense, at all. KVM *only* honors guest MTRR memtypes if EPT is enabled *and* the guest has a device that may perform non-coherent DMA access. From a hardware virtualization perspective of guest MTRRs, there is _nothing_ special about EPT. Legacy shadowing paging doesn't magically account for guest MTRRs, nor does NPT. Unwinding and deciphering KVM's murky history, the MTRR virtualization code appears to be the result of misdiagnosed issues when EPT + VT-d with passthrough devices was enabled years and years ago. And importantly, the underlying bugs that were fudged around by honoring guest MTRR memtypes have since been fixed (though rather poorly in some cases). The zapping GFNs logic in the MTRR virtualization code came from: commit efdfe536d8c643391e19d5726b072f82964bfbdb Author: Xiao Guangrong <guangrong.xiao@linux.intel.com> Date: Wed May 13 14:42:27 2015 +0800 KVM: MMU: fix MTRR update Currently, whenever guest MTRR registers are changed kvm_mmu_reset_context is called to switch to the new root shadow page table, however, it's useless since: 1) the cache type is not cached into shadow page's attribute so that the original root shadow page will be reused 2) the cache type is set on the last spte, that means we should sync the last sptes when MTRR is changed This patch fixs this issue by drop all the spte in the gfn range which is being updated by MTRR which was a fix for: commit 0bed3b568b68e5835ef5da888a372b9beabf7544 Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Thu Oct 9 16:01:54 2008 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Wed Dec 31 16:51:44 2008 +0200 KVM: Improve MTRR structure As well as reset mmu context when set MTRR. which was part of a "MTRR/PAT support for EPT" series that also added: + if (mt_mask) { + mt_mask = get_memory_type(vcpu, gfn) << + kvm_x86_ops->get_mt_mask_shift(); + spte |= mt_mask; + } where get_memory_type() was a truly gnarly helper to retrieve the guest MTRR memtype for a given memtype. And *very* subtly, at the time of that change, KVM *always* set VMX_EPT_IGMT_BIT, kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK | VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT | VMX_EPT_IGMT_BIT); which came in via: commit 928d4bf747e9c290b690ff515d8f81e8ee226d97 Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Thu Nov 6 14:55:45 2008 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Tue Nov 11 21:00:37 2008 +0200 KVM: VMX: Set IGMT bit in EPT entry There is a potential issue that, when guest using pagetable without vmexit when EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory, which would be inconsistent with host side and would cause host MCE due to inconsistent cache attribute. The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default memory type to protect host (notice that all memory mapped by KVM should be WB). Note the CommitDates! The AuthorDates strongly suggests Sheng Yang added the whole "ignoreIGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough enabling, but it was applied earlier because it was a generic fix. Jumping back to 0bed3b568b68 ("KVM: Improve MTRR structure"), the other relevant code, or rather lack thereof, is the handling of *host* MMIO. That fix came in a bit later, but given the author and timing, it's safe to say it was all part of the same EPT+VT-d enabling mess. commit 2aaf69dcee864f4fb6402638dd2f263324ac839f Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Wed Jan 21 16:52:16 2009 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Sun Feb 15 02:47:37 2009 +0200 KVM: MMU: Map device MMIO as UC in EPT Software are not allow to access device MMIO using cacheable memory type, the patch limit MMIO region with UC and WC(guest can select WC using PAT and PCD/PWT). In addition to the host MMIO and IGMT issues, KVM's MTRR virtualization was obviously never tested on NPT until much later, which lends further credence to the theory/argument that this was all the result of misdiagnosed issues. Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng Yang was trying to resolve issues with passthrough MMIO. * Sheng Yang : Do you mean host(qemu) would access this memory and if we set it to guest : MTRR, host access would be broken? We would cover this in our shadow MTRR : patch, for we encountered this in video ram when doing some experiment with : VGA assignment. And in the same thread, there's also what appears to be confirmation of Intel running into issues with Windows XP related to a guest device driver mapping DMA with WC in the PAT. * Avi Kavity : Sheng Yang wrote: : > Yes... But it's easy to do with assigned devices' mmio, but what if guest : > specific some non-mmio memory's memory type? E.g. we have met one issue in : > Xen, that a assigned-device's XP driver specific one memory region as buffer, : > and modify the memory type then do DMA. : > : > Only map MMIO space can be first step, but I guess we can modify assigned : > memory region memory type follow guest's? : > : : With ept/npt, we can't, since the memory type is in the guest's : pagetable entries, and these are not accessible. [*] https://lore.kernel.org/all/1223539317-32379-1-git-send-email-sheng@linux.intel.com So, for the most part, what likely happened is that 15 years ago, a few engineers (a) fixed a #MC problem by ignoring guest PAT and (b) initially "fixed" passthrough device MMIO by emulating *guest* MTRRs. Except for the below case, everything since then has been a result of those two intertwined changes. The one exception, which is actually yet more confirmation of all of the above, is the revert of Paolo's attempt at "full" virtualization of guest MTRRs: commit 606decd67049217684e3cb5a54104d51ddd4ef35 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Oct 1 13:12:47 2015 +0200 Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages" This reverts commit fd717f11015f673487ffc826e59b2bad69d20fe5. It was reported to cause Machine Check Exceptions (bug 104091). ... commit fd717f11015f673487ffc826e59b2bad69d20fe5 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Tue Jul 7 14:38:13 2015 +0200 KVM: x86: apply guest MTRR virtualization on host reserved pages Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true. However, the guest could prefer a different page type than UC for such pages. A good example is that pass-throughed VGA frame buffer is not always UC as host expected. This patch enables full use of virtual guest MTRRs. I.e. Paolo tried to add back KVM's behavior before "Map device MMIO as UC in EPT" and got the same result: machine checks, likely due to the guest MTRRs not being trustworthy/sane at all times. Note, Paolo also tried to enable MTRR virtualization on SVM+NPT, but that too got reverted. Unfortunately, it doesn't appear that anyone ever found a smoking gun, i.e. exactly why emulating guest MTRRs via NPT PAT caused extremely slow boot times doesn't appear to have a definitive root cause. commit fc07e76ac7ffa3afd621a1c3858a503386a14281 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Oct 1 13:20:22 2015 +0200 Revert "KVM: SVM: use NPT page attributes" This reverts commit 3c2e7f7de3240216042b61073803b61b9b3cfb22. Initializing the mapping from MTRR to PAT values was reported to fail nondeterministically, and it also caused extremely slow boot (due to caching getting disabled---bug 103321) with assigned devices. ... commit 3c2e7f7de3240216042b61073803b61b9b3cfb22 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Tue Jul 7 14:32:17 2015 +0200 KVM: SVM: use NPT page attributes Right now, NPT page attributes are not used, and the final page attribute depends solely on gPAT (which however is not synced correctly), the guest MTRRs and the guest page attributes. However, we can do better by mimicking what is done for VMX. In the absence of PCI passthrough, the guest PAT can be ignored and the page attributes can be just WB. If passthrough is being used, instead, keep respecting the guest PAT, and emulate the guest MTRRs through the PAT field of the nested page tables. The only snag is that WP memory cannot be emulated correctly, because Linux's default PAT setting only includes the other types. In short, honoring guest MTRRs for VMX was initially a workaround of sorts for KVM ignoring guest PAT *and* for KVM not forcing UC for host MMIO. And while there *are* known cases where honoring guest MTRRs is desirable, e.g. passthrough VGA frame buffers, the desired behavior in that case is to get WC instead of UC, i.e. at this point it's for performance, not correctness. Furthermore, the complete absence of MTRR virtualization on NPT and shadow paging proves that, while KVM theoretically can do better, it's by no means necessary for correctnesss. Lastly, since kernels mostly rely on firmware to do MTRR setup, and the host typically provides guest firmware, honoring guest MTRRs is effectively honoring *host* userspace memtypes, which is also backwards. I.e. it would be far better for host userspace to communicate its desired memtype directly to KVM (or perhaps indirectly via VMAs in the host kernel), not through guest MTRRs. Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20240309010929.1403984-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05KVM: x86: Keep consistent naming for APICv/AVIC inhibit reasonsAlejandro Jimenez1-1/+1
Keep kvm_apicv_inhibit enum naming consistent with the current pattern by renaming the reason/enumerator defined as APICV_INHIBIT_REASON_DISABLE to APICV_INHIBIT_REASON_DISABLED. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Link: https://lore.kernel.org/r/20240506225321.3440701-3-alejandro.j.jimenez@oracle.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05KVM: x86: Print names of apicv inhibit reasons in tracesAlejandro Jimenez1-0/+4
Use the tracing infrastructure helper __print_flags() for printing flag bitfields, to enhance the trace output by displaying a string describing each of the inhibit reasons set. The kvm_apicv_inhibit_changed tracepoint currently shows the raw bitmap value, requiring the user to consult the source file where the inhibit reasons are defined to decode the trace output. Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Link: https://lore.kernel.org/r/20240506225321.3440701-2-alejandro.j.jimenez@oracle.com Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05KVM: x86: Add a capability to configure bus frequency for APIC timerIsaku Yamahata1-0/+27
Add KVM_CAP_X86_APIC_BUS_CYCLES_NS capability to configure the APIC bus clock frequency for APIC timer emulation. Allow KVM_ENABLE_CAPABILITY(KVM_CAP_X86_APIC_BUS_CYCLES_NS) to set the frequency in nanoseconds. When using this capability, the user space VMM should configure CPUID leaf 0x15 to advertise the frequency. Vishal reported that the TDX guest kernel expects a 25MHz APIC bus frequency but ends up getting interrupts at a significantly higher rate. The TDX architecture hard-codes the core crystal clock frequency to 25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture does not allow the VMM to override the value. In addition, per Intel SDM: "The APIC timer frequency will be the processor’s bus clock or core crystal clock frequency (when TSC/core crystal clock ratio is enumerated in CPUID leaf 0x15) divided by the value specified in the divide configuration register." The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded APIC bus frequency of 1GHz. The KVM doesn't enumerate CPUID leaf 0x15 to the guest unless the user space VMM sets it using KVM_SET_CPUID. If the CPUID leaf 0x15 is enumerated, the guest kernel uses it as the APIC bus frequency. If not, the guest kernel measures the frequency based on other known timers like the ACPI timer or the legacy PIT. As reported by Vishal the TDX guest kernel expects a 25MHz timer frequency but gets timer interrupt more frequently due to the 1GHz frequency used by KVM. To ensure that the guest doesn't have a conflicting view of the APIC bus frequency, allow the userspace to tell KVM to use the same frequency that TDX mandates instead of the default 1Ghz. Reported-by: Vishal Annapurve <vannapurve@google.com> Closes: https://lore.kernel.org/lkml/20231006011255.4163884-1-vannapurve@google.com Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/6748a4c12269e756f0c48680da8ccc5367c31ce7.1714081726.git.reinette.chatre@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05KVM: x86: Make nanoseconds per APIC bus cycle a VM variableIsaku Yamahata1-0/+1
Introduce the VM variable "nanoseconds per APIC bus cycle" in preparation to make the APIC bus frequency configurable. The TDX architecture hard-codes the core crystal clock frequency to 25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture does not allow the VMM to override the value. In addition, per Intel SDM: "The APIC timer frequency will be the processor’s bus clock or core crystal clock frequency (when TSC/core crystal clock ratio is enumerated in CPUID leaf 0x15) divided by the value specified in the divide configuration register." The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded APIC bus frequency of 1GHz. Introduce the VM variable "nanoseconds per APIC bus cycle" to prepare for allowing userspace to tell KVM to use the frequency that TDX mandates instead of the default 1Ghz. Doing so ensures that the guest doesn't have a conflicting view of the APIC bus frequency. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> [reinette: rework changelog] Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/ae75ce37c6c38bb4efd10a0a41932984c40b24ac.1714081726.git.reinette.chatre@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03Merge branch 'kvm-6.11-sev-snp' into HEADPaolo Bonzini1-0/+30
Pull base x86 KVM support for running SEV-SNP guests from Michael Roth: * add some basic infrastructure and introduces a new KVM_X86_SNP_VM vm_type to handle differences versus the existing KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM types. * implement the KVM API to handle the creation of a cryptographic launch context, encrypt/measure the initial image into guest memory, and finalize it before launching it. * implement handling for various guest-generated events such as page state changes, onlining of additional vCPUs, etc. * implement the gmem/mmu hooks needed to prepare gmem-allocated pages before mapping them into guest private memory ranges as well as cleaning them up prior to returning them to the host for use as normal memory. Because those cleanup hooks supplant certain activities like issuing WBINVDs during KVM MMU invalidations, avoid duplicating that work to avoid unecessary overhead. This merge leaves out support support for attestation guest requests and for loading the signing keys to be used for attestation requests.
2024-06-03KVM: x86: Drop support for hand tuning APIC timer advancement from userspaceSean Christopherson1-10/+1
Remove support for specifying a static local APIC timer advancement value, and instead present a read-only boolean parameter to let userspace enable or disable KVM's dynamic APIC timer advancement. Realistically, it's all but impossible for userspace to specify an advancement that is more precise than what KVM's adaptive tuning can provide. E.g. a static value needs to be tuned for the exact hardware and kernel, and if KVM is using hrtimers, likely requires additional tuning for the exact configuration of the entire system. Dropping support for a userspace provided value also fixes several flaws in the interface. E.g. KVM interprets a negative value other than -1 as a large advancement, toggling between a negative and positive value yields unpredictable behavior as vCPUs will switch from dynamic to static advancement, changing the advancement in the middle of VM creation can result in different values for vCPUs within a VM, etc. Those flaws are mostly fixable, but there's almost no justification for taking on yet more complexity (it's minimal complexity, but still non-zero). The only arguments against using KVM's adaptive tuning is if a setup needs a higher maximum, or if the adjustments are too reactive, but those are arguments for letting userspace control the absolute max advancement and the granularity of each adjustment, e.g. similar to how KVM provides knobs for halt polling. Link: https://lore.kernel.org/all/20240520115334.852510-1-zhoushuling@huawei.com Cc: Shuling Zhou <zhoushuling@huawei.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240522010304.1650603-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-03KVM: x86: Remove IA32_PERF_GLOBAL_OVF_CTRL from KVM_GET_MSR_INDEX_LISTJim Mattson1-1/+1
This MSR reads as 0, and any host-initiated writes are ignored, so there's no reason to enumerate it in KVM_GET_MSR_INDEX_LIST. Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20231113184854.2344416-1-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03KVM: x86: Add a struct to consolidate host values, e.g. EFER, XCR0, etc...Sean Christopherson1-23/+15
Add "struct kvm_host_values kvm_host" to hold the various host values that KVM snapshots during initialization. Bundling the host values into a single struct simplifies adding new MSRs and other features with host state/values that KVM cares about, and provides a one-stop shop. E.g. adding a new value requires one line, whereas tracking each value individual often requires three: declaration, definition, and export. No functional change intended. Link: https://lore.kernel.org/r/20240423221521.2923759-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-05-12KVM: SEV: Implement gmem hook for initializing private pagesMichael Roth1-0/+5
This will handle the RMP table updates needed to put a page into a private state before mapping it into an SEV-SNP guest. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240501085210.2213060-14-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12KVM: SEV: Support SEV-SNP AP Creation NAE eventTom Lendacky1-0/+11
Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP guests to alter the register state of the APs on their own. This allows the guest a way of simulating INIT-SIPI. A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used so as to avoid updating the VMSA pointer while the vCPU is running. For CREATE The guest supplies the GPA of the VMSA to be used for the vCPU with the specified APIC ID. The GPA is saved in the svm struct of the target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added to the vCPU and then the vCPU is kicked. For CREATE_ON_INIT: The guest supplies the GPA of the VMSA to be used for the vCPU with the specified APIC ID the next time an INIT is performed. The GPA is saved in the svm struct of the target vCPU. For DESTROY: The guest indicates it wishes to stop the vCPU. The GPA is cleared from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added to vCPU and then the vCPU is kicked. The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked as a result of the event or as a result of an INIT. If a new VMSA is to be installed, the VMSA guest page is set as the VMSA in the vCPU VMCB and the vCPU state is set to KVM_MP_STATE_RUNNABLE. If a new VMSA is not to be installed, the VMSA is cleared in the vCPU VMCB and the vCPU state is set to KVM_MP_STATE_HALTED to prevent it from being run. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Message-ID: <20240501085210.2213060-13-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12KVM: SEV: Add support to handle RMP nested page faultsBrijesh Singh1-0/+1
When SEV-SNP is enabled in the guest, the hardware places restrictions on all memory accesses based on the contents of the RMP table. When hardware encounters RMP check failure caused by the guest memory access it raises the #NPF. The error code contains additional information on the access type. See the APM volume 2 for additional information. When using gmem, RMP faults resulting from mismatches between the state in the RMP table vs. what the guest expects via its page table result in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This means the only expected case that needs to be handled in the kernel is when the page size of the entry in the RMP table is larger than the mapping in the nested page table, in which case a PSMASH instruction needs to be issued to split the large RMP entry into individual 4K entries so that subsequent accesses can succeed. Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Message-ID: <20240501085210.2213060-12-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12Merge branch 'kvm-coco-hooks' into HEADPaolo Bonzini1-0/+13
Common patches for the target-independent functionality and hooks that are needed by SEV-SNP and TDX.
2024-05-12Merge tag 'kvm-x86-misc-6.10' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-17/+11
KVM x86 misc changes for 6.10: - Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which is unused by hardware, so that KVM can communicate its inability to map GPAs that set bits 51:48 due to lack of 5-level paging. Guest firmware is expected to use the information to safely remap BARs in the uppermost GPA space, i.e to avoid placing a BAR at a legal, but unmappable, GPA. - Use vfree() instead of kvfree() for allocations that always use vcalloc() or __vcalloc(). - Don't completely ignore same-value writes to immutable feature MSRs, as doing so results in KVM failing to reject accesses to MSR that aren't supposed to exist given the vCPU model and/or KVM configuration. - Don't mark APICv as being inhibited due to ABSENT if APICv is disabled KVM-wide to avoid confusing debuggers (KVM will never bother clearing the ABSENT inhibit, even if userspace enables in-kernel local APIC).
2024-05-10Merge tag 'loongarch-kvm-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEADPaolo Bonzini1-2/+2
LoongArch KVM changes for v6.10 1. Add ParaVirt IPI support. 2. Add software breakpoint support. 3. Add mmio trace events support.
2024-05-10KVM: guest_memfd: Add hook for invalidating memoryMichael Roth1-0/+7
In some cases, like with SEV-SNP, guest memory needs to be updated in a platform-specific manner before it can be safely freed back to the host. Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to allow for special handling of this sort when freeing memory in response to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go ahead and define an arch-specific hook for x86 since it will be needed for handling memory used for SEV-SNP guests. Signed-off-by: Michael Roth <michael.roth@amd.com> Message-Id: <20231230172351.574091-6-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-10KVM: guest_memfd: Add hook for initializing memoryPaolo Bonzini1-0/+6
guest_memfd pages are generally expected to be in some arch-defined initial state prior to using them for guest memory. For SEV-SNP this initial state is 'private', or 'guest-owned', and requires additional operations to move these pages into a 'private' state by updating the corresponding entries the RMP table. Allow for an arch-defined hook to handle updates of this sort, and go ahead and implement one for x86 so KVM implementations like AMD SVM can register a kvm_x86_ops callback to handle these updates for SEV-SNP guests. The preparation callback is always called when allocating/grabbing folios via gmem, and it is up to the architecture to keep track of whether or not the pages are already in the expected state (e.g. the RMP table in the case of SEV-SNP). In some cases, it is necessary to defer the preparation of the pages to handle things like in-place encryption of initial guest memory payloads before marking these pages as 'private'/'guest-owned'. Add an argument (always true for now) to kvm_gmem_get_folio() that allows for the preparation callback to be bypassed. To detect possible issues in the way userspace initializes memory, it is only possible to add an unprepared page if it is not already included in the filemap. Link: https://lore.kernel.org/lkml/ZLqVdvsF11Ddo7Dq@google.com/ Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-Id: <20231230172351.574091-5-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Explicitly zero kvm_caps during vendor module loadSean Christopherson1-0/+7
Zero out all of kvm_caps when loading a new vendor module to ensure that KVM can't inadvertently rely on global initialization of a field, and add a comment above the definition of kvm_caps to call out that all fields needs to be explicitly computed during vendor module load. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240423165328.2853870-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Fully re-initialize supported_mce_cap on vendor module loadSean Christopherson1-3/+2
Effectively reset supported_mce_cap on vendor module load to ensure that capabilities aren't unintentionally preserved across module reload, e.g. if kvm-intel.ko added a module param to control LMCE support, or if someone somehow managed to load a vendor module that doesn't support LMCE after loading and unloading kvm-intel.ko. Practically speaking, this bug is a non-issue as kvm-intel.ko doesn't have a module param for LMCE, and there is no system in the world that supports both kvm-intel.ko and kvm-amd.ko. Fixes: c45dcc71b794 ("KVM: VMX: enable guest access to LMCE related MSRs") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240423165328.2853870-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Fully re-initialize supported_vm_types on vendor module loadSean Christopherson1-1/+2
Recompute the entire set of supported VM types when a vendor module is loaded, as preserving supported_vm_types across vendor module unload and reload can result in VM types being incorrectly treated as supported. E.g. if a vendor module is loaded with TDP enabled, unloaded, and then reloaded with TDP disabled, KVM_X86_SW_PROTECTED_VM will be incorrectly retained. Ditto for SEV_VM and SEV_ES_VM and their respective module params in kvm-amd.ko. Fixes: 2a955c4db1dd ("KVM: x86: Add supported_vm_types to kvm_caps") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240423165328.2853870-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-02KVM: x86: Only set APICV_INHIBIT_REASON_ABSENT if APICv is enabledAlejandro Jimenez1-7/+4
Use the APICv enablement status to determine if APICV_INHIBIT_REASON_ABSENT needs to be set, instead of unconditionally setting the reason during initialization. Specifically, in cases where AVIC is disabled via module parameter or lack of hardware support, unconditionally setting an inhibit reason due to the absence of an in-kernel local APIC can lead to a scenario where the reason incorrectly remains set after a local APIC has been created by either KVM_CREATE_IRQCHIP or the enabling of KVM_CAP_IRQCHIP_SPLIT. This is because the helpers in charge of removing the inhibit return early if enable_apicv is not true, and therefore the bit remains set. This leads to confusion as to the cause why APICv is not active, since an incorrect reason will be reported by tracepoints and/or a debugging tool that examines the currently set inhibit reasons. Fixes: ef8b4b720368 ("KVM: ensure APICv is considered inactive if there is no APIC") Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Link: https://lore.kernel.org/r/20240418021823.1275276-2-alejandro.j.jimenez@oracle.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-05-02KVM: x86: Allow, don't ignore, same-value writes to immutable MSRsSean Christopherson1-7/+4
When handling userspace writes to immutable feature MSRs for a vCPU that has already run, fall through into the normal code to set the MSR instead of immediately returning '0'. I.e. allow such writes, instead of ignoring such writes. This fixes a bug where KVM incorrectly allows writes to the VMX MSRs that enumerate which CR{0,4} can be set, but only if the vCPU has already run. The intent of returning '0' and thus ignoring the write, was to avoid any side effects, e.g. refreshing the PMU and thus doing weird things with perf events while the vCPU is running. That approach sounds nice in theory, but in practice it makes it all but impossible to maintain a sane ABI, e.g. all VMX MSRs return -EBUSY if the CPU is post-VMXON, and the VMX MSRs for fixed-1 CR bits are never writable, etc. As for refreshing the PMU, kvm_set_msr_common() explicitly skips the PMU refresh if MSR_IA32_PERF_CAPABILITIES is being written with the current value, specifically to avoid unwanted side effects. And if necessary, adding similar logic for other MSRs is not difficult. Fixes: 0094f62c7eaa ("KVM: x86: Disallow writes to immutable feature MSRs after KVM_RUN") Reported-by: Jim Mattson <jmattson@google.com> Cc: Raghavendra Rao Ananta <rananta@google.com> Link: https://lore.kernel.org/r/20240408231500.1388122-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-20Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-1/+1
Pull kvm fixes from Paolo Bonzini: "This is a bit on the large side, mostly due to two changes: - Changes to disable some broken PMU virtualization (see below for details under "x86 PMU") - Clean up SVM's enter/exit assembly code so that it can be compiled without OBJECT_FILES_NON_STANDARD. This fixes a warning "Unpatched return thunk in use. This should not happen!" when running KVM selftests. Everything else is small bugfixes and selftest changes: - Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure where KVM would allow userspace to refresh the cache with a bogus GPA. The bug has existed for quite some time, but was exposed by a new sanity check added in 6.9 (to ensure a cache is either GPA-based or HVA-based). - Drop an unused param from gfn_to_pfn_cache_invalidate_start() that got left behind during a 6.9 cleanup. - Fix a math goof in x86's hugepage logic for KVM_SET_MEMORY_ATTRIBUTES that results in an array overflow (detected by KASAN). - Fix a bug where KVM incorrectly clears root_role.direct when userspace sets guest CPUID. - Fix a dirty logging bug in the where KVM fails to write-protect SPTEs used by a nested guest, if KVM is using Page-Modification Logging and the nested hypervisor is NOT using EPT. x86 PMU: - Drop support for virtualizing adaptive PEBS, as KVM's implementation is architecturally broken without an obvious/easy path forward, and because exposing adaptive PEBS can leak host LBRs to the guest, i.e. can leak host kernel addresses to the guest. - Set the enable bits for general purpose counters in PERF_GLOBAL_CTRL at RESET time, as done by both Intel and AMD processors. - Disable LBR virtualization on CPUs that don't support LBR callstacks, as KVM unconditionally uses PERF_SAMPLE_BRANCH_CALL_STACK when creating the perf event, and would fail on such CPUs. Tests: - Fix a flaw in the max_guest_memory selftest that results in it exhausting the supply of ucall structures when run with more than 256 vCPUs. - Mark KVM_MEM_READONLY as supported for RISC-V in set_memory_region_test" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (30 commits) KVM: Drop unused @may_block param from gfn_to_pfn_cache_invalidate_start() KVM: selftests: Add coverage of EPT-disabled to vmx_dirty_log_test KVM: x86/mmu: Fix and clarify comments about clearing D-bit vs. write-protecting KVM: x86/mmu: Remove function comments above clear_dirty_{gfn_range,pt_masked}() KVM: x86/mmu: Write-protect L2 SPTEs in TDP MMU when clearing dirty status KVM: x86/mmu: Precisely invalidate MMU root_role during CPUID update KVM: VMX: Disable LBR virtualization if the CPU doesn't support LBR callstacks perf/x86/intel: Expose existence of callback support to KVM KVM: VMX: Snapshot LBR capabilities during module initialization KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible KVM: x86: Stop compiling vmenter.S with OBJECT_FILES_NON_STANDARD KVM: SVM: Create a stack frame in __svm_sev_es_vcpu_run() KVM: SVM: Save/restore args across SEV-ES VMRUN via host save area KVM: SVM: Save/restore non-volatile GPRs in SEV-ES VMRUN via host save area KVM: SVM: Clobber RAX instead of RBX when discarding spec_ctrl_intercepted KVM: SVM: Drop 32-bit "support" from __svm_sev_es_vcpu_run() KVM: SVM: Wrap __svm_sev_es_vcpu_run() with #ifdef CONFIG_KVM_AMD_SEV KVM: SVM: Create a stack frame in __svm_vcpu_run() for unwinding KVM: SVM: Remove a useless zeroing of allocated memory ...
2024-04-12KVM: x86: Split core of hypercall emulation to helper functionSean Christopherson1-18/+38
By necessity, TDX will use a different register ABI for hypercalls. Break out the core functionality so that it may be reused for TDX. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <5134caa55ac3dec33fb2addb5545b52b3b52db02.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>