aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86/kvm (follow)
AgeCommit message (Collapse)AuthorFilesLines
2022-09-30KVM: x86: Hide IA32_PLATFORM_DCA_CAP[31:0] from the guestJim Mattson1-2/+0
The only thing reported by CPUID.9 is the value of IA32_PLATFORM_DCA_CAP[31:0] in EAX. This MSR doesn't even exist in the guest, since CPUID.1:ECX.DCA[bit 18] is clear in the guest. Clear CPUID.9 in KVM_GET_SUPPORTED_CPUID. Fixes: 24c82e576b78 ("KVM: Sanitize cpuid") Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20220922231854.249383-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-22KVM: x86: Inject #UD on emulated XSETBV if XSAVES isn't enabledSean Christopherson2-0/+4
Inject #UD when emulating XSETBV if CR4.OSXSAVE is not set. This also covers the "XSAVE not supported" check, as setting CR4.OSXSAVE=1 #GPs if XSAVE is not supported (and userspace gets to keep the pieces if it forces incoherent vCPU state). Add a comment to kvm_emulate_xsetbv() to call out that the CPU checks CR4.OSXSAVE before checking for intercepts. AMD'S APM implies that #UD has priority (says that intercepts are checked before #GP exceptions), while Intel's SDM says nothing about interception priority. However, testing on hardware shows that both AMD and Intel CPUs prioritize the #UD over interception. Fixes: 02d4160fbd76 ("x86: KVM: add xsetbv to the emulator") Cc: stable@vger.kernel.org Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220824033057.3576315-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-22KVM: x86: Always enable legacy FP/SSE in allowed user XFEATURESDr. David Alan Gilbert1-1/+7
Allow FP and SSE state to be saved and restored via KVM_{G,SET}_XSAVE on XSAVE-capable hosts even if their bits are not exposed to the guest via XCR0. Failing to allow FP+SSE first showed up as a QEMU live migration failure, where migrating a VM from a pre-XSAVE host, e.g. Nehalem, to an XSAVE host failed due to KVM rejecting KVM_SET_XSAVE. However, the bug also causes problems even when migrating between XSAVE-capable hosts as KVM_GET_SAVE won't set any bits in user_xfeatures if XSAVE isn't exposed to the guest, i.e. KVM will fail to actually migrate FP+SSE. Because KVM_{G,S}ET_XSAVE are designed to allowing migrating between hosts with and without XSAVE, KVM_GET_XSAVE on a non-XSAVE (by way of fpu_copy_guest_fpstate_to_uabi()) always sets the FP+SSE bits in the header so that KVM_SET_XSAVE will work even if the new host supports XSAVE. Fixes: ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0") bz: https://bugzilla.redhat.com/show_bug.cgi?id=2079311 Cc: stable@vger.kernel.org Cc: Leonardo Bras <leobras@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> [sean: add comment, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220824033057.3576315-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-22KVM: x86: Reinstate kvm_vcpu_arch.guest_supported_xcr0Sean Christopherson2-10/+4
Reinstate the per-vCPU guest_supported_xcr0 by partially reverting commit 988896bb6182; the implicit assessment that guest_supported_xcr0 is always the same as guest_fpu.fpstate->user_xfeatures was incorrect. kvm_vcpu_after_set_cpuid() isn't the only place that sets user_xfeatures, as user_xfeatures is set to fpu_user_cfg.default_features when guest_fpu is allocated via fpu_alloc_guest_fpstate() => __fpstate_reset(). guest_supported_xcr0 on the other hand is zero-allocated. If userspace never invokes KVM_SET_CPUID2, supported XCR0 will be '0', whereas the allowed user XFEATURES will be non-zero. Practically speaking, the edge case likely doesn't matter as no sane userspace will live migrate a VM without ever doing KVM_SET_CPUID2. The primary motivation is to prepare for KVM intentionally and explicitly setting bits in user_xfeatures that are not set in guest_supported_xcr0. Because KVM_{G,S}ET_XSAVE can be used to svae/restore FP+SSE state even if the host doesn't support XSAVE, KVM needs to set the FP+SSE bits in user_xfeatures even if they're not allowed in XCR0, e.g. because XCR0 isn't exposed to the guest. At that point, the simplest fix is to track the two things separately (allowed save/restore vs. allowed XCR0). Fixes: 988896bb6182 ("x86/kvm/fpu: Remove kvm_vcpu_arch.guest_supported_xcr0") Cc: stable@vger.kernel.org Cc: Leonardo Bras <leobras@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220824033057.3576315-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-22KVM: x86/mmu: add missing update to max_mmu_rmap_sizeMiaohe Lin1-0/+2
The update to statistic max_mmu_rmap_size is unintentionally removed by commit 4293ddb788c1 ("KVM: x86/mmu: Remove redundant spte present check in mmu_set_spte"). Add missing update to it or max_mmu_rmap_size will always be nonsensical 0. Fixes: 4293ddb788c1 ("KVM: x86/mmu: Remove redundant spte present check in mmu_set_spte") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Message-Id: <20220907080657.42898-1-linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-01Merge tag 'kvm-s390-master-6.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEADPaolo Bonzini1-1/+1
PCI interpretation compile fixes
2022-09-01KVM: x86: check validity of argument to KVM_SET_MP_STATEPaolo Bonzini1-3/+17
An invalid argument to KVM_SET_MP_STATE has no effect other than making the vCPU fail to run at the next KVM_RUN. Since it is extremely unlikely that any userspace is relying on it, fail with -EINVAL just like for other architectures. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-01KVM: x86: fix memoryleak in kvm_arch_vcpu_create()Miaohe Lin1-2/+1
When allocating memory for mci_ctl2_banks fails, KVM doesn't release mce_banks leading to memoryleak. Fix this issue by calling kfree() for it when kcalloc() fails. Fixes: 281b52780b57 ("KVM: x86: Add emulation for MSR_IA32_MCx_CTL2 MSRs.") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Message-Id: <20220901122300.22298-1-linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-01KVM: x86: Mask off unsupported and unknown bits of IA32_ARCH_CAPABILITIESJim Mattson1-4/+21
KVM should not claim to virtualize unknown IA32_ARCH_CAPABILITIES bits. When kvm_get_arch_capabilities() was originally written, there were only a few bits defined in this MSR, and KVM could virtualize all of them. However, over the years, several bits have been defined that KVM cannot just blindly pass through to the guest without additional work (such as virtualizing an MSR promised by the IA32_ARCH_CAPABILITES feature bit). Define a mask of supported IA32_ARCH_CAPABILITIES bits, and mask off any other bits that are set in the hardware MSR. Cc: Paolo Bonzini <pbonzini@redhat.com> Fixes: 5b76a3cff011 ("KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry") Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20220830174947.2182144-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-21asm goto: eradicate CC_HAS_ASM_GOTONick Desaulniers1-1/+1
GCC has supported asm goto since 4.5, and Clang has since version 9.0.0. The minimum supported versions of these tools for the build according to Documentation/process/changes.rst are 5.1 and 11.0.0 respectively. Remove the feature detection script, Kconfig option, and clean up some fallback code that is no longer supported. The removed script was also testing for a GCC specific bug that was fixed in the 4.7 release. Also remove workarounds for bpftrace using clang older than 9.0.0, since other BPF backend fixes are required at this point. Link: https://lore.kernel.org/lkml/CAK7LNATSr=BXKfkdW8f-H5VT_w=xBpT2ZQcZ7rm6JfkdE+QnmA@mail.gmail.com/ Link: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48637 Acked-by: Borislav Petkov <bp@suse.de> Suggested-by: Masahiro Yamada <masahiroy@kernel.org> Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-08-19KVM: VMX: Heed the 'msr' argument in msr_write_intercepted()Jim Mattson1-2/+1
Regardless of the 'msr' argument passed to the VMX version of msr_write_intercepted(), the function always checks to see if a specific MSR (IA32_SPEC_CTRL) is intercepted for write. This behavior seems unintentional and unexpected. Modify the function so that it checks to see if the provided 'msr' index is intercepted for write. Fixes: 67f4b9969c30 ("KVM: nVMX: Handle dynamic MSR intercept toggling") Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220810213050.2655000-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19kvm: x86: mmu: Always flush TLBs when enabling dirty loggingJunaid Shahid3-42/+61
When A/D bits are not available, KVM uses a software access tracking mechanism, which involves making the SPTEs inaccessible. However, the clear_young() MMU notifier does not flush TLBs. So it is possible that there may still be stale, potentially writable, TLB entries. This is usually fine, but can be problematic when enabling dirty logging, because it currently only does a TLB flush if any SPTEs were modified. But if all SPTEs are in access-tracked state, then there won't be a TLB flush, which means that the guest could still possibly write to memory and not have it reflected in the dirty bitmap. So just unconditionally flush the TLBs when enabling dirty logging. As an alternative, KVM could explicitly check the MMU-Writable bit when write-protecting SPTEs to decide if a flush is needed (instead of checking the Writable bit), but given that a flush almost always happens anyway, so just making it unconditional seems simpler. Signed-off-by: Junaid Shahid <junaids@google.com> Message-Id: <20220810224939.2611160-1-junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19kvm: x86: mmu: Drop the need_remote_flush() functionJunaid Shahid1-14/+1
This is only used by kvm_mmu_pte_write(), which no longer actually creates the new SPTE and instead just clears the old SPTE. So we just need to check if the old SPTE was shadow-present instead of calling need_remote_flush(). Hence we can drop this function. It was incomplete anyway as it didn't take access-tracking into account. This patch should not result in any functional change. Signed-off-by: Junaid Shahid <junaids@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220723024316.2725328-1-junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19x86/kvm: Fix "missing ENDBR" BUG for fastop functionsJosh Poimboeuf1-1/+2
The following BUG was reported: traps: Missing ENDBR: andw_ax_dx+0x0/0x10 [kvm] ------------[ cut here ]------------ kernel BUG at arch/x86/kernel/traps.c:253! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI <TASK> asm_exc_control_protection+0x2b/0x30 RIP: 0010:andw_ax_dx+0x0/0x10 [kvm] Code: c3 cc cc cc cc 0f 1f 44 00 00 66 0f 1f 00 48 19 d0 c3 cc cc cc cc 0f 1f 40 00 f3 0f 1e fa 20 d0 c3 cc cc cc cc 0f 1f 44 00 00 <66> 0f 1f 00 66 21 d0 c3 cc cc cc cc 0f 1f 40 00 66 0f 1f 00 21 d0 ? andb_al_dl+0x10/0x10 [kvm] ? fastop+0x5d/0xa0 [kvm] x86_emulate_insn+0x822/0x1060 [kvm] x86_emulate_instruction+0x46f/0x750 [kvm] complete_emulated_mmio+0x216/0x2c0 [kvm] kvm_arch_vcpu_ioctl_run+0x604/0x650 [kvm] kvm_vcpu_ioctl+0x2f4/0x6b0 [kvm] ? wake_up_q+0xa0/0xa0 The BUG occurred because the ENDBR in the andw_ax_dx() fastop function had been incorrectly "sealed" (converted to a NOP) by apply_ibt_endbr(). Objtool marked it to be sealed because KVM has no compile-time references to the function. Instead KVM calculates its address at runtime. Prevent objtool from annotating fastop functions as sealable by creating throwaway dummy compile-time references to the functions. Fixes: 6649fa876da4 ("x86/ibt,kvm: Add ENDBR to fastops") Reported-by: Pengfei Xu <pengfei.xu@intel.com> Debugged-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Message-Id: <0d4116f90e9d0c1b754bb90c585e6f0415a1c508.1660837839.git.jpoimboe@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19x86/kvm: Simplify FOP_SETCC()Josh Poimboeuf1-19/+4
SETCC_ALIGN and FOP_ALIGN are both 16. Remove the special casing for FOP_SETCC() and just make it a normal fastop. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Message-Id: <7c13d94d1a775156f7e36eed30509b274a229140.1660837839.git.jpoimboe@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: Rename mmu_notifier_* to mmu_invalidate_*Chao Peng2-9/+9
The motivation of this renaming is to make these variables and related helper functions less mmu_notifier bound and can also be used for non mmu_notifier based page invalidation. mmu_invalidate_* was chosen to better describe the purpose of 'invalidating' a page that those variables are used for. - mmu_notifier_seq/range_start/range_end are renamed to mmu_invalidate_seq/range_start/range_end. - mmu_notifier_retry{_hva} helper functions are renamed to mmu_invalidate_retry{_hva}. - mmu_notifier_count is renamed to mmu_invalidate_in_progress to avoid confusion with mn_active_invalidate_count. - While here, also update kvm_inc/dec_notifier_count() to kvm_mmu_invalidate_begin/end() to match the change for mmu_notifier_count. No functional change intended. Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Message-Id: <20220816125322.1110439-3-chao.p.peng@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-11Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds12-49/+124
Pull more kvm updates from Paolo Bonzini: - Xen timer fixes - Documentation formatting fixes - Make rseq selftest compatible with glibc-2.35 - Fix handling of illegal LEA reg, reg - Cleanup creation of debugfs entries - Fix steal time cache handling bug - Fixes for MMIO caching - Optimize computation of number of LBRs - Fix uninitialized field in guest_maxphyaddr < host_maxphyaddr path * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (26 commits) KVM: x86/MMU: properly format KVM_CAP_VM_DISABLE_NX_HUGE_PAGES capability table Documentation: KVM: extend KVM_CAP_VM_DISABLE_NX_HUGE_PAGES heading underline KVM: VMX: Adjust number of LBR records for PERF_CAPABILITIES at refresh KVM: VMX: Use proper type-safe functions for vCPU => LBRs helpers KVM: x86: Refresh PMU after writes to MSR_IA32_PERF_CAPABILITIES KVM: selftests: Test all possible "invalid" PERF_CAPABILITIES.LBR_FMT vals KVM: selftests: Use getcpu() instead of sched_getcpu() in rseq_test KVM: selftests: Make rseq compatible with glibc-2.35 KVM: Actually create debugfs in kvm_create_vm() KVM: Pass the name of the VM fd to kvm_create_vm_debugfs() KVM: Get an fd before creating the VM KVM: Shove vcpu stats_id init into kvm_vcpu_init() KVM: Shove vm stats_id init into kvm_create_vm() KVM: x86/mmu: Add sanity check that MMIO SPTE mask doesn't overlap gen KVM: x86/mmu: rename trace function name for asynchronous page fault KVM: x86/xen: Stop Xen timer before changing IRQ KVM: x86/xen: Initialize Xen timer only once KVM: SVM: Disable SEV-ES support if MMIO caching is disable KVM: x86/mmu: Fully re-evaluate MMIO caching when SPTE masks change KVM: x86: Tag kvm_mmu_x86_module_init() with __init ...
2022-08-10KVM: VMX: Adjust number of LBR records for PERF_CAPABILITIES at refreshSean Christopherson2-11/+8
Now that the PMU is refreshed when MSR_IA32_PERF_CAPABILITIES is written by host userspace, zero out the number of LBR records for a vCPU during PMU refresh if PMU_CAP_LBR_FMT is not set in PERF_CAPABILITIES instead of handling the check at run-time. guest_cpuid_has() is expensive due to the linear search of guest CPUID entries, intel_pmu_lbr_is_enabled() is checked on every VM-Enter, _and_ simply enumerating the same "Model" as the host causes KVM to set the number of LBR records to a non-zero value. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220727233424.2968356-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: VMX: Use proper type-safe functions for vCPU => LBRs helpersSean Christopherson1-9/+17
Turn vcpu_to_lbr_desc() and vcpu_to_lbr_records() into functions in order to provide type safety, to document exactly what they return, and to allow consuming the helpers in vmx.h. Move the definitions as necessary (the macros "reference" to_vmx() before its definition). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220727233424.2968356-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: x86: Refresh PMU after writes to MSR_IA32_PERF_CAPABILITIESSean Christopherson1-2/+2
Refresh the PMU if userspace modifies MSR_IA32_PERF_CAPABILITIES. KVM consumes the vCPU's PERF_CAPABILITIES when enumerating PEBS support, but relies on CPUID updates to refresh the PMU. I.e. KVM will do the wrong thing if userspace stuffs PERF_CAPABILITIES _after_ setting guest CPUID. Opportunistically fix a curly-brace indentation. Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS") Cc: Like Xu <like.xu.linux@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220727233424.2968356-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: x86/mmu: Add sanity check that MMIO SPTE mask doesn't overlap genSean Christopherson2-0/+22
Add compile-time and init-time sanity checks to ensure that the MMIO SPTE mask doesn't overlap the MMIO SPTE generation or the MMU-present bit. The generation currently avoids using bit 63, but that's as much coincidence as it is strictly necessarly. That will change in the future, as TDX support will require setting bit 63 (SUPPRESS_VE) in the mask. Explicitly carve out the bits that are allowed in the mask so that any future shuffling of SPTE bits doesn't silently break MMIO caching (KVM has broken MMIO caching more than once due to overlapping the generation with other things). Suggested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-Id: <20220805194133.86299-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: x86/mmu: rename trace function name for asynchronous page faultMingwei Zhang1-1/+1
Rename the tracepoint function from trace_kvm_async_pf_doublefault() to trace_kvm_async_pf_repeated_fault() to make it clear, since double fault has nothing to do with this trace function. Asynchronous Page Fault (APF) is an artifact generated by KVM when it cannot find a physical page to satisfy an EPT violation. KVM uses APF to tell the guest OS to do something else such as scheduling other guest processes to make forward progress. However, when another guest process also touches a previously APFed page, KVM halts the vCPU instead of generating a repeated APF to avoid wasting cycles. Double fault (#DF) clearly has a different meaning and a different consequence when triggered. #DF requires two nested contributory exceptions instead of two page faults faulting at the same address. A prevous bug on APF indicates that it may trigger a double fault in the guest [1] and clearly this trace function has nothing to do with it. So rename this function should be a valid choice. No functional change intended. [1] https://www.spinics.net/lists/kvm/msg214957.html Signed-off-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220807052141.69186-1-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: x86/xen: Stop Xen timer before changing IRQColeman Dietsch1-18/+17
Stop Xen timer (if it's running) prior to changing the IRQ vector and potentially (re)starting the timer. Changing the IRQ vector while the timer is still running can result in KVM injecting a garbage event, e.g. vm_xen_inject_timer_irqs() could see a non-zero xen.timer_pending from a previous timer but inject the new xen.timer_virq. Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode") Cc: stable@vger.kernel.org Link: https://syzkaller.appspot.com/bug?id=8234a9dfd3aafbf092cc5a7cd9842e3ebc45fc42 Reported-by: syzbot+e54f930ed78eb0f85281@syzkaller.appspotmail.com Signed-off-by: Coleman Dietsch <dietschc@csp.edu> Reviewed-by: Sean Christopherson <seanjc@google.com> Acked-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20220808190607.323899-3-dietschc@csp.edu> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: x86/xen: Initialize Xen timer only onceColeman Dietsch1-1/+3
Add a check for existing xen timers before initializing a new one. Currently kvm_xen_init_timer() is called on every KVM_XEN_VCPU_ATTR_TYPE_TIMER, which is causing the following ODEBUG crash when vcpu->arch.xen.timer is already set. ODEBUG: init active (active state 0) object type: hrtimer hint: xen_timer_callbac0 RIP: 0010:debug_print_object+0x16e/0x250 lib/debugobjects.c:502 Call Trace: __debug_object_init debug_hrtimer_init debug_init hrtimer_init kvm_xen_init_timer kvm_xen_vcpu_set_attr kvm_arch_vcpu_ioctl kvm_vcpu_ioctl vfs_ioctl Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode") Cc: stable@vger.kernel.org Link: https://syzkaller.appspot.com/bug?id=8234a9dfd3aafbf092cc5a7cd9842e3ebc45fc42 Reported-by: syzbot+e54f930ed78eb0f85281@syzkaller.appspotmail.com Signed-off-by: Coleman Dietsch <dietschc@csp.edu> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220808190607.323899-2-dietschc@csp.edu> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: SVM: Disable SEV-ES support if MMIO caching is disableSean Christopherson5-5/+19
Disable SEV-ES if MMIO caching is disabled as SEV-ES relies on MMIO SPTEs generating #NPF(RSVD), which are reflected by the CPU into the guest as a #VC. With SEV-ES, the untrusted host, a.k.a. KVM, doesn't have access to the guest instruction stream or register state and so can't directly emulate in response to a #NPF on an emulated MMIO GPA. Disabling MMIO caching means guest accesses to emulated MMIO ranges cause #NPF(!PRESENT), and those flavors of #NPF cause automatic VM-Exits, not #VC. Adjust KVM's MMIO masks to account for the C-bit location prior to doing SEV(-ES) setup, and document that dependency between adjusting the MMIO SPTE mask and SEV(-ES) setup. Fixes: b09763da4dd8 ("KVM: x86/mmu: Add module param to disable MMIO caching (for testing)") Reported-by: Michael Roth <michael.roth@amd.com> Tested-by: Michael Roth <michael.roth@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220803224957.1285926-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: x86/mmu: Fully re-evaluate MMIO caching when SPTE masks changeSean Christopherson3-0/+24
Fully re-evaluate whether or not MMIO caching can be enabled when SPTE masks change; simply clearing enable_mmio_caching when a configuration isn't compatible with caching fails to handle the scenario where the masks are updated, e.g. by VMX for EPT or by SVM to account for the C-bit location, and toggle compatibility from false=>true. Snapshot the original module param so that re-evaluating MMIO caching preserves userspace's desire to allow caching. Use a snapshot approach so that enable_mmio_caching still reflects KVM's actual behavior. Fixes: 8b9e74bfbf8c ("KVM: x86/mmu: Use enable_mmio_caching to track if MMIO caching is enabled") Reported-by: Michael Roth <michael.roth@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: stable@vger.kernel.org Tested-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-Id: <20220803224957.1285926-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: x86: Tag kvm_mmu_x86_module_init() with __initSean Christopherson1-1/+1
Mark kvm_mmu_x86_module_init() with __init, the entire reason it exists is to initialize variables when kvm.ko is loaded, i.e. it must never be called after module initialization. Fixes: 1d0e84806047 ("KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded") Cc: stable@vger.kernel.org Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220803224957.1285926-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: x86: emulator: Fix illegal LEA handlingMichal Luczaj1-1/+5
The emulator mishandles LEA with register source operand. Even though such LEA is illegal, it can be encoded and fed to CPU. In which case real hardware throws #UD. The emulator, instead, returns address of x86_emulate_ctxt._regs. This info leak hurts host's kASLR. Tell the decoder that illegal LEA is not to be emulated. Signed-off-by: Michal Luczaj <mhal@rbox.co> Message-Id: <20220729134801.1120-1-mhal@rbox.co> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: X86: avoid uninitialized 'fault.async_page_fault' from fixed-up #PFYu Zhang1-0/+1
kvm_fixup_and_inject_pf_error() was introduced to fixup the error code( e.g., to add RSVD flag) and inject the #PF to the guest, when guest MAXPHYADDR is smaller than the host one. When it comes to nested, L0 is expected to intercept and fix up the #PF and then inject to L2 directly if - L2.MAXPHYADDR < L0.MAXPHYADDR and - L1 has no intention to intercept L2's #PF (e.g., L2 and L1 have the same MAXPHYADDR value && L1 is using EPT for L2), instead of constructing a #PF VM Exit to L1. Currently, with PFEC_MASK and PFEC_MATCH both set to 0 in vmcs02, the interception and injection may happen on all L2 #PFs. However, failing to initialize 'fault' in kvm_fixup_and_inject_pf_error() may cause the fault.async_page_fault being NOT zeroed, and later the #PF being treated as a nested async page fault, and then being injected to L1. Instead of zeroing 'fault' at the beginning of this function, we mannually set the value of 'fault.async_page_fault', because false is the value we really expect. Fixes: 897861479c064 ("KVM: x86: Add helper functions for illegal GPA checking and page fault injection") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216178 Reported-by: Yang Lixiao <lixiao.yang@intel.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220718074756.53788-1-yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: x86: Bug the VM if an accelerated x2APIC trap occurs on a "bad" regSean Christopherson1-3/+5
Bug the VM if retrieving the x2APIC MSR/register while processing an accelerated vAPIC trap VM-Exit fails. In theory it's impossible for the lookup to fail as hardware has already validated the register, but bugs happen, and not checking the result of kvm_lapic_msr_read() would result in consuming the uninitialized "val" if a KVM or hardware bug occurs. Fixes: 1bd9dfec9fd4 ("KVM: x86: Do not block APIC write for non ICR registers") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220804235028.1766253-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: x86: do not report preemption if the steal time cache is stalePaolo Bonzini1-0/+2
Commit 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status", 2021-11-11) open coded the previous call to kvm_map_gfn, but in doing so it dropped the comparison between the cached guest physical address and the one in the MSR. This cause an incorrect cache hit if the guest modifies the steal time address while the memslots remain the same. This can happen with kexec, in which case the preempted bit is written at the address used by the old kernel instead of the old one. Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: stable@vger.kernel.org Fixes: 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: x86: revalidate steal time cache if MSR value changesPaolo Bonzini1-3/+3
Commit 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status", 2021-11-11) open coded the previous call to kvm_map_gfn, but in doing so it dropped the comparison between the cached guest physical address and the one in the MSR. This cause an incorrect cache hit if the guest modifies the steal time address while the memslots remain the same. This can happen with kexec, in which case the steal time data is written at the address used by the old kernel instead of the old one. While at it, rename the variable from gfn to gpa since it is a plain physical address and not a right-shifted one. Reported-by: Dave Young <ruyang@redhat.com> Reported-by: Xiaoying Yan <yiyan@redhat.com> Analyzed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: stable@vger.kernel.org Fixes: 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-09Merge tag 'x86_bugs_pbrsb' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-3/+5
Pull x86 eIBRS fixes from Borislav Petkov: "More from the CPU vulnerability nightmares front: Intel eIBRS machines do not sufficiently mitigate against RET mispredictions when doing a VM Exit therefore an additional RSB, one-entry stuffing is needed" * tag 'x86_bugs_pbrsb' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation: Add LFENCE to RSB fill sequence x86/speculation: Add RSB VM Exit protections
2022-08-06Merge tag 'iommu-updates-v5.20-or-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommuLinus Torvalds1-1/+0
Pull iommu updates from Joerg Roedel: - The most intrusive patch is small and changes the default allocation policy for DMA addresses. Before the change the allocator tried its best to find an address in the first 4GB. But that lead to performance problems when that space gets exhaused, and since most devices are capable of 64-bit DMA these days, we changed it to search in the full DMA-mask range from the beginning. This change has the potential to uncover bugs elsewhere, in the kernel or the hardware. There is a Kconfig option and a command line option to restore the old behavior, but none of them is enabled by default. - Add Robin Murphy as reviewer of IOMMU code and maintainer for the dma-iommu and iova code - Chaning IOVA magazine size from 1032 to 1024 bytes to save memory - Some core code cleanups and dead-code removal - Support for ACPI IORT RMR node - Support for multiple PCI domains in the AMD-Vi driver - ARM SMMU changes from Will Deacon: - Add even more Qualcomm device-tree compatible strings - Support dumping of IMP DEF Qualcomm registers on TLB sync timeout - Fix reference count leak on device tree node in Qualcomm driver - Intel VT-d driver updates from Lu Baolu: - Make intel-iommu.h private - Optimize the use of two locks - Extend the driver to support large-scale platforms - Cleanup some dead code - MediaTek IOMMU refactoring and support for TTBR up to 35bit - Basic support for Exynos SysMMU v7 - VirtIO IOMMU driver gets a map/unmap_pages() implementation - Other smaller cleanups and fixes * tag 'iommu-updates-v5.20-or-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (116 commits) iommu/amd: Fix compile warning in init code iommu/amd: Add support for AVIC when SNP is enabled iommu/amd: Simplify and Consolidate Virtual APIC (AVIC) Enablement ACPI/IORT: Fix build error implicit-function-declaration drivers: iommu: fix clang -wformat warning iommu/arm-smmu: qcom_iommu: Add of_node_put() when breaking out of loop iommu/arm-smmu-qcom: Add SM6375 SMMU compatible dt-bindings: arm-smmu: Add compatible for Qualcomm SM6375 MAINTAINERS: Add Robin Murphy as IOMMU SUBSYTEM reviewer iommu/amd: Do not support IOMMUv2 APIs when SNP is enabled iommu/amd: Do not support IOMMU_DOMAIN_IDENTITY after SNP is enabled iommu/amd: Set translation valid bit only when IO page tables are in use iommu/amd: Introduce function to check and enable SNP iommu/amd: Globally detect SNP support iommu/amd: Process all IVHDs before enabling IOMMU features iommu/amd: Introduce global variable for storing common EFR and EFR2 iommu/amd: Introduce Support for Extended Feature 2 Register iommu/amd: Change macro for IOMMU control register bit shift to decimal value iommu/exynos: Enable default VM instance on SysMMU v7 iommu/exynos: Add SysMMU v7 register set ...
2022-08-05Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mmLinus Torvalds1-1/+1
Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...
2022-08-03x86/speculation: Add RSB VM Exit protectionsDaniel Sneddon1-3/+5
tl;dr: The Enhanced IBRS mitigation for Spectre v2 does not work as documented for RET instructions after VM exits. Mitigate it with a new one-entry RSB stuffing mechanism and a new LFENCE. == Background == Indirect Branch Restricted Speculation (IBRS) was designed to help mitigate Branch Target Injection and Speculative Store Bypass, i.e. Spectre, attacks. IBRS prevents software run in less privileged modes from affecting branch prediction in more privileged modes. IBRS requires the MSR to be written on every privilege level change. To overcome some of the performance issues of IBRS, Enhanced IBRS was introduced. eIBRS is an "always on" IBRS, in other words, just turn it on once instead of writing the MSR on every privilege level change. When eIBRS is enabled, more privileged modes should be protected from less privileged modes, including protecting VMMs from guests. == Problem == Here's a simplification of how guests are run on Linux' KVM: void run_kvm_guest(void) { // Prepare to run guest VMRESUME(); // Clean up after guest runs } The execution flow for that would look something like this to the processor: 1. Host-side: call run_kvm_guest() 2. Host-side: VMRESUME 3. Guest runs, does "CALL guest_function" 4. VM exit, host runs again 5. Host might make some "cleanup" function calls 6. Host-side: RET from run_kvm_guest() Now, when back on the host, there are a couple of possible scenarios of post-guest activity the host needs to do before executing host code: * on pre-eIBRS hardware (legacy IBRS, or nothing at all), the RSB is not touched and Linux has to do a 32-entry stuffing. * on eIBRS hardware, VM exit with IBRS enabled, or restoring the host IBRS=1 shortly after VM exit, has a documented side effect of flushing the RSB except in this PBRSB situation where the software needs to stuff the last RSB entry "by hand". IOW, with eIBRS supported, host RET instructions should no longer be influenced by guest behavior after the host retires a single CALL instruction. However, if the RET instructions are "unbalanced" with CALLs after a VM exit as is the RET in #6, it might speculatively use the address for the instruction after the CALL in #3 as an RSB prediction. This is a problem since the (untrusted) guest controls this address. Balanced CALL/RET instruction pairs such as in step #5 are not affected. == Solution == The PBRSB issue affects a wide variety of Intel processors which support eIBRS. But not all of them need mitigation. Today, X86_FEATURE_RSB_VMEXIT triggers an RSB filling sequence that mitigates PBRSB. Systems setting RSB_VMEXIT need no further mitigation - i.e., eIBRS systems which enable legacy IBRS explicitly. However, such systems (X86_FEATURE_IBRS_ENHANCED) do not set RSB_VMEXIT and most of them need a new mitigation. Therefore, introduce a new feature flag X86_FEATURE_RSB_VMEXIT_LITE which triggers a lighter-weight PBRSB mitigation versus RSB_VMEXIT. The lighter-weight mitigation performs a CALL instruction which is immediately followed by a speculative execution barrier (INT3). This steers speculative execution to the barrier -- just like a retpoline -- which ensures that speculation can never reach an unbalanced RET. Then, ensure this CALL is retired before continuing execution with an LFENCE. In other words, the window of exposure is opened at VM exit where RET behavior is troublesome. While the window is open, force RSB predictions sampling for RET targets to a dead end at the INT3. Close the window with the LFENCE. There is a subset of eIBRS systems which are not vulnerable to PBRSB. Add these systems to the cpu_vuln_whitelist[] as NO_EIBRS_PBRSB. Future systems that aren't vulnerable will set ARCH_CAP_PBRSB_NO. [ bp: Massage, incorporate review comments from Andy Cooper. ] Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-08-01KVM: x86/mmu: remove unused variablePaolo Bonzini1-2/+0
The last use of 'pfn' went away with the same-named argument to host_pfn_mapping_level; now that the hugepage level is obtained exclusively from the host page tables, kvm_mmu_zap_collapsible_spte does not need to know host pfns at all. Fixes: a8ac499bb6ab ("KVM: x86/mmu: Don't require refcounted "struct page" to create huge SPTEs") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-01Merge remote-tracking branch 'kvm/next' into kvm-next-5.20Paolo Bonzini44-1440/+2994
KVM/s390, KVM/x86 and common infrastructure changes for 5.20 x86: * Permit guests to ignore single-bit ECC errors * Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache * Intel IPI virtualization * Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS * PEBS virtualization * Simplify PMU emulation by just using PERF_TYPE_RAW events * More accurate event reinjection on SVM (avoid retrying instructions) * Allow getting/setting the state of the speaker port data bit * Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent * "Notify" VM exit (detect microarchitectural hangs) for Intel * Cleanups for MCE MSR emulation s390: * add an interface to provide a hypervisor dump for secure guests * improve selftests to use TAP interface * enable interpretive execution of zPCI instructions (for PCI passthrough) * First part of deferred teardown * CPU Topology * PV attestation * Minor fixes Generic: * new selftests API using struct kvm_vcpu instead of a (vm, id) tuple x86: * Use try_cmpxchg64 instead of cmpxchg64 * Bugfixes * Ignore benign host accesses to PMU MSRs when PMU is disabled * Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior * x86/MMU: Allow NX huge pages to be disabled on a per-vm basis * Port eager page splitting to shadow MMU as well * Enable CMCI capability by default and handle injected UCNA errors * Expose pid of vcpu threads in debugfs * x2AVIC support for AMD * cleanup PIO emulation * Fixes for LLDT/LTR emulation * Don't require refcounted "struct page" to create huge SPTEs x86 cleanups: * Use separate namespaces for guest PTEs and shadow PTEs bitmasks * PIO emulation * Reorganize rmap API, mostly around rmap destruction * Do not workaround very old KVM bugs for L0 that runs with nesting enabled * new selftests API for CPUID
2022-07-29Merge branches 'arm/exynos', 'arm/mediatek', 'arm/msm', 'arm/smmu', 'virtio', 'x86/vt-d', 'x86/amd' and 'core' into nextJoerg Roedel1-1/+0
2022-07-28KVM, x86/mmu: Fix the comment around kvm_tdp_mmu_zap_leafs()Kai Huang1-7/+3
Now kvm_tdp_mmu_zap_leafs() only zaps leaf SPTEs but not any non-root pages within that GFN range anymore, so the comment around it isn't right. Fix it by shifting the comment from tdp_mmu_zap_leafs() instead of duplicating it, as tdp_mmu_zap_leafs() is static and is only called by kvm_tdp_mmu_zap_leafs(). Opportunistically tweak the blurb about SPTEs being cleared to (a) say "zapped" instead of "cleared" because "cleared" will be wrong if/when KVM allows a non-zero value for non-present SPTE (i.e. for Intel TDX), and (b) to clarify that a flush is needed if and only if a SPTE has been zapped since MMU lock was last acquired. Fixes: f47e5bbbc92f ("KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmap") Suggested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <20220728030452.484261-1-kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: SVM: Dump Virtual Machine Save Area (VMSA) to klogJarkko Sakkinen1-0/+3
As Virtual Machine Save Area (VMSA) is essential in troubleshooting attestation, dump it to the klog with the KERN_DEBUG level of priority. Cc: Jarkko Sakkinen <jarkko@kernel.org> Suggested-by: Harald Hoyer <harald@profian.com> Signed-off-by: Jarkko Sakkinen <jarkko@profian.com> Message-Id: <20220728050919.24113-1-jarkko@profian.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Treat NX as a valid SPTE bit for NPTSean Christopherson1-1/+1
Treat the NX bit as valid when using NPT, as KVM will set the NX bit when the NX huge page mitigation is enabled (mindblowing) and trigger the WARN that fires on reserved SPTE bits being set. KVM has required NX support for SVM since commit b26a71a1a5b9 ("KVM: SVM: Refuse to load kvm_amd if NX support is not available") for exactly this reason, but apparently it never occurred to anyone to actually test NPT with the mitigation enabled. ------------[ cut here ]------------ spte = 0x800000018a600ee7, level = 2, rsvd bits = 0x800f0000001fe000 WARNING: CPU: 152 PID: 15966 at arch/x86/kvm/mmu/spte.c:215 make_spte+0x327/0x340 [kvm] Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 10.48.0 01/27/2022 RIP: 0010:make_spte+0x327/0x340 [kvm] Call Trace: <TASK> tdp_mmu_map_handle_target_level+0xc3/0x230 [kvm] kvm_tdp_mmu_map+0x343/0x3b0 [kvm] direct_page_fault+0x1ae/0x2a0 [kvm] kvm_tdp_page_fault+0x7d/0x90 [kvm] kvm_mmu_page_fault+0xfb/0x2e0 [kvm] npf_interception+0x55/0x90 [kvm_amd] svm_invoke_exit_handler+0x31/0xf0 [kvm_amd] svm_handle_exit+0xf6/0x1d0 [kvm_amd] vcpu_enter_guest+0xb6d/0xee0 [kvm] ? kvm_pmu_trigger_event+0x6d/0x230 [kvm] vcpu_run+0x65/0x2c0 [kvm] kvm_arch_vcpu_ioctl_run+0x355/0x610 [kvm] kvm_vcpu_ioctl+0x551/0x610 [kvm] __se_sys_ioctl+0x77/0xc0 __x64_sys_ioctl+0x1d/0x20 do_syscall_64+0x44/0xa0 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> ---[ end trace 0000000000000000 ]--- Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220723013029.1753623-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86: Do not block APIC write for non ICR registersSuravee Suthikulpanit1-11/+11
The commit 5413bcba7ed5 ("KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode") introduces logic to prevent APIC write for offset other than ICR in kvm_apic_write_nodecode() function. This breaks x2AVIC support, which requires KVM to trap and emulate x2APIC MSR writes. Therefore, removes the warning and modify to logic to allow MSR write. Fixes: 5413bcba7ed5 ("KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode") Cc: Zeng Guang <guang.zeng@intel.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <20220725053356.4275-1-suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: SVM: Do not virtualize MSR accesses for APIC LVTT registerSuravee Suthikulpanit1-1/+8
AMD does not support APIC TSC-deadline timer mode. AVIC hardware will generate GP fault when guest kernel writes 1 to bits [18] of the APIC LVTT register (offset 0x32) to set the timer mode. (Note: bit 18 is reserved on AMD system). Therefore, always intercept and let KVM emulate the MSR accesses. Fixes: f3d7c8aa6882 ("KVM: SVM: Fix x2APIC MSRs interception") Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <20220725033428.3699-1-suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: nVMX: Set UMIP bit CR4_FIXED1 MSR when emulating UMIPSean Christopherson1-0/+3
Make UMIP an "allowed-1" bit CR4_FIXED1 MSR when KVM is emulating UMIP. KVM emulates UMIP for both L1 and L2, and so should enumerate that L2 is allowed to have CR4.UMIP=1. Not setting the bit doesn't immediately break nVMX, as KVM does set/clear the bit in CR4_FIXED1 in response to a guest CPUID update, i.e. KVM will correctly (dis)allow nested VM-Entry based on whether or not UMIP is exposed to L1. That said, KVM should enumerate the bit as being allowed from time zero, e.g. userspace will see the wrong value if the MSR is read before CPUID is written. Fixes: 0367f205a3b7 ("KVM: vmx: add support for emulating UMIP") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220607213604.3346000-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28Revert "KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL VM-{Entry,Exit} control"Paolo Bonzini3-27/+0
This reverts commit 03a8871add95213827e2bea84c12133ae5df952e. Since commit 03a8871add95 ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL VM-{Entry,Exit} control"), KVM has taken ownership of the "load IA32_PERF_GLOBAL_CTRL" VMX entry/exit control bits, trying to set these bits in the IA32_VMX_TRUE_{ENTRY,EXIT}_CTLS MSRs if the guest's CPUID supports the architectural PMU (CPUID[EAX=0Ah].EAX[7:0]=1), and clear otherwise. This was a misguided attempt at mimicking what commit 5f76f6f5ff96 ("KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled", 2018-10-01) did for MPX. However, that commit was a workaround for another KVM bug and not something that should be imitated. Mucking with the VMX MSRs creates a subtle, difficult to maintain ABI as KVM must ensure that any internal changes, e.g. to how KVM handles _any_ guest CPUID changes, yield the same functional result. Therefore, KVM's policy is to let userspace have full control of the guest vCPU model so long as the host kernel is not at risk. Now that KVM really truly ensures kvm_set_msr() will succeed by loading PERF_GLOBAL_CTRL if and only if it exists, revert KVM's misguided and roundabout behavior. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [sean: make it a pure revert] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220722224409.1336532-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: nVMX: Attempt to load PERF_GLOBAL_CTRL on nVMX xfer iff it existsSean Christopherson1-1/+3
Attempt to load PERF_GLOBAL_CTRL during nested VM-Enter/VM-Exit if and only if the MSR exists (according to the guest vCPU model). KVM has very misguided handling of VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL and attempts to force the nVMX MSR settings to match the vPMU model, i.e. to hide/expose the control based on whether or not the MSR exists from the guest's perspective. KVM's modifications fail to handle the scenario where the vPMU is hidden from the guest _after_ being exposed to the guest, e.g. by userspace doing multiple KVM_SET_CPUID2 calls, which is allowed if done before any KVM_RUN. nested_vmx_pmu_refresh() is called if and only if there's a recognized vPMU, i.e. KVM will leave the bits in the allow state and then ultimately reject the MSR load and WARN. KVM should not force the VMX MSRs in the first place. KVM taking control of the MSRs was a misguided attempt at mimicking what commit 5f76f6f5ff96 ("KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled", 2018-10-01) did for MPX. However, the MPX commit was a workaround for another KVM bug and not something that should be imitated (and it should never been done in the first place). In other words, KVM's ABI _should_ be that userspace has full control over the MSRs, at which point triggering the WARN that loading the MSR must not fail is trivial. The intent of the WARN is still valid; KVM has consistency checks to ensure that vmcs12->{guest,host}_ia32_perf_global_ctrl is valid. The problem is that '0' must be considered a valid value at all times, and so the simple/obvious solution is to just not actually load the MSR when it does not exist. It is userspace's responsibility to provide a sane vCPU model, i.e. KVM is well within its ABI and Intel's VMX architecture to skip the loads if the MSR does not exist. Fixes: 03a8871add95 ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL VM-{Entry,Exit} control") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220722224409.1336532-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: VMX: Add helper to check if the guest PMU has PERF_GLOBAL_CTRLSean Christopherson2-2/+14
Add a helper to check of the guest PMU has PERF_GLOBAL_CTRL, which is unintuitive _and_ diverges from Intel's architecturally defined behavior. Even worse, KVM currently implements the check using two different (but equivalent) checks, _and_ there has been at least one attempt to add a _third_ flavor. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220722224409.1336532-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: VMX: Mark all PERF_GLOBAL_(OVF)_CTRL bits reserved if there's no vPMUSean Christopherson1-0/+2
Mark all MSR_CORE_PERF_GLOBAL_CTRL and MSR_CORE_PERF_GLOBAL_OVF_CTRL bits as reserved if there is no guest vPMU. The nVMX VM-Entry consistency checks do not check for a valid vPMU prior to consuming the masks via kvm_valid_perf_global_ctrl(), i.e. may incorrectly allow a non-zero mask to be loaded via VM-Enter or VM-Exit (well, attempted to be loaded, the actual MSR load will be rejected by intel_is_valid_msr()). Fixes: f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220722224409.1336532-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28Revert "KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled"Paolo Bonzini1-20/+1
Since commit 5f76f6f5ff96 ("KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled"), KVM has taken ownership of the "load IA32_BNDCFGS" and "clear IA32_BNDCFGS" VMX entry/exit controls, trying to set these bits in the IA32_VMX_TRUE_{ENTRY,EXIT}_CTLS MSRs if the guest's CPUID supports MPX, and clear otherwise. The intent of the patch was to apply it to L0 in order to work around L1 kernels that lack the fix in commit 691bd4340bef ("kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS", 2017-07-04): by hiding the control bits from L0, L1 hides BNDCFGS from KVM_GET_MSR_INDEX_LIST, and the L1 bug is neutralized even in the lack of commit 691bd4340bef. This was perhaps a sensible kludge at the time, but a horrible idea in the long term and in fact it has not been extended to other CPUID bits like these: X86_FEATURE_LM => VM_EXIT_HOST_ADDR_SPACE_SIZE, VM_ENTRY_IA32E_MODE, VMX_MISC_SAVE_EFER_LMA X86_FEATURE_TSC => CPU_BASED_RDTSC_EXITING, CPU_BASED_USE_TSC_OFFSETTING, SECONDARY_EXEC_TSC_SCALING X86_FEATURE_INVPCID_SINGLE => SECONDARY_EXEC_ENABLE_INVPCID X86_FEATURE_MWAIT => CPU_BASED_MONITOR_EXITING, CPU_BASED_MWAIT_EXITING X86_FEATURE_INTEL_PT => SECONDARY_EXEC_PT_CONCEAL_VMX, SECONDARY_EXEC_PT_USE_GPA, VM_EXIT_CLEAR_IA32_RTIT_CTL, VM_ENTRY_LOAD_IA32_RTIT_CTL X86_FEATURE_XSAVES => SECONDARY_EXEC_XSAVES These days it's sort of common knowledge that any MSR in KVM_GET_MSR_INDEX_LIST must allow *at least* setting it with KVM_SET_MSR to a default value, so it is unlikely that something like commit 5f76f6f5ff96 will be needed again. So revert it, at the potential cost of breaking L1s with a 6 year old kernel. While in principle the L0 owner doesn't control what runs on L1, such an old hypervisor would probably have many other bugs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>