aboutsummaryrefslogtreecommitdiffstats
path: root/tools/perf/scripts/python/export-to-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2024-05-09KVM: selftests: arm64: Test vCPU-scoped feature ID registersOliver Upton1-1/+52
Test that CLIDR_EL1 and MPIDR_EL1 are modifiable from userspace and that the values are preserved across a vCPU reset like the other feature ID registers. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240502233529.1958459-8-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-05-09KVM: selftests: arm64: Test that feature ID regs survive a resetOliver Upton1-8/+33
One of the expectations with feature ID registers is that their values survive a vCPU reset. Start testing that. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240502233529.1958459-7-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-05-09KVM: selftests: arm64: Store expected register value in set_id_regsOliver Upton1-9/+18
Rather than comparing against what is returned by the ioctl, store expected values for the feature ID registers in a table and compare with that instead. This will prove useful for subsequent tests involving vCPU reset. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240502233529.1958459-6-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-05-09KVM: selftests: arm64: Rename helper in set_id_regs to imply VM scopeOliver Upton1-2/+2
Prepare for a later change that'll cram in per-vCPU feature ID test cases by renaming the current test case. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240502233529.1958459-5-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-05-09KVM: arm64: Only reset vCPU-scoped feature ID regs onceOliver Upton3-13/+26
The general expecation with feature ID registers is that they're 'reset' exactly once by KVM for the lifetime of a vCPU/VM, such that any userspace changes to the CPU features / identity are honored after a vCPU gets reset (e.g. PSCI_ON). KVM handles what it calls VM-scoped feature ID registers correctly, but feature ID registers local to a vCPU (CLIDR_EL1, MPIDR_EL1) get wiped after every reset. What's especially concerning is that a potentially-changing MPIDR_EL1 breaks MPIDR compression for indexing mpidr_data, as the mask of useful bits to build the index could change. This is absolutely no good. Avoid resetting vCPU feature ID registers more than once. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240502233529.1958459-4-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-05-09KVM: arm64: Reset VM feature ID regs from kvm_reset_sys_regs()Oliver Upton1-17/+10
A subsequent change to KVM will expand the range of feature ID registers that get special treatment at reset. Fold the existing ones back in to kvm_reset_sys_regs() to avoid the need for an additional table walk. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240502233529.1958459-3-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-05-09KVM: arm64: Rename is_id_reg() to imply VM scopeOliver Upton1-5/+6
The naming of some of the feature ID checks is ambiguous. Rephrase the is_id_reg() helper to make its purpose slightly clearer. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240502233529.1958459-2-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-05-08KVM: arm64: Destroy mpidr_data for 'late' vCPU creationOliver Upton1-9/+41
A particularly annoying userspace could create a vCPU after KVM has computed mpidr_data for the VM, either by racing against VGIC initialization or having a userspace irqchip. In any case, this means mpidr_data no longer fully describes the VM, and attempts to find the new vCPU with kvm_mpidr_to_vcpu() will fail. The fix is to discard mpidr_data altogether, as it is only a performance optimization and not required for correctness. In all likelihood KVM will recompute the mappings when KVM_RUN is called on the new vCPU. Note that reads of mpidr_data are not guarded by a lock; promote to RCU to cope with the possibility of mpidr_data being invalidated at runtime. Fixes: 54a8006d0b49 ("KVM: arm64: Fast-track kvm_mpidr_to_vcpu() when mpidr_data is available") Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240508071952.2035422-1-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-05-08KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE supportWill Deacon1-1/+1
The early command line parsing treats "kvm-arm.mode=protected" as an alias for "id_aa64mmfr1.vh=0", forcing the use of nVHE so that the host kernel runs at EL1 with the pKVM hypervisor at EL2. With the introduction of hVHE support in ad744e8cb346 ("arm64: Allow arm64_sw.hvhe on command line"), the hypervisor can run using the EL2+0 translation regime. This is interesting for unusual CPUs that have VH stuck to 1, but also because it opens the possibility of a hypervisor "userspace" in the distant future which could be used to isolate vCPU contexts in the hypervisor (see Marc's talk from KVM Forum 2022 [1]). Repaint the "kvm-arm.mode=protected" alias to map to "arm64_sw.hvhe=1", which will use hVHE on CPUs that support it and remain with nVHE otherwise. [1] https://www.youtube.com/watch?v=1F_Mf2j9eIo Signed-off-by: Will Deacon <will@kernel.org> Acked-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240501163400.15838-3-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-05-08KVM: arm64: Fix hvhe/nvhe early alias parsingWill Deacon1-1/+1
Booting a kernel with "arm64_sw.hvhe=1 kvm-arm.mode=nvhe" on the command-line results in KVM initialising using hVHE, whereas one might expect the latter option to override the former. Fix this by adding "arm64_sw.hvhe=0" to the alias expansion for "kvm-arm.mode=nvhe". Signed-off-by: Will Deacon <will@kernel.org> Acked-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240501163400.15838-2-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-05-07KVM: SEV: Allow per-guest configuration of GHCB protocol versionMichael Roth4-6/+42
The GHCB protocol version may be different from one guest to the next. Add a field to track it for each KVM instance and extend KVM_SEV_INIT2 to allow it to be configured by userspace. Now that all SEV-ES support for GHCB protocol version 2 is in place, go ahead and default to it when creating SEV-ES guests through the new KVM_SEV_INIT2 interface. Keep the older KVM_SEV_ES_INIT interface restricted to GHCB protocol version 1. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240501071048.2208265-5-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: SEV: Add GHCB handling for termination requestsMichael Roth1-0/+9
GHCB version 2 adds support for a GHCB-based termination request that a guest can issue when it reaches an error state and wishes to inform the hypervisor that it should be terminated. Implement support for that similarly to GHCB MSR-based termination requests that are already available to SEV-ES guests via earlier versions of the GHCB protocol. See 'Termination Request' in the 'Invoking VMGEXIT' section of the GHCB specification for more details. Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240501071048.2208265-4-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: SEV: Add GHCB handling for Hypervisor Feature Support requestsBrijesh Singh2-0/+16
Version 2 of the GHCB specification introduced advertisement of features that are supported by the Hypervisor. Now that KVM supports version 2 of the GHCB specification, bump the maximum supported protocol version. Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240501071048.2208265-3-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: SEV: Add support to handle AP reset MSR protocolTom Lendacky3-10/+53
Add support for AP Reset Hold being invoked using the GHCB MSR protocol, available in version 2 of the GHCB specification. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240501071048.2208265-2-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Explicitly zero kvm_caps during vendor module loadSean Christopherson1-0/+7
Zero out all of kvm_caps when loading a new vendor module to ensure that KVM can't inadvertently rely on global initialization of a field, and add a comment above the definition of kvm_caps to call out that all fields needs to be explicitly computed during vendor module load. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240423165328.2853870-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Fully re-initialize supported_mce_cap on vendor module loadSean Christopherson1-3/+2
Effectively reset supported_mce_cap on vendor module load to ensure that capabilities aren't unintentionally preserved across module reload, e.g. if kvm-intel.ko added a module param to control LMCE support, or if someone somehow managed to load a vendor module that doesn't support LMCE after loading and unloading kvm-intel.ko. Practically speaking, this bug is a non-issue as kvm-intel.ko doesn't have a module param for LMCE, and there is no system in the world that supports both kvm-intel.ko and kvm-amd.ko. Fixes: c45dcc71b794 ("KVM: VMX: enable guest access to LMCE related MSRs") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240423165328.2853870-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Fully re-initialize supported_vm_types on vendor module loadSean Christopherson1-1/+2
Recompute the entire set of supported VM types when a vendor module is loaded, as preserving supported_vm_types across vendor module unload and reload can result in VM types being incorrectly treated as supported. E.g. if a vendor module is loaded with TDP enabled, unloaded, and then reloaded with TDP disabled, KVM_X86_SW_PROTECTED_VM will be incorrectly retained. Ditto for SEV_VM and SEV_ES_VM and their respective module params in kvm-amd.ko. Fixes: 2a955c4db1dd ("KVM: x86: Add supported_vm_types to kvm_caps") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240423165328.2853870-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfnsSean Christopherson1-1/+1
WARN if __kvm_faultin_pfn() generates a "no slot" pfn, and gracefully handle the unexpected behavior instead of continuing on with dangerous state, e.g. tdp_mmu_map_handle_target_level() _only_ checks fault->slot, and so could install a bogus PFN into the guest. The existing code is functionally ok, because kvm_faultin_pfn() pre-checks all of the cases that result in KVM_PFN_NOSLOT, but it is unnecessarily unsafe as it relies on __gfn_to_pfn_memslot() getting the _exact_ same memslot, i.e. not a re-retrieved pointer with KVM_MEMSLOT_INVALID set. And checking only fault->slot would fall apart if KVM ever added a flag or condition that forced emulation, similar to how KVM handles writes to read-only memslots. Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-17-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error valuesSean Christopherson1-0/+3
Explicitly set "pfn" and "hva" to error values in kvm_mmu_do_page_fault() to harden KVM against using "uninitialized" values. In quotes because the fields are actually zero-initialized, and zero is a legal value for both page frame numbers and virtual addresses. E.g. failure to set "pfn" prior to creating an SPTE could result in KVM pointing at physical address '0', which is far less desirable than KVM generating a SPTE with reserved PA bits set and thus effectively killing the VM. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Set kvm_page_fault.hva to KVM_HVA_ERR_BAD for "no slot" faultsSean Christopherson1-0/+1
Explicitly set fault->hva to KVM_HVA_ERR_BAD when handling a "no slot" fault to ensure that KVM doesn't use a bogus virtual address, e.g. if there *was* a slot but it's unusable (APIC access page), or if there really was no slot, in which case fault->hva will be '0' (which is a legal address for x86). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Handle no-slot faults at the beginning of kvm_faultin_pfn()Sean Christopherson1-12/+17
Handle the "no memslot" case at the beginning of kvm_faultin_pfn(), just after the private versus shared check, so that there's no need to repeatedly query whether or not a slot exists. This also makes it more obvious that, except for private vs. shared attributes, the process of faulting in a pfn simply doesn't apply to gfns without a slot. Opportunistically stuff @fault's metadata in kvm_handle_noslot_fault() so that it doesn't need to be duplicated in all paths that invoke kvm_handle_noslot_fault(), and to minimize the probability of not stuffing the right fields. Leave the existing handle behind, but convert it to a WARN, to guard against __kvm_faultin_pfn() unexpectedly nullifying fault->slot. Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to kvm_faultin_pfn()Sean Christopherson1-43/+44
Move the checks related to the validity of an access to a memslot from the inner __kvm_faultin_pfn() to its sole caller, kvm_faultin_pfn(). This allows emulating accesses to the APIC access page, which don't need to resolve a pfn, even if there is a relevant in-progress mmu_notifier invalidation. Ditto for accesses to KVM internal memslots from L2, which KVM also treats as emulated MMIO. More importantly, this will allow for future cleanup by having the "no memslot" case bail from kvm_faultin_pfn() very early on. Go to rather extreme and gross lengths to make the change a glorified nop, e.g. call into __kvm_faultin_pfn() even when there is no slot, as the related code is very subtle. E.g. fault->slot can be nullified if it points at the APIC access page, some flows in KVM x86 expect fault->pfn to be KVM_PFN_NOSLOT, while others check only fault->slot, etc. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Explicitly disallow private accesses to emulated MMIOSean Christopherson1-0/+5
Explicitly detect and disallow private accesses to emulated MMIO in kvm_handle_noslot_fault() instead of relying on kvm_faultin_pfn_private() to perform the check. This will allow the page fault path to go straight to kvm_handle_noslot_fault() without bouncing through __kvm_faultin_pfn(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240228024147.41573-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Don't force emulation of L2 accesses to non-APIC internal slotsSean Christopherson1-4/+13
Allow mapping KVM's internal memslots used for EPT without unrestricted guest into L2, i.e. allow mapping the hidden TSS and the identity mapped page tables into L2. Unlike the APIC access page, there is no correctness issue with letting L2 access the "hidden" memory. Allowing these memslots to be mapped into L2 fixes a largely theoretical bug where KVM could incorrectly emulate subsequent _L1_ accesses as MMIO, and also ensures consistent KVM behavior for L2. If KVM is using TDP, but L1 is using shadow paging for L2, then routing through kvm_handle_noslot_fault() will incorrectly cache the gfn as MMIO, and create an MMIO SPTE. Creating an MMIO SPTE is ok, but only because kvm_mmu_page_role.guest_mode ensure KVM uses different roots for L1 vs. L2. But vcpu->arch.mmio_gfn will remain valid, and could cause KVM to incorrectly treat an L1 access to the hidden TSS or identity mapped page tables as MMIO. Furthermore, forcing L2 accesses to be treated as "no slot" faults doesn't actually prevent exposing KVM's internal memslots to L2, it simply forces KVM to emulate the access. In most cases, that will trigger MMIO, amusingly due to filling vcpu->arch.mmio_gfn, but also because vcpu_is_mmio_gpa() unconditionally treats APIC accesses as MMIO, i.e. APIC accesses are ok. But the hidden TSS and identity mapped page tables could go either way (MMIO or access the private memslot's backing memory). Alternatively, the inconsistent emulator behavior could be addressed by forcing MMIO emulation for L2 access to all internal memslots, not just to the APIC. But that's arguably less correct than letting L2 access the hidden TSS and identity mapped page tables, not to mention that it's *extremely* unlikely anyone cares what KVM does in this case. From L1's perspective there is R/W memory at those memslots, the memory just happens to be initialized with non-zero data. Making the memory disappear when it is accessed by L2 is far more magical and arbitrary than the memory existing in the first place. The APIC access page is special because KVM _must_ emulate the access to do the right thing (emulate an APIC access instead of reading/writing the APIC access page). And despite what commit 3a2936dedd20 ("kvm: mmu: Don't expose private memslots to L2") said, it's not just necessary when L1 is accelerating L2's virtual APIC, it's just as important (likely *more* imporant for correctness when L1 is passing through its own APIC to L2. Fixes: 3a2936dedd20 ("kvm: mmu: Don't expose private memslots to L2") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Move private vs. shared check above slot validity checksSean Christopherson1-5/+15
Prioritize private vs. shared gfn attribute checks above slot validity checks to ensure a consistent userspace ABI. E.g. as is, KVM will exit to userspace if there is no memslot, but emulate accesses to the APIC access page even if the attributes mismatch. Fixes: 8dd2eee9d526 ("KVM: x86/mmu: Handle page fault for private memory") Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Cc: Chao Peng <chao.p.peng@linux.intel.com> Cc: Fuad Tabba <tabba@google.com> Cc: Michael Roth <michael.roth@amd.com> Cc: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: WARN and skip MMIO cache on private, reserved page faultsSean Christopherson1-0/+3
WARN and skip the emulated MMIO fastpath if a private, reserved page fault is encountered, as private+reserved should be an impossible combination (KVM should never create an MMIO SPTE for a private access). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240228024147.41573-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: check for invalid async page faults involving private memoryPaolo Bonzini2-7/+12
Right now the error code is not used when an async page fault is completed. This is not a problem in the current code, but it is untidy. For protected VMs, we will also need to check that the page attributes match the current state of the page, because asynchronous page faults can only occur on shared pages (private pages go through kvm_faultin_pfn_private() instead of __gfn_to_pfn_memslot()). Start by piping the error code from kvm_arch_setup_async_pf() to kvm_arch_async_page_ready() via the architecture-specific async page fault data. For now, it can be used to assert that there are no async page faults on private memory. Extracted from a patch by Isaku Yamahata. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Use synthetic page fault error code to indicate private faultsSean Christopherson3-2/+21
Add and use a synthetic, KVM-defined page fault error code to indicate whether a fault is to private vs. shared memory. TDX and SNP have different mechanisms for reporting private vs. shared, and KVM's software-protected VMs have no mechanism at all. Usurp an error code flag to avoid having to plumb another parameter to kvm_mmu_page_fault() and friends. Alternatively, KVM could borrow AMD's PFERR_GUEST_ENC_MASK, i.e. set it for TDX and software-protected VMs as appropriate, but that would require *clearing* the flag for SEV and SEV-ES VMs, which support encrypted memory at the hardware layer, but don't utilize private memory at the KVM layer. Opportunistically add a comment to call out that the logic for software- protected VMs is (and was before this commit) broken for nested MMUs, i.e. for nested TDP, as the GPA is an L2 GPA. Punt on trying to play nice with nested MMUs as there is a _lot_ of functionality that simply doesn't work for software-protected VMs, e.g. all of the paths where KVM accesses guest memory need to be updated to be aware of private vs. shared memory. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20240228024147.41573-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zeroSean Christopherson1-0/+7
WARN if bits 63:32 are non-zero when handling an intercepted legacy #PF, as the error code for #PF is limited to 32 bits (and in practice, 16 bits on Intel CPUS). This behavior is architectural, is part of KVM's ABI (see kvm_vcpu_events.error_code), and is explicitly documented as being preserved for intecerpted #PF in both the APM: The error code saved in EXITINFO1 is the same as would be pushed onto the stack by a non-intercepted #PF exception in protected mode. and even more explicitly in the SDM as VMCS.VM_EXIT_INTR_ERROR_CODE is a 32-bit field. Simply drop the upper bits if hardware provides garbage, as spurious information should do no harm (though in all likelihood hardware is buggy and the kernel is doomed). Handling all upper 32 bits in the #PF path will allow moving the sanity check on synthetic checks from kvm_mmu_page_fault() to npf_interception(), which in turn will allow deriving PFERR_PRIVATE_ACCESS from AMD's PFERR_GUEST_ENC_MASK without running afoul of the sanity check. Note, this is also why Intel uses bit 15 for SGX (highest bit on Intel CPUs) and AMD uses bit 31 for RMP (highest bit on AMD CPUs); using the highest bit minimizes the probability of a collision with the "other" vendor, without needing to plumb more bits through microcode. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Pass full 64-bit error code when handling page faultsIsaku Yamahata3-5/+4
Plumb the full 64-bit error code throughout the page fault handling code so that KVM can use the upper 32 bits, e.g. SNP's PFERR_GUEST_ENC_MASK will be used to determine whether or not a fault is private vs. shared. Note, passing the 64-bit error code to FNAME(walk_addr)() does NOT change the behavior of permission_fault() when invoked in the page fault path, as KVM explicitly clears PFERR_IMPLICIT_ACCESS in kvm_mmu_page_fault(). Continue passing '0' from the async #PF worker, as guest_memfd and thus private memory doesn't support async page faults. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [mdr: drop references/changes on rebase, update commit message] Signed-off-by: Michael Roth <michael.roth@amd.com> [sean: drop truncation in call to FNAME(walk_addr)(), rewrite changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240228024147.41573-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Move synthetic PFERR_* sanity checks to SVM's #NPF handlerSean Christopherson3-11/+18
Move the sanity check that hardware never sets bits that collide with KVM- define synthetic bits from kvm_mmu_page_fault() to npf_interception(), i.e. make the sanity check #NPF specific. The legacy #PF path already WARNs if _any_ of bits 63:32 are set, and the error code that comes from VMX's EPT Violatation and Misconfig is 100% synthesized (KVM morphs VMX's EXIT_QUALIFICATION into error code flags). Add a compile-time assert in the legacy #PF handler to make sure that KVM- define flags are covered by its existing sanity check on the upper bits. Opportunistically add a description of PFERR_IMPLICIT_ACCESS, since we are removing the comment that defined it. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20240228024147.41573-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Define more SEV+ page fault error bits/flags for #NPFSean Christopherson1-0/+4
Define more #NPF error code flags that are relevant to SEV+ (mostly SNP) guests, as specified by the APM: * Bit 31 (RMP): Set to 1 if the fault was caused due to an RMP check or a VMPL check failure, 0 otherwise. * Bit 34 (ENC): Set to 1 if the guest’s effective C-bit was 1, 0 otherwise. * Bit 35 (SIZEM): Set to 1 if the fault was caused by a size mismatch between PVALIDATE or RMPADJUST and the RMP, 0 otherwise. * Bit 36 (VMPL): Set to 1 if the fault was caused by a VMPL permission check failure, 0 otherwise. Note, the APM is *extremely* misleading, and strongly implies that the above flags can _only_ be set for #NPF exits from SNP guests. That is a lie, as bit 34 (C-bit=1, i.e. was encrypted) can be set when running _any_ flavor of SEV guest on SNP capable hardware. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240228024147.41573-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Remove separate "bit" defines for page fault error code masksSean Christopherson2-25/+12
Open code the bit number directly in the PFERR_* masks and drop the intermediate PFERR_*_BIT defines, as having to bounce through two macros just to see which flag corresponds to which bit is quite annoying, as is having to define two macros just to add recognition of a new flag. Use ternary operator to derive the bit in permission_fault(), the one function that actually needs the bit number as part of clever shifting to avoid conditional branches. Generally the compiler is able to turn it into a conditional move, and if not it's not really a big deal. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240228024147.41573-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Exit to userspace with -EFAULT if private fault hits emulationSean Christopherson2-8/+19
Exit to userspace with -EFAULT / KVM_EXIT_MEMORY_FAULT if a private fault triggers emulation of any kind, as KVM doesn't currently support emulating access to guest private memory. Practically speaking, private faults and emulation are already mutually exclusive, but there are many flow that can result in KVM returning RET_PF_EMULATE, and adding one last check to harden against weird, unexpected combinations and/or KVM bugs is inexpensive. Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240228024147.41573-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-06LoongArch: KVM: Add mmio trace events supportBibo Mao2-6/+22
Add mmio trace events support, currently generic mmio events KVM_TRACE_MMIO_WRITE/xxx_READ/xx_READ_UNSATISFIED are added here. Also vcpu id field is added for all kvm trace events, since perf KVM tool parses vcpu id information for kvm entry event. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-05-06LoongArch: KVM: Add software breakpoint supportBibo Mao7-3/+40
When VM runs in kvm mode, system will not exit to host mode when executing a general software breakpoint instruction such as INSN_BREAK, trap exception happens in guest mode rather than host mode. In order to debug guest kernel on host side, one mechanism should be used to let VM exit to host mode. Here a hypercall instruction with a special code is used for software breakpoint usage. VM exits to host mode and kvm hypervisor identifies the special hypercall code and sets exit_reason with KVM_EXIT_DEBUG. And then let qemu handle it. Idea comes from ppc kvm, one api KVM_REG_LOONGARCH_DEBUG_INST is added to get the hypercall code. VMM needs get sw breakpoint instruction with this api and set the corresponding sw break point for guest kernel. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-05-06LoongArch: KVM: Add PV IPI support on guest sideBibo Mao8-2/+197
PARAVIRT config option and PV IPI is added for the guest side, function pv_ipi_init() is used to add IPI sending and IPI receiving hooks. This function firstly checks whether system runs in VM mode, and if kernel runs in VM mode, it will call function kvm_para_available() to detect the current hypervirsor type (now only KVM type detection is supported). The paravirt functions can work only if current hypervisor type is KVM, since there is only KVM supported on LoongArch now. PV IPI uses virtual IPI sender and virtual IPI receiver functions. With virtual IPI sender, IPI message is stored in memory rather than emulated HW. IPI multicast is also supported, and 128 vcpus can received IPIs at the same time like X86 KVM method. Hypercall method is used for IPI sending. With virtual IPI receiver, HW SWI0 is used rather than real IPI HW. Since VCPU has separate HW SWI0 like HW timer, there is no trap in IPI interrupt acknowledge. Since IPI message is stored in memory, there is no trap in getting IPI message. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-05-06LoongArch: KVM: Add PV IPI support on host sideBibo Mao6-2/+211
On LoongArch system, IPI hw uses iocsr registers. There are one iocsr register access on IPI sending, and two iocsr access on IPI receiving for the IPI interrupt handler. In VM mode all iocsr accessing will cause VM to trap into hypervisor. So with one IPI hw notification there will be three times of trap. In this patch PV IPI is added for VM, hypercall instruction is used for IPI sender, and hypervisor will inject an SWI to the destination vcpu. During the SWI interrupt handler, only CSR.ESTAT register is written to clear irq. CSR.ESTAT register access will not trap into hypervisor, so with PV IPI supported, there is one trap with IPI sender, and no trap with IPI receiver, there is only one trap with IPI notification. Also this patch adds IPI multicast support, the method is similar with x86. With IPI multicast support, IPI notification can be sent to at most 128 vcpus at one time. It greatly reduces the times of trapping into hypervisor. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-05-06LoongArch: KVM: Add vcpu mapping from physical cpuidBibo Mao4-0/+129
Physical CPUID is used for interrupt routing for irqchips such as ipi, msgint and eiointc interrupt controllers. Physical CPUID is stored at the CSR register LOONGARCH_CSR_CPUID, it can not be changed once vcpu is created and the physical CPUIDs of two vcpus cannot be the same. Different irqchips have different size declaration about physical CPUID, the max CPUID value for CSR LOONGARCH_CSR_CPUID on Loongson-3A5000 is 512, the max CPUID supported by IPI hardware is 1024, while for eiointc irqchip is 256, and for msgint irqchip is 65536. The smallest value from all interrupt controllers is selected now, and the max cpuid size is defines as 256 by KVM which comes from the eiointc irqchip. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-05-06LoongArch: KVM: Add cpucfg area for kvm hypervisorBibo Mao3-17/+50
Instruction cpucfg can be used to get processor features. And there is a trap exception when it is executed in VM mode, and also it can be used to provide cpu features to VM. On real hardware cpucfg area 0 - 20 is used by now. Here one specified area 0x40000000 -- 0x400000ff is used for KVM hypervisor to provide PV features, and the area can be extended for other hypervisors in future. This area will never be used for real HW, it is only used by software. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-05-06LoongArch: KVM: Add hypercall instruction emulationBibo Mao3-1/+38
On LoongArch system, there is a hypercall instruction special for virtualization. When system executes this instruction on host side, there is an illegal instruction exception reported, however it will trap into host when it is executed in VM mode. When hypercall is emulated, A0 register is set with value KVM_HCALL_INVALID_CODE, rather than inject EXCCODE_INE invalid instruction exception. So VM can continue to executing the next code. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-05-06LoongArch/smp: Refine some ipi functions on LoongArch platformBibo Mao7-71/+63
Refine the ipi handling on LoongArch platform, there are three modifications: 1. Add generic function get_percpu_irq(), replacing some percpu irq functions such as get_ipi_irq()/get_pmc_irq()/get_timer_irq() with get_percpu_irq(). 2. Change definition about parameter action called by function loongson_send_ipi_single() and loongson_send_ipi_mask(), and it is defined as decimal encoding format at ipi sender side. Normal decimal encoding is used rather than binary bitmap encoding for ipi action, ipi hw sender uses decimal encoding code, and ipi receiver will get binary bitmap encoding, the ipi hw will convert it into bitmap in ipi message buffer. 3. Add a structure smp_ops on LoongArch platform so that pv ipi can be used later. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-05-05Linux 6.9-rc7Linus Torvalds1-1/+1
2024-05-05epoll: be better about file lifetimesLinus Torvalds1-1/+37
epoll can call out to vfs_poll() with a file pointer that may race with the last 'fput()'. That would make f_count go down to zero, and while the ep->mtx locking means that the resulting file pointer tear-down will be blocked until the poll returns, it means that f_count is already dead, and any use of it won't actually get a reference to the file any more: it's dead regardless. Make sure we have a valid ref on the file pointer before we call down to vfs_poll() from the epoll routines. Link: https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/ Reported-by: syzbot+045b454ab35fd82a35fb@syzkaller.appspotmail.com Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-05-04eventfs: Have "events" directory get permissions from its parentSteven Rostedt (Google)1-6/+24
The events directory gets its permissions from the root inode. But this can cause an inconsistency if the instances directory changes its permissions, as the permissions of the created directories under it should inherit the permissions of the instances directory when directories under it are created. Currently the behavior is: # cd /sys/kernel/tracing # chgrp 1002 instances # mkdir instances/foo # ls -l instances/foo [..] -r--r----- 1 root lkp 0 May 1 18:55 buffer_total_size_kb -rw-r----- 1 root lkp 0 May 1 18:55 current_tracer -rw-r----- 1 root lkp 0 May 1 18:55 error_log drwxr-xr-x 1 root root 0 May 1 18:55 events --w------- 1 root lkp 0 May 1 18:55 free_buffer drwxr-x--- 2 root lkp 0 May 1 18:55 options drwxr-x--- 10 root lkp 0 May 1 18:55 per_cpu -rw-r----- 1 root lkp 0 May 1 18:55 set_event All the files and directories under "foo" has the "lkp" group except the "events" directory. That's because its getting its default value from the mount point instead of its parent. Have the "events" directory make its default value based on its parent's permissions. That now gives: # ls -l instances/foo [..] -rw-r----- 1 root lkp 0 May 1 21:16 buffer_subbuf_size_kb -r--r----- 1 root lkp 0 May 1 21:16 buffer_total_size_kb -rw-r----- 1 root lkp 0 May 1 21:16 current_tracer -rw-r----- 1 root lkp 0 May 1 21:16 error_log drwxr-xr-x 1 root lkp 0 May 1 21:16 events --w------- 1 root lkp 0 May 1 21:16 free_buffer drwxr-x--- 2 root lkp 0 May 1 21:16 options drwxr-x--- 10 root lkp 0 May 1 21:16 per_cpu -rw-r----- 1 root lkp 0 May 1 21:16 set_event Link: https://lore.kernel.org/linux-trace-kernel/20240502200906.161887248@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04eventfs: Do not treat events directory different than other directoriesSteven Rostedt (Google)1-15/+1
Treat the events directory the same as other directories when it comes to permissions. The events directory was considered different because it's dentry is persistent, whereas the other directory dentries are created when accessed. But the way tracefs now does its ownership by using the root dentry's permissions as the default permissions, the events directory can get out of sync when a remount is performed setting the group and user permissions. Remove the special case for the events directory on setting the attributes. This allows the updates caused by remount to work properly as well as simplifies the code. Link: https://lore.kernel.org/linux-trace-kernel/20240502200906.002923579@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04eventfs: Do not differentiate the toplevel events directorySteven Rostedt (Google)2-25/+11
The toplevel events directory is really no different than the events directory of instances. Having the two be different caused inconsistencies and made it harder to fix the permissions bugs. Make all events directories act the same. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.846448710@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04tracefs: Still use mount point as default permissions for instancesSteven Rostedt (Google)1-2/+25
If the instances directory's permissions were never change, then have it and its children use the mount point permissions as the default. Currently, the permissions of instance directories are determined by the instance directory's permissions itself. But if the tracefs file system is remounted and changes the permissions, the instance directory and its children should use the new permission. But because both the instance directory and its children use the instance directory's inode for permissions, it misses the update. To demonstrate this: # cd /sys/kernel/tracing/ # mkdir instances/foo # ls -ld instances/foo drwxr-x--- 5 root root 0 May 1 19:07 instances/foo # ls -ld instances drwxr-x--- 3 root root 0 May 1 18:57 instances # ls -ld current_tracer -rw-r----- 1 root root 0 May 1 18:57 current_tracer # mount -o remount,gid=1002 . # ls -ld instances drwxr-x--- 3 root root 0 May 1 18:57 instances # ls -ld instances/foo/ drwxr-x--- 5 root root 0 May 1 19:07 instances/foo/ # ls -ld current_tracer -rw-r----- 1 root lkp 0 May 1 18:57 current_tracer Notice that changing the group id to that of "lkp" did not affect the instances directory nor its children. It should have been: # ls -ld current_tracer -rw-r----- 1 root root 0 May 1 19:19 current_tracer # ls -ld instances/foo/ drwxr-x--- 5 root root 0 May 1 19:25 instances/foo/ # ls -ld instances drwxr-x--- 3 root root 0 May 1 19:19 instances # mount -o remount,gid=1002 . # ls -ld current_tracer -rw-r----- 1 root lkp 0 May 1 19:19 current_tracer # ls -ld instances drwxr-x--- 3 root lkp 0 May 1 19:19 instances # ls -ld instances/foo/ drwxr-x--- 5 root lkp 0 May 1 19:25 instances/foo/ Where all files were updated by the remount gid update. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.686838327@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04tracefs: Reset permissions on remount if permissions are optionsSteven Rostedt (Google)3-2/+99
There's an inconsistency with the way permissions are handled in tracefs. Because the permissions are generated when accessed, they default to the root inode's permission if they were never set by the user. If the user sets the permissions, then a flag is set and the permissions are saved via the inode (for tracefs files) or an internal attribute field (for eventfs). But if a remount happens that specify the permissions, all the files that were not changed by the user gets updated, but the ones that were are not. If the user were to remount the file system with a given permission, then all files and directories within that file system should be updated. This can cause security issues if a file's permission was updated but the admin forgot about it. They could incorrectly think that remounting with permissions set would update all files, but miss some. For example: # cd /sys/kernel/tracing # chgrp 1002 current_tracer # ls -l [..] -rw-r----- 1 root root 0 May 1 21:25 buffer_size_kb -rw-r----- 1 root root 0 May 1 21:25 buffer_subbuf_size_kb -r--r----- 1 root root 0 May 1 21:25 buffer_total_size_kb -rw-r----- 1 root lkp 0 May 1 21:25 current_tracer -rw-r----- 1 root root 0 May 1 21:25 dynamic_events -r--r----- 1 root root 0 May 1 21:25 dyn_ftrace_total_info -r--r----- 1 root root 0 May 1 21:25 enabled_functions Where current_tracer now has group "lkp". # mount -o remount,gid=1001 . # ls -l -rw-r----- 1 root tracing 0 May 1 21:25 buffer_size_kb -rw-r----- 1 root tracing 0 May 1 21:25 buffer_subbuf_size_kb -r--r----- 1 root tracing 0 May 1 21:25 buffer_total_size_kb -rw-r----- 1 root lkp 0 May 1 21:25 current_tracer -rw-r----- 1 root tracing 0 May 1 21:25 dynamic_events -r--r----- 1 root tracing 0 May 1 21:25 dyn_ftrace_total_info -r--r----- 1 root tracing 0 May 1 21:25 enabled_functions Everything changed but the "current_tracer". Add a new link list that keeps track of all the tracefs_inodes which has the permission flags that tell if the file/dir should use the root inode's permission or not. Then on remount, clear all the flags so that the default behavior of using the root inode's permission is done for all files and directories. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.529542160@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04eventfs: Free all of the eventfs_inode after RCUSteven Rostedt (Google)1-9/+16
The freeing of eventfs_inode via a kfree_rcu() callback. But the content of the eventfs_inode was being freed after the last kref. This is dangerous, as changes are being made that can access the content of an eventfs_inode from an RCU loop. Instead of using kfree_rcu() use call_rcu() that calls a function to do all the freeing of the eventfs_inode after a RCU grace period has expired. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.370261163@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 43aa6f97c2d03 ("eventfs: Get rid of dentry pointers without refcounts") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>