aboutsummaryrefslogtreecommitdiffstats
path: root/virt/kvm/kvm_main.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2022-11-06Merge tag 'kvmarm-fixes-6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEADPaolo Bonzini1-0/+3
* Fix the pKVM stage-1 walker erronously using the stage-2 accessor * Correctly convert vcpu->kvm to a hyp pointer when generating an exception in a nVHE+MTE configuration * Check that KVM_CAP_DIRTY_LOG_* are valid before enabling them * Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE * Document the boot requirements for FGT when entering the kernel at EL1
2022-10-31KVM: Check KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL} prior to enabling themGavin Shan1-0/+3
There are two capabilities related to ring-based dirty page tracking: KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL. Both are supported by x86. However, arm64 supports KVM_CAP_DIRTY_LOG_RING_ACQ_REL only when the feature is supported on arm64. The userspace doesn't have to enable the advertised capability, meaning KVM_CAP_DIRTY_LOG_RING can be enabled on arm64 by userspace and it's wrong. Fix it by double checking if the capability has been advertised prior to enabling it. It's rejected to enable the capability if it hasn't been advertised. Fixes: 17601bfed909 ("KVM: Add KVM_CAP_DIRTY_LOG_RING_ACQ_REL capability and config option") Reported-by: Sean Christopherson <seanjc@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221031003621.164306-4-gshan@redhat.com
2022-10-27KVM: debugfs: Return retval of simple_attr_open() if it failsHou Wenlong1-7/+6
Although simple_attr_open() fails only with -ENOMEM with current code base, it would be nicer to return retval of simple_attr_open() directly in kvm_debugfs_open(). No functional change intended. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Message-Id: <69d64d93accd1f33691b8a383ae555baee80f943.1665975828.git.houwenlong.hwl@antgroup.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-22kvm: Add support for arch compat vm ioctlsAlexander Graf1-0/+11
We will introduce the first architecture specific compat vm ioctl in the next patch. Add all necessary boilerplate to allow architectures to override compat vm ioctls when necessary. Signed-off-by: Alexander Graf <graf@amazon.com> Message-Id: <20221017184541.2658-2-graf@amazon.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-03Merge tag 'kvmarm-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEADPaolo Bonzini1-1/+8
KVM/arm64 updates for v6.1 - Fixes for single-stepping in the presence of an async exception as well as the preservation of PSTATE.SS - Better handling of AArch32 ID registers on AArch64-only systems - Fixes for the dirty-ring API, allowing it to work on architectures with relaxed memory ordering - Advertise the new kvmarm mailing list - Various minor cleanups and spelling fixes
2022-09-29KVM: Add KVM_CAP_DIRTY_LOG_RING_ACQ_REL capability and config optionMarc Zyngier1-1/+8
In order to differenciate between architectures that require no extra synchronisation when accessing the dirty ring and those who do, add a new capability (KVM_CAP_DIRTY_LOG_RING_ACQ_REL) that identify the latter sort. TSO architectures can obviously advertise both, while relaxed architectures must only advertise the ACQ_REL version. This requires some configuration symbol rejigging, with HAVE_KVM_DIRTY_RING being only indirectly selected by two top-level config symbols: - HAVE_KVM_DIRTY_RING_TSO for strongly ordered architectures (x86) - HAVE_KVM_DIRTY_RING_ACQ_REL for weakly ordered architectures (arm64) Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20220926145120.27974-3-maz@kernel.org
2022-09-26KVM: remove KVM_REQ_UNHALTPaolo Bonzini1-3/+1
KVM_REQ_UNHALT is now unnecessary because it is replaced by the return value of kvm_vcpu_block/kvm_vcpu_halt. Remove it. No functional change intended. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Message-Id: <20220921003201.1441511-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: fix memoryleak in kvm_init()Miaohe Lin1-3/+2
When alloc_cpumask_var_node() fails for a certain cpu, there might be some allocated cpumasks for percpu cpu_kick_mask. We should free these cpumasks or memoryleak will occur. Fixes: baff59ccdc65 ("KVM: Pre-allocate cpumasks for kvm_make_all_cpus_request_except()") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Link: https://lore.kernel.org/r/20220823063414.59778-1-linmiaohe@huawei.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: Drop unnecessary initialization of "ops" in kvm_ioctl_create_device()Li kunyu1-1/+1
The variable is initialized but it is only used after its assignment. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Li kunyu <kunyu@nfschina.com> Message-Id: <20220819021535.483702-1-kunyu@nfschina.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: Drop unnecessary initialization of "npages" in hva_to_pfn_slow()Li kunyu1-1/+1
The variable is initialized but it is only used after its assignment. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Li kunyu <kunyu@nfschina.com> Message-Id: <20220819022804.483914-1-kunyu@nfschina.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: Rename mmu_notifier_* to mmu_invalidate_*Chao Peng1-25/+27
The motivation of this renaming is to make these variables and related helper functions less mmu_notifier bound and can also be used for non mmu_notifier based page invalidation. mmu_invalidate_* was chosen to better describe the purpose of 'invalidating' a page that those variables are used for. - mmu_notifier_seq/range_start/range_end are renamed to mmu_invalidate_seq/range_start/range_end. - mmu_notifier_retry{_hva} helper functions are renamed to mmu_invalidate_retry{_hva}. - mmu_notifier_count is renamed to mmu_invalidate_in_progress to avoid confusion with mn_active_invalidate_count. - While here, also update kvm_inc/dec_notifier_count() to kvm_mmu_invalidate_begin/end() to match the change for mmu_notifier_count. No functional change intended. Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Message-Id: <20220816125322.1110439-3-chao.p.peng@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: Move coalesced MMIO initialization (back) into kvm_create_vm()Sean Christopherson1-5/+6
Invoke kvm_coalesced_mmio_init() from kvm_create_vm() now that allocating and initializing coalesced MMIO objects is separate from registering any associated devices. Moving coalesced MMIO cleans up the last oddity where KVM does VM creation/initialization after kvm_create_vm(), and more importantly after kvm_arch_post_init_vm() is called and the VM is added to the global vm_list, i.e. after the VM is fully created as far as KVM is concerned. Originally, kvm_coalesced_mmio_init() was called by kvm_create_vm(), but the original implementation was completely devoid of error handling. Commit 6ce5a090a9a0 ("KVM: coalesced_mmio: fix kvm_coalesced_mmio_init()'s error handling" fixed the various bugs, and in doing so rightly moved the call to after kvm_create_vm() because kvm_coalesced_mmio_init() also registered the coalesced MMIO device. Commit 2b3c246a682c ("KVM: Make coalesced mmio use a device per zone") cleaned up that mess by having each zone register a separate device, i.e. moved device registration to its logical home in kvm_vm_ioctl_register_coalesced_mmio(). As a result, kvm_coalesced_mmio_init() is now a "pure" initialization helper and can be safely called from kvm_create_vm(). Opportunstically drop the #ifdef, KVM provides stubs for kvm_coalesced_mmio_{init,free}() when CONFIG_KVM_MMIO=n (s390). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220816053937.2477106-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: Unconditionally get a ref to /dev/kvm module when creating a VMSean Christopherson1-10/+4
Unconditionally get a reference to the /dev/kvm module when creating a VM instead of using try_get_module(), which will fail if the module is in the process of being forcefully unloaded. The error handling when try_get_module() fails doesn't properly unwind all that has been done, e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM from the global list. Not removing VMs from the global list tends to be fatal, e.g. leads to use-after-free explosions. The obvious alternative would be to add proper unwinding, but the justification for using try_get_module(), "rmmod --wait", is completely bogus as support for "rmmod --wait", i.e. delete_module() without O_NONBLOCK, was removed by commit 3f2b9c9cdf38 ("module: remove rmmod --wait option.") nearly a decade ago. It's still possible for try_get_module() to fail due to the module dying (more like being killed), as the module will be tagged MODULE_STATE_GOING by "rmmod --force", i.e. delete_module(..., O_TRUNC), but playing nice with forced unloading is an exercise in futility and gives a falsea sense of security. Using try_get_module() only prevents acquiring _new_ references, it doesn't magically put the references held by other VMs, and forced unloading doesn't wait, i.e. "rmmod --force" on KVM is all but guaranteed to cause spectacular fireworks; the window where KVM will fail try_get_module() is tiny compared to the window where KVM is building and running the VM with an elevated module refcount. Addressing KVM's inability to play nice with "rmmod --force" is firmly out-of-scope. Forcefully unloading any module taints kernel (for obvious reasons) _and_ requires the kernel to be built with CONFIG_MODULE_FORCE_UNLOAD=y, which is off by default and comes with the amusing disclaimer that it's "mainly for kernel developers and desperate users". In other words, KVM is free to scoff at bug reports due to using "rmmod --force" while VMs may be running. Fixes: 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed") Cc: stable@vger.kernel.org Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220816053937.2477106-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: Properly unwind VM creation if creating debugfs failsSean Christopherson1-8/+8
Properly unwind VM creation if kvm_create_vm_debugfs() fails. A recent change to invoke kvm_create_vm_debug() in kvm_create_vm() was led astray by buggy try_get_module() handling adding by commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed"). The debugfs error path effectively inherits the bad error path of try_module_get(), e.g. KVM leaves the to-be-free VM on vm_list even though KVM appears to do the right thing by calling module_put() and falling through. Opportunistically hoist kvm_create_vm_debugfs() above the call to kvm_arch_post_init_vm() so that the "post-init" arch hook is actually invoked after the VM is initialized (ignoring kvm_coalesced_mmio_init() for the moment). x86 is the only non-nop implementation of the post-init hook, and it doesn't allocate/initialize any objects that are reachable via debugfs code (spawns a kthread worker for the NX huge page mitigation). Leave the buggy try_get_module() alone for now, it will be fixed in a separate commit. Fixes: b74ed7a68ec1 ("KVM: Actually create debugfs in kvm_create_vm()") Reported-by: syzbot+744e173caec2e1627ee0@syzkaller.appspotmail.com Cc: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Message-Id: <20220816053937.2477106-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: Actually create debugfs in kvm_create_vm()Oliver Upton1-17/+19
Doing debugfs creation after vm creation leaves things in a quasi-initialized state for a while. This is further complicated by the fact that we tear down debugfs from kvm_destroy_vm(). Align debugfs and stats init/destroy with the vm init/destroy pattern to avoid any headaches. Note the fix for a benign mistake in error handling for calls to kvm_arch_create_vm_debugfs() rolled in. Since all implementations of the function return 0 unconditionally it isn't actually a bug at the moment. Lastly, tear down debugfs/stats data in the kvm_create_vm_debugfs() error path. Previously it was safe to assume that kvm_destroy_vm() would take out the garbage, that is no longer the case. Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20220720092259.3491733-6-oliver.upton@linux.dev> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: Pass the name of the VM fd to kvm_create_vm_debugfs()Oliver Upton1-3/+6
At the time the VM fd is used in kvm_create_vm_debugfs(), the fd has been allocated but not yet installed. It is only really useful as an identifier in strings for the VM (such as debugfs). Treat it exactly as such by passing the string name of the fd to kvm_create_vm_debugfs(), futureproofing against possible misuse of the VM fd. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20220720092259.3491733-5-oliver.upton@linux.dev> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: Get an fd before creating the VMOliver Upton1-13/+17
Allocate a VM's fd at the very beginning of kvm_dev_ioctl_create_vm() so that KVM can use the fd value to generate strigns, e.g. for debugfs, when creating and initializing the VM. Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20220720092259.3491733-4-oliver.upton@linux.dev> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: Shove vcpu stats_id init into kvm_vcpu_init()Oliver Upton1-4/+4
Initialize stats_id alongside other kvm_vcpu fields to make it more difficult to unintentionally access stats_id before it's set. No functional change intended. Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20220720092259.3491733-3-oliver.upton@linux.dev> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: Shove vm stats_id init into kvm_create_vm()Oliver Upton1-3/+3
Initialize stats_id alongside other struct kvm fields to make it more difficult to unintentionally access stats_id before it's set. While at it, move the format string to the first line of the call and fix the indentation of the second line. No functional change intended. Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20220720092259.3491733-2-oliver.upton@linux.dev> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-01Merge remote-tracking branch 'kvm/next' into kvm-next-5.20Paolo Bonzini1-57/+162
KVM/s390, KVM/x86 and common infrastructure changes for 5.20 x86: * Permit guests to ignore single-bit ECC errors * Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache * Intel IPI virtualization * Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS * PEBS virtualization * Simplify PMU emulation by just using PERF_TYPE_RAW events * More accurate event reinjection on SVM (avoid retrying instructions) * Allow getting/setting the state of the speaker port data bit * Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent * "Notify" VM exit (detect microarchitectural hangs) for Intel * Cleanups for MCE MSR emulation s390: * add an interface to provide a hypervisor dump for secure guests * improve selftests to use TAP interface * enable interpretive execution of zPCI instructions (for PCI passthrough) * First part of deferred teardown * CPU Topology * PV attestation * Minor fixes Generic: * new selftests API using struct kvm_vcpu instead of a (vm, id) tuple x86: * Use try_cmpxchg64 instead of cmpxchg64 * Bugfixes * Ignore benign host accesses to PMU MSRs when PMU is disabled * Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior * x86/MMU: Allow NX huge pages to be disabled on a per-vm basis * Port eager page splitting to shadow MMU as well * Enable CMCI capability by default and handle injected UCNA errors * Expose pid of vcpu threads in debugfs * x2AVIC support for AMD * cleanup PIO emulation * Fixes for LLDT/LTR emulation * Don't require refcounted "struct page" to create huge SPTEs x86 cleanups: * Use separate namespaces for guest PTEs and shadow PTEs bitmasks * PIO emulation * Reorganize rmap API, mostly around rmap destruction * Do not workaround very old KVM bugs for L0 that runs with nesting enabled * new selftests API for CPUID
2022-07-29KVM: Add gfp_custom flag in struct kvm_mmu_memory_cacheAnup Patel1-1/+3
The kvm_mmu_topup_memory_cache() always uses GFP_KERNEL_ACCOUNT for memory allocation which prevents it's use in atomic context. To address this limitation of kvm_mmu_topup_memory_cache(), we add gfp_custom flag in struct kvm_mmu_memory_cache. When the gfp_custom flag is set to some GFP_xyz flags, the kvm_mmu_topup_memory_cache() will use that instead of GFP_KERNEL_ACCOUNT. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2022-06-24KVM: debugfs: expose pid of vcpu threadsVineeth Pillai1-2/+13
Add a new debugfs file to expose the pid of each vcpu threads. This is very helpful for userland tools to get the vcpu pids without worrying about thread naming conventions of the VMM. Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org> Message-Id: <20220523190327.2658-1-vineeth@bitbyteword.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: Allow for different capacities in kvm_mmu_memory_cache structsDavid Matlack1-3/+30
Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at declaration time rather than being fixed for all declarations. This will be used in a follow-up commit to declare an cache in x86 with a capacity of 512+ objects without having to increase the capacity of all caches in KVM. This change requires each cache now specify its capacity at runtime, since the cache struct itself no longer has a fixed capacity known at compile time. To protect against someone accidentally defining a kvm_mmu_memory_cache struct directly (without the extra storage), this commit includes a WARN_ON() in kvm_mmu_topup_memory_cache(). In order to support different capacities, this commit changes the objects pointer array to be dynamically allocated the first time the cache is topped-up. While here, opportunistically clean up the stack-allocated kvm_mmu_memory_cache structs in riscv and arm64 to use designated initializers. No functional change intended. Reviewed-by: Marc Zyngier <maz@kernel.org> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-22-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: Do not zero initialize 'pfn' in hva_to_pfn()Sean Christopherson1-1/+1
Drop the unnecessary initialization of the local 'pfn' variable in hva_to_pfn(). First and foremost, '0' is not an invalid pfn, it's a perfectly valid pfn on most architectures. I.e. if hva_to_pfn() were to return an "uninitializd" pfn, it would actually be interpeted as a legal pfn by most callers. Second, hva_to_pfn() can't return an uninitialized pfn as hva_to_pfn() explicitly sets pfn to an error value (or returns an error value directly) if a helper returns failure, and all helpers set the pfn on success. The zeroing of 'pfn' was introduced by commit 2fc843117d64 ("KVM: reorganize hva_to_pfn"), probably to avoid "uninitialized variable" warnings on statements that return pfn. However, no compiler seems to produce them, making the initialization unnecessary. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: Rename/refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page()Sean Christopherson1-14/+52
Rename and refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page() to better reflect what KVM is actually checking, and to eliminate extra pfn_to_page() lookups. The kvm_release_pfn_*() an kvm_try_get_pfn() helpers in particular benefit from "refouncted" nomenclature, as it's not all that obvious why KVM needs to get/put refcounts for some PG_reserved pages (ZERO_PAGE and ZONE_DEVICE). Add a comment to call out that the list of exceptions to PG_reserved is all but guaranteed to be incomplete. The list has mostly been compiled by people throwing noodles at KVM and finding out they stick a little too well, e.g. the ZERO_PAGE's refcount overflowed and ZONE_DEVICE pages didn't get freed. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: Take a 'struct page', not a pfn in kvm_is_zone_device_page()Sean Christopherson1-4/+4
Operate on a 'struct page' instead of a pfn when checking if a page is a ZONE_DEVICE page, and rename the helper accordingly. Generally speaking, KVM doesn't actually care about ZONE_DEVICE memory, i.e. shouldn't do anything special for ZONE_DEVICE memory. Rather, KVM wants to treat ZONE_DEVICE memory like regular memory, and the need to identify ZONE_DEVICE memory only arises as an exception to PG_reserved pages. In other words, KVM should only ever check for ZONE_DEVICE memory after KVM has already verified that there is a struct page associated with the pfn. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: Remove kvm_vcpu_gfn_to_page() and kvm_vcpu_gpa_to_page()Sean Christopherson1-20/+11
Drop helpers to convert a gfn/gpa to a 'struct page' in the context of a vCPU. KVM doesn't require that guests be backed by 'struct page' memory, thus any use of helpers that assume 'struct page' is bound to be flawed, as was the case for the recently removed last user in x86's nested VMX. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: Don't WARN if kvm_pfn_to_page() encounters a "reserved" pfnSean Christopherson1-3/+1
Drop a WARN_ON() if kvm_pfn_to_page() encounters a "reserved" pfn, which in this context means a struct page that has PG_reserved but is not a/the ZERO_PAGE and is not a ZONE_DEVICE page. The usage, via gfn_to_page(), in x86 is safe as gfn_to_page() is used only to retrieve a page from KVM-controlled memslot, but the usage in PPC and s390 operates on arbitrary gfns and thus memslots that can be backed by incompatible memory. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: Avoid pfn_to_page() and vice versa when releasing pagesSean Christopherson1-21/+43
Invert the order of KVM's page/pfn release helpers so that the "inner" helper operates on a page instead of a pfn. As pointed out by Linus[*], converting between struct page and a pfn isn't necessarily cheap, and that's not even counting the overhead of is_error_noslot_pfn() and kvm_is_reserved_pfn(). Even if the checks were dirt cheap, there's no reason to convert from a page to a pfn and back to a page, just to mark the page dirty/accessed or to put a reference to the page. Opportunistically drop a stale declaration of kvm_set_page_accessed() from kvm_host.h (there was no implementation). No functional change intended. [*] https://lore.kernel.org/all/CAHk-=wifQimj2d6npq-wCi5onYPjzQg4vyO4tFcPJJZr268cRw@mail.gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: Don't set Accessed/Dirty bits for ZERO_PAGESean Christopherson1-2/+14
Don't set Accessed/Dirty bits for a struct page with PG_reserved set, i.e. don't set A/D bits for the ZERO_PAGE. The ZERO_PAGE (or pages depending on the architecture) should obviously never be written, and similarly there's no point in marking it accessed as the page will never be swapped out or reclaimed. The comment in page-flags.h is quite clear that PG_reserved pages should be managed only by their owner, and strictly following that mandate also simplifies KVM's logic. Fixes: 7df003c85218 ("KVM: fix overflow of zero page refcount with ksm running") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: Drop bogus "pfn != 0" guard from kvm_release_pfn()Sean Christopherson1-3/+0
Remove a check from kvm_release_pfn() to bail if the provided @pfn is zero. Zero is a perfectly valid pfn on most architectures, and should not be used to indicate an error or an invalid pfn. The bogus check was added by commit 917248144db5 ("x86/kvm: Cache gfn to pfn translation"), which also did the bad thing of zeroing the pfn and gfn to mark a cache invalid. Thankfully, that bad behavior was axed by commit 357a18ad230f ("KVM: Kill kvm_map_gfn() / kvm_unmap_gfn() and gfn_to_pfn_cache"). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-15KVM: Rename ack_flush() to ack_kick()Lai Jiangshan1-2/+2
Make it use the same verb as in kvm_kick_many_cpus(). Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220605063417.308311-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09Merge branch 'kvm-5.20-early'Paolo Bonzini1-4/+15
s390: * add an interface to provide a hypervisor dump for secure guests * improve selftests to show tests x86: * Intel IPI virtualization * Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS * PEBS virtualization * Simplify PMU emulation by just using PERF_TYPE_RAW events * More accurate event reinjection on SVM (avoid retrying instructions) * Allow getting/setting the state of the speaker port data bit * Rewrite gfn-pfn cache refresh * Refuse starting the module if VM-Entry/VM-Exit controls are inconsistent * "Notify" VM exit
2022-06-09KVM: x86: disable preemption around the call to kvm_arch_vcpu_{un|}blockingMaxim Levitsky1-2/+6
On SVM, if preemption happens right after the call to finish_rcuwait but before call to kvm_arch_vcpu_unblocking on SVM/AVIC, it itself will re-enable AVIC, and then we will try to re-enable it again in kvm_arch_vcpu_unblocking which will lead to a warning in __avic_vcpu_load. The same problem can happen if the vCPU is preempted right after the call to kvm_arch_vcpu_blocking but before the call to prepare_to_rcuwait and in this case, we will end up with AVIC enabled during sleep - Ooops. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220606180829.102503-7-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09Merge tag 'kvm-riscv-fixes-5.19-1' of https://github.com/kvm-riscv/linux into HEADPaolo Bonzini1-1/+1
KVM/riscv fixes for 5.19, take #1 - Typo fix in arch/riscv/kvm/vmid.c - Remove broken reference pattern from MAINTAINERS entry
2022-06-08KVM: Move kvm_arch_vcpu_precreate() under kvm->lockZeng Guang1-4/+6
kvm_arch_vcpu_precreate() targets to handle arch specific VM resource to be prepared prior to the actual creation of vCPU. For example, x86 platform may need do per-VM allocation based on max_vcpu_ids at the first vCPU creation. It probably leads to concurrency control on this allocation as multiple vCPU creation could happen simultaneously. From the architectual point of view, it's necessary to execute kvm_arch_vcpu_precreate() under protect of kvm->lock. Currently only arm64, x86 and s390 have non-nop implementations at the stage of vCPU pre-creation. Remove the lock acquiring in s390's design and make sure all architecture can run kvm_arch_vcpu_precreate() safely under kvm->lock without recrusive lock issue. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Zeng Guang <guang.zeng@intel.com> Message-Id: <20220419154409.11842-1-guang.zeng@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07Merge branch 'kvm-5.20-early-patches' into HEADPaolo Bonzini1-0/+9
2022-06-07Merge branch 'kvm-5.19-early-fixes' into HEADPaolo Bonzini1-1/+4
2022-06-07KVM: Don't null dereference ops->destroyAlexey Kardashevskiy1-1/+4
A KVM device cleanup happens in either of two callbacks: 1) destroy() which is called when the VM is being destroyed; 2) release() which is called when a device fd is closed. Most KVM devices use 1) but Book3s's interrupt controller KVM devices (XICS, XIVE, XIVE-native) use 2) as they need to close and reopen during the machine execution. The error handling in kvm_ioctl_create_device() assumes destroy() is always defined which leads to NULL dereference as discovered by Syzkaller. This adds a checks for destroy!=NULL and adds a missing release(). This is not changing kvm_destroy_devices() as devices with defined release() should have been removed from the KVM devices list by then. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-26Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-1/+2
Pull kvm updates from Paolo Bonzini: "S390: - ultravisor communication device driver - fix TEID on terminating storage key ops RISC-V: - Added Sv57x4 support for G-stage page table - Added range based local HFENCE functions - Added remote HFENCE functions based on VCPU requests - Added ISA extension registers in ONE_REG interface - Updated KVM RISC-V maintainers entry to cover selftests support ARM: - Add support for the ARMv8.6 WFxT extension - Guard pages for the EL2 stacks - Trap and emulate AArch32 ID registers to hide unsupported features - Ability to select and save/restore the set of hypercalls exposed to the guest - Support for PSCI-initiated suspend in collaboration with userspace - GICv3 register-based LPI invalidation support - Move host PMU event merging into the vcpu data structure - GICv3 ITS save/restore fixes - The usual set of small-scale cleanups and fixes x86: - New ioctls to get/set TSC frequency for a whole VM - Allow userspace to opt out of hypercall patching - Only do MSR filtering for MSRs accessed by rdmsr/wrmsr AMD SEV improvements: - Add KVM_EXIT_SHUTDOWN metadata for SEV-ES - V_TSC_AUX support Nested virtualization improvements for AMD: - Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE, nested vGIF) - Allow AVIC to co-exist with a nested guest running - Fixes for LBR virtualizations when a nested guest is running, and nested LBR virtualization support - PAUSE filtering for nested hypervisors Guest support: - Decoupling of vcpu_is_preempted from PV spinlocks" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (199 commits) KVM: x86: Fix the intel_pt PMI handling wrongly considered from guest KVM: selftests: x86: Sync the new name of the test case to .gitignore Documentation: kvm: reorder ARM-specific section about KVM_SYSTEM_EVENT_SUSPEND x86, kvm: use correct GFP flags for preemption disabled KVM: LAPIC: Drop pending LAPIC timer injection when canceling the timer x86/kvm: Alloc dummy async #PF token outside of raw spinlock KVM: x86: avoid calling x86 emulator without a decoded instruction KVM: SVM: Use kzalloc for sev ioctl interfaces to prevent kernel data leak x86/fpu: KVM: Set the base guest FPU uABI size to sizeof(struct kvm_xsave) s390/uv_uapi: depend on CONFIG_S390 KVM: selftests: x86: Fix test failure on arch lbr capable platforms KVM: LAPIC: Trace LAPIC timer expiration on every vmentry KVM: s390: selftest: Test suppression indication on key prot exception KVM: s390: Don't indicate suppression on dirtying, failing memop selftests: drivers/s390x: Add uvdevice tests drivers/s390/char: Add Ultravisor io device MAINTAINERS: Update KVM RISC-V entry to cover selftests support RISC-V: KVM: Introduce ISA extension register RISC-V: KVM: Cleanup stale TLB entries when host CPU changes RISC-V: KVM: Add remote HFENCE functions based on VCPU requests ...
2022-05-25KVM: Fix multiple races in gfn=>pfn cache refreshSean Christopherson1-0/+9
Rework the gfn=>pfn cache (gpc) refresh logic to address multiple races between the cache itself, and between the cache and mmu_notifier events. The existing refresh code attempts to guard against races with the mmu_notifier by speculatively marking the cache valid, and then marking it invalid if a mmu_notifier invalidation occurs. That handles the case where an invalidation occurs between dropping and re-acquiring gpc->lock, but it doesn't handle the scenario where the cache is refreshed after the cache was invalidated by the notifier, but before the notifier elevates mmu_notifier_count. The gpc refresh can't use the "retry" helper as its invalidation occurs _before_ mmu_notifier_count is elevated and before mmu_notifier_range_start is set/updated. CPU0 CPU1 ---- ---- gfn_to_pfn_cache_invalidate_start() | -> gpc->valid = false; kvm_gfn_to_pfn_cache_refresh() | |-> gpc->valid = true; hva_to_pfn_retry() | -> acquire kvm->mmu_lock kvm->mmu_notifier_count == 0 mmu_seq == kvm->mmu_notifier_seq drop kvm->mmu_lock return pfn 'X' acquire kvm->mmu_lock kvm_inc_notifier_count() drop kvm->mmu_lock() kernel frees pfn 'X' kvm_gfn_to_pfn_cache_check() | |-> gpc->valid == true caller accesses freed pfn 'X' Key off of mn_active_invalidate_count to detect that a pfncache refresh needs to wait for an in-progress mmu_notifier invalidation. While mn_active_invalidate_count is not guaranteed to be stable, it is guaranteed to be elevated prior to an invalidation acquiring gpc->lock, so either the refresh will see an active invalidation and wait, or the invalidation will run after the refresh completes. Speculatively marking the cache valid is itself flawed, as a concurrent kvm_gfn_to_pfn_cache_check() would see a valid cache with stale pfn/khva values. The KVM Xen use case explicitly allows/wants multiple users; even though the caches are allocated per vCPU, __kvm_xen_has_interrupt() can read a different vCPU (or vCPUs). Address this race by invalidating the cache prior to dropping gpc->lock (this is made possible by fixing the above mmu_notifier race). Complicating all of this is the fact that both the hva=>pfn resolution and mapping of the kernel address can sleep, i.e. must be done outside of gpc->lock. Fix the above races in one fell swoop, trying to fix each individual race is largely pointless and essentially impossible to test, e.g. closing one hole just shifts the focus to the other hole. Fixes: 982ed0de4753 ("KVM: Reinstate gfn_to_pfn_cache with invalidation support") Cc: stable@vger.kernel.org Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429210025.3293691-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25Merge tag 'kvmarm-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEADPaolo Bonzini1-11/+32
KVM/arm64 updates for 5.19 - Add support for the ARMv8.6 WFxT extension - Guard pages for the EL2 stacks - Trap and emulate AArch32 ID registers to hide unsupported features - Ability to select and save/restore the set of hypercalls exposed to the guest - Support for PSCI-initiated suspend in collaboration with userspace - GICv3 register-based LPI invalidation support - Move host PMU event merging into the vcpu data structure - GICv3 ITS save/restore fixes - The usual set of small-scale cleanups and fixes [Due to the conflict, KVM_SYSTEM_EVENT_SEV_TERM is relocated from 4 to 6. - Paolo]
2022-05-20KVM: Free new dirty bitmap if creating a new memslot failsSean Christopherson1-1/+1
Fix a goof in kvm_prepare_memory_region() where KVM fails to free the new memslot's dirty bitmap during a CREATE action if kvm_arch_prepare_memory_region() fails. The logic is supposed to detect if the bitmap was allocated and thus needs to be freed, versus if the bitmap was inherited from the old memslot and thus needs to be kept. If there is no old memslot, then obviously the bitmap can't have been inherited The bug was exposed by commit 86931ff7207b ("KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR"), which made it trivally easy for syzkaller to trigger failure during kvm_arch_prepare_memory_region(), but the bug can be hit other ways too, e.g. due to -ENOMEM when allocating x86's memslot metadata. The backtrace from kmemleak: __vmalloc_node_range+0xb40/0xbd0 mm/vmalloc.c:3195 __vmalloc_node mm/vmalloc.c:3232 [inline] __vmalloc+0x49/0x50 mm/vmalloc.c:3246 __vmalloc_array mm/util.c:671 [inline] __vcalloc+0x49/0x70 mm/util.c:694 kvm_alloc_dirty_bitmap virt/kvm/kvm_main.c:1319 kvm_prepare_memory_region virt/kvm/kvm_main.c:1551 kvm_set_memslot+0x1bd/0x690 virt/kvm/kvm_main.c:1782 __kvm_set_memory_region+0x689/0x750 virt/kvm/kvm_main.c:1949 kvm_set_memory_region virt/kvm/kvm_main.c:1962 kvm_vm_ioctl_set_memory_region virt/kvm/kvm_main.c:1974 kvm_vm_ioctl+0x377/0x13a0 virt/kvm/kvm_main.c:4528 vfs_ioctl fs/ioctl.c:51 __do_sys_ioctl fs/ioctl.c:870 __se_sys_ioctl fs/ioctl.c:856 __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae And the relevant sequence of KVM events: ioctl(3, KVM_CREATE_VM, 0) = 4 ioctl(4, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=KVM_MEM_LOG_DIRTY_PAGES, guest_phys_addr=0x10000000000000, memory_size=4096, userspace_addr=0x20fe8000} ) = -1 EINVAL (Invalid argument) Fixes: 244893fa2859 ("KVM: Dynamically allocate "new" memslots from the get-go") Cc: stable@vger.kernel.org Reported-by: syzbot+8606b8a9cc97a63f1c87@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220518003842.1341782-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-02KVM: Add max_vcpus field in common 'struct kvm'Sean Christopherson1-1/+2
For TDX guests, the maximum number of vcpus needs to be specified when the TDX guest VM is initialized (creating the TDX data corresponding to TDX guest) before creating vcpu. It needs to record the maximum number of vcpus on VM creation (KVM_CREATE_VM) and return error if the number of vcpus exceeds it Because there is already max_vcpu member in arm64 struct kvm_arch, move it to common struct kvm and initialize it to KVM_MAX_VCPUS before kvm_arch_init_vm() instead of adding it to x86 struct kvm_arch. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <e53234cdee6a92357d06c80c03d77c19cdefb804.1646422845.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29Merge branch 'kvm-fixes-for-5.18-rc5' into HEADPaolo Bonzini1-0/+1
Fixes for (relatively) old bugs, to be merged in both the -rc and next development trees: * Fix potential races when walking host page table * Fix bad user ABI for KVM_EXIT_SYSTEM_EVENT * Fix shadow page table leak when KVM runs nested
2022-04-29KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENTPaolo Bonzini1-0/+1
When KVM_EXIT_SYSTEM_EVENT was introduced, it included a flags member that at the time was unused. Unfortunately this extensibility mechanism has several issues: - x86 is not writing the member, so it would not be possible to use it on x86 except for new events - the member is not aligned to 64 bits, so the definition of the uAPI struct is incorrect for 32- on 64-bit userspace. This is a problem for RISC-V, which supports CONFIG_KVM_COMPAT, but fortunately usage of flags was only introduced in 5.18. Since padding has to be introduced, place a new field in there that tells if the flags field is valid. To allow further extensibility, in fact, change flags to an array of 16 values, and store how many of the values are valid. The availability of the new ndata field is tied to a system capability; all architectures are changed to fill in the field. To avoid breaking compilation of userspace that was using the flags field, provide a userspace-only union to overlap flags with data[0]. The new field is placed at the same offset for both 32- and 64-bit userspace. Cc: Will Deacon <will@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Peter Gonda <pgonda@google.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Message-Id: <20220422103013.34832-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: SEV: add cache flush to solve SEV cache incoherency issuesMingwei Zhang1-3/+24
Flush the CPU caches when memory is reclaimed from an SEV guest (where reclaim also includes it being unmapped from KVM's memslots). Due to lack of coherency for SEV encrypted memory, failure to flush results in silent data corruption if userspace is malicious/broken and doesn't ensure SEV guest memory is properly pinned and unpinned. Cache coherency is not enforced across the VM boundary in SEV (AMD APM vol.2 Section 15.34.7). Confidential cachelines, generated by confidential VM guests have to be explicitly flushed on the host side. If a memory page containing dirty confidential cachelines was released by VM and reallocated to another user, the cachelines may corrupt the new user at a later time. KVM takes a shortcut by assuming all confidential memory remain pinned until the end of VM lifetime. Therefore, KVM does not flush cache at mmu_notifier invalidation events. Because of this incorrect assumption and the lack of cache flushing, malicous userspace can crash the host kernel: creating a malicious VM and continuously allocates/releases unpinned confidential memory pages when the VM is running. Add cache flush operations to mmu_notifier operations to ensure that any physical memory leaving the guest VM get flushed. In particular, hook mmu_notifier_invalidate_range_start and mmu_notifier_release events and flush cache accordingly. The hook after releasing the mmu lock to avoid contention with other vCPUs. Cc: stable@vger.kernel.org Suggested-by: Sean Christpherson <seanjc@google.com> Reported-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220421031407.2516575-4-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: SPDX style and spelling fixesTom Rix1-2/+2
SPDX comments use use /* */ style comments in headers anad // style comments in .c files. Also fix two spelling mistakes. Signed-off-by: Tom Rix <trix@redhat.com> Message-Id: <20220410153840.55506-1-trix@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: Initialize debugfs_dentry when a VM is created to avoid NULL derefSean Christopherson1-6/+6
Initialize debugfs_entry to its semi-magical -ENOENT value when the VM is created. KVM's teardown when VM creation fails is kludgy and calls kvm_uevent_notify_change() and kvm_destroy_vm_debugfs() even if KVM never attempted kvm_create_vm_debugfs(). Because debugfs_entry is zero initialized, the IS_ERR() checks pass and KVM derefs a NULL pointer. BUG: kernel NULL pointer dereference, address: 0000000000000018 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 1068b1067 P4D 1068b1067 PUD 1068b0067 PMD 0 Oops: 0000 [#1] SMP CPU: 0 PID: 871 Comm: repro Not tainted 5.18.0-rc1+ #825 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__dentry_path+0x7b/0x130 Call Trace: <TASK> dentry_path_raw+0x42/0x70 kvm_uevent_notify_change.part.0+0x10c/0x200 [kvm] kvm_put_kvm+0x63/0x2b0 [kvm] kvm_dev_ioctl+0x43a/0x920 [kvm] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x31/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> Modules linked in: kvm_intel kvm irqbypass Fixes: a44a4cc1c969 ("KVM: Don't create VM debugfs files outside of the VM directory") Cc: stable@vger.kernel.org Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oupton@google.com> Reported-by: syzbot+df6fbbd2ee39f21289ef@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20220415004622.2207751-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-08Merge tag 'kvmarm-fixes-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEADPaolo Bonzini1-2/+8
KVM/arm64 fixes for 5.18, take #1 - Some PSCI fixes after introducing PSCIv1.1 and SYSTEM_RESET2 - Fix the MMU write-lock not being taken on THP split - Fix mixed-width VM handling - Fix potential UAF when debugfs registration fails - Various selftest updates for all of the above