diff options
Diffstat (limited to 'Documentation/virt/kvm/locking.rst')
-rw-r--r-- | Documentation/virt/kvm/locking.rst | 123 |
1 files changed, 82 insertions, 41 deletions
diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst index c02291beac3f..845a561629f1 100644 --- a/Documentation/virt/kvm/locking.rst +++ b/Documentation/virt/kvm/locking.rst @@ -16,7 +16,25 @@ The acquisition orders for mutexes are as follows: - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring them together is quite rare. -On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock. +- Unlike kvm->slots_lock, kvm->slots_arch_lock is released before + synchronize_srcu(&kvm->srcu). Therefore kvm->slots_arch_lock + can be taken inside a kvm->srcu read-side critical section, + while kvm->slots_lock cannot. + +- kvm->mn_active_invalidate_count ensures that pairs of + invalidate_range_start() and invalidate_range_end() callbacks + use the same memslots array. kvm->slots_lock and kvm->slots_arch_lock + are taken on the waiting side in install_new_memslots, so MMU notifiers + must not take either kvm->slots_lock or kvm->slots_arch_lock. + +On x86: + +- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock + +- kvm->arch.mmu_lock is an rwlock. kvm->arch.tdp_mmu_pages_lock and + kvm->arch.mmu_unsync_pages_lock are taken inside kvm->arch.mmu_lock, and + cannot be taken without already holding kvm->arch.mmu_lock (typically with + ``read_lock`` for the TDP MMU, thus the need for additional spinlocks). Everything else is a leaf: no other lock is taken inside the critical sections. @@ -31,25 +49,24 @@ the mmu-lock on x86. Currently, the page fault can be fast in one of the following two cases: 1. Access Tracking: The SPTE is not present, but it is marked for access - tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to - restore the saved R/X bits. This is described in more detail later below. + tracking. That means we need to restore the saved R/X bits. This is + described in more detail later below. -2. Write-Protection: The SPTE is present and the fault is - caused by write-protect. That means we just need to change the W bit of - the spte. +2. Write-Protection: The SPTE is present and the fault is caused by + write-protect. That means we just need to change the W bit of the spte. -What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and -SPTE_MMU_WRITEABLE bit on the spte: +What we use to avoid all the race is the Host-writable bit and MMU-writable bit +on the spte: -- SPTE_HOST_WRITEABLE means the gfn is writable on host. -- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when - the gfn is writable on guest mmu and it is not write-protected by shadow - page write-protection. +- Host-writable means the gfn is writable in the host kernel page tables and in + its KVM memslot. +- MMU-writable means the gfn is writable in the guest's mmu and it is not + write-protected by shadow page write-protection. On fast page fault path, we will use cmpxchg to atomically set the spte W -bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or -restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This -is safe because whenever changing these bits can be detected by cmpxchg. +bit if spte.HOST_WRITEABLE = 1 and spte.WRITE_PROTECT = 1, to restore the saved +R/X bits if for an access-traced spte, or both. This is safe because whenever +changing these bits can be detected by cmpxchg. But we need carefully check these cases: @@ -96,19 +113,18 @@ will happen: We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap. For direct sp, we can easily avoid it since the spte of direct sp is fixed -to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic() -to pin gfn to pfn, because after gfn_to_pfn_atomic(): +to gfn. For indirect sp, we disabled fast page fault for simplicity. + +A solution for indirect sp could be to pin the gfn, for example via +kvm_vcpu_gfn_to_pfn_atomic, before the cmpxchg. After the pinning: - We have held the refcount of pfn that means the pfn can not be freed and be reused for another gfn. -- The pfn is writable that means it can not be shared between different gfns +- The pfn is writable and therefore it cannot be shared between different gfns by KSM. Then, we can ensure the dirty bitmaps is correctly set for a gfn. -Currently, to simplify the whole things, we disable fast page fault for -indirect shadow page. - 2) Dirty bit tracking In the origin code, the spte can be fast updated (non-atomically) if the @@ -179,47 +195,62 @@ See the comments in spte_has_volatile_bits() and mmu_spte_update(). Lockless Access Tracking: This is used for Intel CPUs that are using EPT but do not support the EPT A/D -bits. In this case, when the KVM MMU notifier is called to track accesses to a -page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present -by clearing the RWX bits in the PTE and storing the original R & X bits in -some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the -PTE (using the ignored bit 62). When the VM tries to access the page later on, -a fault is generated and the fast page fault mechanism described above is used -to atomically restore the PTE to a Present state. The W bit is not saved when -the PTE is marked for access tracking and during restoration to the Present -state, the W bit is set depending on whether or not it was a write access. If -it wasn't, then the W bit will remain clear until a write access happens, at -which time it will be set using the Dirty tracking mechanism described above. +bits. In this case, PTEs are tagged as A/D disabled (using ignored bits), and +when the KVM MMU notifier is called to track accesses to a page (via +kvm_mmu_notifier_clear_flush_young), it marks the PTE not-present in hardware +by clearing the RWX bits in the PTE and storing the original R & X bits in more +unused/ignored bits. When the VM tries to access the page later on, a fault is +generated and the fast page fault mechanism described above is used to +atomically restore the PTE to a Present state. The W bit is not saved when the +PTE is marked for access tracking and during restoration to the Present state, +the W bit is set depending on whether or not it was a write access. If it +wasn't, then the W bit will remain clear until a write access happens, at which +time it will be set using the Dirty tracking mechanism described above. 3. Reference ------------ -:Name: kvm_lock +``kvm_lock`` +^^^^^^^^^^^^ + :Type: mutex :Arch: any :Protects: - vm_list -:Name: kvm_count_lock +``kvm_count_lock`` +^^^^^^^^^^^^^^^^^^ + :Type: raw_spinlock_t :Arch: any :Protects: - hardware virtualization enable/disable :Comment: 'raw' because hardware enabling/disabling must be atomic /wrt migration. -:Name: kvm_arch::tsc_write_lock -:Type: raw_spinlock +``kvm->mn_invalidate_lock`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:Type: spinlock_t +:Arch: any +:Protects: mn_active_invalidate_count, mn_memslots_update_rcuwait + +``kvm_arch::tsc_write_lock`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:Type: raw_spinlock_t :Arch: x86 :Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} - tsc offset in vmcb :Comment: 'raw' because updating the tsc offsets must not be preempted. -:Name: kvm->mmu_lock -:Type: spinlock_t +``kvm->mmu_lock`` +^^^^^^^^^^^^^^^^^ +:Type: spinlock_t or rwlock_t :Arch: any :Protects: -shadow page/shadow tlb entry :Comment: it is a spinlock since it is used in mmu notifier. -:Name: kvm->srcu +``kvm->srcu`` +^^^^^^^^^^^^^ :Type: srcu lock :Arch: any :Protects: - kvm->memslots @@ -230,10 +261,20 @@ which time it will be set using the Dirty tracking mechanism described above. The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu if it is needed by multiple functions. -:Name: blocked_vcpu_on_cpu_lock +``kvm->slots_arch_lock`` +^^^^^^^^^^^^^^^^^^^^^^^^ +:Type: mutex +:Arch: any (only needed on x86 though) +:Protects: any arch-specific fields of memslots that have to be modified + in a ``kvm->srcu`` read-side critical section. +:Comment: must be held before reading the pointer to the current memslots, + until after all changes to the memslots are complete + +``wakeup_vcpus_on_cpu_lock`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ :Type: spinlock_t :Arch: x86 -:Protects: blocked_vcpu_on_cpu +:Protects: wakeup_vcpus_on_cpu :Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts. When VT-d posted-interrupts is supported and the VM has assigned devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu |