aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/virt/kvm/locking.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/virt/kvm/locking.rst')
-rw-r--r--Documentation/virt/kvm/locking.rst123
1 files changed, 82 insertions, 41 deletions
diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index c02291beac3f..845a561629f1 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -16,7 +16,25 @@ The acquisition orders for mutexes are as follows:
- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
them together is quite rare.
-On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.
+- Unlike kvm->slots_lock, kvm->slots_arch_lock is released before
+ synchronize_srcu(&kvm->srcu). Therefore kvm->slots_arch_lock
+ can be taken inside a kvm->srcu read-side critical section,
+ while kvm->slots_lock cannot.
+
+- kvm->mn_active_invalidate_count ensures that pairs of
+ invalidate_range_start() and invalidate_range_end() callbacks
+ use the same memslots array. kvm->slots_lock and kvm->slots_arch_lock
+ are taken on the waiting side in install_new_memslots, so MMU notifiers
+ must not take either kvm->slots_lock or kvm->slots_arch_lock.
+
+On x86:
+
+- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock
+
+- kvm->arch.mmu_lock is an rwlock. kvm->arch.tdp_mmu_pages_lock and
+ kvm->arch.mmu_unsync_pages_lock are taken inside kvm->arch.mmu_lock, and
+ cannot be taken without already holding kvm->arch.mmu_lock (typically with
+ ``read_lock`` for the TDP MMU, thus the need for additional spinlocks).
Everything else is a leaf: no other lock is taken inside the critical
sections.
@@ -31,25 +49,24 @@ the mmu-lock on x86. Currently, the page fault can be fast in one of the
following two cases:
1. Access Tracking: The SPTE is not present, but it is marked for access
- tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to
- restore the saved R/X bits. This is described in more detail later below.
+ tracking. That means we need to restore the saved R/X bits. This is
+ described in more detail later below.
-2. Write-Protection: The SPTE is present and the fault is
- caused by write-protect. That means we just need to change the W bit of
- the spte.
+2. Write-Protection: The SPTE is present and the fault is caused by
+ write-protect. That means we just need to change the W bit of the spte.
-What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
-SPTE_MMU_WRITEABLE bit on the spte:
+What we use to avoid all the race is the Host-writable bit and MMU-writable bit
+on the spte:
-- SPTE_HOST_WRITEABLE means the gfn is writable on host.
-- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when
- the gfn is writable on guest mmu and it is not write-protected by shadow
- page write-protection.
+- Host-writable means the gfn is writable in the host kernel page tables and in
+ its KVM memslot.
+- MMU-writable means the gfn is writable in the guest's mmu and it is not
+ write-protected by shadow page write-protection.
On fast page fault path, we will use cmpxchg to atomically set the spte W
-bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or
-restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This
-is safe because whenever changing these bits can be detected by cmpxchg.
+bit if spte.HOST_WRITEABLE = 1 and spte.WRITE_PROTECT = 1, to restore the saved
+R/X bits if for an access-traced spte, or both. This is safe because whenever
+changing these bits can be detected by cmpxchg.
But we need carefully check these cases:
@@ -96,19 +113,18 @@ will happen:
We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
For direct sp, we can easily avoid it since the spte of direct sp is fixed
-to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic()
-to pin gfn to pfn, because after gfn_to_pfn_atomic():
+to gfn. For indirect sp, we disabled fast page fault for simplicity.
+
+A solution for indirect sp could be to pin the gfn, for example via
+kvm_vcpu_gfn_to_pfn_atomic, before the cmpxchg. After the pinning:
- We have held the refcount of pfn that means the pfn can not be freed and
be reused for another gfn.
-- The pfn is writable that means it can not be shared between different gfns
+- The pfn is writable and therefore it cannot be shared between different gfns
by KSM.
Then, we can ensure the dirty bitmaps is correctly set for a gfn.
-Currently, to simplify the whole things, we disable fast page fault for
-indirect shadow page.
-
2) Dirty bit tracking
In the origin code, the spte can be fast updated (non-atomically) if the
@@ -179,47 +195,62 @@ See the comments in spte_has_volatile_bits() and mmu_spte_update().
Lockless Access Tracking:
This is used for Intel CPUs that are using EPT but do not support the EPT A/D
-bits. In this case, when the KVM MMU notifier is called to track accesses to a
-page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present
-by clearing the RWX bits in the PTE and storing the original R & X bits in
-some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the
-PTE (using the ignored bit 62). When the VM tries to access the page later on,
-a fault is generated and the fast page fault mechanism described above is used
-to atomically restore the PTE to a Present state. The W bit is not saved when
-the PTE is marked for access tracking and during restoration to the Present
-state, the W bit is set depending on whether or not it was a write access. If
-it wasn't, then the W bit will remain clear until a write access happens, at
-which time it will be set using the Dirty tracking mechanism described above.
+bits. In this case, PTEs are tagged as A/D disabled (using ignored bits), and
+when the KVM MMU notifier is called to track accesses to a page (via
+kvm_mmu_notifier_clear_flush_young), it marks the PTE not-present in hardware
+by clearing the RWX bits in the PTE and storing the original R & X bits in more
+unused/ignored bits. When the VM tries to access the page later on, a fault is
+generated and the fast page fault mechanism described above is used to
+atomically restore the PTE to a Present state. The W bit is not saved when the
+PTE is marked for access tracking and during restoration to the Present state,
+the W bit is set depending on whether or not it was a write access. If it
+wasn't, then the W bit will remain clear until a write access happens, at which
+time it will be set using the Dirty tracking mechanism described above.
3. Reference
------------
-:Name: kvm_lock
+``kvm_lock``
+^^^^^^^^^^^^
+
:Type: mutex
:Arch: any
:Protects: - vm_list
-:Name: kvm_count_lock
+``kvm_count_lock``
+^^^^^^^^^^^^^^^^^^
+
:Type: raw_spinlock_t
:Arch: any
:Protects: - hardware virtualization enable/disable
:Comment: 'raw' because hardware enabling/disabling must be atomic /wrt
migration.
-:Name: kvm_arch::tsc_write_lock
-:Type: raw_spinlock
+``kvm->mn_invalidate_lock``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:Type: spinlock_t
+:Arch: any
+:Protects: mn_active_invalidate_count, mn_memslots_update_rcuwait
+
+``kvm_arch::tsc_write_lock``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:Type: raw_spinlock_t
:Arch: x86
:Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
- tsc offset in vmcb
:Comment: 'raw' because updating the tsc offsets must not be preempted.
-:Name: kvm->mmu_lock
-:Type: spinlock_t
+``kvm->mmu_lock``
+^^^^^^^^^^^^^^^^^
+:Type: spinlock_t or rwlock_t
:Arch: any
:Protects: -shadow page/shadow tlb entry
:Comment: it is a spinlock since it is used in mmu notifier.
-:Name: kvm->srcu
+``kvm->srcu``
+^^^^^^^^^^^^^
:Type: srcu lock
:Arch: any
:Protects: - kvm->memslots
@@ -230,10 +261,20 @@ which time it will be set using the Dirty tracking mechanism described above.
The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
if it is needed by multiple functions.
-:Name: blocked_vcpu_on_cpu_lock
+``kvm->slots_arch_lock``
+^^^^^^^^^^^^^^^^^^^^^^^^
+:Type: mutex
+:Arch: any (only needed on x86 though)
+:Protects: any arch-specific fields of memslots that have to be modified
+ in a ``kvm->srcu`` read-side critical section.
+:Comment: must be held before reading the pointer to the current memslots,
+ until after all changes to the memslots are complete
+
+``wakeup_vcpus_on_cpu_lock``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
:Type: spinlock_t
:Arch: x86
-:Protects: blocked_vcpu_on_cpu
+:Protects: wakeup_vcpus_on_cpu
:Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts.
When VT-d posted-interrupts is supported and the VM has assigned
devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu