aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/virt/kvm
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/virt/kvm')
-rw-r--r--Documentation/virt/kvm/api.rst2747
-rw-r--r--Documentation/virt/kvm/arm/hyp-abi.rst21
-rw-r--r--Documentation/virt/kvm/arm/hypercalls.rst138
-rw-r--r--Documentation/virt/kvm/arm/index.rst3
-rw-r--r--Documentation/virt/kvm/arm/psci.rst77
-rw-r--r--Documentation/virt/kvm/arm/ptp_kvm.rst25
-rw-r--r--Documentation/virt/kvm/arm/pvtime.rst6
-rw-r--r--Documentation/virt/kvm/cpuid.rst107
-rw-r--r--Documentation/virt/kvm/devices/arm-vgic-its.rst2
-rw-r--r--Documentation/virt/kvm/devices/arm-vgic-v3.rst2
-rw-r--r--Documentation/virt/kvm/devices/s390_flic.rst11
-rw-r--r--Documentation/virt/kvm/devices/vcpu.rst163
-rw-r--r--Documentation/virt/kvm/devices/vm.rst3
-rw-r--r--Documentation/virt/kvm/devices/xics.rst2
-rw-r--r--Documentation/virt/kvm/devices/xive.rst2
-rw-r--r--Documentation/virt/kvm/index.rst22
-rw-r--r--Documentation/virt/kvm/locking.rst123
-rw-r--r--Documentation/virt/kvm/review-checklist.rst2
-rw-r--r--Documentation/virt/kvm/s390/index.rst13
-rw-r--r--Documentation/virt/kvm/s390/s390-diag.rst (renamed from Documentation/virt/kvm/s390-diag.rst)33
-rw-r--r--Documentation/virt/kvm/s390/s390-pv-boot.rst84
-rw-r--r--Documentation/virt/kvm/s390/s390-pv-dump.rst64
-rw-r--r--Documentation/virt/kvm/s390/s390-pv.rst116
-rw-r--r--Documentation/virt/kvm/vcpu-requests.rst53
-rw-r--r--Documentation/virt/kvm/x86/amd-memory-encryption.rst (renamed from Documentation/virt/kvm/amd-memory-encryption.rst)204
-rw-r--r--Documentation/virt/kvm/x86/cpuid.rst124
-rw-r--r--Documentation/virt/kvm/x86/errata.rst39
-rw-r--r--Documentation/virt/kvm/x86/halt-polling.rst (renamed from Documentation/virt/kvm/halt-polling.rst)0
-rw-r--r--Documentation/virt/kvm/x86/hypercalls.rst (renamed from Documentation/virt/kvm/hypercalls.rst)25
-rw-r--r--Documentation/virt/kvm/x86/index.rst19
-rw-r--r--Documentation/virt/kvm/x86/mmu.rst (renamed from Documentation/virt/kvm/mmu.rst)31
-rw-r--r--Documentation/virt/kvm/x86/msr.rst (renamed from Documentation/virt/kvm/msr.rst)132
-rw-r--r--Documentation/virt/kvm/x86/nested-vmx.rst (renamed from Documentation/virt/kvm/nested-vmx.rst)13
-rw-r--r--Documentation/virt/kvm/x86/running-nested-guests.rst278
-rw-r--r--Documentation/virt/kvm/x86/timekeeping.rst (renamed from Documentation/virt/kvm/timekeeping.rst)0
35 files changed, 4068 insertions, 616 deletions
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index ebd383fba939..eee9f857a986 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -55,7 +55,7 @@ not cause harm to the host, their actual behavior is not guaranteed by
the API. See "General description" for details on the ioctl usage
model that is supported by KVM.
-It is important to note that althought VM ioctls may only be issued from
+It is important to note that although VM ioctls may only be issued from
the process that created the VM, a VM's lifecycle is associated with its
file descriptor, not its creator (process). In other words, the VM and
its resources, *including the associated address space*, are not freed
@@ -65,7 +65,7 @@ not be freed until both the parent (original) process and its child have
put their references to the VM's file descriptor.
Because a VM's resources are not freed until the last reference to its
-file descriptor is released, creating additional references to a VM via
+file descriptor is released, creating additional references to a VM
via fork(), dup(), etc... without careful consideration is strongly
discouraged and may have unwanted side effects, e.g. memory allocated
by and on behalf of the VM's process may not be freed/unaccounted when
@@ -151,12 +151,6 @@ In order to create user controlled virtual machines on S390, check
KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
privileged user (CAP_SYS_ADMIN).
-To use hardware assisted virtualization on MIPS (VZ ASE) rather than
-the default trap & emulate implementation (which changes the virtual
-memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
-flag KVM_VM_MIPS_VZ.
-
-
On arm64, the physical address size for a VM (IPA Size limit) is limited
to 40bits by default. The limit can be configured if the host supports the
extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
@@ -182,6 +176,9 @@ is dependent on the CPU capability and the kernel configuration. The limit can
be retrieved using KVM_CAP_ARM_VM_IPA_SIZE of the KVM_CHECK_EXTENSION
ioctl() at run-time.
+Creation of the VM will fail if the requested IPA size (whether it is
+implicit or explicit) is unsupported on the host.
+
Please note that configuring the IPA size does not affect the capability
exposed by the guest CPUs in ID_AA64MMFR0_EL1[PARange]. It only affects
size of the address translated by the stage2 level (guest physical to
@@ -201,7 +198,7 @@ Errors:
====== ============================================================
EFAULT the msr index list cannot be read from or written to
- E2BIG the msr index list is to be to fit in the array specified by
+ E2BIG the msr index list is too big to fit in the array specified by
the user.
====== ============================================================
@@ -262,6 +259,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
memory region. This ioctl returns the size of that region. See the
KVM_RUN documentation for details.
+Besides the size of the KVM_RUN communication region, other areas of
+the VCPU file descriptor can be mmap-ed, including:
+
+- if KVM_CAP_COALESCED_MMIO is available, a page at
+ KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
+ this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
+ KVM_CAP_COALESCED_MMIO is not documented yet.
+
+- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
+ KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. For more information on
+ KVM_CAP_DIRTY_LOG_RING, see section 8.3.
+
4.6 KVM_SET_MEMORY_REGION
-------------------------
@@ -348,15 +357,17 @@ since the last call to this ioctl. Bit 0 is the first page in the
memory slot. Ensure the entire structure is cleared to avoid padding
issues.
-If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 specifies
-the address space for which you want to return the dirty bitmap.
-They must be less than the value that KVM_CHECK_EXTENSION returns for
-the KVM_CAP_MULTI_ADDRESS_SPACE capability.
+If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 of slot field specifies
+the address space for which you want to return the dirty bitmap. See
+KVM_SET_USER_MEMORY_REGION for details on the usage of slot field.
The bits in the dirty bitmap are cleared before the ioctl returns, unless
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is enabled. For more information,
see the description of the capability.
+Note that the Xen shared info page, if configured, shall always be assumed
+to be dirty. KVM will not explicitly mark it such.
+
4.9 KVM_SET_MEMORY_ALIAS
------------------------
@@ -380,9 +391,14 @@ This ioctl is obsolete and has been removed.
Errors:
- ===== =============================
+ ======= ==============================================================
EINTR an unmasked signal is pending
- ===== =============================
+ ENOEXEC the vcpu hasn't been initialized or the guest tried to execute
+ instructions from device memory (arm64)
+ ENOSYS data abort outside memslots with no syndrome info and
+ KVM_CAP_ARM_NISV_TO_USER not enabled (arm64)
+ EPERM SVE feature set but not finalized (arm64)
+ ======= ==============================================================
This ioctl is used to run a guest virtual cpu. While there are no
explicit parameters, there is an implicit parameter block that can be
@@ -395,7 +411,7 @@ kvm_run' (see below).
-----------------
:Capability: basic
-:Architectures: all except ARM, arm64
+:Architectures: all except arm64
:Type: vcpu ioctl
:Parameters: struct kvm_regs (out)
:Returns: 0 on success, -1 on error
@@ -428,7 +444,7 @@ Reads the general purpose registers from the vcpu.
-----------------
:Capability: basic
-:Architectures: all except ARM, arm64
+:Architectures: all except arm64
:Type: vcpu ioctl
:Parameters: struct kvm_regs (in)
:Returns: 0 on success, -1 on error
@@ -513,7 +529,7 @@ translation mode.
------------------
:Capability: basic
-:Architectures: x86, ppc, mips
+:Architectures: x86, ppc, mips, riscv
:Type: vcpu ioctl
:Parameters: struct kvm_interrupt (in)
:Returns: 0 on success, negative on failure.
@@ -536,7 +552,7 @@ X86:
========= ===================================
0 on success,
-EEXIST if an interrupt is already enqueued
- -EINVAL the the irq number is invalid
+ -EINVAL the irq number is invalid
-ENXIO if the PIC is in the kernel
-EFAULT if the pointer is invalid
========= ===================================
@@ -582,6 +598,23 @@ interrupt number dequeues the interrupt.
This is an asynchronous vcpu ioctl and can be invoked from any thread.
+RISC-V:
+^^^^^^^
+
+Queues an external interrupt to be injected into the virutal CPU. This ioctl
+is overloaded with 2 different irq values:
+
+a) KVM_INTERRUPT_SET
+
+ This sets external interrupt for a virtual CPU and it will receive
+ once it is ready.
+
+b) KVM_INTERRUPT_UNSET
+
+ This clears pending external interrupt for a virtual CPU.
+
+This is an asynchronous vcpu ioctl and can be invoked from any thread.
+
4.17 KVM_DEBUG_GUEST
--------------------
@@ -669,6 +702,15 @@ MSRs that have been set successfully.
Defines the vcpu responses to the cpuid instruction. Applications
should use the KVM_SET_CPUID2 ioctl if available.
+Caveat emptor:
+ - If this IOCTL fails, KVM gives no guarantees that previous valid CPUID
+ configuration (if there is) is not corrupted. Userspace can get a copy
+ of the resulting CPUID configuration through KVM_GET_CPUID2 in case.
+ - Using KVM_SET_CPUID{,2} after KVM_RUN, i.e. changing the guest vCPU model
+ after running the guest, may cause guest instability.
+ - Using heterogeneous CPUID configurations, modulo APIC IDs, topology, etc...
+ may cause guest instability.
+
::
struct kvm_cpuid_entry {
@@ -776,7 +818,7 @@ Writes the floating point state to the vcpu.
-----------------------
:Capability: KVM_CAP_IRQCHIP, KVM_CAP_S390_IRQCHIP (s390)
-:Architectures: x86, ARM, arm64, s390
+:Architectures: x86, arm64, s390
:Type: vm ioctl
:Parameters: none
:Returns: 0 on success, -1 on error
@@ -785,7 +827,7 @@ Creates an interrupt controller model in the kernel.
On x86, creates a virtual ioapic, a virtual PIC (two PICs, nested), and sets up
future vcpus to have a local APIC. IRQ routing for GSIs 0-15 is set to both
PIC and IOAPIC; GSI 16-23 only go to the IOAPIC.
-On ARM/arm64, a GICv2 is created. Any other GIC versions require the usage of
+On arm64, a GICv2 is created. Any other GIC versions require the usage of
KVM_CREATE_DEVICE, which also supports creating a GICv2. Using
KVM_CREATE_DEVICE is preferred over KVM_CREATE_IRQCHIP for GICv2.
On s390, a dummy irq routing table is created.
@@ -798,7 +840,7 @@ before KVM_CREATE_IRQCHIP can be used.
-----------------
:Capability: KVM_CAP_IRQCHIP
-:Architectures: x86, arm, arm64
+:Architectures: x86, arm64
:Type: vm ioctl
:Parameters: struct kvm_irq_level
:Returns: 0 on success, -1 on error
@@ -822,12 +864,12 @@ capability is present (or unless it is not using the in-kernel irqchip,
of course).
-ARM/arm64 can signal an interrupt either at the CPU level, or at the
+arm64 can signal an interrupt either at the CPU level, or at the
in-kernel irqchip (GIC), and for in-kernel irqchip can tell the GIC to
use PPIs designated for specific cpus. The irq field is interpreted
like this::
-  bits: | 31 ... 28 | 27 ... 24 | 23 ... 16 | 15 ... 0 |
+ bits: | 31 ... 28 | 27 ... 24 | 23 ... 16 | 15 ... 0 |
field: | vcpu2_index | irq_type | vcpu_index | irq_id |
The irq_type field has the following values:
@@ -848,7 +890,7 @@ When KVM_CAP_ARM_IRQ_LINE_LAYOUT_2 is supported, the target vcpu is
identified as (256 * vcpu2_index + vcpu_index). Otherwise, vcpu2_index
must be zero.
-Note that on arm/arm64, the KVM_CAP_IRQCHIP capability only conditions
+Note that on arm64, the KVM_CAP_IRQCHIP capability only conditions
injection of interrupts for the in-kernel irqchip. KVM_IRQ_LINE can always
be used for a userspace interrupt controller.
@@ -940,6 +982,24 @@ memory.
__u8 pad2[30];
};
+If certain flags are returned from the KVM_CAP_XEN_HVM check, they may
+be set in the flags field of this ioctl:
+
+The KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL flag requests KVM to generate
+the contents of the hypercall page automatically; hypercalls will be
+intercepted and passed to userspace through KVM_EXIT_XEN. In this
+ase, all of the blob size and address fields must be zero.
+
+The KVM_XEN_HVM_CONFIG_EVTCHN_SEND flag indicates to KVM that userspace
+will always use the KVM_XEN_HVM_EVTCHN_SEND ioctl to deliver event
+channel interrupts rather than manipulating the guest's shared_info
+structures directly. This, in turn, may allow KVM to enable features
+such as intercepting the SCHEDOP_poll hypercall to accelerate PV
+spinlock operation for the guest. Userspace may still use the ioctl
+to deliver events if it was advertised, even if userspace does not
+send this indication that it will always do so
+
+No other flags are currently valid in the struct kvm_xen_hvm_config.
4.29 KVM_GET_CLOCK
------------------
@@ -957,20 +1017,37 @@ such as migration.
When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
set of bits that KVM can return in struct kvm_clock_data's flag member.
-The only flag defined now is KVM_CLOCK_TSC_STABLE. If set, the returned
-value is the exact kvmclock value seen by all VCPUs at the instant
-when KVM_GET_CLOCK was called. If clear, the returned value is simply
-CLOCK_MONOTONIC plus a constant offset; the offset can be modified
-with KVM_SET_CLOCK. KVM will try to make all VCPUs follow this clock,
-but the exact value read by each VCPU could differ, because the host
-TSC is not stable.
+The following flags are defined:
+
+KVM_CLOCK_TSC_STABLE
+ If set, the returned value is the exact kvmclock
+ value seen by all VCPUs at the instant when KVM_GET_CLOCK was called.
+ If clear, the returned value is simply CLOCK_MONOTONIC plus a constant
+ offset; the offset can be modified with KVM_SET_CLOCK. KVM will try
+ to make all VCPUs follow this clock, but the exact value read by each
+ VCPU could differ, because the host TSC is not stable.
+
+KVM_CLOCK_REALTIME
+ If set, the `realtime` field in the kvm_clock_data
+ structure is populated with the value of the host's real time
+ clocksource at the instant when KVM_GET_CLOCK was called. If clear,
+ the `realtime` field does not contain a value.
+
+KVM_CLOCK_HOST_TSC
+ If set, the `host_tsc` field in the kvm_clock_data
+ structure is populated with the value of the host's timestamp counter (TSC)
+ at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field
+ does not contain a value.
::
struct kvm_clock_data {
__u64 clock; /* kvmclock current value */
__u32 flags;
- __u32 pad[9];
+ __u32 pad0;
+ __u64 realtime;
+ __u64 host_tsc;
+ __u32 pad[4];
};
@@ -987,12 +1064,25 @@ Sets the current timestamp of kvmclock to the value specified in its parameter.
In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios
such as migration.
+The following flags can be passed:
+
+KVM_CLOCK_REALTIME
+ If set, KVM will compare the value of the `realtime` field
+ with the value of the host's real time clocksource at the instant when
+ KVM_SET_CLOCK was called. The difference in elapsed time is added to the final
+ kvmclock value that will be provided to guests.
+
+Other flags returned by ``KVM_GET_CLOCK`` are accepted but ignored.
+
::
struct kvm_clock_data {
__u64 clock; /* kvmclock current value */
__u32 flags;
- __u32 pad[9];
+ __u32 pad0;
+ __u64 realtime;
+ __u64 host_tsc;
+ __u32 pad[4];
};
@@ -1001,7 +1091,7 @@ such as migration.
:Capability: KVM_CAP_VCPU_EVENTS
:Extended by: KVM_CAP_INTR_SHADOW
-:Architectures: x86, arm, arm64
+:Architectures: x86, arm64
:Type: vcpu ioctl
:Parameters: struct kvm_vcpu_event (out)
:Returns: 0 on success, -1 on error
@@ -1060,8 +1150,12 @@ The following bits are defined in the flags field:
fields contain a valid state. This bit will be set whenever
KVM_CAP_EXCEPTION_PAYLOAD is enabled.
-ARM/ARM64:
-^^^^^^^^^^
+- KVM_VCPUEVENT_VALID_TRIPLE_FAULT may be set to signal that the
+ triple_fault_pending field contains a valid state. This bit will
+ be set whenever KVM_CAP_X86_TRIPLE_FAULT_EVENT is enabled.
+
+ARM64:
+^^^^^^
If the guest accesses a device that is being emulated by the host kernel in
such a way that a real device would generate a physical SError, KVM may make
@@ -1120,7 +1214,7 @@ directly to the virtual CPU).
:Capability: KVM_CAP_VCPU_EVENTS
:Extended by: KVM_CAP_INTR_SHADOW
-:Architectures: x86, arm, arm64
+:Architectures: x86, arm64
:Type: vcpu ioctl
:Parameters: struct kvm_vcpu_event (in)
:Returns: 0 on success, -1 on error
@@ -1155,8 +1249,12 @@ can be set in the flags field to signal that the
exception_has_payload, exception_payload, and exception.pending fields
contain a valid state and shall be written into the VCPU.
-ARM/ARM64:
-^^^^^^^^^^
+If KVM_CAP_X86_TRIPLE_FAULT_EVENT is enabled, KVM_VCPUEVENT_VALID_TRIPLE_FAULT
+can be set in flags field to signal that the triple_fault field contains
+a valid state and shall be written into the VCPU.
+
+ARM64:
+^^^^^^
User space may need to inject several types of events to the guest.
@@ -1260,6 +1358,9 @@ field userspace_addr, which must point at user addressable memory for
the entire memory slot size. Any object may back this memory, including
anonymous memory, ordinary files, and hugetlbfs.
+On architectures that support a form of address tagging, userspace_addr must
+be an untagged address.
+
It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
be identical. This allows large pages in the guest to be backed by large
pages in the host.
@@ -1305,14 +1406,14 @@ documentation when it pops into existence).
-------------------
:Capability: KVM_CAP_ENABLE_CAP
-:Architectures: mips, ppc, s390
+:Architectures: mips, ppc, s390, x86
:Type: vcpu ioctl
:Parameters: struct kvm_enable_cap (in)
:Returns: 0 on success; -1 on error
:Capability: KVM_CAP_ENABLE_CAP_VM
:Architectures: all
-:Type: vcpu ioctl
+:Type: vm ioctl
:Parameters: struct kvm_enable_cap (in)
:Returns: 0 on success; -1 on error
@@ -1360,7 +1461,7 @@ for vm-wide capabilities.
---------------------
:Capability: KVM_CAP_MP_STATE
-:Architectures: x86, s390, arm, arm64
+:Architectures: x86, s390, arm64, riscv
:Type: vcpu ioctl
:Parameters: struct kvm_mp_state (out)
:Returns: 0 on success; -1 on error
@@ -1377,7 +1478,8 @@ uniprocessor guests).
Possible values are:
========================== ===============================================
- KVM_MP_STATE_RUNNABLE the vcpu is currently running [x86,arm/arm64]
+ KVM_MP_STATE_RUNNABLE the vcpu is currently running
+ [x86,arm64,riscv]
KVM_MP_STATE_UNINITIALIZED the vcpu is an application processor (AP)
which has not yet received an INIT signal [x86]
KVM_MP_STATE_INIT_RECEIVED the vcpu has received an INIT signal, and is
@@ -1386,20 +1488,49 @@ Possible values are:
is waiting for an interrupt [x86]
KVM_MP_STATE_SIPI_RECEIVED the vcpu has just received a SIPI (vector
accessible via KVM_GET_VCPU_EVENTS) [x86]
- KVM_MP_STATE_STOPPED the vcpu is stopped [s390,arm/arm64]
+ KVM_MP_STATE_STOPPED the vcpu is stopped [s390,arm64,riscv]
KVM_MP_STATE_CHECK_STOP the vcpu is in a special error state [s390]
KVM_MP_STATE_OPERATING the vcpu is operating (running or halted)
[s390]
KVM_MP_STATE_LOAD the vcpu is in a special load/startup state
[s390]
+ KVM_MP_STATE_SUSPENDED the vcpu is in a suspend state and is waiting
+ for a wakeup event [arm64]
========================== ===============================================
On x86, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an
in-kernel irqchip, the multiprocessing state must be maintained by userspace on
these architectures.
-For arm/arm64:
-^^^^^^^^^^^^^^
+For arm64:
+^^^^^^^^^^
+
+If a vCPU is in the KVM_MP_STATE_SUSPENDED state, KVM will emulate the
+architectural execution of a WFI instruction.
+
+If a wakeup event is recognized, KVM will exit to userspace with a
+KVM_SYSTEM_EVENT exit, where the event type is KVM_SYSTEM_EVENT_WAKEUP. If
+userspace wants to honor the wakeup, it must set the vCPU's MP state to
+KVM_MP_STATE_RUNNABLE. If it does not, KVM will continue to await a wakeup
+event in subsequent calls to KVM_RUN.
+
+.. warning::
+
+ If userspace intends to keep the vCPU in a SUSPENDED state, it is
+ strongly recommended that userspace take action to suppress the
+ wakeup event (such as masking an interrupt). Otherwise, subsequent
+ calls to KVM_RUN will immediately exit with a KVM_SYSTEM_EVENT_WAKEUP
+ event and inadvertently waste CPU cycles.
+
+ Additionally, if userspace takes action to suppress a wakeup event,
+ it is strongly recommended that it also restores the vCPU to its
+ original state when the vCPU is made RUNNABLE again. For example,
+ if userspace masked a pending interrupt to suppress the wakeup,
+ the interrupt should be unmasked before returning control to the
+ guest.
+
+For riscv:
+^^^^^^^^^^
The only states that are valid are KVM_MP_STATE_STOPPED and
KVM_MP_STATE_RUNNABLE which reflect if the vcpu is paused or not.
@@ -1408,7 +1539,7 @@ KVM_MP_STATE_RUNNABLE which reflect if the vcpu is paused or not.
---------------------
:Capability: KVM_CAP_MP_STATE
-:Architectures: x86, s390, arm, arm64
+:Architectures: x86, s390, arm64, riscv
:Type: vcpu ioctl
:Parameters: struct kvm_mp_state (in)
:Returns: 0 on success; -1 on error
@@ -1420,8 +1551,8 @@ On x86, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an
in-kernel irqchip, the multiprocessing state must be maintained by userspace on
these architectures.
-For arm/arm64:
-^^^^^^^^^^^^^^
+For arm64/riscv:
+^^^^^^^^^^^^^^^^
The only states that are valid are KVM_MP_STATE_STOPPED and
KVM_MP_STATE_RUNNABLE which reflect if the vcpu should be paused or not.
@@ -1461,7 +1592,8 @@ Fails if any VCPU has already been created.
Define which vcpu is the Bootstrap Processor (BSP). Values are the same
as the vcpu id in KVM_CREATE_VCPU. If this ioctl is not called, the default
-is vcpu 0.
+is vcpu 0. This ioctl has to be called before vcpu creation,
+otherwise it will return EBUSY error.
4.42 KVM_GET_XSAVE
@@ -1478,6 +1610,7 @@ is vcpu 0.
struct kvm_xsave {
__u32 region[1024];
+ __u32 extra[0];
};
This ioctl would copy current vcpu's xsave struct to the userspace.
@@ -1486,7 +1619,7 @@ This ioctl would copy current vcpu's xsave struct to the userspace.
4.43 KVM_SET_XSAVE
------------------
-:Capability: KVM_CAP_XSAVE
+:Capability: KVM_CAP_XSAVE and KVM_CAP_XSAVE2
:Architectures: x86
:Type: vcpu ioctl
:Parameters: struct kvm_xsave (in)
@@ -1497,9 +1630,18 @@ This ioctl would copy current vcpu's xsave struct to the userspace.
struct kvm_xsave {
__u32 region[1024];
+ __u32 extra[0];
};
-This ioctl would copy userspace's xsave struct to the kernel.
+This ioctl would copy userspace's xsave struct to the kernel. It copies
+as many bytes as are returned by KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2),
+when invoked on the vm file descriptor. The size value returned by
+KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) will always be at least 4096.
+Currently, it is only greater than 4096 if a dynamic feature has been
+enabled with ``arch_prctl()``, but this may change in the future.
+
+The offsets of the state save areas in struct kvm_xsave follow the
+contents of CPUID leaf 0xD on the host.
4.44 KVM_GET_XCRS
@@ -1574,8 +1716,8 @@ This ioctl would set vcpu's xcr to the value userspace specified.
};
#define KVM_CPUID_FLAG_SIGNIFCANT_INDEX BIT(0)
- #define KVM_CPUID_FLAG_STATEFUL_FUNC BIT(1)
- #define KVM_CPUID_FLAG_STATE_READ_NEXT BIT(2)
+ #define KVM_CPUID_FLAG_STATEFUL_FUNC BIT(1) /* deprecated */
+ #define KVM_CPUID_FLAG_STATE_READ_NEXT BIT(2) /* deprecated */
struct kvm_cpuid_entry2 {
__u32 function;
@@ -1596,6 +1738,10 @@ userspace capabilities, and with user requirements (for example, the
user may wish to constrain cpuid to emulate older hardware, or for
feature consistency across a cluster).
+Dynamically-enabled feature bits need to be requested with
+``arch_prctl()`` before calling this ioctl. Feature bits that have not
+been requested are excluded from the result.
+
Note that certain capabilities, such as KVM_CAP_X86_DISABLE_EXITS, may
expose cpuid features (e.g. MONITOR) which are not supported by kvm in
its default configuration. If userspace enables such capabilities, it
@@ -1626,13 +1772,6 @@ emulate them efficiently. The fields in each entry are defined as follows:
KVM_CPUID_FLAG_SIGNIFCANT_INDEX:
if the index field is valid
- KVM_CPUID_FLAG_STATEFUL_FUNC:
- if cpuid for this function returns different values for successive
- invocations; there will be several entries with the same function,
- all with this flag set
- KVM_CPUID_FLAG_STATE_READ_NEXT:
- for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is
- the first entry to be read by a cpu
eax, ebx, ecx, edx:
the values returned by the cpuid instruction for
@@ -1682,14 +1821,14 @@ The flags bitmap is defined as::
------------------------
:Capability: KVM_CAP_IRQ_ROUTING
-:Architectures: x86 s390 arm arm64
+:Architectures: x86 s390 arm64
:Type: vm ioctl
:Parameters: struct kvm_irq_routing (in)
:Returns: 0 on success, -1 on error
Sets the GSI routing table entries, overwriting any previously set entries.
-On arm/arm64, GSI routing has the following limitation:
+On arm64, GSI routing has the following limitation:
- GSI routing does not apply to KVM_IRQ_LINE but only to KVM_IRQFD.
@@ -1715,6 +1854,7 @@ No flags are specified so far, the corresponding field must be set to zero.
struct kvm_irq_routing_msi msi;
struct kvm_irq_routing_s390_adapter adapter;
struct kvm_irq_routing_hv_sint hv_sint;
+ struct kvm_irq_routing_xen_evtchn xen_evtchn;
__u32 pad[8];
} u;
};
@@ -1724,6 +1864,7 @@ No flags are specified so far, the corresponding field must be set to zero.
#define KVM_IRQ_ROUTING_MSI 2
#define KVM_IRQ_ROUTING_S390_ADAPTER 3
#define KVM_IRQ_ROUTING_HV_SINT 4
+ #define KVM_IRQ_ROUTING_XEN_EVTCHN 5
flags:
@@ -1775,26 +1916,43 @@ address_hi must be zero.
__u32 sint;
};
+ struct kvm_irq_routing_xen_evtchn {
+ __u32 port;
+ __u32 vcpu;
+ __u32 priority;
+ };
+
+
+When KVM_CAP_XEN_HVM includes the KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL bit
+in its indication of supported features, routing to Xen event channels
+is supported. Although the priority field is present, only the value
+KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL is supported, which means delivery by
+2 level event channels. FIFO event channel support may be added in
+the future.
+
4.55 KVM_SET_TSC_KHZ
--------------------
-:Capability: KVM_CAP_TSC_CONTROL
+:Capability: KVM_CAP_TSC_CONTROL / KVM_CAP_VM_TSC_CONTROL
:Architectures: x86
-:Type: vcpu ioctl
+:Type: vcpu ioctl / vm ioctl
:Parameters: virtual tsc_khz
:Returns: 0 on success, -1 on error
Specifies the tsc frequency for the virtual machine. The unit of the
frequency is KHz.
+If the KVM_CAP_VM_TSC_CONTROL capability is advertised, this can also
+be used as a vm ioctl to set the initial tsc frequency of subsequently
+created vCPUs.
4.56 KVM_GET_TSC_KHZ
--------------------
-:Capability: KVM_CAP_GET_TSC_KHZ
+:Capability: KVM_CAP_GET_TSC_KHZ / KVM_CAP_VM_TSC_CONTROL
:Architectures: x86
-:Type: vcpu ioctl
+:Type: vcpu ioctl / vm ioctl
:Parameters: none
:Returns: virtual tsc-khz on success, negative value on error
@@ -2116,9 +2274,10 @@ prior to calling the KVM_RUN ioctl.
Errors:
====== ============================================================
-  ENOENT   no such register
-  EINVAL   invalid register ID, or no such register
-  EPERM    (arm64) register access not allowed before vcpu finalization
+ ENOENT no such register
+ EINVAL invalid register ID, or no such register or used with VMs in
+ protected virtualization mode on s390
+ EPERM (arm64) register access not allowed before vcpu finalization
====== ============================================================
(These error codes are indicative only: do not rely on a specific error
@@ -2162,9 +2321,12 @@ registers, find a list below:
PPC KVM_REG_PPC_MMCRA 64
PPC KVM_REG_PPC_MMCR2 64
PPC KVM_REG_PPC_MMCRS 64
+ PPC KVM_REG_PPC_MMCR3 64
PPC KVM_REG_PPC_SIAR 64
PPC KVM_REG_PPC_SDAR 64
PPC KVM_REG_PPC_SIER 64
+ PPC KVM_REG_PPC_SIER2 64
+ PPC KVM_REG_PPC_SIER3 64
PPC KVM_REG_PPC_PMC1 32
PPC KVM_REG_PPC_PMC2 32
PPC KVM_REG_PPC_PMC3 32
@@ -2248,6 +2410,8 @@ registers, find a list below:
PPC KVM_REG_PPC_PSSCR 64
PPC KVM_REG_PPC_DEC_EXPIRY 64
PPC KVM_REG_PPC_PTCR 64
+ PPC KVM_REG_PPC_DAWR1 64
+ PPC KVM_REG_PPC_DAWRX1 64
PPC KVM_REG_PPC_TM_GPR0 64
...
PPC KVM_REG_PPC_TM_GPR31 64
@@ -2487,6 +2651,24 @@ EINVAL.
After the vcpu's SVE configuration is finalized, further attempts to
write this register will fail with EPERM.
+arm64 bitmap feature firmware pseudo-registers have the following bit pattern::
+
+ 0x6030 0000 0016 <regno:16>
+
+The bitmap feature firmware registers exposes the hypercall services that
+are available for userspace to configure. The set bits corresponds to the
+services that are available for the guests to access. By default, KVM
+sets all the supported bits during VM initialization. The userspace can
+discover the available services via KVM_GET_ONE_REG, and write back the
+bitmap corresponding to the features that it wishes guests to see via
+KVM_SET_ONE_REG.
+
+Note: These registers are immutable once any of the vCPUs of the VM has
+run at least once. A KVM_SET_ONE_REG in such a scenario will return
+a -EBUSY to userspace.
+
+(See Documentation/virt/kvm/arm/hypercalls.rst for more details.)
+
MIPS registers are mapped using the lower 32 bits. The upper 16 of that is
the register group type:
@@ -2538,6 +2720,144 @@ following id bit patterns::
0x7020 0000 0003 02 <0:3> <reg:5>
+RISC-V registers are mapped using the lower 32 bits. The upper 8 bits of
+that is the register group type.
+
+RISC-V config registers are meant for configuring a Guest VCPU and it has
+the following id bit patterns::
+
+ 0x8020 0000 01 <index into the kvm_riscv_config struct:24> (32bit Host)
+ 0x8030 0000 01 <index into the kvm_riscv_config struct:24> (64bit Host)
+
+Following are the RISC-V config registers:
+
+======================= ========= =============================================
+ Encoding Register Description
+======================= ========= =============================================
+ 0x80x0 0000 0100 0000 isa ISA feature bitmap of Guest VCPU
+======================= ========= =============================================
+
+The isa config register can be read anytime but can only be written before
+a Guest VCPU runs. It will have ISA feature bits matching underlying host
+set by default.
+
+RISC-V core registers represent the general excution state of a Guest VCPU
+and it has the following id bit patterns::
+
+ 0x8020 0000 02 <index into the kvm_riscv_core struct:24> (32bit Host)
+ 0x8030 0000 02 <index into the kvm_riscv_core struct:24> (64bit Host)
+
+Following are the RISC-V core registers:
+
+======================= ========= =============================================
+ Encoding Register Description
+======================= ========= =============================================
+ 0x80x0 0000 0200 0000 regs.pc Program counter
+ 0x80x0 0000 0200 0001 regs.ra Return address
+ 0x80x0 0000 0200 0002 regs.sp Stack pointer
+ 0x80x0 0000 0200 0003 regs.gp Global pointer
+ 0x80x0 0000 0200 0004 regs.tp Task pointer
+ 0x80x0 0000 0200 0005 regs.t0 Caller saved register 0
+ 0x80x0 0000 0200 0006 regs.t1 Caller saved register 1
+ 0x80x0 0000 0200 0007 regs.t2 Caller saved register 2
+ 0x80x0 0000 0200 0008 regs.s0 Callee saved register 0
+ 0x80x0 0000 0200 0009 regs.s1 Callee saved register 1
+ 0x80x0 0000 0200 000a regs.a0 Function argument (or return value) 0
+ 0x80x0 0000 0200 000b regs.a1 Function argument (or return value) 1
+ 0x80x0 0000 0200 000c regs.a2 Function argument 2
+ 0x80x0 0000 0200 000d regs.a3 Function argument 3
+ 0x80x0 0000 0200 000e regs.a4 Function argument 4
+ 0x80x0 0000 0200 000f regs.a5 Function argument 5
+ 0x80x0 0000 0200 0010 regs.a6 Function argument 6
+ 0x80x0 0000 0200 0011 regs.a7 Function argument 7
+ 0x80x0 0000 0200 0012 regs.s2 Callee saved register 2
+ 0x80x0 0000 0200 0013 regs.s3 Callee saved register 3
+ 0x80x0 0000 0200 0014 regs.s4 Callee saved register 4
+ 0x80x0 0000 0200 0015 regs.s5 Callee saved register 5
+ 0x80x0 0000 0200 0016 regs.s6 Callee saved register 6
+ 0x80x0 0000 0200 0017 regs.s7 Callee saved register 7
+ 0x80x0 0000 0200 0018 regs.s8 Callee saved register 8
+ 0x80x0 0000 0200 0019 regs.s9 Callee saved register 9
+ 0x80x0 0000 0200 001a regs.s10 Callee saved register 10
+ 0x80x0 0000 0200 001b regs.s11 Callee saved register 11
+ 0x80x0 0000 0200 001c regs.t3 Caller saved register 3
+ 0x80x0 0000 0200 001d regs.t4 Caller saved register 4
+ 0x80x0 0000 0200 001e regs.t5 Caller saved register 5
+ 0x80x0 0000 0200 001f regs.t6 Caller saved register 6
+ 0x80x0 0000 0200 0020 mode Privilege mode (1 = S-mode or 0 = U-mode)
+======================= ========= =============================================
+
+RISC-V csr registers represent the supervisor mode control/status registers
+of a Guest VCPU and it has the following id bit patterns::
+
+ 0x8020 0000 03 <index into the kvm_riscv_csr struct:24> (32bit Host)
+ 0x8030 0000 03 <index into the kvm_riscv_csr struct:24> (64bit Host)
+
+Following are the RISC-V csr registers:
+
+======================= ========= =============================================
+ Encoding Register Description
+======================= ========= =============================================
+ 0x80x0 0000 0300 0000 sstatus Supervisor status
+ 0x80x0 0000 0300 0001 sie Supervisor interrupt enable
+ 0x80x0 0000 0300 0002 stvec Supervisor trap vector base
+ 0x80x0 0000 0300 0003 sscratch Supervisor scratch register
+ 0x80x0 0000 0300 0004 sepc Supervisor exception program counter
+ 0x80x0 0000 0300 0005 scause Supervisor trap cause
+ 0x80x0 0000 0300 0006 stval Supervisor bad address or instruction
+ 0x80x0 0000 0300 0007 sip Supervisor interrupt pending
+ 0x80x0 0000 0300 0008 satp Supervisor address translation and protection
+======================= ========= =============================================
+
+RISC-V timer registers represent the timer state of a Guest VCPU and it has
+the following id bit patterns::
+
+ 0x8030 0000 04 <index into the kvm_riscv_timer struct:24>
+
+Following are the RISC-V timer registers:
+
+======================= ========= =============================================
+ Encoding Register Description
+======================= ========= =============================================
+ 0x8030 0000 0400 0000 frequency Time base frequency (read-only)
+ 0x8030 0000 0400 0001 time Time value visible to Guest
+ 0x8030 0000 0400 0002 compare Time compare programmed by Guest
+ 0x8030 0000 0400 0003 state Time compare state (1 = ON or 0 = OFF)
+======================= ========= =============================================
+
+RISC-V F-extension registers represent the single precision floating point
+state of a Guest VCPU and it has the following id bit patterns::
+
+ 0x8020 0000 05 <index into the __riscv_f_ext_state struct:24>
+
+Following are the RISC-V F-extension registers:
+
+======================= ========= =============================================
+ Encoding Register Description
+======================= ========= =============================================
+ 0x8020 0000 0500 0000 f[0] Floating point register 0
+ ...
+ 0x8020 0000 0500 001f f[31] Floating point register 31
+ 0x8020 0000 0500 0020 fcsr Floating point control and status register
+======================= ========= =============================================
+
+RISC-V D-extension registers represent the double precision floating point
+state of a Guest VCPU and it has the following id bit patterns::
+
+ 0x8020 0000 06 <index into the __riscv_d_ext_state struct:24> (fcsr)
+ 0x8030 0000 06 <index into the __riscv_d_ext_state struct:24> (non-fcsr)
+
+Following are the RISC-V D-extension registers:
+
+======================= ========= =============================================
+ Encoding Register Description
+======================= ========= =============================================
+ 0x8030 0000 0600 0000 f[0] Floating point register 0
+ ...
+ 0x8030 0000 0600 001f f[31] Floating point register 31
+ 0x8020 0000 0600 0020 fcsr Floating point control and status register
+======================= ========= =============================================
+
4.69 KVM_GET_ONE_REG
--------------------
@@ -2551,9 +2871,10 @@ following id bit patterns::
Errors include:
======== ============================================================
-  ENOENT   no such register
-  EINVAL   invalid register ID, or no such register
-  EPERM    (arm64) register access not allowed before vcpu finalization
+ ENOENT no such register
+ EINVAL invalid register ID, or no such register or used with VMs in
+ protected virtualization mode on s390
+ EPERM (arm64) register access not allowed before vcpu finalization
======== ============================================================
(These error codes are indicative only: do not rely on a specific error
@@ -2577,13 +2898,15 @@ list in 4.68.
:Parameters: None
:Returns: 0 on success, -1 on error
-This signals to the host kernel that the specified guest is being paused by
-userspace. The host will set a flag in the pvclock structure that is checked
-from the soft lockup watchdog. The flag is part of the pvclock structure that
-is shared between guest and host, specifically the second bit of the flags
+This ioctl sets a flag accessible to the guest indicating that the specified
+vCPU has been paused by the host userspace.
+
+The host will set a flag in the pvclock structure that is checked from the
+soft lockup watchdog. The flag is part of the pvclock structure that is
+shared between guest and host, specifically the second bit of the flags
field of the pvclock_vcpu_time_info structure. It will be set exclusively by
the host and read/cleared exclusively by the guest. The guest operation of
-checking and clearing the flag must an atomic operation so
+checking and clearing the flag must be an atomic operation so
load-link/store-conditional, or equivalent must be used. There are two cases
where the guest will clear the flag: when the soft lockup watchdog timer resets
itself or when a soft lockup is detected. This ioctl can be called any time
@@ -2594,7 +2917,7 @@ after pausing the vcpu, but before it is resumed.
-------------------
:Capability: KVM_CAP_SIGNAL_MSI
-:Architectures: x86 arm arm64
+:Architectures: x86 arm64
:Type: vm ioctl
:Parameters: struct kvm_msi (in)
:Returns: >0 on delivery, 0 if guest blocked the MSI, and -1 on error
@@ -2683,7 +3006,9 @@ KVM_CREATE_PIT2. The state is returned in the following structure::
Valid flags are::
/* disable PIT in HPET legacy mode */
- #define KVM_PIT_FLAGS_HPET_LEGACY 0x00000001
+ #define KVM_PIT_FLAGS_HPET_LEGACY 0x00000001
+ /* speaker port data bit enabled */
+ #define KVM_PIT_FLAGS_SPEAKER_DATA_ON 0x00000002
This IOCTL replaces the obsolete KVM_GET_PIT.
@@ -2782,7 +3107,7 @@ into the hash PTE second double word).
--------------
:Capability: KVM_CAP_IRQFD
-:Architectures: x86 s390 arm arm64
+:Architectures: x86 s390 arm64
:Type: vm ioctl
:Parameters: struct kvm_irqfd (in)
:Returns: 0 on success, -1 on error
@@ -2808,7 +3133,7 @@ Note that closing the resamplefd is not sufficient to disable the
irqfd. The KVM_IRQFD_FLAG_RESAMPLE is only necessary on assignment
and need not be specified with KVM_IRQFD_FLAG_DEASSIGN.
-On arm/arm64, gsi routing being supported, the following can happen:
+On arm64, gsi routing being supported, the following can happen:
- in case no routing entry is associated to this gsi, injection fails
- in case the gsi is associated to an irqchip routing entry,
@@ -3007,6 +3332,7 @@ number.
:Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device,
KVM_CAP_VCPU_ATTRIBUTES for vcpu device
+ KVM_CAP_SYS_ATTRIBUTES for system (/dev/kvm) device (no set)
:Type: device ioctl, vm ioctl, vcpu ioctl
:Parameters: struct kvm_device_attr
:Returns: 0 on success, -1 on error
@@ -3041,7 +3367,8 @@ transferred is defined by the particular attribute.
------------------------
:Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device,
- KVM_CAP_VCPU_ATTRIBUTES for vcpu device
+ KVM_CAP_VCPU_ATTRIBUTES for vcpu device
+ KVM_CAP_SYS_ATTRIBUTES for system (/dev/kvm) device
:Type: device ioctl, vm ioctl, vcpu ioctl
:Parameters: struct kvm_device_attr
:Returns: 0 on success, -1 on error
@@ -3062,7 +3389,7 @@ current state. "addr" is ignored.
----------------------
:Capability: basic
-:Architectures: arm, arm64
+:Architectures: arm64
:Type: vcpu ioctl
:Parameters: struct kvm_vcpu_init (in)
:Returns: 0 on success; -1 on error
@@ -3070,15 +3397,27 @@ current state. "addr" is ignored.
Errors:
====== =================================================================
-  EINVAL    the target is unknown, or the combination of features is invalid.
-  ENOENT    a features bit specified is unknown.
+ EINVAL the target is unknown, or the combination of features is invalid.
+ ENOENT a features bit specified is unknown.
====== =================================================================
This tells KVM what type of CPU to present to the guest, and what
-optional features it should have.  This will cause a reset of the cpu
-registers to their initial values.  If this is not called, KVM_RUN will
+optional features it should have. This will cause a reset of the cpu
+registers to their initial values. If this is not called, KVM_RUN will
return ENOEXEC for that vcpu.
+The initial values are defined as:
+ - Processor state:
+ * AArch64: EL1h, D, A, I and F bits set. All other bits
+ are cleared.
+ * AArch32: SVC, A, I and F bits set. All other bits are
+ cleared.
+ - General Purpose registers, including PC and SP: set to 0
+ - FPSIMD/NEON registers: set to 0
+ - SVE registers: set to 0
+ - System registers: Reset to their architecturally defined
+ values as for a warm reset to EL1 (resp. SVC)
+
Note that because some registers reflect machine topology, all vcpus
should be created before this ioctl is invoked.
@@ -3148,9 +3487,9 @@ Possible features:
-----------------------------
:Capability: basic
-:Architectures: arm, arm64
+:Architectures: arm64
:Type: vm ioctl
-:Parameters: struct struct kvm_vcpu_init (out)
+:Parameters: struct kvm_vcpu_init (out)
:Returns: 0 on success; -1 on error
Errors:
@@ -3170,14 +3509,14 @@ not mandatory.
The information returned by this ioctl can be used to prepare an instance
of struct kvm_vcpu_init for KVM_ARM_VCPU_INIT ioctl which will result in
-in VCPU matching underlying host.
+VCPU matching underlying host.
4.84 KVM_GET_REG_LIST
---------------------
:Capability: basic
-:Architectures: arm, arm64, mips
+:Architectures: arm64, mips
:Type: vcpu ioctl
:Parameters: struct kvm_reg_list (in/out)
:Returns: 0 on success; -1 on error
@@ -3185,8 +3524,8 @@ in VCPU matching underlying host.
Errors:
===== ==============================================================
-  E2BIG     the reg index list is too big to fit in the array specified by
-             the user (the number required will be written into n).
+ E2BIG the reg index list is too big to fit in the array specified by
+ the user (the number required will be written into n).
===== ==============================================================
::
@@ -3204,7 +3543,7 @@ KVM_GET_ONE_REG/KVM_SET_ONE_REG calls.
-----------------------------------------
:Capability: KVM_CAP_ARM_SET_DEVICE_ADDR
-:Architectures: arm, arm64
+:Architectures: arm64
:Type: vm ioctl
:Parameters: struct kvm_arm_device_address (in)
:Returns: 0 on success, -1 on error
@@ -3231,13 +3570,13 @@ can access emulated or directly exposed devices, which the host kernel needs
to know about. The id field is an architecture specific identifier for a
specific device.
-ARM/arm64 divides the id field into two parts, a device id and an
+arm64 divides the id field into two parts, a device id and an
address type id specific to the individual device::
-  bits: | 63 ... 32 | 31 ... 16 | 15 ... 0 |
+ bits: | 63 ... 32 | 31 ... 16 | 15 ... 0 |
field: | 0x00000000 | device id | addr type id |
-ARM/arm64 currently only require this when using the in-kernel GIC
+arm64 currently only require this when using the in-kernel GIC
support for the hardware VGIC features, using KVM_ARM_DEVICE_VGIC_V2
as the device id. When setting the base address for the guest's
mapping of the VGIC virtual CPU and distributor interface, the ioctl
@@ -3298,10 +3637,12 @@ The top 16 bits of the control field are architecture specific control
flags which can include the following:
- KVM_GUESTDBG_USE_SW_BP: using software breakpoints [x86, arm64]
- - KVM_GUESTDBG_USE_HW_BP: using hardware breakpoints [x86, s390, arm64]
+ - KVM_GUESTDBG_USE_HW_BP: using hardware breakpoints [x86, s390]
+ - KVM_GUESTDBG_USE_HW: using hardware debug events [arm64]
- KVM_GUESTDBG_INJECT_DB: inject DB type exception [x86]
- KVM_GUESTDBG_INJECT_BP: inject BP type exception [x86]
- KVM_GUESTDBG_EXIT_PENDING: trigger an immediate guest exit [s390]
+ - KVM_GUESTDBG_BLOCKIRQ: avoid injecting interrupts/NMI/SMI [x86]
For example KVM_GUESTDBG_USE_SW_BP indicates that software breakpoints
are enabled in memory so we need to ensure breakpoint exceptions are
@@ -3321,6 +3662,9 @@ indicating the number of supported registers.
For ppc, the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability indicates whether
the single-step debug event (KVM_GUESTDBG_SINGLESTEP) is supported.
+Also when supported, KVM_CAP_SET_GUEST_DEBUG2 capability indicates the
+supported KVM_GUESTDBG_* bits in the control field.
+
When debug events exit the main run loop with the reason
KVM_EXIT_DEBUG with the kvm_debug_exit_arch part of the kvm_run
structure containing architecture specific debug information.
@@ -3347,8 +3691,8 @@ The member 'flags' is used for passing flags from userspace.
::
#define KVM_CPUID_FLAG_SIGNIFCANT_INDEX BIT(0)
- #define KVM_CPUID_FLAG_STATEFUL_FUNC BIT(1)
- #define KVM_CPUID_FLAG_STATE_READ_NEXT BIT(2)
+ #define KVM_CPUID_FLAG_STATEFUL_FUNC BIT(1) /* deprecated */
+ #define KVM_CPUID_FLAG_STATE_READ_NEXT BIT(2) /* deprecated */
struct kvm_cpuid_entry2 {
__u32 function;
@@ -3394,13 +3738,6 @@ The fields in each entry are defined as follows:
KVM_CPUID_FLAG_SIGNIFCANT_INDEX:
if the index field is valid
- KVM_CPUID_FLAG_STATEFUL_FUNC:
- if cpuid for this function returns different values for successive
- invocations; there will be several entries with the same function,
- all with this flag set
- KVM_CPUID_FLAG_STATE_READ_NEXT:
- for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is
- the first entry to be read by a cpu
eax, ebx, ecx, edx:
@@ -3410,15 +3747,17 @@ The fields in each entry are defined as follows:
4.89 KVM_S390_MEM_OP
--------------------
-:Capability: KVM_CAP_S390_MEM_OP
+:Capability: KVM_CAP_S390_MEM_OP, KVM_CAP_S390_PROTECTED, KVM_CAP_S390_MEM_OP_EXTENSION
:Architectures: s390
-:Type: vcpu ioctl
+:Type: vm ioctl, vcpu ioctl
:Parameters: struct kvm_s390_mem_op (in)
:Returns: = 0 on success,
< 0 on generic error (e.g. -EFAULT or -ENOMEM),
> 0 if an exception occurred while walking the page tables
-Read or write data from/to the logical (virtual) memory of a VCPU.
+Read or write data from/to the VM's memory.
+The KVM_CAP_S390_MEM_OP_EXTENSION capability specifies what functionality is
+supported.
Parameters are specified via the following structure::
@@ -3428,33 +3767,105 @@ Parameters are specified via the following structure::
__u32 size; /* amount of bytes */
__u32 op; /* type of operation */
__u64 buf; /* buffer in userspace */
- __u8 ar; /* the access register number */
- __u8 reserved[31]; /* should be set to 0 */
+ union {
+ struct {
+ __u8 ar; /* the access register number */
+ __u8 key; /* access key, ignored if flag unset */
+ };
+ __u32 sida_offset; /* offset into the sida */
+ __u8 reserved[32]; /* ignored */
+ };
};
-The type of operation is specified in the "op" field. It is either
-KVM_S390_MEMOP_LOGICAL_READ for reading from logical memory space or
-KVM_S390_MEMOP_LOGICAL_WRITE for writing to logical memory space. The
-KVM_S390_MEMOP_F_CHECK_ONLY flag can be set in the "flags" field to check
-whether the corresponding memory access would create an access exception
-(without touching the data in the memory at the destination). In case an
-access exception occurred while walking the MMU tables of the guest, the
-ioctl returns a positive error number to indicate the type of exception.
-This exception is also raised directly at the corresponding VCPU if the
-flag KVM_S390_MEMOP_F_INJECT_EXCEPTION is set in the "flags" field.
-
The start address of the memory region has to be specified in the "gaddr"
field, and the length of the region in the "size" field (which must not
be 0). The maximum value for "size" can be obtained by checking the
KVM_CAP_S390_MEM_OP capability. "buf" is the buffer supplied by the
userspace application where the read data should be written to for
-KVM_S390_MEMOP_LOGICAL_READ, or where the data that should be written is
-stored for a KVM_S390_MEMOP_LOGICAL_WRITE. When KVM_S390_MEMOP_F_CHECK_ONLY
-is specified, "buf" is unused and can be NULL. "ar" designates the access
-register number to be used; the valid range is 0..15.
-
-The "reserved" field is meant for future extensions. It is not used by
-KVM with the currently defined set of flags.
+a read access, or where the data that should be written is stored for
+a write access. The "reserved" field is meant for future extensions.
+Reserved and unused values are ignored. Future extension that add members must
+introduce new flags.
+
+The type of operation is specified in the "op" field. Flags modifying
+their behavior can be set in the "flags" field. Undefined flag bits must
+be set to 0.
+
+Possible operations are:
+ * ``KVM_S390_MEMOP_LOGICAL_READ``
+ * ``KVM_S390_MEMOP_LOGICAL_WRITE``
+ * ``KVM_S390_MEMOP_ABSOLUTE_READ``
+ * ``KVM_S390_MEMOP_ABSOLUTE_WRITE``
+ * ``KVM_S390_MEMOP_SIDA_READ``
+ * ``KVM_S390_MEMOP_SIDA_WRITE``
+
+Logical read/write:
+^^^^^^^^^^^^^^^^^^^
+
+Access logical memory, i.e. translate the given guest address to an absolute
+address given the state of the VCPU and use the absolute address as target of
+the access. "ar" designates the access register number to be used; the valid
+range is 0..15.
+Logical accesses are permitted for the VCPU ioctl only.
+Logical accesses are permitted for non-protected guests only.
+
+Supported flags:
+ * ``KVM_S390_MEMOP_F_CHECK_ONLY``
+ * ``KVM_S390_MEMOP_F_INJECT_EXCEPTION``
+ * ``KVM_S390_MEMOP_F_SKEY_PROTECTION``
+
+The KVM_S390_MEMOP_F_CHECK_ONLY flag can be set to check whether the
+corresponding memory access would cause an access exception; however,
+no actual access to the data in memory at the destination is performed.
+In this case, "buf" is unused and can be NULL.
+
+In case an access exception occurred during the access (or would occur
+in case of KVM_S390_MEMOP_F_CHECK_ONLY), the ioctl returns a positive
+error number indicating the type of exception. This exception is also
+raised directly at the corresponding VCPU if the flag
+KVM_S390_MEMOP_F_INJECT_EXCEPTION is set.
+On protection exceptions, unless specified otherwise, the injected
+translation-exception identifier (TEID) indicates suppression.
+
+If the KVM_S390_MEMOP_F_SKEY_PROTECTION flag is set, storage key
+protection is also in effect and may cause exceptions if accesses are
+prohibited given the access key designated by "key"; the valid range is 0..15.
+KVM_S390_MEMOP_F_SKEY_PROTECTION is available if KVM_CAP_S390_MEM_OP_EXTENSION
+is > 0.
+Since the accessed memory may span multiple pages and those pages might have
+different storage keys, it is possible that a protection exception occurs
+after memory has been modified. In this case, if the exception is injected,
+the TEID does not indicate suppression.
+
+Absolute read/write:
+^^^^^^^^^^^^^^^^^^^^
+
+Access absolute memory. This operation is intended to be used with the
+KVM_S390_MEMOP_F_SKEY_PROTECTION flag, to allow accessing memory and performing
+the checks required for storage key protection as one operation (as opposed to
+user space getting the storage keys, performing the checks, and accessing
+memory thereafter, which could lead to a delay between check and access).
+Absolute accesses are permitted for the VM ioctl if KVM_CAP_S390_MEM_OP_EXTENSION
+is > 0.
+Currently absolute accesses are not permitted for VCPU ioctls.
+Absolute accesses are permitted for non-protected guests only.
+
+Supported flags:
+ * ``KVM_S390_MEMOP_F_CHECK_ONLY``
+ * ``KVM_S390_MEMOP_F_SKEY_PROTECTION``
+
+The semantics of the flags are as for logical accesses.
+
+SIDA read/write:
+^^^^^^^^^^^^^^^^
+
+Access the secure instruction data area which contains memory operands necessary
+for instruction emulation for protected guests.
+SIDA accesses are available if the KVM_CAP_S390_PROTECTED capability is available.
+SIDA accesses are permitted for the VCPU ioctl only.
+SIDA accesses are permitted for protected guests only.
+
+No flags are supported.
4.90 KVM_S390_GET_SKEYS
-----------------------
@@ -3463,7 +3874,7 @@ KVM with the currently defined set of flags.
:Architectures: s390
:Type: vm ioctl
:Parameters: struct kvm_s390_skeys
-:Returns: 0 on success, KVM_S390_GET_KEYS_NONE if guest is not using storage
+:Returns: 0 on success, KVM_S390_GET_SKEYS_NONE if guest is not using storage
keys, negative value on error
This ioctl is used to get guest storage key values on the s390
@@ -3482,7 +3893,7 @@ you want to get.
The count field is the number of consecutive frames (starting from start_gfn)
whose storage keys to get. The count field must be at least 1 and the maximum
-allowed value is defined as KVM_S390_SKEYS_ALLOC_MAX. Values outside this range
+allowed value is defined as KVM_S390_SKEYS_MAX. Values outside this range
will cause the ioctl to return -EINVAL.
The skeydata_addr field is the address to a buffer large enough to hold count
@@ -3506,7 +3917,7 @@ you want to set.
The count field is the number of consecutive frames (starting from start_gfn)
whose storage keys to get. The count field must be at least 1 and the maximum
-allowed value is defined as KVM_S390_SKEYS_ALLOC_MAX. Values outside this range
+allowed value is defined as KVM_S390_SKEYS_MAX. Values outside this range
will cause the ioctl to return -EINVAL.
The skeydata_addr field is the address to a buffer containing count bytes of
@@ -3660,31 +4071,112 @@ which is the maximum number of possibly pending cpu-local interrupts.
Queues an SMI on the thread's vcpu.
-4.97 KVM_CAP_PPC_MULTITCE
--------------------------
+4.97 KVM_X86_SET_MSR_FILTER
+----------------------------
-:Capability: KVM_CAP_PPC_MULTITCE
-:Architectures: ppc
-:Type: vm
+:Capability: KVM_CAP_X86_MSR_FILTER
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_msr_filter
+:Returns: 0 on success, < 0 on error
-This capability means the kernel is capable of handling hypercalls
-H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
-space. This significantly accelerates DMA operations for PPC KVM guests.
-User space should expect that its handlers for these hypercalls
-are not going to be called if user space previously registered LIOBN
-in KVM (via KVM_CREATE_SPAPR_TCE or similar calls).
+::
-In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
-user space might have to advertise it for the guest. For example,
-IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is
-present in the "ibm,hypertas-functions" device-tree property.
+ struct kvm_msr_filter_range {
+ #define KVM_MSR_FILTER_READ (1 << 0)
+ #define KVM_MSR_FILTER_WRITE (1 << 1)
+ __u32 flags;
+ __u32 nmsrs; /* number of msrs in bitmap */
+ __u32 base; /* MSR index the bitmap starts at */
+ __u8 *bitmap; /* a 1 bit allows the operations in flags, 0 denies */
+ };
-The hypercalls mentioned above may or may not be processed successfully
-in the kernel based fast path. If they can not be handled by the kernel,
-they will get passed on to user space. So user space still has to have
-an implementation for these despite the in kernel acceleration.
+ #define KVM_MSR_FILTER_MAX_RANGES 16
+ struct kvm_msr_filter {
+ #define KVM_MSR_FILTER_DEFAULT_ALLOW (0 << 0)
+ #define KVM_MSR_FILTER_DEFAULT_DENY (1 << 0)
+ __u32 flags;
+ struct kvm_msr_filter_range ranges[KVM_MSR_FILTER_MAX_RANGES];
+ };
-This capability is always enabled.
+flags values for ``struct kvm_msr_filter_range``:
+
+``KVM_MSR_FILTER_READ``
+
+ Filter read accesses to MSRs using the given bitmap. A 0 in the bitmap
+ indicates that a read should immediately fail, while a 1 indicates that
+ a read for a particular MSR should be handled regardless of the default
+ filter action.
+
+``KVM_MSR_FILTER_WRITE``
+
+ Filter write accesses to MSRs using the given bitmap. A 0 in the bitmap
+ indicates that a write should immediately fail, while a 1 indicates that
+ a write for a particular MSR should be handled regardless of the default
+ filter action.
+
+``KVM_MSR_FILTER_READ | KVM_MSR_FILTER_WRITE``
+
+ Filter both read and write accesses to MSRs using the given bitmap. A 0
+ in the bitmap indicates that both reads and writes should immediately fail,
+ while a 1 indicates that reads and writes for a particular MSR are not
+ filtered by this range.
+
+flags values for ``struct kvm_msr_filter``:
+
+``KVM_MSR_FILTER_DEFAULT_ALLOW``
+
+ If no filter range matches an MSR index that is getting accessed, KVM will
+ fall back to allowing access to the MSR.
+
+``KVM_MSR_FILTER_DEFAULT_DENY``
+
+ If no filter range matches an MSR index that is getting accessed, KVM will
+ fall back to rejecting access to the MSR. In this mode, all MSRs that should
+ be processed by KVM need to explicitly be marked as allowed in the bitmaps.
+
+This ioctl allows user space to define up to 16 bitmaps of MSR ranges to
+specify whether a certain MSR access should be explicitly filtered for or not.
+
+If this ioctl has never been invoked, MSR accesses are not guarded and the
+default KVM in-kernel emulation behavior is fully preserved.
+
+Calling this ioctl with an empty set of ranges (all nmsrs == 0) disables MSR
+filtering. In that mode, ``KVM_MSR_FILTER_DEFAULT_DENY`` is invalid and causes
+an error.
+
+As soon as the filtering is in place, every MSR access is processed through
+the filtering except for accesses to the x2APIC MSRs (from 0x800 to 0x8ff);
+x2APIC MSRs are always allowed, independent of the ``default_allow`` setting,
+and their behavior depends on the ``X2APIC_ENABLE`` bit of the APIC base
+register.
+
+.. warning::
+ MSR accesses coming from nested vmentry/vmexit are not filtered.
+ This includes both writes to individual VMCS fields and reads/writes
+ through the MSR lists pointed to by the VMCS.
+
+If a bit is within one of the defined ranges, read and write accesses are
+guarded by the bitmap's value for the MSR index if the kind of access
+is included in the ``struct kvm_msr_filter_range`` flags. If no range
+cover this particular access, the behavior is determined by the flags
+field in the kvm_msr_filter struct: ``KVM_MSR_FILTER_DEFAULT_ALLOW``
+and ``KVM_MSR_FILTER_DEFAULT_DENY``.
+
+Each bitmap range specifies a range of MSRs to potentially allow access on.
+The range goes from MSR index [base .. base+nmsrs]. The flags field
+indicates whether reads, writes or both reads and writes are filtered
+by setting a 1 bit in the bitmap for the corresponding MSR index.
+
+If an MSR access is not permitted through the filtering, it generates a
+#GP inside the guest. When combined with KVM_CAP_X86_USER_SPACE_MSR, that
+allows user space to deflect and potentially handle various MSR accesses
+into user space.
+
+Note, invoking this ioctl while a vCPU is running is inherently racy. However,
+KVM does guarantee that vCPUs will see either the previous filter or the new
+filter, e.g. MSRs with identical settings in both the old and new filter will
+have deterministic behavior.
4.98 KVM_CREATE_SPAPR_TCE_64
----------------------------
@@ -3830,49 +4322,20 @@ base 2 of the page size in the bottom 6 bits.
-EFAULT if struct kvm_reinject_control cannot be read,
-EINVAL if the supplied shift or flags are invalid,
-ENOMEM if unable to allocate the new HPT,
- -ENOSPC if there was a hash collision
-
-::
-
- struct kvm_ppc_rmmu_info {
- struct kvm_ppc_radix_geom {
- __u8 page_shift;
- __u8 level_bits[4];
- __u8 pad[3];
- } geometries[8];
- __u32 ap_encodings[8];
- };
-
-The geometries[] field gives up to 8 supported geometries for the
-radix page table, in terms of the log base 2 of the smallest page
-size, and the number of bits indexed at each level of the tree, from
-the PTE level up to the PGD level in that order. Any unused entries
-will have 0 in the page_shift field.
-
-The ap_encodings gives the supported page sizes and their AP field
-encodings, encoded with the AP value in the top 3 bits and the log
-base 2 of the page size in the bottom 6 bits.
-
-4.102 KVM_PPC_RESIZE_HPT_PREPARE
---------------------------------
-
-:Capability: KVM_CAP_SPAPR_RESIZE_HPT
-:Architectures: powerpc
-:Type: vm ioctl
-:Parameters: struct kvm_ppc_resize_hpt (in)
-:Returns: 0 on successful completion,
- >0 if a new HPT is being prepared, the value is an estimated
- number of milliseconds until preparation is complete,
- -EFAULT if struct kvm_reinject_control cannot be read,
- -EINVAL if the supplied shift or flags are invalid,when moving existing
- HPT entries to the new HPT,
- -EIO on other error conditions
Used to implement the PAPR extension for runtime resizing of a guest's
Hashed Page Table (HPT). Specifically this starts, stops or monitors
the preparation of a new potential HPT for the guest, essentially
implementing the H_RESIZE_HPT_PREPARE hypercall.
+::
+
+ struct kvm_ppc_resize_hpt {
+ __u64 flags;
+ __u32 shift;
+ __u32 pad;
+ };
+
If called with shift > 0 when there is no pending HPT for the guest,
this begins preparation of a new pending HPT of size 2^(shift) bytes.
It then returns a positive integer with the estimated number of
@@ -3900,14 +4363,6 @@ Normally this will be called repeatedly with the same parameters until
it returns <= 0. The first call will initiate preparation, subsequent
ones will monitor preparation until it completes or fails.
-::
-
- struct kvm_ppc_resize_hpt {
- __u64 flags;
- __u32 shift;
- __u32 pad;
- };
-
4.103 KVM_PPC_RESIZE_HPT_COMMIT
-------------------------------
@@ -3930,6 +4385,14 @@ Hashed Page Table (HPT). Specifically this requests that the guest be
transferred to working with the new HPT, essentially implementing the
H_RESIZE_HPT_COMMIT hypercall.
+::
+
+ struct kvm_ppc_resize_hpt {
+ __u64 flags;
+ __u32 shift;
+ __u32 pad;
+ };
+
This should only be called after KVM_PPC_RESIZE_HPT_PREPARE has
returned 0 with the same parameters. In other cases
KVM_PPC_RESIZE_HPT_COMMIT will return an error (usually -ENXIO or
@@ -3945,14 +4408,6 @@ HPT and the previous HPT will be discarded.
On failure, the guest will still be operating on its previous HPT.
-::
-
- struct kvm_ppc_resize_hpt {
- __u64 flags;
- __u32 shift;
- __u32 pad;
- };
-
4.104 KVM_X86_GET_MCE_CAP_SUPPORTED
-----------------------------------
@@ -4214,7 +4669,7 @@ H_GET_CPU_CHARACTERISTICS hypercall.
:Capability: basic
:Architectures: x86
-:Type: system
+:Type: vm
:Parameters: an opaque platform specific structure (in/out)
:Returns: 0 on success; -1 on error
@@ -4224,7 +4679,7 @@ encrypted VMs.
Currently, this ioctl is used for issuing Secure Encrypted Virtualization
(SEV) commands on AMD Processors. The SEV commands are defined in
-Documentation/virt/kvm/amd-memory-encryption.rst.
+Documentation/virt/kvm/x86/amd-memory-encryption.rst.
4.111 KVM_MEMORY_ENCRYPT_REG_REGION
-----------------------------------
@@ -4346,6 +4801,8 @@ Errors:
#define KVM_STATE_NESTED_VMX_SMM_GUEST_MODE 0x00000001
#define KVM_STATE_NESTED_VMX_SMM_VMXON 0x00000002
+ #define KVM_STATE_VMX_PREEMPTION_TIMER_DEADLINE 0x00000001
+
struct kvm_vmx_nested_state_hdr {
__u64 vmxon_pa;
__u64 vmcs12_pa;
@@ -4353,6 +4810,9 @@ Errors:
struct {
__u16 flags;
} smm;
+
+ __u32 flags;
+ __u64 preemption_timer_deadline;
};
struct kvm_vmx_nested_state_data {
@@ -4411,9 +4871,9 @@ to I/O ports.
------------------------------------
:Capability: KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
-:Architectures: x86, arm, arm64, mips
+:Architectures: x86, arm64, mips
:Type: vm ioctl
-:Parameters: struct kvm_dirty_log (in)
+:Parameters: struct kvm_clear_dirty_log (in)
:Returns: 0 on success, -1 on error
::
@@ -4440,10 +4900,9 @@ in KVM's dirty bitmap, and dirty tracking is re-enabled for that page
(for example via write-protection, or by clearing the dirty bit in
a page table entry).
-If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 specifies
-the address space for which you want to return the dirty bitmap.
-They must be less than the value that KVM_CHECK_EXTENSION returns for
-the KVM_CAP_MULTI_ADDRESS_SPACE capability.
+If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 of slot field specifies
+the address space for which you want to clear the dirty status. See
+KVM_SET_USER_MEMORY_REGION for details on the usage of slot field.
This ioctl is mostly useful when KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
is enabled; for more information, see the description of the capability.
@@ -4453,9 +4912,9 @@ that KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is present.
4.118 KVM_GET_SUPPORTED_HV_CPUID
--------------------------------
-:Capability: KVM_CAP_HYPERV_CPUID
+:Capability: KVM_CAP_HYPERV_CPUID (vcpu), KVM_CAP_SYS_HYPERV_CPUID (system)
:Architectures: x86
-:Type: vcpu ioctl
+:Type: system ioctl, vcpu ioctl
:Parameters: struct kvm_cpuid2 (in/out)
:Returns: 0 on success, -1 on error
@@ -4489,6 +4948,7 @@ KVM_GET_SUPPORTED_CPUID ioctl because some of them intersect with KVM feature
leaves (0x40000000, 0x40000001).
Currently, the following list of CPUID leaves are returned:
+
- HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS
- HYPERV_CPUID_INTERFACE
- HYPERV_CPUID_VERSION
@@ -4496,11 +4956,11 @@ Currently, the following list of CPUID leaves are returned:
- HYPERV_CPUID_ENLIGHTMENT_INFO
- HYPERV_CPUID_IMPLEMENT_LIMITS
- HYPERV_CPUID_NESTED_FEATURES
+ - HYPERV_CPUID_SYNDBG_VENDOR_AND_MAX_FUNCTIONS
+ - HYPERV_CPUID_SYNDBG_INTERFACE
+ - HYPERV_CPUID_SYNDBG_PLATFORM_CAPABILITIES
-HYPERV_CPUID_NESTED_FEATURES leaf is only exposed when Enlightened VMCS was
-enabled on the corresponding vCPU (KVM_CAP_HYPERV_ENLIGHTENED_VMCS).
-
-Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure
+Userspace invokes KVM_GET_SUPPORTED_HV_CPUID by passing a kvm_cpuid2 structure
with the 'nent' field indicating the number of entries in the variable-size
array 'entries'. If the number of entries is too low to describe all Hyper-V
feature leaves, an error (E2BIG) is returned. If the number is more or equal
@@ -4510,10 +4970,20 @@ number of valid entries in the 'entries' array, which is then filled.
'index' and 'flags' fields in 'struct kvm_cpuid_entry2' are currently reserved,
userspace should not expect to get any particular value there.
+Note, vcpu version of KVM_GET_SUPPORTED_HV_CPUID is currently deprecated. Unlike
+system ioctl which exposes all supported feature bits unconditionally, vcpu
+version has the following quirks:
+
+- HYPERV_CPUID_NESTED_FEATURES leaf and HV_X64_ENLIGHTENED_VMCS_RECOMMENDED
+ feature bit are only exposed when Enlightened VMCS was previously enabled
+ on the corresponding vCPU (KVM_CAP_HYPERV_ENLIGHTENED_VMCS).
+- HV_STIMER_DIRECT_MODE_AVAILABLE bit is only exposed with in-kernel LAPIC.
+ (presumes KVM_CREATE_IRQCHIP has already been called).
+
4.119 KVM_ARM_VCPU_FINALIZE
---------------------------
-:Architectures: arm, arm64
+:Architectures: arm64
:Type: vcpu ioctl
:Parameters: int feature (in)
:Returns: 0 on success, -1 on error
@@ -4649,6 +5119,797 @@ the clear cpu reset definition in the POP. However, the cpu is not put
into ESA mode. This reset is a superset of the initial reset.
+4.125 KVM_S390_PV_COMMAND
+-------------------------
+
+:Capability: KVM_CAP_S390_PROTECTED
+:Architectures: s390
+:Type: vm ioctl
+:Parameters: struct kvm_pv_cmd
+:Returns: 0 on success, < 0 on error
+
+::
+
+ struct kvm_pv_cmd {
+ __u32 cmd; /* Command to be executed */
+ __u16 rc; /* Ultravisor return code */
+ __u16 rrc; /* Ultravisor return reason code */
+ __u64 data; /* Data or address */
+ __u32 flags; /* flags for future extensions. Must be 0 for now */
+ __u32 reserved[3];
+ };
+
+**Ultravisor return codes**
+The Ultravisor return (reason) codes are provided by the kernel if a
+Ultravisor call has been executed to achieve the results expected by
+the command. Therefore they are independent of the IOCTL return
+code. If KVM changes `rc`, its value will always be greater than 0
+hence setting it to 0 before issuing a PV command is advised to be
+able to detect a change of `rc`.
+
+**cmd values:**
+
+KVM_PV_ENABLE
+ Allocate memory and register the VM with the Ultravisor, thereby
+ donating memory to the Ultravisor that will become inaccessible to
+ KVM. All existing CPUs are converted to protected ones. After this
+ command has succeeded, any CPU added via hotplug will become
+ protected during its creation as well.
+
+ Errors:
+
+ ===== =============================
+ EINTR an unmasked signal is pending
+ ===== =============================
+
+KVM_PV_DISABLE
+ Deregister the VM from the Ultravisor and reclaim the memory that
+ had been donated to the Ultravisor, making it usable by the kernel
+ again. All registered VCPUs are converted back to non-protected
+ ones.
+
+KVM_PV_VM_SET_SEC_PARMS
+ Pass the image header from VM memory to the Ultravisor in
+ preparation of image unpacking and verification.
+
+KVM_PV_VM_UNPACK
+ Unpack (protect and decrypt) a page of the encrypted boot image.
+
+KVM_PV_VM_VERIFY
+ Verify the integrity of the unpacked image. Only if this succeeds,
+ KVM is allowed to start protected VCPUs.
+
+KVM_PV_INFO
+ :Capability: KVM_CAP_S390_PROTECTED_DUMP
+
+ Presents an API that provides Ultravisor related data to userspace
+ via subcommands. len_max is the size of the user space buffer,
+ len_written is KVM's indication of how much bytes of that buffer
+ were actually written to. len_written can be used to determine the
+ valid fields if more response fields are added in the future.
+
+ ::
+
+ enum pv_cmd_info_id {
+ KVM_PV_INFO_VM,
+ KVM_PV_INFO_DUMP,
+ };
+
+ struct kvm_s390_pv_info_header {
+ __u32 id;
+ __u32 len_max;
+ __u32 len_written;
+ __u32 reserved;
+ };
+
+ struct kvm_s390_pv_info {
+ struct kvm_s390_pv_info_header header;
+ struct kvm_s390_pv_info_dump dump;
+ struct kvm_s390_pv_info_vm vm;
+ };
+
+**subcommands:**
+
+ KVM_PV_INFO_VM
+ This subcommand provides basic Ultravisor information for PV
+ hosts. These values are likely also exported as files in the sysfs
+ firmware UV query interface but they are more easily available to
+ programs in this API.
+
+ The installed calls and feature_indication members provide the
+ installed UV calls and the UV's other feature indications.
+
+ The max_* members provide information about the maximum number of PV
+ vcpus, PV guests and PV guest memory size.
+
+ ::
+
+ struct kvm_s390_pv_info_vm {
+ __u64 inst_calls_list[4];
+ __u64 max_cpus;
+ __u64 max_guests;
+ __u64 max_guest_addr;
+ __u64 feature_indication;
+ };
+
+
+ KVM_PV_INFO_DUMP
+ This subcommand provides information related to dumping PV guests.
+
+ ::
+
+ struct kvm_s390_pv_info_dump {
+ __u64 dump_cpu_buffer_len;
+ __u64 dump_config_mem_buffer_per_1m;
+ __u64 dump_config_finalize_len;
+ };
+
+KVM_PV_DUMP
+ :Capability: KVM_CAP_S390_PROTECTED_DUMP
+
+ Presents an API that provides calls which facilitate dumping a
+ protected VM.
+
+ ::
+
+ struct kvm_s390_pv_dmp {
+ __u64 subcmd;
+ __u64 buff_addr;
+ __u64 buff_len;
+ __u64 gaddr; /* For dump storage state */
+ };
+
+ **subcommands:**
+
+ KVM_PV_DUMP_INIT
+ Initializes the dump process of a protected VM. If this call does
+ not succeed all other subcommands will fail with -EINVAL. This
+ subcommand will return -EINVAL if a dump process has not yet been
+ completed.
+
+ Not all PV vms can be dumped, the owner needs to set `dump
+ allowed` PCF bit 34 in the SE header to allow dumping.
+
+ KVM_PV_DUMP_CONFIG_STOR_STATE
+ Stores `buff_len` bytes of tweak component values starting with
+ the 1MB block specified by the absolute guest address
+ (`gaddr`). `buff_len` needs to be `conf_dump_storage_state_len`
+ aligned and at least >= the `conf_dump_storage_state_len` value
+ provided by the dump uv_info data. buff_user might be written to
+ even if an error rc is returned. For instance if we encounter a
+ fault after writing the first page of data.
+
+ KVM_PV_DUMP_COMPLETE
+ If the subcommand succeeds it completes the dump process and lets
+ KVM_PV_DUMP_INIT be called again.
+
+ On success `conf_dump_finalize_len` bytes of completion data will be
+ stored to the `buff_addr`. The completion data contains a key
+ derivation seed, IV, tweak nonce and encryption keys as well as an
+ authentication tag all of which are needed to decrypt the dump at a
+ later time.
+
+4.126 KVM_XEN_HVM_SET_ATTR
+--------------------------
+
+:Capability: KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_xen_hvm_attr
+:Returns: 0 on success, < 0 on error
+
+::
+
+ struct kvm_xen_hvm_attr {
+ __u16 type;
+ __u16 pad[3];
+ union {
+ __u8 long_mode;
+ __u8 vector;
+ struct {
+ __u64 gfn;
+ } shared_info;
+ struct {
+ __u32 send_port;
+ __u32 type; /* EVTCHNSTAT_ipi / EVTCHNSTAT_interdomain */
+ __u32 flags;
+ union {
+ struct {
+ __u32 port;
+ __u32 vcpu;
+ __u32 priority;
+ } port;
+ struct {
+ __u32 port; /* Zero for eventfd */
+ __s32 fd;
+ } eventfd;
+ __u32 padding[4];
+ } deliver;
+ } evtchn;
+ __u32 xen_version;
+ __u64 pad[8];
+ } u;
+ };
+
+type values:
+
+KVM_XEN_ATTR_TYPE_LONG_MODE
+ Sets the ABI mode of the VM to 32-bit or 64-bit (long mode). This
+ determines the layout of the shared info pages exposed to the VM.
+
+KVM_XEN_ATTR_TYPE_SHARED_INFO
+ Sets the guest physical frame number at which the Xen "shared info"
+ page resides. Note that although Xen places vcpu_info for the first
+ 32 vCPUs in the shared_info page, KVM does not automatically do so
+ and instead requires that KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO be used
+ explicitly even when the vcpu_info for a given vCPU resides at the
+ "default" location in the shared_info page. This is because KVM is
+ not aware of the Xen CPU id which is used as the index into the
+ vcpu_info[] array, so cannot know the correct default location.
+
+ Note that the shared info page may be constantly written to by KVM;
+ it contains the event channel bitmap used to deliver interrupts to
+ a Xen guest, amongst other things. It is exempt from dirty tracking
+ mechanisms — KVM will not explicitly mark the page as dirty each
+ time an event channel interrupt is delivered to the guest! Thus,
+ userspace should always assume that the designated GFN is dirty if
+ any vCPU has been running or any event channel interrupts can be
+ routed to the guest.
+
+KVM_XEN_ATTR_TYPE_UPCALL_VECTOR
+ Sets the exception vector used to deliver Xen event channel upcalls.
+ This is the HVM-wide vector injected directly by the hypervisor
+ (not through the local APIC), typically configured by a guest via
+ HVM_PARAM_CALLBACK_IRQ.
+
+KVM_XEN_ATTR_TYPE_EVTCHN
+ This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates
+ support for KVM_XEN_HVM_CONFIG_EVTCHN_SEND features. It configures
+ an outbound port number for interception of EVTCHNOP_send requests
+ from the guest. A given sending port number may be directed back
+ to a specified vCPU (by APIC ID) / port / priority on the guest,
+ or to trigger events on an eventfd. The vCPU and priority can be
+ changed by setting KVM_XEN_EVTCHN_UPDATE in a subsequent call,
+ but other fields cannot change for a given sending port. A port
+ mapping is removed by using KVM_XEN_EVTCHN_DEASSIGN in the flags
+ field.
+
+KVM_XEN_ATTR_TYPE_XEN_VERSION
+ This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates
+ support for KVM_XEN_HVM_CONFIG_EVTCHN_SEND features. It configures
+ the 32-bit version code returned to the guest when it invokes the
+ XENVER_version call; typically (XEN_MAJOR << 16 | XEN_MINOR). PV
+ Xen guests will often use this to as a dummy hypercall to trigger
+ event channel delivery, so responding within the kernel without
+ exiting to userspace is beneficial.
+
+4.127 KVM_XEN_HVM_GET_ATTR
+--------------------------
+
+:Capability: KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_xen_hvm_attr
+:Returns: 0 on success, < 0 on error
+
+Allows Xen VM attributes to be read. For the structure and types,
+see KVM_XEN_HVM_SET_ATTR above. The KVM_XEN_ATTR_TYPE_EVTCHN
+attribute cannot be read.
+
+4.128 KVM_XEN_VCPU_SET_ATTR
+---------------------------
+
+:Capability: KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_xen_vcpu_attr
+:Returns: 0 on success, < 0 on error
+
+::
+
+ struct kvm_xen_vcpu_attr {
+ __u16 type;
+ __u16 pad[3];
+ union {
+ __u64 gpa;
+ __u64 pad[4];
+ struct {
+ __u64 state;
+ __u64 state_entry_time;
+ __u64 time_running;
+ __u64 time_runnable;
+ __u64 time_blocked;
+ __u64 time_offline;
+ } runstate;
+ __u32 vcpu_id;
+ struct {
+ __u32 port;
+ __u32 priority;
+ __u64 expires_ns;
+ } timer;
+ __u8 vector;
+ } u;
+ };
+
+type values:
+
+KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO
+ Sets the guest physical address of the vcpu_info for a given vCPU.
+ As with the shared_info page for the VM, the corresponding page may be
+ dirtied at any time if event channel interrupt delivery is enabled, so
+ userspace should always assume that the page is dirty without relying
+ on dirty logging.
+
+KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO
+ Sets the guest physical address of an additional pvclock structure
+ for a given vCPU. This is typically used for guest vsyscall support.
+
+KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR
+ Sets the guest physical address of the vcpu_runstate_info for a given
+ vCPU. This is how a Xen guest tracks CPU state such as steal time.
+
+KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT
+ Sets the runstate (RUNSTATE_running/_runnable/_blocked/_offline) of
+ the given vCPU from the .u.runstate.state member of the structure.
+ KVM automatically accounts running and runnable time but blocked
+ and offline states are only entered explicitly.
+
+KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_DATA
+ Sets all fields of the vCPU runstate data from the .u.runstate member
+ of the structure, including the current runstate. The state_entry_time
+ must equal the sum of the other four times.
+
+KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST
+ This *adds* the contents of the .u.runstate members of the structure
+ to the corresponding members of the given vCPU's runstate data, thus
+ permitting atomic adjustments to the runstate times. The adjustment
+ to the state_entry_time must equal the sum of the adjustments to the
+ other four times. The state field must be set to -1, or to a valid
+ runstate value (RUNSTATE_running, RUNSTATE_runnable, RUNSTATE_blocked
+ or RUNSTATE_offline) to set the current accounted state as of the
+ adjusted state_entry_time.
+
+KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID
+ This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates
+ support for KVM_XEN_HVM_CONFIG_EVTCHN_SEND features. It sets the Xen
+ vCPU ID of the given vCPU, to allow timer-related VCPU operations to
+ be intercepted by KVM.
+
+KVM_XEN_VCPU_ATTR_TYPE_TIMER
+ This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates
+ support for KVM_XEN_HVM_CONFIG_EVTCHN_SEND features. It sets the
+ event channel port/priority for the VIRQ_TIMER of the vCPU, as well
+ as allowing a pending timer to be saved/restored.
+
+KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR
+ This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates
+ support for KVM_XEN_HVM_CONFIG_EVTCHN_SEND features. It sets the
+ per-vCPU local APIC upcall vector, configured by a Xen guest with
+ the HVMOP_set_evtchn_upcall_vector hypercall. This is typically
+ used by Windows guests, and is distinct from the HVM-wide upcall
+ vector configured with HVM_PARAM_CALLBACK_IRQ.
+
+
+4.129 KVM_XEN_VCPU_GET_ATTR
+---------------------------
+
+:Capability: KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_xen_vcpu_attr
+:Returns: 0 on success, < 0 on error
+
+Allows Xen vCPU attributes to be read. For the structure and types,
+see KVM_XEN_VCPU_SET_ATTR above.
+
+The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
+with the KVM_XEN_VCPU_GET_ATTR ioctl.
+
+4.130 KVM_ARM_MTE_COPY_TAGS
+---------------------------
+
+:Capability: KVM_CAP_ARM_MTE
+:Architectures: arm64
+:Type: vm ioctl
+:Parameters: struct kvm_arm_copy_mte_tags
+:Returns: number of bytes copied, < 0 on error (-EINVAL for incorrect
+ arguments, -EFAULT if memory cannot be accessed).
+
+::
+
+ struct kvm_arm_copy_mte_tags {
+ __u64 guest_ipa;
+ __u64 length;
+ void __user *addr;
+ __u64 flags;
+ __u64 reserved[2];
+ };
+
+Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
+``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
+field must point to a buffer which the tags will be copied to or from.
+
+``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
+``KVM_ARM_TAGS_FROM_GUEST``.
+
+The size of the buffer to store the tags is ``(length / 16)`` bytes
+(granules in MTE are 16 bytes long). Each byte contains a single tag
+value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
+``PTRACE_POKEMTETAGS``.
+
+If an error occurs before any data is copied then a negative error code is
+returned. If some tags have been copied before an error occurs then the number
+of bytes successfully copied is returned. If the call completes successfully
+then ``length`` is returned.
+
+4.131 KVM_GET_SREGS2
+--------------------
+
+:Capability: KVM_CAP_SREGS2
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_sregs2 (out)
+:Returns: 0 on success, -1 on error
+
+Reads special registers from the vcpu.
+This ioctl (when supported) replaces the KVM_GET_SREGS.
+
+::
+
+ struct kvm_sregs2 {
+ /* out (KVM_GET_SREGS2) / in (KVM_SET_SREGS2) */
+ struct kvm_segment cs, ds, es, fs, gs, ss;
+ struct kvm_segment tr, ldt;
+ struct kvm_dtable gdt, idt;
+ __u64 cr0, cr2, cr3, cr4, cr8;
+ __u64 efer;
+ __u64 apic_base;
+ __u64 flags;
+ __u64 pdptrs[4];
+ };
+
+flags values for ``kvm_sregs2``:
+
+``KVM_SREGS2_FLAGS_PDPTRS_VALID``
+
+ Indicates thats the struct contain valid PDPTR values.
+
+
+4.132 KVM_SET_SREGS2
+--------------------
+
+:Capability: KVM_CAP_SREGS2
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_sregs2 (in)
+:Returns: 0 on success, -1 on error
+
+Writes special registers into the vcpu.
+See KVM_GET_SREGS2 for the data structures.
+This ioctl (when supported) replaces the KVM_SET_SREGS.
+
+4.133 KVM_GET_STATS_FD
+----------------------
+
+:Capability: KVM_CAP_STATS_BINARY_FD
+:Architectures: all
+:Type: vm ioctl, vcpu ioctl
+:Parameters: none
+:Returns: statistics file descriptor on success, < 0 on error
+
+Errors:
+
+ ====== ======================================================
+ ENOMEM if the fd could not be created due to lack of memory
+ EMFILE if the number of opened files exceeds the limit
+ ====== ======================================================
+
+The returned file descriptor can be used to read VM/vCPU statistics data in
+binary format. The data in the file descriptor consists of four blocks
+organized as follows:
+
++-------------+
+| Header |
++-------------+
+| id string |
++-------------+
+| Descriptors |
++-------------+
+| Stats Data |
++-------------+
+
+Apart from the header starting at offset 0, please be aware that it is
+not guaranteed that the four blocks are adjacent or in the above order;
+the offsets of the id, descriptors and data blocks are found in the
+header. However, all four blocks are aligned to 64 bit offsets in the
+file and they do not overlap.
+
+All blocks except the data block are immutable. Userspace can read them
+only one time after retrieving the file descriptor, and then use ``pread`` or
+``lseek`` to read the statistics repeatedly.
+
+All data is in system endianness.
+
+The format of the header is as follows::
+
+ struct kvm_stats_header {
+ __u32 flags;
+ __u32 name_size;
+ __u32 num_desc;
+ __u32 id_offset;
+ __u32 desc_offset;
+ __u32 data_offset;
+ };
+
+The ``flags`` field is not used at the moment. It is always read as 0.
+
+The ``name_size`` field is the size (in byte) of the statistics name string
+(including trailing '\0') which is contained in the "id string" block and
+appended at the end of every descriptor.
+
+The ``num_desc`` field is the number of descriptors that are included in the
+descriptor block. (The actual number of values in the data block may be
+larger, since each descriptor may comprise more than one value).
+
+The ``id_offset`` field is the offset of the id string from the start of the
+file indicated by the file descriptor. It is a multiple of 8.
+
+The ``desc_offset`` field is the offset of the Descriptors block from the start
+of the file indicated by the file descriptor. It is a multiple of 8.
+
+The ``data_offset`` field is the offset of the Stats Data block from the start
+of the file indicated by the file descriptor. It is a multiple of 8.
+
+The id string block contains a string which identifies the file descriptor on
+which KVM_GET_STATS_FD was invoked. The size of the block, including the
+trailing ``'\0'``, is indicated by the ``name_size`` field in the header.
+
+The descriptors block is only needed to be read once for the lifetime of the
+file descriptor contains a sequence of ``struct kvm_stats_desc``, each followed
+by a string of size ``name_size``.
+::
+
+ #define KVM_STATS_TYPE_SHIFT 0
+ #define KVM_STATS_TYPE_MASK (0xF << KVM_STATS_TYPE_SHIFT)
+ #define KVM_STATS_TYPE_CUMULATIVE (0x0 << KVM_STATS_TYPE_SHIFT)
+ #define KVM_STATS_TYPE_INSTANT (0x1 << KVM_STATS_TYPE_SHIFT)
+ #define KVM_STATS_TYPE_PEAK (0x2 << KVM_STATS_TYPE_SHIFT)
+ #define KVM_STATS_TYPE_LINEAR_HIST (0x3 << KVM_STATS_TYPE_SHIFT)
+ #define KVM_STATS_TYPE_LOG_HIST (0x4 << KVM_STATS_TYPE_SHIFT)
+ #define KVM_STATS_TYPE_MAX KVM_STATS_TYPE_LOG_HIST
+
+ #define KVM_STATS_UNIT_SHIFT 4
+ #define KVM_STATS_UNIT_MASK (0xF << KVM_STATS_UNIT_SHIFT)
+ #define KVM_STATS_UNIT_NONE (0x0 << KVM_STATS_UNIT_SHIFT)
+ #define KVM_STATS_UNIT_BYTES (0x1 << KVM_STATS_UNIT_SHIFT)
+ #define KVM_STATS_UNIT_SECONDS (0x2 << KVM_STATS_UNIT_SHIFT)
+ #define KVM_STATS_UNIT_CYCLES (0x3 << KVM_STATS_UNIT_SHIFT)
+ #define KVM_STATS_UNIT_BOOLEAN (0x4 << KVM_STATS_UNIT_SHIFT)
+ #define KVM_STATS_UNIT_MAX KVM_STATS_UNIT_BOOLEAN
+
+ #define KVM_STATS_BASE_SHIFT 8
+ #define KVM_STATS_BASE_MASK (0xF << KVM_STATS_BASE_SHIFT)
+ #define KVM_STATS_BASE_POW10 (0x0 << KVM_STATS_BASE_SHIFT)
+ #define KVM_STATS_BASE_POW2 (0x1 << KVM_STATS_BASE_SHIFT)
+ #define KVM_STATS_BASE_MAX KVM_STATS_BASE_POW2
+
+ struct kvm_stats_desc {
+ __u32 flags;
+ __s16 exponent;
+ __u16 size;
+ __u32 offset;
+ __u32 bucket_size;
+ char name[];
+ };
+
+The ``flags`` field contains the type and unit of the statistics data described
+by this descriptor. Its endianness is CPU native.
+The following flags are supported:
+
+Bits 0-3 of ``flags`` encode the type:
+
+ * ``KVM_STATS_TYPE_CUMULATIVE``
+ The statistics reports a cumulative count. The value of data can only be increased.
+ Most of the counters used in KVM are of this type.
+ The corresponding ``size`` field for this type is always 1.
+ All cumulative statistics data are read/write.
+ * ``KVM_STATS_TYPE_INSTANT``
+ The statistics reports an instantaneous value. Its value can be increased or
+ decreased. This type is usually used as a measurement of some resources,
+ like the number of dirty pages, the number of large pages, etc.
+ All instant statistics are read only.
+ The corresponding ``size`` field for this type is always 1.
+ * ``KVM_STATS_TYPE_PEAK``
+ The statistics data reports a peak value, for example the maximum number
+ of items in a hash table bucket, the longest time waited and so on.
+ The value of data can only be increased.
+ The corresponding ``size`` field for this type is always 1.
+ * ``KVM_STATS_TYPE_LINEAR_HIST``
+ The statistic is reported as a linear histogram. The number of
+ buckets is specified by the ``size`` field. The size of buckets is specified
+ by the ``hist_param`` field. The range of the Nth bucket (1 <= N < ``size``)
+ is [``hist_param``*(N-1), ``hist_param``*N), while the range of the last
+ bucket is [``hist_param``*(``size``-1), +INF). (+INF means positive infinity
+ value.)
+ * ``KVM_STATS_TYPE_LOG_HIST``
+ The statistic is reported as a logarithmic histogram. The number of
+ buckets is specified by the ``size`` field. The range of the first bucket is
+ [0, 1), while the range of the last bucket is [pow(2, ``size``-2), +INF).
+ Otherwise, The Nth bucket (1 < N < ``size``) covers
+ [pow(2, N-2), pow(2, N-1)).
+
+Bits 4-7 of ``flags`` encode the unit:
+
+ * ``KVM_STATS_UNIT_NONE``
+ There is no unit for the value of statistics data. This usually means that
+ the value is a simple counter of an event.
+ * ``KVM_STATS_UNIT_BYTES``
+ It indicates that the statistics data is used to measure memory size, in the
+ unit of Byte, KiByte, MiByte, GiByte, etc. The unit of the data is
+ determined by the ``exponent`` field in the descriptor.
+ * ``KVM_STATS_UNIT_SECONDS``
+ It indicates that the statistics data is used to measure time or latency.
+ * ``KVM_STATS_UNIT_CYCLES``
+ It indicates that the statistics data is used to measure CPU clock cycles.
+ * ``KVM_STATS_UNIT_BOOLEAN``
+ It indicates that the statistic will always be either 0 or 1. Boolean
+ statistics of "peak" type will never go back from 1 to 0. Boolean
+ statistics can be linear histograms (with two buckets) but not logarithmic
+ histograms.
+
+Note that, in the case of histograms, the unit applies to the bucket
+ranges, while the bucket value indicates how many samples fell in the
+bucket's range.
+
+Bits 8-11 of ``flags``, together with ``exponent``, encode the scale of the
+unit:
+
+ * ``KVM_STATS_BASE_POW10``
+ The scale is based on power of 10. It is used for measurement of time and
+ CPU clock cycles. For example, an exponent of -9 can be used with
+ ``KVM_STATS_UNIT_SECONDS`` to express that the unit is nanoseconds.
+ * ``KVM_STATS_BASE_POW2``
+ The scale is based on power of 2. It is used for measurement of memory size.
+ For example, an exponent of 20 can be used with ``KVM_STATS_UNIT_BYTES`` to
+ express that the unit is MiB.
+
+The ``size`` field is the number of values of this statistics data. Its
+value is usually 1 for most of simple statistics. 1 means it contains an
+unsigned 64bit data.
+
+The ``offset`` field is the offset from the start of Data Block to the start of
+the corresponding statistics data.
+
+The ``bucket_size`` field is used as a parameter for histogram statistics data.
+It is only used by linear histogram statistics data, specifying the size of a
+bucket in the unit expressed by bits 4-11 of ``flags`` together with ``exponent``.
+
+The ``name`` field is the name string of the statistics data. The name string
+starts at the end of ``struct kvm_stats_desc``. The maximum length including
+the trailing ``'\0'``, is indicated by ``name_size`` in the header.
+
+The Stats Data block contains an array of 64-bit values in the same order
+as the descriptors in Descriptors block.
+
+4.134 KVM_GET_XSAVE2
+--------------------
+
+:Capability: KVM_CAP_XSAVE2
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_xsave (out)
+:Returns: 0 on success, -1 on error
+
+
+::
+
+ struct kvm_xsave {
+ __u32 region[1024];
+ __u32 extra[0];
+ };
+
+This ioctl would copy current vcpu's xsave struct to the userspace. It
+copies as many bytes as are returned by KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2)
+when invoked on the vm file descriptor. The size value returned by
+KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) will always be at least 4096.
+Currently, it is only greater than 4096 if a dynamic feature has been
+enabled with ``arch_prctl()``, but this may change in the future.
+
+The offsets of the state save areas in struct kvm_xsave follow the contents
+of CPUID leaf 0xD on the host.
+
+4.135 KVM_XEN_HVM_EVTCHN_SEND
+-----------------------------
+
+:Capability: KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_EVTCHN_SEND
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_irq_routing_xen_evtchn
+:Returns: 0 on success, < 0 on error
+
+
+::
+
+ struct kvm_irq_routing_xen_evtchn {
+ __u32 port;
+ __u32 vcpu;
+ __u32 priority;
+ };
+
+This ioctl injects an event channel interrupt directly to the guest vCPU.
+
+4.136 KVM_S390_PV_CPU_COMMAND
+-----------------------------
+
+:Capability: KVM_CAP_S390_PROTECTED_DUMP
+:Architectures: s390
+:Type: vcpu ioctl
+:Parameters: none
+:Returns: 0 on success, < 0 on error
+
+This ioctl closely mirrors `KVM_S390_PV_COMMAND` but handles requests
+for vcpus. It re-uses the kvm_s390_pv_dmp struct and hence also shares
+the command ids.
+
+**command:**
+
+KVM_PV_DUMP
+ Presents an API that provides calls which facilitate dumping a vcpu
+ of a protected VM.
+
+**subcommand:**
+
+KVM_PV_DUMP_CPU
+ Provides encrypted dump data like register values.
+ The length of the returned data is provided by uv_info.guest_cpu_stor_len.
+
+4.137 KVM_S390_ZPCI_OP
+----------------------
+
+:Capability: KVM_CAP_S390_ZPCI_OP
+:Architectures: s390
+:Type: vm ioctl
+:Parameters: struct kvm_s390_zpci_op (in)
+:Returns: 0 on success, <0 on error
+
+Used to manage hardware-assisted virtualization features for zPCI devices.
+
+Parameters are specified via the following structure::
+
+ struct kvm_s390_zpci_op {
+ /* in */
+ __u32 fh; /* target device */
+ __u8 op; /* operation to perform */
+ __u8 pad[3];
+ union {
+ /* for KVM_S390_ZPCIOP_REG_AEN */
+ struct {
+ __u64 ibv; /* Guest addr of interrupt bit vector */
+ __u64 sb; /* Guest addr of summary bit */
+ __u32 flags;
+ __u32 noi; /* Number of interrupts */
+ __u8 isc; /* Guest interrupt subclass */
+ __u8 sbo; /* Offset of guest summary bit vector */
+ __u16 pad;
+ } reg_aen;
+ __u64 reserved[8];
+ } u;
+ };
+
+The type of operation is specified in the "op" field.
+KVM_S390_ZPCIOP_REG_AEN is used to register the VM for adapter event
+notification interpretation, which will allow firmware delivery of adapter
+events directly to the vm, with KVM providing a backup delivery mechanism;
+KVM_S390_ZPCIOP_DEREG_AEN is used to subsequently disable interpretation of
+adapter event notifications.
+
+The target zPCI function must also be specified via the "fh" field. For the
+KVM_S390_ZPCIOP_REG_AEN operation, additional information to establish firmware
+delivery must be provided via the "reg_aen" struct.
+
+The "pad" and "reserved" fields may be used for future extensions and should be
+set to 0s by userspace.
+
5. The kvm_run structure
========================
@@ -4710,9 +5971,14 @@ local APIC is not used.
__u16 flags;
More architecture-specific flags detailing state of the VCPU that may
-affect the device's behavior. The only currently defined flag is
-KVM_RUN_X86_SMM, which is valid on x86 machines and is set if the
-VCPU is in system management mode.
+affect the device's behavior. Current defined flags::
+
+ /* x86, set if the VCPU is in system management mode */
+ #define KVM_RUN_X86_SMM (1 << 0)
+ /* x86, set if bus lock detected in VM */
+ #define KVM_RUN_BUS_LOCK (1 << 1)
+ /* arm64, set for KVM_EXIT_DEBUG */
+ #define KVM_DEBUG_ARCH_HSR_HIGH_VALID (1 << 0)
::
@@ -4746,6 +6012,7 @@ hardware_exit_reason.
/* KVM_EXIT_FAIL_ENTRY */
struct {
__u64 hardware_entry_failure_reason;
+ __u32 cpu; /* if KVM_LAST_CPU */
} fail_entry;
If exit_reason is KVM_EXIT_FAIL_ENTRY, the vcpu could not be run due
@@ -4812,14 +6079,18 @@ to the byte array.
.. note::
- For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR and
- KVM_EXIT_EPR the corresponding
+ For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR, KVM_EXIT_XEN,
+ KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR the corresponding
+ operations are complete (and guest state is consistent) only after userspace
+ has re-entered the kernel with KVM_RUN. The kernel side will first finish
+ incomplete operations and then check for pending signals.
-operations are complete (and guest state is consistent) only after userspace
-has re-entered the kernel with KVM_RUN. The kernel side will first finish
-incomplete operations and then check for pending signals. Userspace
-can re-enter the guest with an unmasked signal pending to complete
-pending operations.
+ The pending state of the operation is not preserved in state which is
+ visible to userspace, thus userspace should ensure that the operation is
+ completed before performing a live migration. Userspace can re-enter the
+ guest with an unmasked signal pending or with the immediate_exit field set
+ to complete pending operations without allowing any further instructions
+ to be executed.
::
@@ -4980,17 +6251,20 @@ should put the acknowledged interrupt vector into the 'epr' field.
#define KVM_SYSTEM_EVENT_SHUTDOWN 1
#define KVM_SYSTEM_EVENT_RESET 2
#define KVM_SYSTEM_EVENT_CRASH 3
+ #define KVM_SYSTEM_EVENT_WAKEUP 4
+ #define KVM_SYSTEM_EVENT_SUSPEND 5
+ #define KVM_SYSTEM_EVENT_SEV_TERM 6
__u32 type;
- __u64 flags;
+ __u32 ndata;
+ __u64 data[16];
} system_event;
If exit_reason is KVM_EXIT_SYSTEM_EVENT then the vcpu has triggered
a system-level event using some architecture specific mechanism (hypercall
-or some special instruction). In case of ARM/ARM64, this is triggered using
-HVC instruction based PSCI call from the vcpu. The 'type' field describes
-the system-level event type. The 'flags' field describes architecture
-specific flags for the system-level event.
+or some special instruction). In case of ARM64, this is triggered using
+HVC instruction based PSCI call from the vcpu.
+The 'type' field describes the system-level event type.
Valid values for 'type' are:
- KVM_SYSTEM_EVENT_SHUTDOWN -- the guest has requested a shutdown of the
@@ -5004,6 +6278,54 @@ Valid values for 'type' are:
has requested a crash condition maintenance. Userspace can choose
to ignore the request, or to gather VM memory core dump and/or
reset/shutdown of the VM.
+ - KVM_SYSTEM_EVENT_SEV_TERM -- an AMD SEV guest requested termination.
+ The guest physical address of the guest's GHCB is stored in `data[0]`.
+ - KVM_SYSTEM_EVENT_WAKEUP -- the exiting vCPU is in a suspended state and
+ KVM has recognized a wakeup event. Userspace may honor this event by
+ marking the exiting vCPU as runnable, or deny it and call KVM_RUN again.
+ - KVM_SYSTEM_EVENT_SUSPEND -- the guest has requested a suspension of
+ the VM.
+
+If KVM_CAP_SYSTEM_EVENT_DATA is present, the 'data' field can contain
+architecture specific information for the system-level event. Only
+the first `ndata` items (possibly zero) of the data array are valid.
+
+ - for arm64, data[0] is set to KVM_SYSTEM_EVENT_RESET_FLAG_PSCI_RESET2 if
+ the guest issued a SYSTEM_RESET2 call according to v1.1 of the PSCI
+ specification.
+
+ - for RISC-V, data[0] is set to the value of the second argument of the
+ ``sbi_system_reset`` call.
+
+Previous versions of Linux defined a `flags` member in this struct. The
+field is now aliased to `data[0]`. Userspace can assume that it is only
+written if ndata is greater than 0.
+
+For arm/arm64:
+--------------
+
+KVM_SYSTEM_EVENT_SUSPEND exits are enabled with the
+KVM_CAP_ARM_SYSTEM_SUSPEND VM capability. If a guest invokes the PSCI
+SYSTEM_SUSPEND function, KVM will exit to userspace with this event
+type.
+
+It is the sole responsibility of userspace to implement the PSCI
+SYSTEM_SUSPEND call according to ARM DEN0022D.b 5.19 "SYSTEM_SUSPEND".
+KVM does not change the vCPU's state before exiting to userspace, so
+the call parameters are left in-place in the vCPU registers.
+
+Userspace is _required_ to take action for such an exit. It must
+either:
+
+ - Honor the guest request to suspend the VM. Userspace can request
+ in-kernel emulation of suspension by setting the calling vCPU's
+ state to KVM_MP_STATE_SUSPENDED. Userspace must configure the vCPU's
+ state according to the parameters passed to the PSCI function when
+ the calling vCPU is resumed. See ARM DEN0022D.b 5.19.1 "Intended use"
+ for details on the function parameters.
+
+ - Deny the guest request to suspend the VM. See ARM DEN0022D.b 5.19.2
+ "Caller responsibilities" for possible return values.
::
@@ -5024,10 +6346,13 @@ EOI was received.
struct kvm_hyperv_exit {
#define KVM_EXIT_HYPERV_SYNIC 1
#define KVM_EXIT_HYPERV_HCALL 2
+ #define KVM_EXIT_HYPERV_SYNDBG 3
__u32 type;
+ __u32 pad1;
union {
struct {
__u32 msr;
+ __u32 pad2;
__u64 control;
__u64 evt_page;
__u64 msg_page;
@@ -5037,6 +6362,15 @@ EOI was received.
__u64 result;
__u64 params[2];
} hcall;
+ struct {
+ __u32 msr;
+ __u32 pad2;
+ __u64 control;
+ __u64 status;
+ __u64 send_page;
+ __u64 recv_page;
+ __u64 pending_page;
+ } syndbg;
} u;
};
/* KVM_EXIT_HYPERV */
@@ -5053,6 +6387,12 @@ Hyper-V SynIC state change. Notification is used to remap SynIC
event/message pages and to enable/disable SynIC messages/events processing
in userspace.
+ - KVM_EXIT_HYPERV_SYNDBG -- synchronously notify user-space about
+
+Hyper-V Synthetic debugger state change. Notification is used to either update
+the pending_page location or to send a control command (send the buffer located
+in send_page or recv a buffer to recv_page).
+
::
/* KVM_EXIT_ARM_NISV */
@@ -5061,7 +6401,7 @@ in userspace.
__u64 fault_ipa;
} arm_nisv;
-Used on arm and arm64 systems. If a guest accesses memory not in a memslot,
+Used on arm64 systems. If a guest accesses memory not in a memslot,
KVM will typically return to userspace and ask it to do MMIO emulation on its
behalf. However, for certain classes of instructions, no instruction decode
(direction, length of memory access) is provided, and fetching and decoding
@@ -5078,11 +6418,10 @@ did not fall within an I/O window.
Userspace implementations can query for KVM_CAP_ARM_NISV_TO_USER, and enable
this capability at VM creation. Once this is done, these types of errors will
instead return to userspace with KVM_EXIT_ARM_NISV, with the valid bits from
-the HSR (arm) and ESR_EL2 (arm64) in the esr_iss field, and the faulting IPA
-in the fault_ipa field. Userspace can either fix up the access if it's
-actually an I/O access by decoding the instruction from guest memory (if it's
-very brave) and continue executing the guest, or it can decide to suspend,
-dump, or restart the guest.
+the ESR_EL2 in the esr_iss field, and the faulting IPA in the fault_ipa field.
+Userspace can either fix up the access if it's actually an I/O access by
+decoding the instruction from guest memory (if it's very brave) and continue
+executing the guest, or it can decide to suspend, dump, or restart the guest.
Note that KVM does not skip the faulting instruction as it does for
KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state
@@ -5090,6 +6429,112 @@ if it decides to decode and emulate the instruction.
::
+ /* KVM_EXIT_X86_RDMSR / KVM_EXIT_X86_WRMSR */
+ struct {
+ __u8 error; /* user -> kernel */
+ __u8 pad[7];
+ __u32 reason; /* kernel -> user */
+ __u32 index; /* kernel -> user */
+ __u64 data; /* kernel <-> user */
+ } msr;
+
+Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is
+enabled, MSR accesses to registers that would invoke a #GP by KVM kernel code
+will instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
+exit for writes.
+
+The "reason" field specifies why the MSR trap occurred. User space will only
+receive MSR exit traps when a particular reason was requested during through
+ENABLE_CAP. Currently valid exit reasons are:
+
+ KVM_MSR_EXIT_REASON_UNKNOWN - access to MSR that is unknown to KVM
+ KVM_MSR_EXIT_REASON_INVAL - access to invalid MSRs or reserved bits
+ KVM_MSR_EXIT_REASON_FILTER - access blocked by KVM_X86_SET_MSR_FILTER
+
+For KVM_EXIT_X86_RDMSR, the "index" field tells user space which MSR the guest
+wants to read. To respond to this request with a successful read, user space
+writes the respective data into the "data" field and must continue guest
+execution to ensure the read data is transferred into guest register state.
+
+If the RDMSR request was unsuccessful, user space indicates that with a "1" in
+the "error" field. This will inject a #GP into the guest when the VCPU is
+executed again.
+
+For KVM_EXIT_X86_WRMSR, the "index" field tells user space which MSR the guest
+wants to write. Once finished processing the event, user space must continue
+vCPU execution. If the MSR write was unsuccessful, user space also sets the
+"error" field to "1".
+
+::
+
+
+ struct kvm_xen_exit {
+ #define KVM_EXIT_XEN_HCALL 1
+ __u32 type;
+ union {
+ struct {
+ __u32 longmode;
+ __u32 cpl;
+ __u64 input;
+ __u64 result;
+ __u64 params[6];
+ } hcall;
+ } u;
+ };
+ /* KVM_EXIT_XEN */
+ struct kvm_hyperv_exit xen;
+
+Indicates that the VCPU exits into userspace to process some tasks
+related to Xen emulation.
+
+Valid values for 'type' are:
+
+ - KVM_EXIT_XEN_HCALL -- synchronously notify user-space about Xen hypercall.
+ Userspace is expected to place the hypercall result into the appropriate
+ field before invoking KVM_RUN again.
+
+::
+
+ /* KVM_EXIT_RISCV_SBI */
+ struct {
+ unsigned long extension_id;
+ unsigned long function_id;
+ unsigned long args[6];
+ unsigned long ret[2];
+ } riscv_sbi;
+
+If exit reason is KVM_EXIT_RISCV_SBI then it indicates that the VCPU has
+done a SBI call which is not handled by KVM RISC-V kernel module. The details
+of the SBI call are available in 'riscv_sbi' member of kvm_run structure. The
+'extension_id' field of 'riscv_sbi' represents SBI extension ID whereas the
+'function_id' field represents function ID of given SBI extension. The 'args'
+array field of 'riscv_sbi' represents parameters for the SBI call and 'ret'
+array field represents return values. The userspace should update the return
+values of SBI call before resuming the VCPU. For more details on RISC-V SBI
+spec refer, https://github.com/riscv/riscv-sbi-doc.
+
+::
+
+ /* KVM_EXIT_NOTIFY */
+ struct {
+ #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
+ __u32 flags;
+ } notify;
+
+Used on x86 systems. When the VM capability KVM_CAP_X86_NOTIFY_VMEXIT is
+enabled, a VM exit generated if no event window occurs in VM non-root mode
+for a specified amount of time. Once KVM_X86_NOTIFY_VMEXIT_USER is set when
+enabling the cap, it would exit to userspace with the exit reason
+KVM_EXIT_NOTIFY for further handling. The "flags" field contains more
+detailed info.
+
+The valid value for 'flags' is:
+
+ - KVM_NOTIFY_CONTEXT_INVALID -- the VM context is corrupted and not valid
+ in VMCS. It would run into unknown result if resume the target VM.
+
+::
+
/* Fix the size of the union. */
char padding[256];
};
@@ -5704,11 +7149,16 @@ and injected exceptions.
7.18 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
-:Architectures: x86, arm, arm64, mips
+:Architectures: x86, arm64, mips
:Parameters: args[0] whether feature should be enabled or not
-With this capability enabled, KVM_GET_DIRTY_LOG will not automatically
-clear and write-protect all pages that are returned as dirty.
+Valid flags are::
+
+ #define KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE (1 << 0)
+ #define KVM_DIRTY_LOG_INITIALLY_SET (1 << 1)
+
+With KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE is set, KVM_GET_DIRTY_LOG will not
+automatically clear and write-protect all pages that are returned as dirty.
Rather, userspace will have to do this operation separately using
KVM_CLEAR_DIRTY_LOG.
@@ -5719,18 +7169,378 @@ than requiring to sync a full memslot; this ensures that KVM does not
take spinlocks for an extended period of time. Second, in some cases a
large amount of time can pass between a call to KVM_GET_DIRTY_LOG and
userspace actually using the data in the page. Pages can be modified
-during this time, which is inefficint for both the guest and userspace:
+during this time, which is inefficient for both the guest and userspace:
the guest will incur a higher penalty due to write protection faults,
while userspace can see false reports of dirty pages. Manual reprotection
helps reducing this time, improving guest performance and reducing the
number of dirty log false positives.
+With KVM_DIRTY_LOG_INITIALLY_SET set, all the bits of the dirty bitmap
+will be initialized to 1 when created. This also improves performance because
+dirty logging can be enabled gradually in small chunks on the first call
+to KVM_CLEAR_DIRTY_LOG. KVM_DIRTY_LOG_INITIALLY_SET depends on
+KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE (it is also only available on
+x86 and arm64 for now).
+
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 was previously available under the name
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT, but the implementation had bugs that make
it hard or impossible to use it correctly. The availability of
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 signals that those bugs are fixed.
Userspace should not try to use KVM_CAP_MANUAL_DIRTY_LOG_PROTECT.
+7.19 KVM_CAP_PPC_SECURE_GUEST
+------------------------------
+
+:Architectures: ppc
+
+This capability indicates that KVM is running on a host that has
+ultravisor firmware and thus can support a secure guest. On such a
+system, a guest can ask the ultravisor to make it a secure guest,
+one whose memory is inaccessible to the host except for pages which
+are explicitly requested to be shared with the host. The ultravisor
+notifies KVM when a guest requests to become a secure guest, and KVM
+has the opportunity to veto the transition.
+
+If present, this capability can be enabled for a VM, meaning that KVM
+will allow the transition to secure guest mode. Otherwise KVM will
+veto the transition.
+
+7.20 KVM_CAP_HALT_POLL
+----------------------
+
+:Architectures: all
+:Target: VM
+:Parameters: args[0] is the maximum poll time in nanoseconds
+:Returns: 0 on success; -1 on error
+
+This capability overrides the kvm module parameter halt_poll_ns for the
+target VM.
+
+VCPU polling allows a VCPU to poll for wakeup events instead of immediately
+scheduling during guest halts. The maximum time a VCPU can spend polling is
+controlled by the kvm module parameter halt_poll_ns. This capability allows
+the maximum halt time to specified on a per-VM basis, effectively overriding
+the module parameter for the target VM.
+
+7.21 KVM_CAP_X86_USER_SPACE_MSR
+-------------------------------
+
+:Architectures: x86
+:Target: VM
+:Parameters: args[0] contains the mask of KVM_MSR_EXIT_REASON_* events to report
+:Returns: 0 on success; -1 on error
+
+This capability enables trapping of #GP invoking RDMSR and WRMSR instructions
+into user space.
+
+When a guest requests to read or write an MSR, KVM may not implement all MSRs
+that are relevant to a respective system. It also does not differentiate by
+CPU type.
+
+To allow more fine grained control over MSR handling, user space may enable
+this capability. With it enabled, MSR accesses that match the mask specified in
+args[0] and trigger a #GP event inside the guest by KVM will instead trigger
+KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR exit notifications which user space
+can then handle to implement model specific MSR handling and/or user notifications
+to inform a user that an MSR was not handled.
+
+7.22 KVM_CAP_X86_BUS_LOCK_EXIT
+-------------------------------
+
+:Architectures: x86
+:Target: VM
+:Parameters: args[0] defines the policy used when bus locks detected in guest
+:Returns: 0 on success, -EINVAL when args[0] contains invalid bits
+
+Valid bits in args[0] are::
+
+ #define KVM_BUS_LOCK_DETECTION_OFF (1 << 0)
+ #define KVM_BUS_LOCK_DETECTION_EXIT (1 << 1)
+
+Enabling this capability on a VM provides userspace with a way to select
+a policy to handle the bus locks detected in guest. Userspace can obtain
+the supported modes from the result of KVM_CHECK_EXTENSION and define it
+through the KVM_ENABLE_CAP.
+
+KVM_BUS_LOCK_DETECTION_OFF and KVM_BUS_LOCK_DETECTION_EXIT are supported
+currently and mutually exclusive with each other. More bits can be added in
+the future.
+
+With KVM_BUS_LOCK_DETECTION_OFF set, bus locks in guest will not cause vm exits
+so that no additional actions are needed. This is the default mode.
+
+With KVM_BUS_LOCK_DETECTION_EXIT set, vm exits happen when bus lock detected
+in VM. KVM just exits to userspace when handling them. Userspace can enforce
+its own throttling or other policy based mitigations.
+
+This capability is aimed to address the thread that VM can exploit bus locks to
+degree the performance of the whole system. Once the userspace enable this
+capability and select the KVM_BUS_LOCK_DETECTION_EXIT mode, KVM will set the
+KVM_RUN_BUS_LOCK flag in vcpu-run->flags field and exit to userspace. Concerning
+the bus lock vm exit can be preempted by a higher priority VM exit, the exit
+notifications to userspace can be KVM_EXIT_BUS_LOCK or other reasons.
+KVM_RUN_BUS_LOCK flag is used to distinguish between them.
+
+7.23 KVM_CAP_PPC_DAWR1
+----------------------
+
+:Architectures: ppc
+:Parameters: none
+:Returns: 0 on success, -EINVAL when CPU doesn't support 2nd DAWR
+
+This capability can be used to check / enable 2nd DAWR feature provided
+by POWER10 processor.
+
+
+7.24 KVM_CAP_VM_COPY_ENC_CONTEXT_FROM
+-------------------------------------
+
+Architectures: x86 SEV enabled
+Type: vm
+Parameters: args[0] is the fd of the source vm
+Returns: 0 on success; ENOTTY on error
+
+This capability enables userspace to copy encryption context from the vm
+indicated by the fd to the vm this is called on.
+
+This is intended to support in-guest workloads scheduled by the host. This
+allows the in-guest workload to maintain its own NPTs and keeps the two vms
+from accidentally clobbering each other with interrupts and the like (separate
+APIC/MSRs/etc).
+
+7.25 KVM_CAP_SGX_ATTRIBUTE
+--------------------------
+
+:Architectures: x86
+:Target: VM
+:Parameters: args[0] is a file handle of a SGX attribute file in securityfs
+:Returns: 0 on success, -EINVAL if the file handle is invalid or if a requested
+ attribute is not supported by KVM.
+
+KVM_CAP_SGX_ATTRIBUTE enables a userspace VMM to grant a VM access to one or
+more priveleged enclave attributes. args[0] must hold a file handle to a valid
+SGX attribute file corresponding to an attribute that is supported/restricted
+by KVM (currently only PROVISIONKEY).
+
+The SGX subsystem restricts access to a subset of enclave attributes to provide
+additional security for an uncompromised kernel, e.g. use of the PROVISIONKEY
+is restricted to deter malware from using the PROVISIONKEY to obtain a stable
+system fingerprint. To prevent userspace from circumventing such restrictions
+by running an enclave in a VM, KVM prevents access to privileged attributes by
+default.
+
+See Documentation/x86/sgx.rst for more details.
+
+7.26 KVM_CAP_PPC_RPT_INVALIDATE
+-------------------------------
+
+:Capability: KVM_CAP_PPC_RPT_INVALIDATE
+:Architectures: ppc
+:Type: vm
+
+This capability indicates that the kernel is capable of handling
+H_RPT_INVALIDATE hcall.
+
+In order to enable the use of H_RPT_INVALIDATE in the guest,
+user space might have to advertise it for the guest. For example,
+IBM pSeries (sPAPR) guest starts using it if "hcall-rpt-invalidate" is
+present in the "ibm,hypertas-functions" device-tree property.
+
+This capability is enabled for hypervisors on platforms like POWER9
+that support radix MMU.
+
+7.27 KVM_CAP_EXIT_ON_EMULATION_FAILURE
+--------------------------------------
+
+:Architectures: x86
+:Parameters: args[0] whether the feature should be enabled or not
+
+When this capability is enabled, an emulation failure will result in an exit
+to userspace with KVM_INTERNAL_ERROR (except when the emulator was invoked
+to handle a VMware backdoor instruction). Furthermore, KVM will now provide up
+to 15 instruction bytes for any exit to userspace resulting from an emulation
+failure. When these exits to userspace occur use the emulation_failure struct
+instead of the internal struct. They both have the same layout, but the
+emulation_failure struct matches the content better. It also explicitly
+defines the 'flags' field which is used to describe the fields in the struct
+that are valid (ie: if KVM_INTERNAL_ERROR_EMULATION_FLAG_INSTRUCTION_BYTES is
+set in the 'flags' field then both 'insn_size' and 'insn_bytes' have valid data
+in them.)
+
+7.28 KVM_CAP_ARM_MTE
+--------------------
+
+:Architectures: arm64
+:Parameters: none
+
+This capability indicates that KVM (and the hardware) supports exposing the
+Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
+VMM before creating any VCPUs to allow the guest access. Note that MTE is only
+available to a guest running in AArch64 mode and enabling this capability will
+cause attempts to create AArch32 VCPUs to fail.
+
+When enabled the guest is able to access tags associated with any memory given
+to the guest. KVM will ensure that the tags are maintained during swap or
+hibernation of the host; however the VMM needs to manually save/restore the
+tags as appropriate if the VM is migrated.
+
+When this capability is enabled all memory in memslots must be mapped as
+not-shareable (no MAP_SHARED), attempts to create a memslot with a
+MAP_SHARED mmap will result in an -EINVAL return.
+
+When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
+perform a bulk copy of tags to/from the guest.
+
+7.29 KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
+-------------------------------------
+
+Architectures: x86 SEV enabled
+Type: vm
+Parameters: args[0] is the fd of the source vm
+Returns: 0 on success
+
+This capability enables userspace to migrate the encryption context from the VM
+indicated by the fd to the VM this is called on.
+
+This is intended to support intra-host migration of VMs between userspace VMMs,
+upgrading the VMM process without interrupting the guest.
+
+7.30 KVM_CAP_PPC_AIL_MODE_3
+-------------------------------
+
+:Capability: KVM_CAP_PPC_AIL_MODE_3
+:Architectures: ppc
+:Type: vm
+
+This capability indicates that the kernel supports the mode 3 setting for the
+"Address Translation Mode on Interrupt" aka "Alternate Interrupt Location"
+resource that is controlled with the H_SET_MODE hypercall.
+
+This capability allows a guest kernel to use a better-performance mode for
+handling interrupts and system calls.
+
+7.31 KVM_CAP_DISABLE_QUIRKS2
+----------------------------
+
+:Capability: KVM_CAP_DISABLE_QUIRKS2
+:Parameters: args[0] - set of KVM quirks to disable
+:Architectures: x86
+:Type: vm
+
+This capability, if enabled, will cause KVM to disable some behavior
+quirks.
+
+Calling KVM_CHECK_EXTENSION for this capability returns a bitmask of
+quirks that can be disabled in KVM.
+
+The argument to KVM_ENABLE_CAP for this capability is a bitmask of
+quirks to disable, and must be a subset of the bitmask returned by
+KVM_CHECK_EXTENSION.
+
+The valid bits in cap.args[0] are:
+
+=================================== ============================================
+ KVM_X86_QUIRK_LINT0_REENABLED By default, the reset value for the LVT
+ LINT0 register is 0x700 (APIC_MODE_EXTINT).
+ When this quirk is disabled, the reset value
+ is 0x10000 (APIC_LVT_MASKED).
+
+ KVM_X86_QUIRK_CD_NW_CLEARED By default, KVM clears CR0.CD and CR0.NW.
+ When this quirk is disabled, KVM does not
+ change the value of CR0.CD and CR0.NW.
+
+ KVM_X86_QUIRK_LAPIC_MMIO_HOLE By default, the MMIO LAPIC interface is
+ available even when configured for x2APIC
+ mode. When this quirk is disabled, KVM
+ disables the MMIO LAPIC interface if the
+ LAPIC is in x2APIC mode.
+
+ KVM_X86_QUIRK_OUT_7E_INC_RIP By default, KVM pre-increments %rip before
+ exiting to userspace for an OUT instruction
+ to port 0x7e. When this quirk is disabled,
+ KVM does not pre-increment %rip before
+ exiting to userspace.
+
+ KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT When this quirk is disabled, KVM sets
+ CPUID.01H:ECX[bit 3] (MONITOR/MWAIT) if
+ IA32_MISC_ENABLE[bit 18] (MWAIT) is set.
+ Additionally, when this quirk is disabled,
+ KVM clears CPUID.01H:ECX[bit 3] if
+ IA32_MISC_ENABLE[bit 18] is cleared.
+
+ KVM_X86_QUIRK_FIX_HYPERCALL_INSN By default, KVM rewrites guest
+ VMMCALL/VMCALL instructions to match the
+ vendor's hypercall instruction for the
+ system. When this quirk is disabled, KVM
+ will no longer rewrite invalid guest
+ hypercall instructions. Executing the
+ incorrect hypercall instruction will
+ generate a #UD within the guest.
+
+KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS By default, KVM emulates MONITOR/MWAIT (if
+ they are intercepted) as NOPs regardless of
+ whether or not MONITOR/MWAIT are supported
+ according to guest CPUID. When this quirk
+ is disabled and KVM_X86_DISABLE_EXITS_MWAIT
+ is not set (MONITOR/MWAIT are intercepted),
+ KVM will inject a #UD on MONITOR/MWAIT if
+ they're unsupported per guest CPUID. Note,
+ KVM will modify MONITOR/MWAIT support in
+ guest CPUID on writes to MISC_ENABLE if
+ KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT is
+ disabled.
+=================================== ============================================
+
+7.32 KVM_CAP_MAX_VCPU_ID
+------------------------
+
+:Architectures: x86
+:Target: VM
+:Parameters: args[0] - maximum APIC ID value set for current VM
+:Returns: 0 on success, -EINVAL if args[0] is beyond KVM_MAX_VCPU_IDS
+ supported in KVM or if it has been set.
+
+This capability allows userspace to specify maximum possible APIC ID
+assigned for current VM session prior to the creation of vCPUs, saving
+memory for data structures indexed by the APIC ID. Userspace is able
+to calculate the limit to APIC ID values from designated
+CPU topology.
+
+The value can be changed only until KVM_ENABLE_CAP is set to a nonzero
+value or until a vCPU is created. Upon creation of the first vCPU,
+if the value was set to zero or KVM_ENABLE_CAP was not invoked, KVM
+uses the return value of KVM_CHECK_EXTENSION(KVM_CAP_MAX_VCPU_ID) as
+the maximum APIC ID.
+
+7.33 KVM_CAP_X86_NOTIFY_VMEXIT
+------------------------------
+
+:Architectures: x86
+:Target: VM
+:Parameters: args[0] is the value of notify window as well as some flags
+:Returns: 0 on success, -EINVAL if args[0] contains invalid flags or notify
+ VM exit is unsupported.
+
+Bits 63:32 of args[0] are used for notify window.
+Bits 31:0 of args[0] are for some flags. Valid bits are::
+
+ #define KVM_X86_NOTIFY_VMEXIT_ENABLED (1 << 0)
+ #define KVM_X86_NOTIFY_VMEXIT_USER (1 << 1)
+
+This capability allows userspace to configure the notify VM exit on/off
+in per-VM scope during VM creation. Notify VM exit is disabled by default.
+When userspace sets KVM_X86_NOTIFY_VMEXIT_ENABLED bit in args[0], VMM will
+enable this feature with the notify window provided, which will generate
+a VM exit if no event window occurs in VM non-root mode for a specified of
+time (notify window).
+
+If KVM_X86_NOTIFY_VMEXIT_USER is set in args[0], upon notify VM exits happen,
+KVM would exit to userspace for handling.
+
+This capability is aimed to mitigate the threat that malicious VMs can
+cause CPU stuck (due to event windows don't open up) and make the CPU
+unavailable to host or other VMs.
+
8. Other capabilities.
======================
@@ -5743,7 +7553,7 @@ features of the KVM implementation.
:Architectures: ppc
This capability, if KVM_CHECK_EXTENSION indicates that it is
-available, means that that the kernel has an implementation of the
+available, means that the kernel has an implementation of the
H_RANDOM hypercall backed by a hardware random-number generator.
If present, the kernel H_RANDOM handler can be enabled for guest use
with the KVM_CAP_PPC_ENABLE_HCALL capability.
@@ -5754,7 +7564,7 @@ with the KVM_CAP_PPC_ENABLE_HCALL capability.
:Architectures: x86
This capability, if KVM_CHECK_EXTENSION indicates that it is
-available, means that that the kernel has an implementation of the
+available, means that the kernel has an implementation of the
Hyper-V Synthetic interrupt controller(SynIC). Hyper-V SynIC is
used to support Windows Hyper-V based guest paravirt drivers(VMBus).
@@ -5769,7 +7579,7 @@ by the CPU, as it's incompatible with SynIC auto-EOI behavior.
:Architectures: ppc
This capability, if KVM_CHECK_EXTENSION indicates that it is
-available, means that that the kernel can support guests using the
+available, means that the kernel can support guests using the
radix MMU defined in Power ISA V3.00 (as implemented in the POWER9
processor).
@@ -5779,7 +7589,7 @@ processor).
:Architectures: ppc
This capability, if KVM_CHECK_EXTENSION indicates that it is
-available, means that that the kernel can support guests using the
+available, means that the kernel can support guests using the
hashed page table MMU defined in Power ISA V3.00 (as implemented in
the POWER9 processor), including in-memory segment tables.
@@ -5858,7 +7668,7 @@ reserved.
8.9 KVM_CAP_ARM_USER_IRQ
------------------------
-:Architectures: arm, arm64
+:Architectures: arm64
This capability, if KVM_CHECK_EXTENSION indicates that it is available, means
that if userspace creates a VM without an in-kernel interrupt controller, it
@@ -5884,7 +7694,7 @@ run->kvm_valid_regs or run->kvm_dirty_regs bits.
If KVM_CAP_ARM_USER_IRQ is supported, the KVM_CHECK_EXTENSION ioctl returns a
number larger than 0 indicating the version of this capability is implemented
-and thereby which bits in in run->s.regs.device_irq_level can signal values.
+and thereby which bits in run->s.regs.device_irq_level can signal values.
Currently the following bits are defined for the device_irq_level bitmap::
@@ -5985,7 +7795,7 @@ HvFlushVirtualAddressList, HvFlushVirtualAddressListEx.
8.19 KVM_CAP_ARM_INJECT_SERROR_ESR
----------------------------------
-:Architectures: arm, arm64
+:Architectures: arm64
This capability indicates that userspace can specify (via the
KVM_SET_VCPU_EVENTS ioctl) the syndrome value reported to the guest when it
@@ -6009,7 +7819,7 @@ HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
-----------------------------------
-:Architecture: x86
+:Architectures: x86
This capability indicates that KVM running on top of Hyper-V hypervisor
enables Direct TLB flush for its guests meaning that TLB flush
@@ -6022,8 +7832,431 @@ in CPUID and only exposes Hyper-V identification. In this case, guest
thinks it's running on Hyper-V and only use Hyper-V hypercalls.
8.22 KVM_CAP_S390_VCPU_RESETS
+-----------------------------
-Architectures: s390
+:Architectures: s390
This capability indicates that the KVM_S390_NORMAL_RESET and
KVM_S390_CLEAR_RESET ioctls are available.
+
+8.23 KVM_CAP_S390_PROTECTED
+---------------------------
+
+:Architectures: s390
+
+This capability indicates that the Ultravisor has been initialized and
+KVM can therefore start protected VMs.
+This capability governs the KVM_S390_PV_COMMAND ioctl and the
+KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected
+guests when the state change is invalid.
+
+8.24 KVM_CAP_STEAL_TIME
+-----------------------
+
+:Architectures: arm64, x86
+
+This capability indicates that KVM supports steal time accounting.
+When steal time accounting is supported it may be enabled with
+architecture-specific interfaces. This capability and the architecture-
+specific interfaces must be consistent, i.e. if one says the feature
+is supported, than the other should as well and vice versa. For arm64
+see Documentation/virt/kvm/devices/vcpu.rst "KVM_ARM_VCPU_PVTIME_CTRL".
+For x86 see Documentation/virt/kvm/x86/msr.rst "MSR_KVM_STEAL_TIME".
+
+8.25 KVM_CAP_S390_DIAG318
+-------------------------
+
+:Architectures: s390
+
+This capability enables a guest to set information about its control program
+(i.e. guest kernel type and version). The information is helpful during
+system/firmware service events, providing additional data about the guest
+environments running on the machine.
+
+The information is associated with the DIAGNOSE 0x318 instruction, which sets
+an 8-byte value consisting of a one-byte Control Program Name Code (CPNC) and
+a 7-byte Control Program Version Code (CPVC). The CPNC determines what
+environment the control program is running in (e.g. Linux, z/VM...), and the
+CPVC is used for information specific to OS (e.g. Linux version, Linux
+distribution...)
+
+If this capability is available, then the CPNC and CPVC can be synchronized
+between KVM and userspace via the sync regs mechanism (KVM_SYNC_DIAG318).
+
+8.26 KVM_CAP_X86_USER_SPACE_MSR
+-------------------------------
+
+:Architectures: x86
+
+This capability indicates that KVM supports deflection of MSR reads and
+writes to user space. It can be enabled on a VM level. If enabled, MSR
+accesses that would usually trigger a #GP by KVM into the guest will
+instead get bounced to user space through the KVM_EXIT_X86_RDMSR and
+KVM_EXIT_X86_WRMSR exit notifications.
+
+8.27 KVM_CAP_X86_MSR_FILTER
+---------------------------
+
+:Architectures: x86
+
+This capability indicates that KVM supports that accesses to user defined MSRs
+may be rejected. With this capability exposed, KVM exports new VM ioctl
+KVM_X86_SET_MSR_FILTER which user space can call to specify bitmaps of MSR
+ranges that KVM should reject access to.
+
+In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to
+trap and emulate MSRs that are outside of the scope of KVM as well as
+limit the attack surface on KVM's MSR emulation code.
+
+8.28 KVM_CAP_ENFORCE_PV_FEATURE_CPUID
+-------------------------------------
+
+Architectures: x86
+
+When enabled, KVM will disable paravirtual features provided to the
+guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf
+(0x40000001). Otherwise, a guest may use the paravirtual features
+regardless of what has actually been exposed through the CPUID leaf.
+
+8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
+----------------------------------------------------------
+
+:Architectures: x86
+:Parameters: args[0] - size of the dirty log ring
+
+KVM is capable of tracking dirty memory using ring buffers that are
+mmaped into userspace; there is one dirty ring per vcpu.
+
+The dirty ring is available to userspace as an array of
+``struct kvm_dirty_gfn``. Each dirty entry it's defined as::
+
+ struct kvm_dirty_gfn {
+ __u32 flags;
+ __u32 slot; /* as_id | slot_id */
+ __u64 offset;
+ };
+
+The following values are defined for the flags field to define the
+current state of the entry::
+
+ #define KVM_DIRTY_GFN_F_DIRTY BIT(0)
+ #define KVM_DIRTY_GFN_F_RESET BIT(1)
+ #define KVM_DIRTY_GFN_F_MASK 0x3
+
+Userspace should call KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM
+ioctl to enable this capability for the new guest and set the size of
+the rings. Enabling the capability is only allowed before creating any
+vCPU, and the size of the ring must be a power of two. The larger the
+ring buffer, the less likely the ring is full and the VM is forced to
+exit to userspace. The optimal size depends on the workload, but it is
+recommended that it be at least 64 KiB (4096 entries).
+
+Just like for dirty page bitmaps, the buffer tracks writes to
+all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
+set in KVM_SET_USER_MEMORY_REGION. Once a memory region is registered
+with the flag set, userspace can start harvesting dirty pages from the
+ring buffer.
+
+An entry in the ring buffer can be unused (flag bits ``00``),
+dirty (flag bits ``01``) or harvested (flag bits ``1X``). The
+state machine for the entry is as follows::
+
+ dirtied harvested reset
+ 00 -----------> 01 -------------> 1X -------+
+ ^ |
+ | |
+ +------------------------------------------+
+
+To harvest the dirty pages, userspace accesses the mmaped ring buffer
+to read the dirty GFNs. If the flags has the DIRTY bit set (at this stage
+the RESET bit must be cleared), then it means this GFN is a dirty GFN.
+The userspace should harvest this GFN and mark the flags from state
+``01b`` to ``1Xb`` (bit 0 will be ignored by KVM, but bit 1 must be set
+to show that this GFN is harvested and waiting for a reset), and move
+on to the next GFN. The userspace should continue to do this until the
+flags of a GFN have the DIRTY bit cleared, meaning that it has harvested
+all the dirty GFNs that were available.
+
+Note that on weakly ordered architectures, userspace accesses to the
+ring buffer (and more specifically the 'flags' field) must be ordered,
+using load-acquire/store-release accessors when available, or any
+other memory barrier that will ensure this ordering.
+
+It's not necessary for userspace to harvest the all dirty GFNs at once.
+However it must collect the dirty GFNs in sequence, i.e., the userspace
+program cannot skip one dirty GFN to collect the one next to it.
+
+After processing one or more entries in the ring buffer, userspace
+calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about
+it, so that the kernel will reprotect those collected GFNs.
+Therefore, the ioctl must be called *before* reading the content of
+the dirty pages.
+
+The dirty ring can get full. When it happens, the KVM_RUN of the
+vcpu will return with exit reason KVM_EXIT_DIRTY_LOG_FULL.
+
+The dirty ring interface has a major difference comparing to the
+KVM_GET_DIRTY_LOG interface in that, when reading the dirty ring from
+userspace, it's still possible that the kernel has not yet flushed the
+processor's dirty page buffers into the kernel buffer (with dirty bitmaps, the
+flushing is done by the KVM_GET_DIRTY_LOG ioctl). To achieve that, one
+needs to kick the vcpu out of KVM_RUN using a signal. The resulting
+vmexit ensures that all dirty GFNs are flushed to the dirty rings.
+
+NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the corresponding
+ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctls
+KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG. After enabling
+KVM_CAP_DIRTY_LOG_RING with an acceptable dirty ring size, the virtual
+machine will switch to ring-buffer dirty page tracking and further
+KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail.
+
+NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that
+should be exposed by weakly ordered architecture, in order to indicate
+the additional memory ordering requirements imposed on userspace when
+reading the state of an entry and mutating it from DIRTY to HARVESTED.
+Architecture with TSO-like ordering (such as x86) are allowed to
+expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL
+to userspace.
+
+8.30 KVM_CAP_XEN_HVM
+--------------------
+
+:Architectures: x86
+
+This capability indicates the features that Xen supports for hosting Xen
+PVHVM guests. Valid flags are::
+
+ #define KVM_XEN_HVM_CONFIG_HYPERCALL_MSR (1 << 0)
+ #define KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL (1 << 1)
+ #define KVM_XEN_HVM_CONFIG_SHARED_INFO (1 << 2)
+ #define KVM_XEN_HVM_CONFIG_RUNSTATE (1 << 3)
+ #define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4)
+ #define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5)
+
+The KVM_XEN_HVM_CONFIG_HYPERCALL_MSR flag indicates that the KVM_XEN_HVM_CONFIG
+ioctl is available, for the guest to set its hypercall page.
+
+If KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL is also set, the same flag may also be
+provided in the flags to KVM_XEN_HVM_CONFIG, without providing hypercall page
+contents, to request that KVM generate hypercall page content automatically
+and also enable interception of guest hypercalls with KVM_EXIT_XEN.
+
+The KVM_XEN_HVM_CONFIG_SHARED_INFO flag indicates the availability of the
+KVM_XEN_HVM_SET_ATTR, KVM_XEN_HVM_GET_ATTR, KVM_XEN_VCPU_SET_ATTR and
+KVM_XEN_VCPU_GET_ATTR ioctls, as well as the delivery of exception vectors
+for event channel upcalls when the evtchn_upcall_pending field of a vcpu's
+vcpu_info is set.
+
+The KVM_XEN_HVM_CONFIG_RUNSTATE flag indicates that the runstate-related
+features KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR/_CURRENT/_DATA/_ADJUST are
+supported by the KVM_XEN_VCPU_SET_ATTR/KVM_XEN_VCPU_GET_ATTR ioctls.
+
+The KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL flag indicates that IRQ routing entries
+of the type KVM_IRQ_ROUTING_XEN_EVTCHN are supported, with the priority
+field set to indicate 2 level event channel delivery.
+
+The KVM_XEN_HVM_CONFIG_EVTCHN_SEND flag indicates that KVM supports
+injecting event channel events directly into the guest with the
+KVM_XEN_HVM_EVTCHN_SEND ioctl. It also indicates support for the
+KVM_XEN_ATTR_TYPE_EVTCHN/XEN_VERSION HVM attributes and the
+KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID/TIMER/UPCALL_VECTOR vCPU attributes.
+related to event channel delivery, timers, and the XENVER_version
+interception.
+
+8.31 KVM_CAP_PPC_MULTITCE
+-------------------------
+
+:Capability: KVM_CAP_PPC_MULTITCE
+:Architectures: ppc
+:Type: vm
+
+This capability means the kernel is capable of handling hypercalls
+H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
+space. This significantly accelerates DMA operations for PPC KVM guests.
+User space should expect that its handlers for these hypercalls
+are not going to be called if user space previously registered LIOBN
+in KVM (via KVM_CREATE_SPAPR_TCE or similar calls).
+
+In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
+user space might have to advertise it for the guest. For example,
+IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is
+present in the "ibm,hypertas-functions" device-tree property.
+
+The hypercalls mentioned above may or may not be processed successfully
+in the kernel based fast path. If they can not be handled by the kernel,
+they will get passed on to user space. So user space still has to have
+an implementation for these despite the in kernel acceleration.
+
+This capability is always enabled.
+
+8.32 KVM_CAP_PTP_KVM
+--------------------
+
+:Architectures: arm64
+
+This capability indicates that the KVM virtual PTP service is
+supported in the host. A VMM can check whether the service is
+available to the guest on migration.
+
+8.33 KVM_CAP_HYPERV_ENFORCE_CPUID
+---------------------------------
+
+Architectures: x86
+
+When enabled, KVM will disable emulated Hyper-V features provided to the
+guest according to the bits Hyper-V CPUID feature leaves. Otherwise, all
+currently implmented Hyper-V features are provided unconditionally when
+Hyper-V identification is set in the HYPERV_CPUID_INTERFACE (0x40000001)
+leaf.
+
+8.34 KVM_CAP_EXIT_HYPERCALL
+---------------------------
+
+:Capability: KVM_CAP_EXIT_HYPERCALL
+:Architectures: x86
+:Type: vm
+
+This capability, if enabled, will cause KVM to exit to userspace
+with KVM_EXIT_HYPERCALL exit reason to process some hypercalls.
+
+Calling KVM_CHECK_EXTENSION for this capability will return a bitmask
+of hypercalls that can be configured to exit to userspace.
+Right now, the only such hypercall is KVM_HC_MAP_GPA_RANGE.
+
+The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset
+of the result of KVM_CHECK_EXTENSION. KVM will forward to userspace
+the hypercalls whose corresponding bit is in the argument, and return
+ENOSYS for the others.
+
+8.35 KVM_CAP_PMU_CAPABILITY
+---------------------------
+
+:Capability KVM_CAP_PMU_CAPABILITY
+:Architectures: x86
+:Type: vm
+:Parameters: arg[0] is bitmask of PMU virtualization capabilities.
+:Returns 0 on success, -EINVAL when arg[0] contains invalid bits
+
+This capability alters PMU virtualization in KVM.
+
+Calling KVM_CHECK_EXTENSION for this capability returns a bitmask of
+PMU virtualization capabilities that can be adjusted on a VM.
+
+The argument to KVM_ENABLE_CAP is also a bitmask and selects specific
+PMU virtualization capabilities to be applied to the VM. This can
+only be invoked on a VM prior to the creation of VCPUs.
+
+At this time, KVM_PMU_CAP_DISABLE is the only capability. Setting
+this capability will disable PMU virtualization for that VM. Usermode
+should adjust CPUID leaf 0xA to reflect that the PMU is disabled.
+
+8.36 KVM_CAP_ARM_SYSTEM_SUSPEND
+-------------------------------
+
+:Capability: KVM_CAP_ARM_SYSTEM_SUSPEND
+:Architectures: arm64
+:Type: vm
+
+When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of
+type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request.
+
+8.37 KVM_CAP_S390_PROTECTED_DUMP
+--------------------------------
+
+:Capability: KVM_CAP_S390_PROTECTED_DUMP
+:Architectures: s390
+:Type: vm
+
+This capability indicates that KVM and the Ultravisor support dumping
+PV guests. The `KVM_PV_DUMP` command is available for the
+`KVM_S390_PV_COMMAND` ioctl and the `KVM_PV_INFO` command provides
+dump related UV data. Also the vcpu ioctl `KVM_S390_PV_CPU_COMMAND` is
+available and supports the `KVM_PV_DUMP_CPU` subcommand.
+
+8.38 KVM_CAP_VM_DISABLE_NX_HUGE_PAGES
+-------------------------------------
+
+:Capability: KVM_CAP_VM_DISABLE_NX_HUGE_PAGES
+:Architectures: x86
+:Type: vm
+:Parameters: arg[0] must be 0.
+:Returns: 0 on success, -EPERM if the userspace process does not
+ have CAP_SYS_BOOT, -EINVAL if args[0] is not 0 or any vCPUs have been
+ created.
+
+This capability disables the NX huge pages mitigation for iTLB MULTIHIT.
+
+The capability has no effect if the nx_huge_pages module parameter is not set.
+
+This capability may only be set before any vCPUs are created.
+
+8.39 KVM_CAP_S390_CPU_TOPOLOGY
+------------------------------
+
+:Capability: KVM_CAP_S390_CPU_TOPOLOGY
+:Architectures: s390
+:Type: vm
+
+This capability indicates that KVM will provide the S390 CPU Topology
+facility which consist of the interpretation of the PTF instruction for
+the function code 2 along with interception and forwarding of both the
+PTF instruction with function codes 0 or 1 and the STSI(15,1,x)
+instruction to the userland hypervisor.
+
+The stfle facility 11, CPU Topology facility, should not be indicated
+to the guest without this capability.
+
+When this capability is present, KVM provides a new attribute group
+on vm fd, KVM_S390_VM_CPU_TOPOLOGY.
+This new attribute allows to get, set or clear the Modified Change
+Topology Report (MTCR) bit of the SCA through the kvm_device_attr
+structure.
+
+When getting the Modified Change Topology Report value, the attr->addr
+must point to a byte where the value will be stored or retrieved from.
+
+9. Known KVM API problems
+=========================
+
+In some cases, KVM's API has some inconsistencies or common pitfalls
+that userspace need to be aware of. This section details some of
+these issues.
+
+Most of them are architecture specific, so the section is split by
+architecture.
+
+9.1. x86
+--------
+
+``KVM_GET_SUPPORTED_CPUID`` issues
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In general, ``KVM_GET_SUPPORTED_CPUID`` is designed so that it is possible
+to take its result and pass it directly to ``KVM_SET_CPUID2``. This section
+documents some cases in which that requires some care.
+
+Local APIC features
+~~~~~~~~~~~~~~~~~~~
+
+CPU[EAX=1]:ECX[21] (X2APIC) is reported by ``KVM_GET_SUPPORTED_CPUID``,
+but it can only be enabled if ``KVM_CREATE_IRQCHIP`` or
+``KVM_ENABLE_CAP(KVM_CAP_IRQCHIP_SPLIT)`` are used to enable in-kernel emulation of
+the local APIC.
+
+The same is true for the ``KVM_FEATURE_PV_UNHALT`` paravirtualized feature.
+
+CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by ``KVM_GET_SUPPORTED_CPUID``.
+It can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel
+has enabled in-kernel emulation of the local APIC.
+
+Obsolete ioctls and capabilities
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+KVM_CAP_DISABLE_QUIRKS does not let userspace know which quirks are actually
+available. Use ``KVM_CHECK_EXTENSION(KVM_CAP_DISABLE_QUIRKS2)`` instead if
+available.
+
+Ordering of KVM_GET_*/KVM_SET_* ioctls
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+TBD
diff --git a/Documentation/virt/kvm/arm/hyp-abi.rst b/Documentation/virt/kvm/arm/hyp-abi.rst
index d1fc27d848e9..412b276449d3 100644
--- a/Documentation/virt/kvm/arm/hyp-abi.rst
+++ b/Documentation/virt/kvm/arm/hyp-abi.rst
@@ -11,6 +11,11 @@ hypervisor when running as a guest (under Xen, KVM or any other
hypervisor), or any hypervisor-specific interaction when the kernel is
used as a host.
+Note: KVM/arm has been removed from the kernel. The API described
+here is still valid though, as it allows the kernel to kexec when
+booted at HYP. It can also be used by a hypervisor other than KVM
+if necessary.
+
On arm and arm64 (without VHE), the kernel doesn't run in hypervisor
mode, but still needs to interact with it, allowing a built-in
hypervisor to be either installed or torn down.
@@ -49,9 +54,19 @@ these functions (see arch/arm{,64}/include/asm/virt.h):
x3 = x1's value when entering the next payload (arm64)
x4 = x2's value when entering the next payload (arm64)
- Mask all exceptions, disable the MMU, move the arguments into place
- (arm64 only), and jump to the restart address while at HYP/EL2. This
- hypercall is not expected to return to its caller.
+ Mask all exceptions, disable the MMU, clear I+D bits, move the arguments
+ into place (arm64 only), and jump to the restart address while at HYP/EL2.
+ This hypercall is not expected to return to its caller.
+
+* ::
+
+ x0 = HVC_FINALISE_EL2 (arm64 only)
+
+ Finish configuring EL2 depending on the command-line options,
+ including an attempt to upgrade the kernel's exception level from
+ EL1 to EL2 by enabling the VHE mode. This is conditioned by the CPU
+ supporting VHE, the EL2 MMU being off, and VHE not being disabled by
+ any other means (command line option, for example).
Any other value of r0/x0 triggers a hypervisor-specific handling,
which is not documented here.
diff --git a/Documentation/virt/kvm/arm/hypercalls.rst b/Documentation/virt/kvm/arm/hypercalls.rst
new file mode 100644
index 000000000000..3e23084644ba
--- /dev/null
+++ b/Documentation/virt/kvm/arm/hypercalls.rst
@@ -0,0 +1,138 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+ARM Hypercall Interface
+=======================
+
+KVM handles the hypercall services as requested by the guests. New hypercall
+services are regularly made available by the ARM specification or by KVM (as
+vendor services) if they make sense from a virtualization point of view.
+
+This means that a guest booted on two different versions of KVM can observe
+two different "firmware" revisions. This could cause issues if a given guest
+is tied to a particular version of a hypercall service, or if a migration
+causes a different version to be exposed out of the blue to an unsuspecting
+guest.
+
+In order to remedy this situation, KVM exposes a set of "firmware
+pseudo-registers" that can be manipulated using the GET/SET_ONE_REG
+interface. These registers can be saved/restored by userspace, and set
+to a convenient value as required.
+
+The following registers are defined:
+
+* KVM_REG_ARM_PSCI_VERSION:
+
+ KVM implements the PSCI (Power State Coordination Interface)
+ specification in order to provide services such as CPU on/off, reset
+ and power-off to the guest.
+
+ - Only valid if the vcpu has the KVM_ARM_VCPU_PSCI_0_2 feature set
+ (and thus has already been initialized)
+ - Returns the current PSCI version on GET_ONE_REG (defaulting to the
+ highest PSCI version implemented by KVM and compatible with v0.2)
+ - Allows any PSCI version implemented by KVM and compatible with
+ v0.2 to be set with SET_ONE_REG
+ - Affects the whole VM (even if the register view is per-vcpu)
+
+* KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1:
+ Holds the state of the firmware support to mitigate CVE-2017-5715, as
+ offered by KVM to the guest via a HVC call. The workaround is described
+ under SMCCC_ARCH_WORKAROUND_1 in [1].
+
+ Accepted values are:
+
+ KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL:
+ KVM does not offer
+ firmware support for the workaround. The mitigation status for the
+ guest is unknown.
+ KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_AVAIL:
+ The workaround HVC call is
+ available to the guest and required for the mitigation.
+ KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_REQUIRED:
+ The workaround HVC call
+ is available to the guest, but it is not needed on this VCPU.
+
+* KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2:
+ Holds the state of the firmware support to mitigate CVE-2018-3639, as
+ offered by KVM to the guest via a HVC call. The workaround is described
+ under SMCCC_ARCH_WORKAROUND_2 in [1]_.
+
+ Accepted values are:
+
+ KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL:
+ A workaround is not
+ available. KVM does not offer firmware support for the workaround.
+ KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_UNKNOWN:
+ The workaround state is
+ unknown. KVM does not offer firmware support for the workaround.
+ KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL:
+ The workaround is available,
+ and can be disabled by a vCPU. If
+ KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED is set, it is active for
+ this vCPU.
+ KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED:
+ The workaround is always active on this vCPU or it is not needed.
+
+
+Bitmap Feature Firmware Registers
+---------------------------------
+
+Contrary to the above registers, the following registers exposes the
+hypercall services in the form of a feature-bitmap to the userspace. This
+bitmap is translated to the services that are available to the guest.
+There is a register defined per service call owner and can be accessed via
+GET/SET_ONE_REG interface.
+
+By default, these registers are set with the upper limit of the features
+that are supported. This way userspace can discover all the usable
+hypercall services via GET_ONE_REG. The user-space can write-back the
+desired bitmap back via SET_ONE_REG. The features for the registers that
+are untouched, probably because userspace isn't aware of them, will be
+exposed as is to the guest.
+
+Note that KVM will not allow the userspace to configure the registers
+anymore once any of the vCPUs has run at least once. Instead, it will
+return a -EBUSY.
+
+The pseudo-firmware bitmap register are as follows:
+
+* KVM_REG_ARM_STD_BMAP:
+ Controls the bitmap of the ARM Standard Secure Service Calls.
+
+ The following bits are accepted:
+
+ Bit-0: KVM_REG_ARM_STD_BIT_TRNG_V1_0:
+ The bit represents the services offered under v1.0 of ARM True Random
+ Number Generator (TRNG) specification, ARM DEN0098.
+
+* KVM_REG_ARM_STD_HYP_BMAP:
+ Controls the bitmap of the ARM Standard Hypervisor Service Calls.
+
+ The following bits are accepted:
+
+ Bit-0: KVM_REG_ARM_STD_HYP_BIT_PV_TIME:
+ The bit represents the Paravirtualized Time service as represented by
+ ARM DEN0057A.
+
+* KVM_REG_ARM_VENDOR_HYP_BMAP:
+ Controls the bitmap of the Vendor specific Hypervisor Service Calls.
+
+ The following bits are accepted:
+
+ Bit-0: KVM_REG_ARM_VENDOR_HYP_BIT_FUNC_FEAT
+ The bit represents the ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID
+ and ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID function-ids.
+
+ Bit-1: KVM_REG_ARM_VENDOR_HYP_BIT_PTP:
+ The bit represents the Precision Time Protocol KVM service.
+
+Errors:
+
+ ======= =============================================================
+ -ENOENT Unknown register accessed.
+ -EBUSY Attempt a 'write' to the register after the VM has started.
+ -EINVAL Invalid bitmap written to the register.
+ ======= =============================================================
+
+.. [1] https://developer.arm.com/-/media/developer/pdf/ARM_DEN_0070A_Firmware_interfaces_for_mitigating_CVE-2017-5715.pdf
diff --git a/Documentation/virt/kvm/arm/index.rst b/Documentation/virt/kvm/arm/index.rst
index 3e2b2aba90fc..e84848432158 100644
--- a/Documentation/virt/kvm/arm/index.rst
+++ b/Documentation/virt/kvm/arm/index.rst
@@ -8,5 +8,6 @@ ARM
:maxdepth: 2
hyp-abi
- psci
+ hypercalls
pvtime
+ ptp_kvm
diff --git a/Documentation/virt/kvm/arm/psci.rst b/Documentation/virt/kvm/arm/psci.rst
deleted file mode 100644
index d52c2e83b5b8..000000000000
--- a/Documentation/virt/kvm/arm/psci.rst
+++ /dev/null
@@ -1,77 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=========================================
-Power State Coordination Interface (PSCI)
-=========================================
-
-KVM implements the PSCI (Power State Coordination Interface)
-specification in order to provide services such as CPU on/off, reset
-and power-off to the guest.
-
-The PSCI specification is regularly updated to provide new features,
-and KVM implements these updates if they make sense from a virtualization
-point of view.
-
-This means that a guest booted on two different versions of KVM can
-observe two different "firmware" revisions. This could cause issues if
-a given guest is tied to a particular PSCI revision (unlikely), or if
-a migration causes a different PSCI version to be exposed out of the
-blue to an unsuspecting guest.
-
-In order to remedy this situation, KVM exposes a set of "firmware
-pseudo-registers" that can be manipulated using the GET/SET_ONE_REG
-interface. These registers can be saved/restored by userspace, and set
-to a convenient value if required.
-
-The following register is defined:
-
-* KVM_REG_ARM_PSCI_VERSION:
-
- - Only valid if the vcpu has the KVM_ARM_VCPU_PSCI_0_2 feature set
- (and thus has already been initialized)
- - Returns the current PSCI version on GET_ONE_REG (defaulting to the
- highest PSCI version implemented by KVM and compatible with v0.2)
- - Allows any PSCI version implemented by KVM and compatible with
- v0.2 to be set with SET_ONE_REG
- - Affects the whole VM (even if the register view is per-vcpu)
-
-* KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1:
- Holds the state of the firmware support to mitigate CVE-2017-5715, as
- offered by KVM to the guest via a HVC call. The workaround is described
- under SMCCC_ARCH_WORKAROUND_1 in [1].
-
- Accepted values are:
-
- KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL:
- KVM does not offer
- firmware support for the workaround. The mitigation status for the
- guest is unknown.
- KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_AVAIL:
- The workaround HVC call is
- available to the guest and required for the mitigation.
- KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_REQUIRED:
- The workaround HVC call
- is available to the guest, but it is not needed on this VCPU.
-
-* KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2:
- Holds the state of the firmware support to mitigate CVE-2018-3639, as
- offered by KVM to the guest via a HVC call. The workaround is described
- under SMCCC_ARCH_WORKAROUND_2 in [1]_.
-
- Accepted values are:
-
- KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL:
- A workaround is not
- available. KVM does not offer firmware support for the workaround.
- KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_UNKNOWN:
- The workaround state is
- unknown. KVM does not offer firmware support for the workaround.
- KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL:
- The workaround is available,
- and can be disabled by a vCPU. If
- KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED is set, it is active for
- this vCPU.
- KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED:
- The workaround is always active on this vCPU or it is not needed.
-
-.. [1] https://developer.arm.com/-/media/developer/pdf/ARM_DEN_0070A_Firmware_interfaces_for_mitigating_CVE-2017-5715.pdf
diff --git a/Documentation/virt/kvm/arm/ptp_kvm.rst b/Documentation/virt/kvm/arm/ptp_kvm.rst
new file mode 100644
index 000000000000..aecdc80ddcd8
--- /dev/null
+++ b/Documentation/virt/kvm/arm/ptp_kvm.rst
@@ -0,0 +1,25 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+PTP_KVM support for arm/arm64
+=============================
+
+PTP_KVM is used for high precision time sync between host and guests.
+It relies on transferring the wall clock and counter value from the
+host to the guest using a KVM-specific hypercall.
+
+* ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID: 0x86000001
+
+This hypercall uses the SMC32/HVC32 calling convention:
+
+ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID
+ ============== ======== =====================================
+ Function ID: (uint32) 0x86000001
+ Arguments: (uint32) KVM_PTP_VIRT_COUNTER(0)
+ KVM_PTP_PHYS_COUNTER(1)
+ Return Values: (int32) NOT_SUPPORTED(-1) on error, or
+ (uint32) Upper 32 bits of wall clock time (r0)
+ (uint32) Lower 32 bits of wall clock time (r1)
+ (uint32) Upper 32 bits of counter (r2)
+ (uint32) Lower 32 bits of counter (r3)
+ Endianness: No Restrictions.
+ ============== ======== =====================================
diff --git a/Documentation/virt/kvm/arm/pvtime.rst b/Documentation/virt/kvm/arm/pvtime.rst
index 2357dd2d8655..392521af7c90 100644
--- a/Documentation/virt/kvm/arm/pvtime.rst
+++ b/Documentation/virt/kvm/arm/pvtime.rst
@@ -19,8 +19,8 @@ Two new SMCCC compatible hypercalls are defined:
These are only available in the SMC64/HVC64 calling convention as
paravirtualized time is not available to 32 bit Arm guests. The existence of
-the PV_FEATURES hypercall should be probed using the SMCCC 1.1 ARCH_FEATURES
-mechanism before calling it.
+the PV_TIME_FEATURES hypercall should be probed using the SMCCC 1.1
+ARCH_FEATURES mechanism before calling it.
PV_TIME_FEATURES
============= ======== ==========
@@ -76,5 +76,5 @@ It is advisable that one or more 64k pages are set aside for the purpose of
these structures and not used for other purposes, this enables the guest to map
the region using 64k pages and avoids conflicting attributes with other memory.
-For the user space interface see Documentation/virt/kvm/devices/vcpu.txt
+For the user space interface see Documentation/virt/kvm/devices/vcpu.rst
section "3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL".
diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
deleted file mode 100644
index 01b081f6e7ea..000000000000
--- a/Documentation/virt/kvm/cpuid.rst
+++ /dev/null
@@ -1,107 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==============
-KVM CPUID bits
-==============
-
-:Author: Glauber Costa <glommer@gmail.com>
-
-A guest running on a kvm host, can check some of its features using
-cpuid. This is not always guaranteed to work, since userspace can
-mask-out some, or even all KVM-related cpuid features before launching
-a guest.
-
-KVM cpuid functions are:
-
-function: KVM_CPUID_SIGNATURE (0x40000000)
-
-returns::
-
- eax = 0x40000001
- ebx = 0x4b4d564b
- ecx = 0x564b4d56
- edx = 0x4d
-
-Note that this value in ebx, ecx and edx corresponds to the string "KVMKVMKVM".
-The value in eax corresponds to the maximum cpuid function present in this leaf,
-and will be updated if more functions are added in the future.
-Note also that old hosts set eax value to 0x0. This should
-be interpreted as if the value was 0x40000001.
-This function queries the presence of KVM cpuid leafs.
-
-function: define KVM_CPUID_FEATURES (0x40000001)
-
-returns::
-
- ebx, ecx
- eax = an OR'ed group of (1 << flag)
-
-where ``flag`` is defined as below:
-
-================================= =========== ================================
-flag value meaning
-================================= =========== ================================
-KVM_FEATURE_CLOCKSOURCE 0 kvmclock available at msrs
- 0x11 and 0x12
-
-KVM_FEATURE_NOP_IO_DELAY 1 not necessary to perform delays
- on PIO operations
-
-KVM_FEATURE_MMU_OP 2 deprecated
-
-KVM_FEATURE_CLOCKSOURCE2 3 kvmclock available at msrs
-
- 0x4b564d00 and 0x4b564d01
-KVM_FEATURE_ASYNC_PF 4 async pf can be enabled by
- writing to msr 0x4b564d02
-
-KVM_FEATURE_STEAL_TIME 5 steal time can be enabled by
- writing to msr 0x4b564d03
-
-KVM_FEATURE_PV_EOI 6 paravirtualized end of interrupt
- handler can be enabled by
- writing to msr 0x4b564d04
-
-KVM_FEATURE_PV_UNHAULT 7 guest checks this feature bit
- before enabling paravirtualized
- spinlock support
-
-KVM_FEATURE_PV_TLB_FLUSH 9 guest checks this feature bit
- before enabling paravirtualized
- tlb flush
-
-KVM_FEATURE_ASYNC_PF_VMEXIT 10 paravirtualized async PF VM EXIT
- can be enabled by setting bit 2
- when writing to msr 0x4b564d02
-
-KVM_FEATURE_PV_SEND_IPI 11 guest checks this feature bit
- before enabling paravirtualized
- sebd IPIs
-
-KVM_FEATURE_PV_POLL_CONTROL 12 host-side polling on HLT can
- be disabled by writing
- to msr 0x4b564d05.
-
-KVM_FEATURE_PV_SCHED_YIELD 13 guest checks this feature bit
- before using paravirtualized
- sched yield.
-
-KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24 host will warn if no guest-side
- per-cpu warps are expeced in
- kvmclock
-================================= =========== ================================
-
-::
-
- edx = an OR'ed group of (1 << flag)
-
-Where ``flag`` here is defined as below:
-
-================== ============ =================================
-flag value meaning
-================== ============ =================================
-KVM_HINTS_REALTIME 0 guest checks this feature bit to
- determine that vCPUs are never
- preempted for an unlimited time
- allowing optimizations
-================== ============ =================================
diff --git a/Documentation/virt/kvm/devices/arm-vgic-its.rst b/Documentation/virt/kvm/devices/arm-vgic-its.rst
index 6c304fd2b1b4..d257eddbae29 100644
--- a/Documentation/virt/kvm/devices/arm-vgic-its.rst
+++ b/Documentation/virt/kvm/devices/arm-vgic-its.rst
@@ -80,7 +80,7 @@ KVM_DEV_ARM_VGIC_GRP_CTRL
-EFAULT Invalid guest ram access
-EBUSY One or more VCPUS are running
-EACCES The virtual ITS is backed by a physical GICv4 ITS, and the
- state is not available
+ state is not available without GICv4.1
======= ==========================================================
KVM_DEV_ARM_VGIC_GRP_ITS_REGS
diff --git a/Documentation/virt/kvm/devices/arm-vgic-v3.rst b/Documentation/virt/kvm/devices/arm-vgic-v3.rst
index 5dd3bff51978..51e5e5762571 100644
--- a/Documentation/virt/kvm/devices/arm-vgic-v3.rst
+++ b/Documentation/virt/kvm/devices/arm-vgic-v3.rst
@@ -228,7 +228,7 @@ Groups:
KVM_DEV_ARM_VGIC_CTRL_INIT
request the initialization of the VGIC, no additional parameter in
- kvm_device_attr.addr.
+ kvm_device_attr.addr. Must be called after all VCPUs have been created.
KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES
save all LPI pending bits into guest RAM pending tables.
diff --git a/Documentation/virt/kvm/devices/s390_flic.rst b/Documentation/virt/kvm/devices/s390_flic.rst
index 954190da7d04..ea96559ba501 100644
--- a/Documentation/virt/kvm/devices/s390_flic.rst
+++ b/Documentation/virt/kvm/devices/s390_flic.rst
@@ -108,16 +108,9 @@ Groups:
mask or unmask the adapter, as specified in mask
KVM_S390_IO_ADAPTER_MAP
- perform a gmap translation for the guest address provided in addr,
- pin a userspace page for the translated address and add it to the
- list of mappings
-
- .. note:: A new mapping will be created unconditionally; therefore,
- the calling code should avoid making duplicate mappings.
-
+ This is now a no-op. The mapping is purely done by the irq route.
KVM_S390_IO_ADAPTER_UNMAP
- release a userspace page for the translated address specified in addr
- from the list of mappings
+ This is now a no-op. The mapping is purely done by the irq route.
KVM_DEV_FLIC_AISM
modify the adapter-interruption-suppression mode for a given isc if the
diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 9963e680770a..716aa3edae14 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -25,8 +25,10 @@ Returns:
======= ========================================================
-EBUSY The PMU overflow interrupt is already set
- -ENXIO The overflow interrupt not set when attempting to get it
- -ENODEV PMUv3 not supported
+ -EFAULT Error reading interrupt number
+ -ENXIO PMUv3 not supported or the overflow interrupt not set
+ when attempting to get it
+ -ENODEV KVM_ARM_VCPU_PMU_V3 feature missing from VCPU
-EINVAL Invalid PMU overflow interrupt number supplied or
trying to set the IRQ number without using an in-kernel
irqchip.
@@ -45,9 +47,10 @@ all vcpus, while as an SPI it must be a separate number per vcpu.
Returns:
======= ======================================================
+ -EEXIST Interrupt number already used
-ENODEV PMUv3 not supported or GIC not initialized
- -ENXIO PMUv3 not properly configured or in-kernel irqchip not
- configured as required prior to calling this attribute
+ -ENXIO PMUv3 not supported, missing VCPU feature or interrupt
+ number not set
-EBUSY PMUv3 already initialized
======= ======================================================
@@ -55,11 +58,89 @@ Request the initialization of the PMUv3. If using the PMUv3 with an in-kernel
virtual GIC implementation, this must be done after initializing the in-kernel
irqchip.
+1.3 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_FILTER
+-----------------------------------------
+
+:Parameters: in kvm_device_attr.addr the address for a PMU event filter is a
+ pointer to a struct kvm_pmu_event_filter
+
+:Returns:
+
+ ======= ======================================================
+ -ENODEV PMUv3 not supported or GIC not initialized
+ -ENXIO PMUv3 not properly configured or in-kernel irqchip not
+ configured as required prior to calling this attribute
+ -EBUSY PMUv3 already initialized or a VCPU has already run
+ -EINVAL Invalid filter range
+ ======= ======================================================
+
+Request the installation of a PMU event filter described as follows::
+
+ struct kvm_pmu_event_filter {
+ __u16 base_event;
+ __u16 nevents;
+
+ #define KVM_PMU_EVENT_ALLOW 0
+ #define KVM_PMU_EVENT_DENY 1
+
+ __u8 action;
+ __u8 pad[3];
+ };
+
+A filter range is defined as the range [@base_event, @base_event + @nevents),
+together with an @action (KVM_PMU_EVENT_ALLOW or KVM_PMU_EVENT_DENY). The
+first registered range defines the global policy (global ALLOW if the first
+@action is DENY, global DENY if the first @action is ALLOW). Multiple ranges
+can be programmed, and must fit within the event space defined by the PMU
+architecture (10 bits on ARMv8.0, 16 bits from ARMv8.1 onwards).
+
+Note: "Cancelling" a filter by registering the opposite action for the same
+range doesn't change the default action. For example, installing an ALLOW
+filter for event range [0:10) as the first filter and then applying a DENY
+action for the same range will leave the whole range as disabled.
+
+Restrictions: Event 0 (SW_INCR) is never filtered, as it doesn't count a
+hardware event. Filtering event 0x1E (CHAIN) has no effect either, as it
+isn't strictly speaking an event. Filtering the cycle counter is possible
+using event 0x11 (CPU_CYCLES).
+
+1.4 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_SET_PMU
+------------------------------------------
+
+:Parameters: in kvm_device_attr.addr the address to an int representing the PMU
+ identifier.
+
+:Returns:
+
+ ======= ====================================================
+ -EBUSY PMUv3 already initialized, a VCPU has already run or
+ an event filter has already been set
+ -EFAULT Error accessing the PMU identifier
+ -ENXIO PMU not found
+ -ENODEV PMUv3 not supported or GIC not initialized
+ -ENOMEM Could not allocate memory
+ ======= ====================================================
+
+Request that the VCPU uses the specified hardware PMU when creating guest events
+for the purpose of PMU emulation. The PMU identifier can be read from the "type"
+file for the desired PMU instance under /sys/devices (or, equivalent,
+/sys/bus/even_source). This attribute is particularly useful on heterogeneous
+systems where there are at least two CPU PMUs on the system. The PMU that is set
+for one VCPU will be used by all the other VCPUs. It isn't possible to set a PMU
+if a PMU event filter is already present.
+
+Note that KVM will not make any attempts to run the VCPU on the physical CPUs
+associated with the PMU specified by this attribute. This is entirely left to
+userspace. However, attempting to run the VCPU on a physical CPU not supported
+by the PMU will fail and KVM_RUN will return with
+exit_reason = KVM_EXIT_FAIL_ENTRY and populate the fail_entry struct by setting
+hardare_entry_failure_reason field to KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED and
+the cpu field to the processor id.
2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
=================================
-:Architectures: ARM, ARM64
+:Architectures: ARM64
2.1. ATTRIBUTES: KVM_ARM_VCPU_TIMER_IRQ_VTIMER, KVM_ARM_VCPU_TIMER_IRQ_PTIMER
-----------------------------------------------------------------------------
@@ -110,5 +191,75 @@ Returns:
Specifies the base address of the stolen time structure for this VCPU. The
base address must be 64 byte aligned and exist within a valid guest memory
-region. See Documentation/virt/kvm/arm/pvtime.txt for more information
+region. See Documentation/virt/kvm/arm/pvtime.rst for more information
including the layout of the stolen time structure.
+
+4. GROUP: KVM_VCPU_TSC_CTRL
+===========================
+
+:Architectures: x86
+
+4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
+
+:Parameters: 64-bit unsigned TSC offset
+
+Returns:
+
+ ======= ======================================
+ -EFAULT Error reading/writing the provided
+ parameter address.
+ -ENXIO Attribute not supported
+ ======= ======================================
+
+Specifies the guest's TSC offset relative to the host's TSC. The guest's
+TSC is then derived by the following equation:
+
+ guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
+
+This attribute is useful to adjust the guest's TSC on live migration,
+so that the TSC counts the time during which the VM was paused. The
+following describes a possible algorithm to use for this purpose.
+
+From the source VMM process:
+
+1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src),
+ kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds
+ (host_src).
+
+2. Read the KVM_VCPU_TSC_OFFSET attribute for every vCPU to record the
+ guest TSC offset (ofs_src[i]).
+
+3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the
+ guest's TSC (freq).
+
+From the destination VMM process:
+
+4. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
+ kvmclock (guest_src) and CLOCK_REALTIME (host_src) in their respective
+ fields. Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
+ structure.
+
+ KVM will advance the VM's kvmclock to account for elapsed time since
+ recording the clock values. Note that this will cause problems in
+ the guest (e.g., timeouts) unless CLOCK_REALTIME is synchronized
+ between the source and destination, and a reasonably short time passes
+ between the source pausing the VMs and the destination executing
+ steps 4-7.
+
+5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_dest) and
+ kvmclock nanoseconds (guest_dest).
+
+6. Adjust the guest TSC offsets for every vCPU to account for (1) time
+ elapsed since recording state and (2) difference in TSCs between the
+ source and destination machine:
+
+ ofs_dst[i] = ofs_src[i] -
+ (guest_src - guest_dest) * freq +
+ (tsc_src - tsc_dest)
+
+ ("ofs[i] + tsc - guest * freq" is the guest TSC value corresponding to
+ a time of 0 in kvmclock. The above formula ensures that it is the
+ same on the destination as it was on the source).
+
+7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
+ respective value derived in the previous step.
diff --git a/Documentation/virt/kvm/devices/vm.rst b/Documentation/virt/kvm/devices/vm.rst
index 0aa5b1cfd700..60acc39e0e93 100644
--- a/Documentation/virt/kvm/devices/vm.rst
+++ b/Documentation/virt/kvm/devices/vm.rst
@@ -215,6 +215,7 @@ KVM_S390_VM_TOD_EXT).
:Parameters: address of a buffer in user space to store the data (u8) to
:Returns: -EFAULT if the given address is not accessible from kernel space;
-EINVAL if setting the TOD clock extension to != 0 is not supported
+ -EOPNOTSUPP for a PV guest (TOD managed by the ultravisor)
3.2. ATTRIBUTE: KVM_S390_VM_TOD_LOW
-----------------------------------
@@ -224,6 +225,7 @@ the POP (u64).
:Parameters: address of a buffer in user space to store the data (u64) to
:Returns: -EFAULT if the given address is not accessible from kernel space
+ -EOPNOTSUPP for a PV guest (TOD managed by the ultravisor)
3.3. ATTRIBUTE: KVM_S390_VM_TOD_EXT
-----------------------------------
@@ -237,6 +239,7 @@ it, it is stored as 0 and not allowed to be set to a value != 0.
(kvm_s390_vm_tod_clock) to
:Returns: -EFAULT if the given address is not accessible from kernel space;
-EINVAL if setting the TOD clock extension to != 0 is not supported
+ -EOPNOTSUPP for a PV guest (TOD managed by the ultravisor)
4. GROUP: KVM_S390_VM_CRYPTO
============================
diff --git a/Documentation/virt/kvm/devices/xics.rst b/Documentation/virt/kvm/devices/xics.rst
index 2d6927e0b776..bf32c77174ab 100644
--- a/Documentation/virt/kvm/devices/xics.rst
+++ b/Documentation/virt/kvm/devices/xics.rst
@@ -22,7 +22,7 @@ Groups:
Errors:
======= ==========================================
- -EINVAL Value greater than KVM_MAX_VCPU_ID.
+ -EINVAL Value greater than KVM_MAX_VCPU_IDS.
-EFAULT Invalid user pointer for attr->addr.
-EBUSY A vcpu is already connected to the device.
======= ==========================================
diff --git a/Documentation/virt/kvm/devices/xive.rst b/Documentation/virt/kvm/devices/xive.rst
index 8bdf3dc38f01..8b5e7b40bdf8 100644
--- a/Documentation/virt/kvm/devices/xive.rst
+++ b/Documentation/virt/kvm/devices/xive.rst
@@ -91,7 +91,7 @@ the legacy interrupt mode, referred as XICS (POWER7/8).
Errors:
======= ==========================================
- -EINVAL Value greater than KVM_MAX_VCPU_ID.
+ -EINVAL Value greater than KVM_MAX_VCPU_IDS.
-EFAULT Invalid user pointer for attr->addr.
-EBUSY A vCPU is already connected to the device.
======= ==========================================
diff --git a/Documentation/virt/kvm/index.rst b/Documentation/virt/kvm/index.rst
index 774deaebf7fa..e0a2c74e1043 100644
--- a/Documentation/virt/kvm/index.rst
+++ b/Documentation/virt/kvm/index.rst
@@ -8,21 +8,13 @@ KVM
:maxdepth: 2
api
- amd-memory-encryption
- cpuid
- halt-polling
- hypercalls
- locking
- mmu
- msr
- nested-vmx
- ppc-pv
- s390-diag
- timekeeping
- vcpu-requests
-
- review-checklist
+ devices/index
arm/index
+ s390/index
+ ppc-pv
+ x86/index
- devices/index
+ locking
+ vcpu-requests
+ review-checklist
diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index c02291beac3f..845a561629f1 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -16,7 +16,25 @@ The acquisition orders for mutexes are as follows:
- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
them together is quite rare.
-On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.
+- Unlike kvm->slots_lock, kvm->slots_arch_lock is released before
+ synchronize_srcu(&kvm->srcu). Therefore kvm->slots_arch_lock
+ can be taken inside a kvm->srcu read-side critical section,
+ while kvm->slots_lock cannot.
+
+- kvm->mn_active_invalidate_count ensures that pairs of
+ invalidate_range_start() and invalidate_range_end() callbacks
+ use the same memslots array. kvm->slots_lock and kvm->slots_arch_lock
+ are taken on the waiting side in install_new_memslots, so MMU notifiers
+ must not take either kvm->slots_lock or kvm->slots_arch_lock.
+
+On x86:
+
+- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock
+
+- kvm->arch.mmu_lock is an rwlock. kvm->arch.tdp_mmu_pages_lock and
+ kvm->arch.mmu_unsync_pages_lock are taken inside kvm->arch.mmu_lock, and
+ cannot be taken without already holding kvm->arch.mmu_lock (typically with
+ ``read_lock`` for the TDP MMU, thus the need for additional spinlocks).
Everything else is a leaf: no other lock is taken inside the critical
sections.
@@ -31,25 +49,24 @@ the mmu-lock on x86. Currently, the page fault can be fast in one of the
following two cases:
1. Access Tracking: The SPTE is not present, but it is marked for access
- tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to
- restore the saved R/X bits. This is described in more detail later below.
+ tracking. That means we need to restore the saved R/X bits. This is
+ described in more detail later below.
-2. Write-Protection: The SPTE is present and the fault is
- caused by write-protect. That means we just need to change the W bit of
- the spte.
+2. Write-Protection: The SPTE is present and the fault is caused by
+ write-protect. That means we just need to change the W bit of the spte.
-What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
-SPTE_MMU_WRITEABLE bit on the spte:
+What we use to avoid all the race is the Host-writable bit and MMU-writable bit
+on the spte:
-- SPTE_HOST_WRITEABLE means the gfn is writable on host.
-- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when
- the gfn is writable on guest mmu and it is not write-protected by shadow
- page write-protection.
+- Host-writable means the gfn is writable in the host kernel page tables and in
+ its KVM memslot.
+- MMU-writable means the gfn is writable in the guest's mmu and it is not
+ write-protected by shadow page write-protection.
On fast page fault path, we will use cmpxchg to atomically set the spte W
-bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or
-restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This
-is safe because whenever changing these bits can be detected by cmpxchg.
+bit if spte.HOST_WRITEABLE = 1 and spte.WRITE_PROTECT = 1, to restore the saved
+R/X bits if for an access-traced spte, or both. This is safe because whenever
+changing these bits can be detected by cmpxchg.
But we need carefully check these cases:
@@ -96,19 +113,18 @@ will happen:
We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
For direct sp, we can easily avoid it since the spte of direct sp is fixed
-to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic()
-to pin gfn to pfn, because after gfn_to_pfn_atomic():
+to gfn. For indirect sp, we disabled fast page fault for simplicity.
+
+A solution for indirect sp could be to pin the gfn, for example via
+kvm_vcpu_gfn_to_pfn_atomic, before the cmpxchg. After the pinning:
- We have held the refcount of pfn that means the pfn can not be freed and
be reused for another gfn.
-- The pfn is writable that means it can not be shared between different gfns
+- The pfn is writable and therefore it cannot be shared between different gfns
by KSM.
Then, we can ensure the dirty bitmaps is correctly set for a gfn.
-Currently, to simplify the whole things, we disable fast page fault for
-indirect shadow page.
-
2) Dirty bit tracking
In the origin code, the spte can be fast updated (non-atomically) if the
@@ -179,47 +195,62 @@ See the comments in spte_has_volatile_bits() and mmu_spte_update().
Lockless Access Tracking:
This is used for Intel CPUs that are using EPT but do not support the EPT A/D
-bits. In this case, when the KVM MMU notifier is called to track accesses to a
-page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present
-by clearing the RWX bits in the PTE and storing the original R & X bits in
-some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the
-PTE (using the ignored bit 62). When the VM tries to access the page later on,
-a fault is generated and the fast page fault mechanism described above is used
-to atomically restore the PTE to a Present state. The W bit is not saved when
-the PTE is marked for access tracking and during restoration to the Present
-state, the W bit is set depending on whether or not it was a write access. If
-it wasn't, then the W bit will remain clear until a write access happens, at
-which time it will be set using the Dirty tracking mechanism described above.
+bits. In this case, PTEs are tagged as A/D disabled (using ignored bits), and
+when the KVM MMU notifier is called to track accesses to a page (via
+kvm_mmu_notifier_clear_flush_young), it marks the PTE not-present in hardware
+by clearing the RWX bits in the PTE and storing the original R & X bits in more
+unused/ignored bits. When the VM tries to access the page later on, a fault is
+generated and the fast page fault mechanism described above is used to
+atomically restore the PTE to a Present state. The W bit is not saved when the
+PTE is marked for access tracking and during restoration to the Present state,
+the W bit is set depending on whether or not it was a write access. If it
+wasn't, then the W bit will remain clear until a write access happens, at which
+time it will be set using the Dirty tracking mechanism described above.
3. Reference
------------
-:Name: kvm_lock
+``kvm_lock``
+^^^^^^^^^^^^
+
:Type: mutex
:Arch: any
:Protects: - vm_list
-:Name: kvm_count_lock
+``kvm_count_lock``
+^^^^^^^^^^^^^^^^^^
+
:Type: raw_spinlock_t
:Arch: any
:Protects: - hardware virtualization enable/disable
:Comment: 'raw' because hardware enabling/disabling must be atomic /wrt
migration.
-:Name: kvm_arch::tsc_write_lock
-:Type: raw_spinlock
+``kvm->mn_invalidate_lock``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:Type: spinlock_t
+:Arch: any
+:Protects: mn_active_invalidate_count, mn_memslots_update_rcuwait
+
+``kvm_arch::tsc_write_lock``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:Type: raw_spinlock_t
:Arch: x86
:Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
- tsc offset in vmcb
:Comment: 'raw' because updating the tsc offsets must not be preempted.
-:Name: kvm->mmu_lock
-:Type: spinlock_t
+``kvm->mmu_lock``
+^^^^^^^^^^^^^^^^^
+:Type: spinlock_t or rwlock_t
:Arch: any
:Protects: -shadow page/shadow tlb entry
:Comment: it is a spinlock since it is used in mmu notifier.
-:Name: kvm->srcu
+``kvm->srcu``
+^^^^^^^^^^^^^
:Type: srcu lock
:Arch: any
:Protects: - kvm->memslots
@@ -230,10 +261,20 @@ which time it will be set using the Dirty tracking mechanism described above.
The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
if it is needed by multiple functions.
-:Name: blocked_vcpu_on_cpu_lock
+``kvm->slots_arch_lock``
+^^^^^^^^^^^^^^^^^^^^^^^^
+:Type: mutex
+:Arch: any (only needed on x86 though)
+:Protects: any arch-specific fields of memslots that have to be modified
+ in a ``kvm->srcu`` read-side critical section.
+:Comment: must be held before reading the pointer to the current memslots,
+ until after all changes to the memslots are complete
+
+``wakeup_vcpus_on_cpu_lock``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
:Type: spinlock_t
:Arch: x86
-:Protects: blocked_vcpu_on_cpu
+:Protects: wakeup_vcpus_on_cpu
:Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts.
When VT-d posted-interrupts is supported and the VM has assigned
devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu
diff --git a/Documentation/virt/kvm/review-checklist.rst b/Documentation/virt/kvm/review-checklist.rst
index 1f86a9d3f705..dc01aea4057b 100644
--- a/Documentation/virt/kvm/review-checklist.rst
+++ b/Documentation/virt/kvm/review-checklist.rst
@@ -10,7 +10,7 @@ Review checklist for kvm patches
2. Patches should be against kvm.git master branch.
3. If the patch introduces or modifies a new userspace API:
- - the API must be documented in Documentation/virt/kvm/api.txt
+ - the API must be documented in Documentation/virt/kvm/api.rst
- the API must be discoverable using KVM_CHECK_EXTENSION
4. New state must include support for save/restore.
diff --git a/Documentation/virt/kvm/s390/index.rst b/Documentation/virt/kvm/s390/index.rst
new file mode 100644
index 000000000000..44ec9ab14b59
--- /dev/null
+++ b/Documentation/virt/kvm/s390/index.rst
@@ -0,0 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+KVM for s390 systems
+====================
+
+.. toctree::
+ :maxdepth: 2
+
+ s390-diag
+ s390-pv
+ s390-pv-boot
+ s390-pv-dump
diff --git a/Documentation/virt/kvm/s390-diag.rst b/Documentation/virt/kvm/s390/s390-diag.rst
index eaac4864d3d6..ca85f030eb0b 100644
--- a/Documentation/virt/kvm/s390-diag.rst
+++ b/Documentation/virt/kvm/s390/s390-diag.rst
@@ -84,3 +84,36 @@ If the function code specifies 0x501, breakpoint functions may be performed.
This function code is handled by userspace.
This diagnose function code has no subfunctions and uses no parameters.
+
+
+DIAGNOSE function code 'X'9C - Voluntary Time Slice Yield
+---------------------------------------------------------
+
+General register 1 contains the target CPU address.
+
+In a guest of a hypervisor like LPAR, KVM or z/VM using shared host CPUs,
+DIAGNOSE with function code 0x9c may improve system performance by
+yielding the host CPU on which the guest CPU is running to be assigned
+to another guest CPU, preferably the logical CPU containing the specified
+target CPU.
+
+
+DIAG 'X'9C forwarding
++++++++++++++++++++++
+
+The guest may send a DIAGNOSE 0x9c in order to yield to a certain
+other vcpu. An example is a Linux guest that tries to yield to the vcpu
+that is currently holding a spinlock, but not running.
+
+However, on the host the real cpu backing the vcpu may itself not be
+running.
+Forwarding the DIAGNOSE 0x9c initially sent by the guest to yield to
+the backing cpu will hopefully cause that cpu, and thus subsequently
+the guest's vcpu, to be scheduled.
+
+
+diag9c_forwarding_hz
+ KVM kernel parameter allowing to specify the maximum number of DIAGNOSE
+ 0x9c forwarding per second in the purpose of avoiding a DIAGNOSE 0x9c
+ forwarding storm.
+ A value of 0 turns the forwarding off.
diff --git a/Documentation/virt/kvm/s390/s390-pv-boot.rst b/Documentation/virt/kvm/s390/s390-pv-boot.rst
new file mode 100644
index 000000000000..96c48480a360
--- /dev/null
+++ b/Documentation/virt/kvm/s390/s390-pv-boot.rst
@@ -0,0 +1,84 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================
+s390 (IBM Z) Boot/IPL of Protected VMs
+======================================
+
+Summary
+-------
+The memory of Protected Virtual Machines (PVMs) is not accessible to
+I/O or the hypervisor. In those cases where the hypervisor needs to
+access the memory of a PVM, that memory must be made accessible.
+Memory made accessible to the hypervisor will be encrypted. See
+Documentation/virt/kvm/s390/s390-pv.rst for details."
+
+On IPL (boot) a small plaintext bootloader is started, which provides
+information about the encrypted components and necessary metadata to
+KVM to decrypt the protected virtual machine.
+
+Based on this data, KVM will make the protected virtual machine known
+to the Ultravisor (UV) and instruct it to secure the memory of the
+PVM, decrypt the components and verify the data and address list
+hashes, to ensure integrity. Afterwards KVM can run the PVM via the
+SIE instruction which the UV will intercept and execute on KVM's
+behalf.
+
+As the guest image is just like an opaque kernel image that does the
+switch into PV mode itself, the user can load encrypted guest
+executables and data via every available method (network, dasd, scsi,
+direct kernel, ...) without the need to change the boot process.
+
+
+Diag308
+-------
+This diagnose instruction is the basic mechanism to handle IPL and
+related operations for virtual machines. The VM can set and retrieve
+IPL information blocks, that specify the IPL method/devices and
+request VM memory and subsystem resets, as well as IPLs.
+
+For PVMs this concept has been extended with new subcodes:
+
+Subcode 8: Set an IPL Information Block of type 5 (information block
+for PVMs)
+Subcode 9: Store the saved block in guest memory
+Subcode 10: Move into Protected Virtualization mode
+
+The new PV load-device-specific-parameters field specifies all data
+that is necessary to move into PV mode.
+
+* PV Header origin
+* PV Header length
+* List of Components composed of
+ * AES-XTS Tweak prefix
+ * Origin
+ * Size
+
+The PV header contains the keys and hashes, which the UV will use to
+decrypt and verify the PV, as well as control flags and a start PSW.
+
+The components are for instance an encrypted kernel, kernel parameters
+and initrd. The components are decrypted by the UV.
+
+After the initial import of the encrypted data, all defined pages will
+contain the guest content. All non-specified pages will start out as
+zero pages on first access.
+
+
+When running in protected virtualization mode, some subcodes will result in
+exceptions or return error codes.
+
+Subcodes 4 and 7, which specify operations that do not clear the guest
+memory, will result in specification exceptions. This is because the
+UV will clear all memory when a secure VM is removed, and therefore
+non-clearing IPL subcodes are not allowed.
+
+Subcodes 8, 9, 10 will result in specification exceptions.
+Re-IPL into a protected mode is only possible via a detour into non
+protected mode.
+
+Keys
+----
+Every CEC will have a unique public key to enable tooling to build
+encrypted images.
+See `s390-tools <https://github.com/ibm-s390-linux/s390-tools/>`_
+for the tooling.
diff --git a/Documentation/virt/kvm/s390/s390-pv-dump.rst b/Documentation/virt/kvm/s390/s390-pv-dump.rst
new file mode 100644
index 000000000000..e542f06048f3
--- /dev/null
+++ b/Documentation/virt/kvm/s390/s390-pv-dump.rst
@@ -0,0 +1,64 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================================
+s390 (IBM Z) Protected Virtualization dumps
+===========================================
+
+Summary
+-------
+
+Dumping a VM is an essential tool for debugging problems inside
+it. This is especially true when a protected VM runs into trouble as
+there's no way to access its memory and registers from the outside
+while it's running.
+
+However when dumping a protected VM we need to maintain its
+confidentiality until the dump is in the hands of the VM owner who
+should be the only one capable of analysing it.
+
+The confidentiality of the VM dump is ensured by the Ultravisor who
+provides an interface to KVM over which encrypted CPU and memory data
+can be requested. The encryption is based on the Customer
+Communication Key which is the key that's used to encrypt VM data in a
+way that the customer is able to decrypt.
+
+
+Dump process
+------------
+
+A dump is done in 3 steps:
+
+**Initiation**
+
+This step initializes the dump process, generates cryptographic seeds
+and extracts dump keys with which the VM dump data will be encrypted.
+
+**Data gathering**
+
+Currently there are two types of data that can be gathered from a VM:
+the memory and the vcpu state.
+
+The vcpu state contains all the important registers, general, floating
+point, vector, control and tod/timers of a vcpu. The vcpu dump can
+contain incomplete data if a vcpu is dumped while an instruction is
+emulated with help of the hypervisor. This is indicated by a flag bit
+in the dump data. For the same reason it is very important to not only
+write out the encrypted vcpu state, but also the unencrypted state
+from the hypervisor.
+
+The memory state is further divided into the encrypted memory and its
+metadata comprised of the encryption tweaks and status flags. The
+encrypted memory can simply be read once it has been exported. The
+time of the export does not matter as no re-encryption is
+needed. Memory that has been swapped out and hence was exported can be
+read from the swap and written to the dump target without need for any
+special actions.
+
+The tweaks / status flags for the exported pages need to be requested
+from the Ultravisor.
+
+**Finalization**
+
+The finalization step will provide the data needed to be able to
+decrypt the vcpu and memory data and end the dump process. When this
+step completes successfully a new dump initiation can be started.
diff --git a/Documentation/virt/kvm/s390/s390-pv.rst b/Documentation/virt/kvm/s390/s390-pv.rst
new file mode 100644
index 000000000000..8e41a3b63fa5
--- /dev/null
+++ b/Documentation/virt/kvm/s390/s390-pv.rst
@@ -0,0 +1,116 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================
+s390 (IBM Z) Ultravisor and Protected VMs
+=========================================
+
+Summary
+-------
+Protected virtual machines (PVM) are KVM VMs that do not allow KVM to
+access VM state like guest memory or guest registers. Instead, the
+PVMs are mostly managed by a new entity called Ultravisor (UV). The UV
+provides an API that can be used by PVMs and KVM to request management
+actions.
+
+Each guest starts in non-protected mode and then may make a request to
+transition into protected mode. On transition, KVM registers the guest
+and its VCPUs with the Ultravisor and prepares everything for running
+it.
+
+The Ultravisor will secure and decrypt the guest's boot memory
+(i.e. kernel/initrd). It will safeguard state changes like VCPU
+starts/stops and injected interrupts while the guest is running.
+
+As access to the guest's state, such as the SIE state description, is
+normally needed to be able to run a VM, some changes have been made in
+the behavior of the SIE instruction. A new format 4 state description
+has been introduced, where some fields have different meanings for a
+PVM. SIE exits are minimized as much as possible to improve speed and
+reduce exposed guest state.
+
+
+Interrupt injection
+-------------------
+Interrupt injection is safeguarded by the Ultravisor. As KVM doesn't
+have access to the VCPUs' lowcores, injection is handled via the
+format 4 state description.
+
+Machine check, external, IO and restart interruptions each can be
+injected on SIE entry via a bit in the interrupt injection control
+field (offset 0x54). If the guest cpu is not enabled for the interrupt
+at the time of injection, a validity interception is recognized. The
+format 4 state description contains fields in the interception data
+block where data associated with the interrupt can be transported.
+
+Program and Service Call exceptions have another layer of
+safeguarding; they can only be injected for instructions that have
+been intercepted into KVM. The exceptions need to be a valid outcome
+of an instruction emulation by KVM, e.g. we can never inject a
+addressing exception as they are reported by SIE since KVM has no
+access to the guest memory.
+
+
+Mask notification interceptions
+-------------------------------
+KVM cannot intercept lctl(g) and lpsw(e) anymore in order to be
+notified when a PVM enables a certain class of interrupt. As a
+replacement, two new interception codes have been introduced: One
+indicating that the contents of CRs 0, 6, or 14 have been changed,
+indicating different interruption subclasses; and one indicating that
+PSW bit 13 has been changed, indicating that a machine check
+intervention was requested and those are now enabled.
+
+Instruction emulation
+---------------------
+With the format 4 state description for PVMs, the SIE instruction already
+interprets more instructions than it does with format 2. It is not able
+to interpret every instruction, but needs to hand some tasks to KVM;
+therefore, the SIE and the ultravisor safeguard emulation inputs and outputs.
+
+The control structures associated with SIE provide the Secure
+Instruction Data Area (SIDA), the Interception Parameters (IP) and the
+Secure Interception General Register Save Area. Guest GRs and most of
+the instruction data, such as I/O data structures, are filtered.
+Instruction data is copied to and from the SIDA when needed. Guest
+GRs are put into / retrieved from the Secure Interception General
+Register Save Area.
+
+Only GR values needed to emulate an instruction will be copied into this
+save area and the real register numbers will be hidden.
+
+The Interception Parameters state description field still contains
+the bytes of the instruction text, but with pre-set register values
+instead of the actual ones. I.e. each instruction always uses the same
+instruction text, in order not to leak guest instruction text.
+This also implies that the register content that a guest had in r<n>
+may be in r<m> from the hypervisor's point of view.
+
+The Secure Instruction Data Area contains instruction storage
+data. Instruction data, i.e. data being referenced by an instruction
+like the SCCB for sclp, is moved via the SIDA. When an instruction is
+intercepted, the SIE will only allow data and program interrupts for
+this instruction to be moved to the guest via the two data areas
+discussed before. Other data is either ignored or results in validity
+interceptions.
+
+
+Instruction emulation interceptions
+-----------------------------------
+There are two types of SIE secure instruction intercepts: the normal
+and the notification type. Normal secure instruction intercepts will
+make the guest pending for instruction completion of the intercepted
+instruction type, i.e. on SIE entry it is attempted to complete
+emulation of the instruction with the data provided by KVM. That might
+be a program exception or instruction completion.
+
+The notification type intercepts inform KVM about guest environment
+changes due to guest instruction interpretation. Such an interception
+is recognized, for example, for the store prefix instruction to provide
+the new lowcore location. On SIE reentry, any KVM data in the data areas
+is ignored and execution continues as if the guest instruction had
+completed. For that reason KVM is not allowed to inject a program
+interrupt.
+
+Links
+-----
+`KVM Forum 2019 presentation <https://static.sched.com/hosted_files/kvmforum2019/3b/ibm_protected_vms_s390x.pdf>`_
diff --git a/Documentation/virt/kvm/vcpu-requests.rst b/Documentation/virt/kvm/vcpu-requests.rst
index 5feb3706a7ae..87f04c1fa53d 100644
--- a/Documentation/virt/kvm/vcpu-requests.rst
+++ b/Documentation/virt/kvm/vcpu-requests.rst
@@ -1,3 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0
+
=================
KVM VCPU Requests
=================
@@ -95,7 +97,7 @@ VCPU requests are simply bit indices of the ``vcpu->requests`` bitmap.
This means general bitops, like those documented in [atomic-ops]_ could
also be used, e.g. ::
- clear_bit(KVM_REQ_UNHALT & KVM_REQUEST_MASK, &vcpu->requests);
+ clear_bit(KVM_REQ_UNBLOCK & KVM_REQUEST_MASK, &vcpu->requests);
However, VCPU request users should refrain from doing so, as it would
break the abstraction. The first 8 bits are reserved for architecture
@@ -112,27 +114,27 @@ KVM_REQ_TLB_FLUSH
choose to use the common kvm_flush_remote_tlbs() implementation will
need to handle this VCPU request.
-KVM_REQ_MMU_RELOAD
+KVM_REQ_VM_DEAD
- When shadow page tables are used and memory slots are removed it's
- necessary to inform each VCPU to completely refresh the tables. This
- request is used for that.
+ This request informs all VCPUs that the VM is dead and unusable, e.g. due to
+ fatal error or because the VM's state has been intentionally destroyed.
-KVM_REQ_PENDING_TIMER
+KVM_REQ_UNBLOCK
- This request may be made from a timer handler run on the host on behalf
- of a VCPU. It informs the VCPU thread to inject a timer interrupt.
+ This request informs the vCPU to exit kvm_vcpu_block. It is used for
+ example from timer handlers that run on the host on behalf of a vCPU,
+ or in order to update the interrupt routing and ensure that assigned
+ devices will wake up the vCPU.
-KVM_REQ_UNHALT
+KVM_REQ_OUTSIDE_GUEST_MODE
- This request may be made from the KVM common function kvm_vcpu_block(),
- which is used to emulate an instruction that causes a CPU to halt until
- one of an architectural specific set of events and/or interrupts is
- received (determined by checking kvm_arch_vcpu_runnable()). When that
- event or interrupt arrives kvm_vcpu_block() makes the request. This is
- in contrast to when kvm_vcpu_block() returns due to any other reason,
- such as a pending signal, which does not indicate the VCPU's halt
- emulation should stop, and therefore does not make the request.
+ This "request" ensures the target vCPU has exited guest mode prior to the
+ sender of the request continuing on. No action needs be taken by the target,
+ and so no request is actually logged for the target. This request is similar
+ to a "kick", but unlike a kick it guarantees the vCPU has actually exited
+ guest mode. A kick only guarantees the vCPU will exit at some point in the
+ future, e.g. a previous kick may have started the process, but there's no
+ guarantee the to-be-kicked vCPU has fully exited guest mode.
KVM_REQUEST_MASK
----------------
@@ -284,24 +286,9 @@ architecture dependent. kvm_vcpu_block() calls kvm_arch_vcpu_runnable()
to check if it should awaken. One reason to do so is to provide
architectures a function where requests may be checked if necessary.
-Clearing Requests
------------------
-
-Generally it only makes sense for the receiving VCPU thread to clear a
-request. However, in some circumstances, such as when the requesting
-thread and the receiving VCPU thread are executed serially, such as when
-they are the same thread, or when they are using some form of concurrency
-control to temporarily execute synchronously, then it's possible to know
-that the request may be cleared immediately, rather than waiting for the
-receiving VCPU thread to handle the request in VCPU RUN. The only current
-examples of this are kvm_vcpu_block() calls made by VCPUs to block
-themselves. A possible side-effect of that call is to make the
-KVM_REQ_UNHALT request, which may then be cleared immediately when the
-VCPU returns from the call.
-
References
==========
-.. [atomic-ops] Documentation/core-api/atomic_ops.rst
+.. [atomic-ops] Documentation/atomic_bitops.txt and Documentation/atomic_t.txt
.. [memory-barriers] Documentation/memory-barriers.txt
.. [lwn-mb] https://lwn.net/Articles/573436/
diff --git a/Documentation/virt/kvm/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index d18c97b4e140..935aaeb97fe6 100644
--- a/Documentation/virt/kvm/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -1,3 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0
+
======================================
Secure Encrypted Virtualization (SEV)
======================================
@@ -22,7 +24,7 @@ to SEV::
[ecx]:
Bits[31:0] Number of encrypted guests supported simultaneously
-If support for SEV is present, MSR 0xc001_0010 (MSR_K8_SYSCFG) and MSR 0xc001_0015
+If support for SEV is present, MSR 0xc001_0010 (MSR_AMD64_SYSCFG) and MSR 0xc001_0015
(MSR_K7_HWCR) can be used to determine if it can be enabled::
0xc001_0010:
@@ -53,6 +55,29 @@ key management interface to perform common hypervisor activities such as
encrypting bootstrap code, snapshot, migrating and debugging the guest. For more
information, see the SEV Key Management spec [api-spec]_
+The main ioctl to access SEV is KVM_MEMORY_ENCRYPT_OP. If the argument
+to KVM_MEMORY_ENCRYPT_OP is NULL, the ioctl returns 0 if SEV is enabled
+and ``ENOTTY` if it is disabled (on some older versions of Linux,
+the ioctl runs normally even with a NULL argument, and therefore will
+likely return ``EFAULT``). If non-NULL, the argument to KVM_MEMORY_ENCRYPT_OP
+must be a struct kvm_sev_cmd::
+
+ struct kvm_sev_cmd {
+ __u32 id;
+ __u64 data;
+ __u32 error;
+ __u32 sev_fd;
+ };
+
+
+The ``id`` field contains the subcommand, and the ``data`` field points to
+another struct containing arguments specific to command. The ``sev_fd``
+should point to a file descriptor that is opened on the ``/dev/sev``
+device, if needed (see individual commands).
+
+On output, ``error`` is zero on success, or an error code. Error codes
+are defined in ``<linux/psp-dev.h>``.
+
KVM implements the following commands to support common lifecycle events of SEV
guests, such as launching, running, snapshotting, migrating and decommissioning.
@@ -62,6 +87,11 @@ guests, such as launching, running, snapshotting, migrating and decommissioning.
The KVM_SEV_INIT command is used by the hypervisor to initialize the SEV platform
context. In a typical workflow, this command should be the first command issued.
+The firmware can be initialized either by using its own non-volatile storage or
+the OS can manage the NV storage for the firmware using the module parameter
+``init_ex_path``. If the file specified by ``init_ex_path`` does not exist or
+is invalid, the OS will create or override the file with output from PSP.
+
Returns: 0 on success, -negative on error
2. KVM_SEV_LAUNCH_START
@@ -90,6 +120,8 @@ Returns: 0 on success, -negative on error
On success, the 'handle' field contains a new handle and on error, a negative value.
+KVM_SEV_LAUNCH_START requires the ``sev_fd`` field to be valid.
+
For more details, see SEV spec Section 6.2.
3. KVM_SEV_LAUNCH_UPDATE_DATA
@@ -123,6 +155,9 @@ measurement. Since the guest owner knows the initial contents of the guest at
boot, the measurement can be verified by comparing it to what the guest owner
expects.
+If len is zero on entry, the measurement blob length is written to len and
+uaddr is unused.
+
Parameters (in): struct kvm_sev_launch_measure
Returns: 0 on success, -negative on error
@@ -238,6 +273,167 @@ Returns: 0 on success, -negative on error
__u32 trans_len;
};
+10. KVM_SEV_GET_ATTESTATION_REPORT
+----------------------------------
+
+The KVM_SEV_GET_ATTESTATION_REPORT command can be used by the hypervisor to query the attestation
+report containing the SHA-256 digest of the guest memory and VMSA passed through the KVM_SEV_LAUNCH
+commands and signed with the PEK. The digest returned by the command should match the digest
+used by the guest owner with the KVM_SEV_LAUNCH_MEASURE.
+
+If len is zero on entry, the measurement blob length is written to len and
+uaddr is unused.
+
+Parameters (in): struct kvm_sev_attestation
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_attestation_report {
+ __u8 mnonce[16]; /* A random mnonce that will be placed in the report */
+
+ __u64 uaddr; /* userspace address where the report should be copied */
+ __u32 len;
+ };
+
+11. KVM_SEV_SEND_START
+----------------------
+
+The KVM_SEV_SEND_START command can be used by the hypervisor to create an
+outgoing guest encryption context.
+
+If session_len is zero on entry, the length of the guest session information is
+written to session_len and all other fields are not used.
+
+Parameters (in): struct kvm_sev_send_start
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_send_start {
+ __u32 policy; /* guest policy */
+
+ __u64 pdh_cert_uaddr; /* platform Diffie-Hellman certificate */
+ __u32 pdh_cert_len;
+
+ __u64 plat_certs_uaddr; /* platform certificate chain */
+ __u32 plat_certs_len;
+
+ __u64 amd_certs_uaddr; /* AMD certificate */
+ __u32 amd_certs_len;
+
+ __u64 session_uaddr; /* Guest session information */
+ __u32 session_len;
+ };
+
+12. KVM_SEV_SEND_UPDATE_DATA
+----------------------------
+
+The KVM_SEV_SEND_UPDATE_DATA command can be used by the hypervisor to encrypt the
+outgoing guest memory region with the encryption context creating using
+KVM_SEV_SEND_START.
+
+If hdr_len or trans_len are zero on entry, the length of the packet header and
+transport region are written to hdr_len and trans_len respectively, and all
+other fields are not used.
+
+Parameters (in): struct kvm_sev_send_update_data
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_launch_send_update_data {
+ __u64 hdr_uaddr; /* userspace address containing the packet header */
+ __u32 hdr_len;
+
+ __u64 guest_uaddr; /* the source memory region to be encrypted */
+ __u32 guest_len;
+
+ __u64 trans_uaddr; /* the destination memory region */
+ __u32 trans_len;
+ };
+
+13. KVM_SEV_SEND_FINISH
+------------------------
+
+After completion of the migration flow, the KVM_SEV_SEND_FINISH command can be
+issued by the hypervisor to delete the encryption context.
+
+Returns: 0 on success, -negative on error
+
+14. KVM_SEV_SEND_CANCEL
+------------------------
+
+After completion of SEND_START, but before SEND_FINISH, the source VMM can issue the
+SEND_CANCEL command to stop a migration. This is necessary so that a cancelled
+migration can restart with a new target later.
+
+Returns: 0 on success, -negative on error
+
+15. KVM_SEV_RECEIVE_START
+-------------------------
+
+The KVM_SEV_RECEIVE_START command is used for creating the memory encryption
+context for an incoming SEV guest. To create the encryption context, the user must
+provide a guest policy, the platform public Diffie-Hellman (PDH) key and session
+information.
+
+Parameters: struct kvm_sev_receive_start (in/out)
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_receive_start {
+ __u32 handle; /* if zero then firmware creates a new handle */
+ __u32 policy; /* guest's policy */
+
+ __u64 pdh_uaddr; /* userspace address pointing to the PDH key */
+ __u32 pdh_len;
+
+ __u64 session_uaddr; /* userspace address which points to the guest session information */
+ __u32 session_len;
+ };
+
+On success, the 'handle' field contains a new handle and on error, a negative value.
+
+For more details, see SEV spec Section 6.12.
+
+16. KVM_SEV_RECEIVE_UPDATE_DATA
+-------------------------------
+
+The KVM_SEV_RECEIVE_UPDATE_DATA command can be used by the hypervisor to copy
+the incoming buffers into the guest memory region with encryption context
+created during the KVM_SEV_RECEIVE_START.
+
+Parameters (in): struct kvm_sev_receive_update_data
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_launch_receive_update_data {
+ __u64 hdr_uaddr; /* userspace address containing the packet header */
+ __u32 hdr_len;
+
+ __u64 guest_uaddr; /* the destination guest memory region */
+ __u32 guest_len;
+
+ __u64 trans_uaddr; /* the incoming buffer memory region */
+ __u32 trans_len;
+ };
+
+17. KVM_SEV_RECEIVE_FINISH
+--------------------------
+
+After completion of the migration flow, the KVM_SEV_RECEIVE_FINISH command can be
+issued by the hypervisor to make the guest ready for execution.
+
+Returns: 0 on success, -negative on error
+
References
==========
@@ -245,6 +441,6 @@ References
See [white-paper]_, [api-spec]_, [amd-apm]_ and [kvm-forum]_ for more info.
.. [white-paper] http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf
-.. [api-spec] http://support.amd.com/TechDocs/55766_SEV-KM_API_Specification.pdf
-.. [amd-apm] http://support.amd.com/TechDocs/24593.pdf (section 15.34)
-.. [kvm-forum] http://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf
+.. [api-spec] https://support.amd.com/TechDocs/55766_SEV-KM_API_Specification.pdf
+.. [amd-apm] https://support.amd.com/TechDocs/24593.pdf (section 15.34)
+.. [kvm-forum] https://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf
diff --git a/Documentation/virt/kvm/x86/cpuid.rst b/Documentation/virt/kvm/x86/cpuid.rst
new file mode 100644
index 000000000000..bda3e3e737d7
--- /dev/null
+++ b/Documentation/virt/kvm/x86/cpuid.rst
@@ -0,0 +1,124 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+KVM CPUID bits
+==============
+
+:Author: Glauber Costa <glommer@gmail.com>
+
+A guest running on a kvm host, can check some of its features using
+cpuid. This is not always guaranteed to work, since userspace can
+mask-out some, or even all KVM-related cpuid features before launching
+a guest.
+
+KVM cpuid functions are:
+
+function: KVM_CPUID_SIGNATURE (0x40000000)
+
+returns::
+
+ eax = 0x40000001
+ ebx = 0x4b4d564b
+ ecx = 0x564b4d56
+ edx = 0x4d
+
+Note that this value in ebx, ecx and edx corresponds to the string "KVMKVMKVM".
+The value in eax corresponds to the maximum cpuid function present in this leaf,
+and will be updated if more functions are added in the future.
+Note also that old hosts set eax value to 0x0. This should
+be interpreted as if the value was 0x40000001.
+This function queries the presence of KVM cpuid leafs.
+
+function: define KVM_CPUID_FEATURES (0x40000001)
+
+returns::
+
+ ebx, ecx
+ eax = an OR'ed group of (1 << flag)
+
+where ``flag`` is defined as below:
+
+================================== =========== ================================
+flag value meaning
+================================== =========== ================================
+KVM_FEATURE_CLOCKSOURCE 0 kvmclock available at msrs
+ 0x11 and 0x12
+
+KVM_FEATURE_NOP_IO_DELAY 1 not necessary to perform delays
+ on PIO operations
+
+KVM_FEATURE_MMU_OP 2 deprecated
+
+KVM_FEATURE_CLOCKSOURCE2 3 kvmclock available at msrs
+ 0x4b564d00 and 0x4b564d01
+
+KVM_FEATURE_ASYNC_PF 4 async pf can be enabled by
+ writing to msr 0x4b564d02
+
+KVM_FEATURE_STEAL_TIME 5 steal time can be enabled by
+ writing to msr 0x4b564d03
+
+KVM_FEATURE_PV_EOI 6 paravirtualized end of interrupt
+ handler can be enabled by
+ writing to msr 0x4b564d04
+
+KVM_FEATURE_PV_UNHALT 7 guest checks this feature bit
+ before enabling paravirtualized
+ spinlock support
+
+KVM_FEATURE_PV_TLB_FLUSH 9 guest checks this feature bit
+ before enabling paravirtualized
+ tlb flush
+
+KVM_FEATURE_ASYNC_PF_VMEXIT 10 paravirtualized async PF VM EXIT
+ can be enabled by setting bit 2
+ when writing to msr 0x4b564d02
+
+KVM_FEATURE_PV_SEND_IPI 11 guest checks this feature bit
+ before enabling paravirtualized
+ send IPIs
+
+KVM_FEATURE_POLL_CONTROL 12 host-side polling on HLT can
+ be disabled by writing
+ to msr 0x4b564d05.
+
+KVM_FEATURE_PV_SCHED_YIELD 13 guest checks this feature bit
+ before using paravirtualized
+ sched yield.
+
+KVM_FEATURE_ASYNC_PF_INT 14 guest checks this feature bit
+ before using the second async
+ pf control msr 0x4b564d06 and
+ async pf acknowledgment msr
+ 0x4b564d07.
+
+KVM_FEATURE_MSI_EXT_DEST_ID 15 guest checks this feature bit
+ before using extended destination
+ ID bits in MSI address bits 11-5.
+
+KVM_FEATURE_HC_MAP_GPA_RANGE 16 guest checks this feature bit before
+ using the map gpa range hypercall
+ to notify the page state change
+
+KVM_FEATURE_MIGRATION_CONTROL 17 guest checks this feature bit before
+ using MSR_KVM_MIGRATION_CONTROL
+
+KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24 host will warn if no guest-side
+ per-cpu warps are expected in
+ kvmclock
+================================== =========== ================================
+
+::
+
+ edx = an OR'ed group of (1 << flag)
+
+Where ``flag`` here is defined as below:
+
+================== ============ =================================
+flag value meaning
+================== ============ =================================
+KVM_HINTS_REALTIME 0 guest checks this feature bit to
+ determine that vCPUs are never
+ preempted for an unlimited time
+ allowing optimizations
+================== ============ =================================
diff --git a/Documentation/virt/kvm/x86/errata.rst b/Documentation/virt/kvm/x86/errata.rst
new file mode 100644
index 000000000000..410e0aa63493
--- /dev/null
+++ b/Documentation/virt/kvm/x86/errata.rst
@@ -0,0 +1,39 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
+Known limitations of CPU virtualization
+=======================================
+
+Whenever perfect emulation of a CPU feature is impossible or too hard, KVM
+has to choose between not implementing the feature at all or introducing
+behavioral differences between virtual machines and bare metal systems.
+
+This file documents some of the known limitations that KVM has in
+virtualizing CPU features.
+
+x86
+===
+
+``KVM_GET_SUPPORTED_CPUID`` issues
+----------------------------------
+
+x87 features
+~~~~~~~~~~~~
+
+Unlike most other CPUID feature bits, CPUID[EAX=7,ECX=0]:EBX[6]
+(FDP_EXCPTN_ONLY) and CPUID[EAX=7,ECX=0]:EBX]13] (ZERO_FCS_FDS) are
+clear if the features are present and set if the features are not present.
+
+Clearing these bits in CPUID has no effect on the operation of the guest;
+if these bits are set on hardware, the features will not be present on
+any virtual machine that runs on that hardware.
+
+**Workaround:** It is recommended to always set these bits in guest CPUID.
+Note however that any software (e.g ``WIN87EM.DLL``) expecting these features
+to be present likely predates these CPUID feature bits, and therefore
+doesn't know to check for them anyway.
+
+Nested virtualization features
+------------------------------
+
+TBD
diff --git a/Documentation/virt/kvm/halt-polling.rst b/Documentation/virt/kvm/x86/halt-polling.rst
index 4922e4a15f18..4922e4a15f18 100644
--- a/Documentation/virt/kvm/halt-polling.rst
+++ b/Documentation/virt/kvm/x86/halt-polling.rst
diff --git a/Documentation/virt/kvm/hypercalls.rst b/Documentation/virt/kvm/x86/hypercalls.rst
index dbaf207e560d..10db7924720f 100644
--- a/Documentation/virt/kvm/hypercalls.rst
+++ b/Documentation/virt/kvm/x86/hypercalls.rst
@@ -22,7 +22,7 @@ S390:
number in R1.
For further information on the S390 diagnose call as supported by KVM,
- refer to Documentation/virt/kvm/s390-diag.txt.
+ refer to Documentation/virt/kvm/s390/s390-diag.rst.
PowerPC:
It uses R3-R10 and hypercall number in R11. R4-R11 are used as output registers.
@@ -30,7 +30,7 @@ PowerPC:
KVM hypercalls uses 4 byte opcode, that are patched with 'hypercall-instructions'
property inside the device tree's /hypervisor node.
- For more information refer to Documentation/virt/kvm/ppc-pv.txt
+ For more information refer to Documentation/virt/kvm/ppc-pv.rst
MIPS:
KVM hypercalls use the HYPCALL instruction with code 0 and the hypercall
@@ -169,3 +169,24 @@ a0: destination APIC ID
:Usage example: When sending a call-function IPI-many to vCPUs, yield if
any of the IPI target vCPUs was preempted.
+
+8. KVM_HC_MAP_GPA_RANGE
+-------------------------
+:Architecture: x86
+:Status: active
+:Purpose: Request KVM to map a GPA range with the specified attributes.
+
+a0: the guest physical address of the start page
+a1: the number of (4kb) pages (must be contiguous in GPA space)
+a2: attributes
+
+ Where 'attributes' :
+ * bits 3:0 - preferred page size encoding 0 = 4kb, 1 = 2mb, 2 = 1gb, etc...
+ * bit 4 - plaintext = 0, encrypted = 1
+ * bits 63:5 - reserved (must be zero)
+
+**Implementation note**: this hypercall is implemented in userspace via
+the KVM_CAP_EXIT_HYPERCALL capability. Userspace must enable that capability
+before advertising KVM_FEATURE_HC_MAP_GPA_RANGE in the guest CPUID. In
+addition, if the guest supports KVM_FEATURE_MIGRATION_CONTROL, userspace
+must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL.
diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst
new file mode 100644
index 000000000000..7ff588826b9f
--- /dev/null
+++ b/Documentation/virt/kvm/x86/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+KVM for x86 systems
+===================
+
+.. toctree::
+ :maxdepth: 2
+
+ amd-memory-encryption
+ cpuid
+ errata
+ halt-polling
+ hypercalls
+ mmu
+ msr
+ nested-vmx
+ running-nested-guests
+ timekeeping
diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/x86/mmu.rst
index 60981887d20b..8364afa228ec 100644
--- a/Documentation/virt/kvm/mmu.rst
+++ b/Documentation/virt/kvm/x86/mmu.rst
@@ -161,7 +161,7 @@ Shadow pages contain the following information:
If clear, this page corresponds to a guest page table denoted by the gfn
field.
role.quadrant:
- When role.gpte_is_8_bytes=0, the guest uses 32-bit gptes while the host uses 64-bit
+ When role.has_4_byte_gpte=1, the guest uses 32-bit gptes while the host uses 64-bit
sptes. That means a guest page table contains more ptes than the host,
so multiple shadow pages are needed to shadow one guest page.
For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the
@@ -171,17 +171,17 @@ Shadow pages contain the following information:
shadow pages) so role.quadrant takes values in the range 0..3. Each
quadrant maps 1GB virtual address space.
role.access:
- Inherited guest access permissions in the form uwx. Note execute
- permission is positive, not negative.
+ Inherited guest access permissions from the parent ptes in the form uwx.
+ Note execute permission is positive, not negative.
role.invalid:
The page is invalid and should not be used. It is a root page that is
currently pinned (by a cpu hardware register pointing to it); once it is
unpinned it will be destroyed.
- role.gpte_is_8_bytes:
- Reflects the size of the guest PTE for which the page is valid, i.e. '1'
- if 64-bit gptes are in use, '0' if 32-bit gptes are in use.
- role.nxe:
- Contains the value of efer.nxe for which the page is valid.
+ role.has_4_byte_gpte:
+ Reflects the size of the guest PTE for which the page is valid, i.e. '0'
+ if direct map or 64-bit gptes are in use, '1' if 32-bit gptes are in use.
+ role.efer_nx:
+ Contains the value of efer.nx for which the page is valid.
role.cr0_wp:
Contains the value of cr0.wp for which the page is valid.
role.smep_andnot_wp:
@@ -192,9 +192,6 @@ Shadow pages contain the following information:
Contains the value of cr4.smap && !cr0.wp for which the page is valid
(pages for which this is true are different from other pages; see the
treatment of cr0.wp=0 below).
- role.ept_sp:
- This is a virtual flag to denote a shadowed nested EPT page. ept_sp
- is true if "cr0_wp && smap_andnot_wp", an otherwise invalid combination.
role.smm:
Is 1 if the page is valid in system management mode. This field
determines which of the kvm_memslots array was used to build this
@@ -205,6 +202,10 @@ Shadow pages contain the following information:
Is 1 if the MMU instance cannot use A/D bits. EPT did not have A/D
bits before Haswell; shadow EPT page tables also cannot use A/D bits
if the L1 hypervisor does not enable them.
+ role.passthrough:
+ The page is not backed by a guest page table, but its first entry
+ points to one. This is set if NPT uses 5-level page tables (host
+ CR4.LA57=1) and is shadowing L1's 4-level NPT (L1 CR4.LA57=1).
gfn:
Either the guest page table containing the translations shadowed by this
page, or the base page frame for linear translations. See role.direct.
@@ -319,7 +320,7 @@ Handling a page fault is performed as follows:
- If both P bit and R/W bit of error code are set, this could possibly
be handled as a "fast page fault" (fixed without taking the MMU lock). See
- the description in Documentation/virt/kvm/locking.txt.
+ the description in Documentation/virt/kvm/locking.rst.
- if needed, walk the guest page tables to determine the guest translation
(gva->gpa or ngpa->gpa)
@@ -376,7 +377,7 @@ Emulating cr0.wp
================
If tdp is not enabled, the host must keep cr0.wp=1 so page write protection
-works for the guest kernel, not guest guest userspace. When the guest
+works for the guest kernel, not guest userspace. When the guest
cr0.wp=1, this does not present a problem. However when the guest cr0.wp=0,
we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the
semantics require allowing any guest kernel access plus user read access).
@@ -455,7 +456,7 @@ If the generation number of the spte does not equal the global generation
number, it will ignore the cached MMIO information and handle the page
fault through the slow path.
-Since only 19 bits are used to store generation-number on mmio spte, all
+Since only 18 bits are used to store generation-number on mmio spte, all
pages are zapped when there is an overflow.
Unfortunately, a single memory access might access kvm_memslots(kvm) multiple
@@ -480,4 +481,4 @@ Further reading
===============
- NPT presentation from KVM Forum 2008
- http://www.linux-kvm.org/images/c/c8/KvmForum2008%24kdf2008_21.pdf
+ https://www.linux-kvm.org/images/c/c8/KvmForum2008%24kdf2008_21.pdf
diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/x86/msr.rst
index 33892036672d..9315fc385fb0 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/x86/msr.rst
@@ -190,41 +190,72 @@ MSR_KVM_ASYNC_PF_EN:
0x4b564d02
data:
- Bits 63-6 hold 64-byte aligned physical address of a
- 64 byte memory area which must be in guest RAM and must be
- zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1
- when asynchronous page faults are enabled on the vcpu 0 when
- disabled. Bit 1 is 1 if asynchronous page faults can be injected
- when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults
- are delivered to L1 as #PF vmexits. Bit 2 can be set only if
- KVM_FEATURE_ASYNC_PF_VMEXIT is present in CPUID.
-
- First 4 byte of 64 byte memory location will be written to by
- the hypervisor at the time of asynchronous page fault (APF)
- injection to indicate type of asynchronous page fault. Value
- of 1 means that the page referred to by the page fault is not
- present. Value 2 means that the page is now available. Disabling
- interrupt inhibits APFs. Guest must not enable interrupt
- before the reason is read, or it may be overwritten by another
- APF. Since APF uses the same exception vector as regular page
- fault guest must reset the reason to 0 before it does
- something that can generate normal page fault. If during page
- fault APF reason is 0 it means that this is regular page
- fault.
-
- During delivery of type 1 APF cr2 contains a token that will
- be used to notify a guest when missing page becomes
- available. When page becomes available type 2 APF is sent with
- cr2 set to the token associated with the page. There is special
- kind of token 0xffffffff which tells vcpu that it should wake
- up all processes waiting for APFs and no individual type 2 APFs
- will be sent.
+ Asynchronous page fault (APF) control MSR.
+
+ Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area
+ which must be in guest RAM and must be zeroed. This memory is expected
+ to hold a copy of the following structure::
+
+ struct kvm_vcpu_pv_apf_data {
+ /* Used for 'page not present' events delivered via #PF */
+ __u32 flags;
+
+ /* Used for 'page ready' events delivered via interrupt notification */
+ __u32 token;
+
+ __u8 pad[56];
+ __u32 enabled;
+ };
+
+ Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
+ when asynchronous page faults are enabled on the vcpu, 0 when disabled.
+ Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
+ cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
+ #PF vmexits. Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
+ present in CPUID. Bit 3 enables interrupt based delivery of 'page ready'
+ events. Bit 3 can only be set if KVM_FEATURE_ASYNC_PF_INT is present in
+ CPUID.
+
+ 'Page not present' events are currently always delivered as synthetic
+ #PF exception. During delivery of these events APF CR2 register contains
+ a token that will be used to notify the guest when missing page becomes
+ available. Also, to make it possible to distinguish between real #PF and
+ APF, first 4 bytes of 64 byte memory location ('flags') will be written
+ to by the hypervisor at the time of injection. Only first bit of 'flags'
+ is currently supported, when set, it indicates that the guest is dealing
+ with asynchronous 'page not present' event. If during a page fault APF
+ 'flags' is '0' it means that this is regular page fault. Guest is
+ supposed to clear 'flags' when it is done handling #PF exception so the
+ next event can be delivered.
+
+ Note, since APF 'page not present' events use the same exception vector
+ as regular page fault, guest must reset 'flags' to '0' before it does
+ something that can generate normal page fault.
+
+ Bytes 5-7 of 64 byte memory location ('token') will be written to by the
+ hypervisor at the time of APF 'page ready' event injection. The content
+ of these bytes is a token which was previously delivered as 'page not
+ present' event. The event indicates the page in now available. Guest is
+ supposed to write '0' to 'token' when it is done handling 'page ready'
+ event and to write 1' to MSR_KVM_ASYNC_PF_ACK after clearing the location;
+ writing to the MSR forces KVM to re-scan its queue and deliver the next
+ pending notification.
+
+ Note, MSR_KVM_ASYNC_PF_INT MSR specifying the interrupt vector for 'page
+ ready' APF delivery needs to be written to before enabling APF mechanism
+ in MSR_KVM_ASYNC_PF_EN or interrupt #0 can get injected. The MSR is
+ available if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
+
+ Note, previously, 'page ready' events were delivered via the same #PF
+ exception as 'page not present' events but this is now deprecated. If
+ bit 3 (interrupt based delivery) is not set APF events are not delivered.
If APF is disabled while there are outstanding APFs, they will
not be delivered.
- Currently type 2 APF will be always delivered on the same vcpu as
- type 1 was, but guest should not rely on that.
+ Currently 'page ready' APF events will be always delivered on the
+ same vcpu as 'page not present' event was, but guest should not rely on
+ that.
MSR_KVM_STEAL_TIME:
0x4b564d03
@@ -319,3 +350,42 @@ data:
KVM guests can request the host not to poll on HLT, for example if
they are performing polling themselves.
+
+MSR_KVM_ASYNC_PF_INT:
+ 0x4b564d06
+
+data:
+ Second asynchronous page fault (APF) control MSR.
+
+ Bits 0-7: APIC vector for delivery of 'page ready' APF events.
+ Bits 8-63: Reserved
+
+ Interrupt vector for asynchnonous 'page ready' notifications delivery.
+ The vector has to be set up before asynchronous page fault mechanism
+ is enabled in MSR_KVM_ASYNC_PF_EN. The MSR is only available if
+ KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
+
+MSR_KVM_ASYNC_PF_ACK:
+ 0x4b564d07
+
+data:
+ Asynchronous page fault (APF) acknowledgment.
+
+ When the guest is done processing 'page ready' APF event and 'token'
+ field in 'struct kvm_vcpu_pv_apf_data' is cleared it is supposed to
+ write '1' to bit 0 of the MSR, this causes the host to re-scan its queue
+ and check if there are more notifications pending. The MSR is available
+ if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
+
+MSR_KVM_MIGRATION_CONTROL:
+ 0x4b564d08
+
+data:
+ This MSR is available if KVM_FEATURE_MIGRATION_CONTROL is present in
+ CPUID. Bit 0 represents whether live migration of the guest is allowed.
+
+ When a guest is started, bit 0 will be 0 if the guest has encrypted
+ memory and 1 if the guest does not have encrypted memory. If the
+ guest is communicating page encryption status to the host using the
+ ``KVM_HC_MAP_GPA_RANGE`` hypercall, it can set bit 0 in this MSR to
+ allow live migration of the guest.
diff --git a/Documentation/virt/kvm/nested-vmx.rst b/Documentation/virt/kvm/x86/nested-vmx.rst
index 592b0ab6970b..ac2095d41f02 100644
--- a/Documentation/virt/kvm/nested-vmx.rst
+++ b/Documentation/virt/kvm/x86/nested-vmx.rst
@@ -22,7 +22,7 @@ its implementation and its performance characteristics, in the OSDI 2010 paper
"The Turtles Project: Design and Implementation of Nested Virtualization",
available at:
- http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
+ https://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
Terminology
@@ -37,8 +37,10 @@ call L2.
Running nested VMX
------------------
-The nested VMX feature is disabled by default. It can be enabled by giving
-the "nested=1" option to the kvm-intel module.
+The nested VMX feature is enabled by default since Linux kernel v4.20. For
+older Linux kernel, it can be enabled by giving the "nested=1" option to the
+kvm-intel module.
+
No modifications are required to user space (qemu). However, qemu's default
emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
@@ -116,10 +118,7 @@ struct shadow_vmcs is ever changed.
natural_width cr4_guest_host_mask;
natural_width cr0_read_shadow;
natural_width cr4_read_shadow;
- natural_width cr3_target_value0;
- natural_width cr3_target_value1;
- natural_width cr3_target_value2;
- natural_width cr3_target_value3;
+ natural_width dead_space[4]; /* Last remnants of cr3_target_value[0-3]. */
natural_width exit_qualification;
natural_width guest_linear_address;
natural_width guest_cr0;
diff --git a/Documentation/virt/kvm/x86/running-nested-guests.rst b/Documentation/virt/kvm/x86/running-nested-guests.rst
new file mode 100644
index 000000000000..a27e6768d900
--- /dev/null
+++ b/Documentation/virt/kvm/x86/running-nested-guests.rst
@@ -0,0 +1,278 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+Running nested guests with KVM
+==============================
+
+A nested guest is the ability to run a guest inside another guest (it
+can be KVM-based or a different hypervisor). The straightforward
+example is a KVM guest that in turn runs on a KVM guest (the rest of
+this document is built on this example)::
+
+ .----------------. .----------------.
+ | | | |
+ | L2 | | L2 |
+ | (Nested Guest) | | (Nested Guest) |
+ | | | |
+ |----------------'--'----------------|
+ | |
+ | L1 (Guest Hypervisor) |
+ | KVM (/dev/kvm) |
+ | |
+ .------------------------------------------------------.
+ | L0 (Host Hypervisor) |
+ | KVM (/dev/kvm) |
+ |------------------------------------------------------|
+ | Hardware (with virtualization extensions) |
+ '------------------------------------------------------'
+
+Terminology:
+
+- L0 – level-0; the bare metal host, running KVM
+
+- L1 – level-1 guest; a VM running on L0; also called the "guest
+ hypervisor", as it itself is capable of running KVM.
+
+- L2 – level-2 guest; a VM running on L1, this is the "nested guest"
+
+.. note:: The above diagram is modelled after the x86 architecture;
+ s390x, ppc64 and other architectures are likely to have
+ a different design for nesting.
+
+ For example, s390x always has an LPAR (LogicalPARtition)
+ hypervisor running on bare metal, adding another layer and
+ resulting in at least four levels in a nested setup — L0 (bare
+ metal, running the LPAR hypervisor), L1 (host hypervisor), L2
+ (guest hypervisor), L3 (nested guest).
+
+ This document will stick with the three-level terminology (L0,
+ L1, and L2) for all architectures; and will largely focus on
+ x86.
+
+
+Use Cases
+---------
+
+There are several scenarios where nested KVM can be useful, to name a
+few:
+
+- As a developer, you want to test your software on different operating
+ systems (OSes). Instead of renting multiple VMs from a Cloud
+ Provider, using nested KVM lets you rent a large enough "guest
+ hypervisor" (level-1 guest). This in turn allows you to create
+ multiple nested guests (level-2 guests), running different OSes, on
+ which you can develop and test your software.
+
+- Live migration of "guest hypervisors" and their nested guests, for
+ load balancing, disaster recovery, etc.
+
+- VM image creation tools (e.g. ``virt-install``, etc) often run
+ their own VM, and users expect these to work inside a VM.
+
+- Some OSes use virtualization internally for security (e.g. to let
+ applications run safely in isolation).
+
+
+Enabling "nested" (x86)
+-----------------------
+
+From Linux kernel v4.20 onwards, the ``nested`` KVM parameter is enabled
+by default for Intel and AMD. (Though your Linux distribution might
+override this default.)
+
+In case you are running a Linux kernel older than v4.19, to enable
+nesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``. To
+persist this setting across reboots, you can add it in a config file, as
+shown below:
+
+1. On the bare metal host (L0), list the kernel modules and ensure that
+ the KVM modules::
+
+ $ lsmod | grep -i kvm
+ kvm_intel 133627 0
+ kvm 435079 1 kvm_intel
+
+2. Show information for ``kvm_intel`` module::
+
+ $ modinfo kvm_intel | grep -i nested
+ parm: nested:bool
+
+3. For the nested KVM configuration to persist across reboots, place the
+ below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it
+ doesn't exist)::
+
+ $ cat /etc/modprobe.d/kvm_intel.conf
+ options kvm-intel nested=y
+
+4. Unload and re-load the KVM Intel module::
+
+ $ sudo rmmod kvm-intel
+ $ sudo modprobe kvm-intel
+
+5. Verify if the ``nested`` parameter for KVM is enabled::
+
+ $ cat /sys/module/kvm_intel/parameters/nested
+ Y
+
+For AMD hosts, the process is the same as above, except that the module
+name is ``kvm-amd``.
+
+
+Additional nested-related kernel parameters (x86)
+-------------------------------------------------
+
+If your hardware is sufficiently advanced (Intel Haswell processor or
+higher, which has newer hardware virt extensions), the following
+additional features will also be enabled by default: "Shadow VMCS
+(Virtual Machine Control Structure)", APIC Virtualization on your bare
+metal host (L0). Parameters for Intel hosts::
+
+ $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs
+ Y
+
+ $ cat /sys/module/kvm_intel/parameters/enable_apicv
+ Y
+
+ $ cat /sys/module/kvm_intel/parameters/ept
+ Y
+
+.. note:: If you suspect your L2 (i.e. nested guest) is running slower,
+ ensure the above are enabled (particularly
+ ``enable_shadow_vmcs`` and ``ept``).
+
+
+Starting a nested guest (x86)
+-----------------------------
+
+Once your bare metal host (L0) is configured for nesting, you should be
+able to start an L1 guest with::
+
+ $ qemu-kvm -cpu host [...]
+
+The above will pass through the host CPU's capabilities as-is to the
+gues); or for better live migration compatibility, use a named CPU
+model supported by QEMU. e.g.::
+
+ $ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on
+
+then the guest hypervisor will subsequently be capable of running a
+nested guest with accelerated KVM.
+
+
+Enabling "nested" (s390x)
+-------------------------
+
+1. On the host hypervisor (L0), enable the ``nested`` parameter on
+ s390x::
+
+ $ rmmod kvm
+ $ modprobe kvm nested=1
+
+.. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive
+ with the ``nested`` paramter — i.e. to be able to enable
+ ``nested``, the ``hpage`` parameter *must* be disabled.
+
+2. The guest hypervisor (L1) must be provided with the ``sie`` CPU
+ feature — with QEMU, this can be done by using "host passthrough"
+ (via the command-line ``-cpu host``).
+
+3. Now the KVM module can be loaded in the L1 (guest hypervisor)::
+
+ $ modprobe kvm
+
+
+Live migration with nested KVM
+------------------------------
+
+Migrating an L1 guest, with a *live* nested guest in it, to another
+bare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for
+Intel x86 systems, and even on older versions for s390x.
+
+On AMD systems, once an L1 guest has started an L2 guest, the L1 guest
+should no longer be migrated or saved (refer to QEMU documentation on
+"savevm"/"loadvm") until the L2 guest shuts down. Attempting to migrate
+or save-and-load an L1 guest while an L2 guest is running will result in
+undefined behavior. You might see a ``kernel BUG!`` entry in ``dmesg``, a
+kernel 'oops', or an outright kernel panic. Such a migrated or loaded L1
+guest can no longer be considered stable or secure, and must be restarted.
+Migrating an L1 guest merely configured to support nesting, while not
+actually running L2 guests, is expected to function normally even on AMD
+systems but may fail once guests are started.
+
+Migrating an L2 guest is always expected to succeed, so all the following
+scenarios should work even on AMD systems:
+
+- Migrating a nested guest (L2) to another L1 guest on the *same* bare
+ metal host.
+
+- Migrating a nested guest (L2) to another L1 guest on a *different*
+ bare metal host.
+
+- Migrating a nested guest (L2) to a bare metal host.
+
+Reporting bugs from nested setups
+-----------------------------------
+
+Debugging "nested" problems can involve sifting through log files across
+L0, L1 and L2; this can result in tedious back-n-forth between the bug
+reporter and the bug fixer.
+
+- Mention that you are in a "nested" setup. If you are running any kind
+ of "nesting" at all, say so. Unfortunately, this needs to be called
+ out because when reporting bugs, people tend to forget to even
+ *mention* that they're using nested virtualization.
+
+- Ensure you are actually running KVM on KVM. Sometimes people do not
+ have KVM enabled for their guest hypervisor (L1), which results in
+ them running with pure emulation or what QEMU calls it as "TCG", but
+ they think they're running nested KVM. Thus confusing "nested Virt"
+ (which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM).
+
+Information to collect (generic)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following is not an exhaustive list, but a very good starting point:
+
+ - Kernel, libvirt, and QEMU version from L0
+
+ - Kernel, libvirt and QEMU version from L1
+
+ - QEMU command-line of L1 -- when using libvirt, you'll find it here:
+ ``/var/log/libvirt/qemu/instance.log``
+
+ - QEMU command-line of L2 -- as above, when using libvirt, get the
+ complete libvirt-generated QEMU command-line
+
+ - ``cat /sys/cpuinfo`` from L0
+
+ - ``cat /sys/cpuinfo`` from L1
+
+ - ``lscpu`` from L0
+
+ - ``lscpu`` from L1
+
+ - Full ``dmesg`` output from L0
+
+ - Full ``dmesg`` output from L1
+
+x86-specific info to collect
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Both the below commands, ``x86info`` and ``dmidecode``, should be
+available on most Linux distributions with the same name:
+
+ - Output of: ``x86info -a`` from L0
+
+ - Output of: ``x86info -a`` from L1
+
+ - Output of: ``dmidecode`` from L0
+
+ - Output of: ``dmidecode`` from L1
+
+s390x-specific info to collect
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Along with the earlier mentioned generic details, the below is
+also recommended:
+
+ - ``/proc/sysinfo`` from L1; this will also include the info from L0
diff --git a/Documentation/virt/kvm/timekeeping.rst b/Documentation/virt/kvm/x86/timekeeping.rst
index 21ae7efa29ba..21ae7efa29ba 100644
--- a/Documentation/virt/kvm/timekeeping.rst
+++ b/Documentation/virt/kvm/x86/timekeeping.rst