aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/Documentation/virt/kvm/api.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/virt/kvm/api.rst')
-rw-r--r--Documentation/virt/kvm/api.rst1470
1 files changed, 1172 insertions, 298 deletions
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index dca926762f1f..0b5a33ee71ee 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -147,10 +147,29 @@ described as 'basic' will be available.
The new VM has no virtual cpus and no memory.
You probably want to use 0 as machine type.
+X86:
+^^^^
+
+Supported X86 VM types can be queried via KVM_CAP_VM_TYPES.
+
+S390:
+^^^^^
+
In order to create user controlled virtual machines on S390, check
KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
privileged user (CAP_SYS_ADMIN).
+MIPS:
+^^^^^
+
+To use hardware assisted virtualization on MIPS (VZ ASE) rather than
+the default trap & emulate implementation (which changes the virtual
+memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
+flag KVM_VM_MIPS_VZ.
+
+ARM64:
+^^^^^^
+
On arm64, the physical address size for a VM (IPA Size limit) is limited
to 40bits by default. The limit can be configured if the host supports the
extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
@@ -272,18 +291,6 @@ the VCPU file descriptor can be mmap-ed, including:
KVM_CAP_DIRTY_LOG_RING, see section 8.3.
-4.6 KVM_SET_MEMORY_REGION
--------------------------
-
-:Capability: basic
-:Architectures: all
-:Type: vm ioctl
-:Parameters: struct kvm_memory_region (in)
-:Returns: 0 on success, -1 on error
-
-This ioctl is obsolete and has been removed.
-
-
4.7 KVM_CREATE_VCPU
-------------------
@@ -365,20 +372,9 @@ The bits in the dirty bitmap are cleared before the ioctl returns, unless
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is enabled. For more information,
see the description of the capability.
-Note that the Xen shared info page, if configured, shall always be assumed
+Note that the Xen shared_info page, if configured, shall always be assumed
to be dirty. KVM will not explicitly mark it such.
-4.9 KVM_SET_MEMORY_ALIAS
-------------------------
-
-:Capability: basic
-:Architectures: x86
-:Type: vm ioctl
-:Parameters: struct kvm_memory_alias (in)
-:Returns: 0 (success), -1 (error)
-
-This ioctl is obsolete and has been removed.
-
4.10 KVM_RUN
------------
@@ -439,6 +435,13 @@ Reads the general purpose registers from the vcpu.
__u64 pc;
};
+ /* LoongArch */
+ struct kvm_regs {
+ /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */
+ unsigned long gpr[32];
+ unsigned long pc;
+ };
+
4.12 KVM_SET_REGS
-----------------
@@ -529,7 +532,7 @@ translation mode.
------------------
:Capability: basic
-:Architectures: x86, ppc, mips, riscv
+:Architectures: x86, ppc, mips, riscv, loongarch
:Type: vcpu ioctl
:Parameters: struct kvm_interrupt (in)
:Returns: 0 on success, negative on failure.
@@ -563,7 +566,7 @@ ioctl is useful if the in-kernel PIC is not used.
PPC:
^^^^
-Queues an external interrupt to be injected. This ioctl is overleaded
+Queues an external interrupt to be injected. This ioctl is overloaded
with 3 different irq values:
a) KVM_INTERRUPT_SET
@@ -601,7 +604,7 @@ This is an asynchronous vcpu ioctl and can be invoked from any thread.
RISC-V:
^^^^^^^
-Queues an external interrupt to be injected into the virutal CPU. This ioctl
+Queues an external interrupt to be injected into the virtual CPU. This ioctl
is overloaded with 2 different irq values:
a) KVM_INTERRUPT_SET
@@ -615,17 +618,13 @@ b) KVM_INTERRUPT_UNSET
This is an asynchronous vcpu ioctl and can be invoked from any thread.
+LOONGARCH:
+^^^^^^^^^^
-4.17 KVM_DEBUG_GUEST
---------------------
-
-:Capability: basic
-:Architectures: none
-:Type: vcpu ioctl
-:Parameters: none)
-:Returns: -1 on error
+Queues an external interrupt to be injected into the virtual CPU. A negative
+interrupt number dequeues the interrupt.
-Support for this has been removed. Use KVM_SET_GUEST_DEBUG instead.
+This is an asynchronous vcpu ioctl and can be invoked from any thread.
4.18 KVM_GET_MSRS
@@ -760,7 +759,7 @@ signal mask.
----------------
:Capability: basic
-:Architectures: x86
+:Architectures: x86, loongarch
:Type: vcpu ioctl
:Parameters: struct kvm_fpu (out)
:Returns: 0 on success, -1 on error
@@ -769,7 +768,7 @@ Reads the floating point state from the vcpu.
::
- /* for KVM_GET_FPU and KVM_SET_FPU */
+ /* x86: for KVM_GET_FPU and KVM_SET_FPU */
struct kvm_fpu {
__u8 fpr[8][16];
__u16 fcw;
@@ -784,12 +783,21 @@ Reads the floating point state from the vcpu.
__u32 pad2;
};
+ /* LoongArch: for KVM_GET_FPU and KVM_SET_FPU */
+ struct kvm_fpu {
+ __u32 fcsr;
+ __u64 fcc;
+ struct kvm_fpureg {
+ __u64 val64[4];
+ }fpr[32];
+ };
+
4.23 KVM_SET_FPU
----------------
:Capability: basic
-:Architectures: x86
+:Architectures: x86, loongarch
:Type: vcpu ioctl
:Parameters: struct kvm_fpu (in)
:Returns: 0 on success, -1 on error
@@ -798,7 +806,7 @@ Writes the floating point state to the vcpu.
::
- /* for KVM_GET_FPU and KVM_SET_FPU */
+ /* x86: for KVM_GET_FPU and KVM_SET_FPU */
struct kvm_fpu {
__u8 fpr[8][16];
__u16 fcw;
@@ -813,6 +821,15 @@ Writes the floating point state to the vcpu.
__u32 pad2;
};
+ /* LoongArch: for KVM_GET_FPU and KVM_SET_FPU */
+ struct kvm_fpu {
+ __u32 fcsr;
+ __u64 fcc;
+ struct kvm_fpureg {
+ __u64 val64[4];
+ }fpr[32];
+ };
+
4.24 KVM_CREATE_IRQCHIP
-----------------------
@@ -988,7 +1005,7 @@ be set in the flags field of this ioctl:
The KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL flag requests KVM to generate
the contents of the hypercall page automatically; hypercalls will be
intercepted and passed to userspace through KVM_EXIT_XEN. In this
-ase, all of the blob size and address fields must be zero.
+case, all of the blob size and address fields must be zero.
The KVM_XEN_HVM_CONFIG_EVTCHN_SEND flag indicates to KVM that userspace
will always use the KVM_XEN_HVM_EVTCHN_SEND ioctl to deliver event
@@ -1093,7 +1110,7 @@ Other flags returned by ``KVM_GET_CLOCK`` are accepted but ignored.
:Extended by: KVM_CAP_INTR_SHADOW
:Architectures: x86, arm64
:Type: vcpu ioctl
-:Parameters: struct kvm_vcpu_event (out)
+:Parameters: struct kvm_vcpu_events (out)
:Returns: 0 on success, -1 on error
X86:
@@ -1150,6 +1167,10 @@ The following bits are defined in the flags field:
fields contain a valid state. This bit will be set whenever
KVM_CAP_EXCEPTION_PAYLOAD is enabled.
+- KVM_VCPUEVENT_VALID_TRIPLE_FAULT may be set to signal that the
+ triple_fault_pending field contains a valid state. This bit will
+ be set whenever KVM_CAP_X86_TRIPLE_FAULT_EVENT is enabled.
+
ARM64:
^^^^^^
@@ -1212,7 +1233,7 @@ directly to the virtual CPU).
:Extended by: KVM_CAP_INTR_SHADOW
:Architectures: x86, arm64
:Type: vcpu ioctl
-:Parameters: struct kvm_vcpu_event (in)
+:Parameters: struct kvm_vcpu_events (in)
:Returns: 0 on success, -1 on error
X86:
@@ -1245,6 +1266,10 @@ can be set in the flags field to signal that the
exception_has_payload, exception_payload, and exception.pending fields
contain a valid state and shall be written into the VCPU.
+If KVM_CAP_X86_TRIPLE_FAULT_EVENT is enabled, KVM_VCPUEVENT_VALID_TRIPLE_FAULT
+can be set in flags field to signal that the triple_fault field contains
+a valid state and shall be written into the VCPU.
+
ARM64:
^^^^^^
@@ -1324,7 +1349,7 @@ yet and must be cleared on entry.
__u64 userspace_addr; /* start of the userspace allocated memory */
};
- /* for kvm_memory_region::flags */
+ /* for kvm_userspace_memory_region::flags */
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
@@ -1369,10 +1394,14 @@ the memory region are automatically reflected into the guest. For example, an
mmap() that affects the region will be made visible immediately. Another
example is madvise(MADV_DROP).
-It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl.
-The KVM_SET_MEMORY_REGION does not allow fine grained control over memory
-allocation and is deprecated.
-
+Note: On arm64, a write generated by the page-table walker (to update
+the Access and Dirty flags, for example) never results in a
+KVM_EXIT_MMIO exit when the slot has the KVM_MEM_READONLY flag. This
+is because KVM cannot provide the data that would be written by the
+page-table walker, making it impossible to emulate the access.
+Instead, an abort (data abort if the cause of the page-table update
+was a load or a store, instruction abort if it was an instruction
+fetch) is injected in the guest.
4.36 KVM_SET_TSS_ADDR
---------------------
@@ -1398,7 +1427,7 @@ documentation when it pops into existence).
-------------------
:Capability: KVM_CAP_ENABLE_CAP
-:Architectures: mips, ppc, s390, x86
+:Architectures: mips, ppc, s390, x86, loongarch
:Type: vcpu ioctl
:Parameters: struct kvm_enable_cap (in)
:Returns: 0 on success; -1 on error
@@ -1453,7 +1482,7 @@ for vm-wide capabilities.
---------------------
:Capability: KVM_CAP_MP_STATE
-:Architectures: x86, s390, arm64, riscv
+:Architectures: x86, s390, arm64, riscv, loongarch
:Type: vcpu ioctl
:Parameters: struct kvm_mp_state (out)
:Returns: 0 on success; -1 on error
@@ -1471,7 +1500,7 @@ Possible values are:
========================== ===============================================
KVM_MP_STATE_RUNNABLE the vcpu is currently running
- [x86,arm64,riscv]
+ [x86,arm64,riscv,loongarch]
KVM_MP_STATE_UNINITIALIZED the vcpu is an application processor (AP)
which has not yet received an INIT signal [x86]
KVM_MP_STATE_INIT_RECEIVED the vcpu has received an INIT signal, and is
@@ -1527,11 +1556,14 @@ For riscv:
The only states that are valid are KVM_MP_STATE_STOPPED and
KVM_MP_STATE_RUNNABLE which reflect if the vcpu is paused or not.
+On LoongArch, only the KVM_MP_STATE_RUNNABLE state is used to reflect
+whether the vcpu is runnable.
+
4.39 KVM_SET_MP_STATE
---------------------
:Capability: KVM_CAP_MP_STATE
-:Architectures: x86, s390, arm64, riscv
+:Architectures: x86, s390, arm64, riscv, loongarch
:Type: vcpu ioctl
:Parameters: struct kvm_mp_state (in)
:Returns: 0 on success; -1 on error
@@ -1549,6 +1581,9 @@ For arm64/riscv:
The only states that are valid are KVM_MP_STATE_STOPPED and
KVM_MP_STATE_RUNNABLE which reflect if the vcpu should be paused or not.
+On LoongArch, only the KVM_MP_STATE_RUNNABLE state is used to reflect
+whether the vcpu is runnable.
+
4.40 KVM_SET_IDENTITY_MAP_ADDR
------------------------------
@@ -2270,6 +2305,8 @@ Errors:
EINVAL invalid register ID, or no such register or used with VMs in
protected virtualization mode on s390
EPERM (arm64) register access not allowed before vcpu finalization
+ EBUSY (riscv) changing register value not allowed after the vcpu
+ has run at least once
====== ============================================================
(These error codes are indicative only: do not rely on a specific error
@@ -2624,7 +2661,7 @@ follows::
this vcpu, and determines which register slices are visible through
this ioctl interface.
-(See Documentation/arm64/sve.rst for an explanation of the "vq"
+(See Documentation/arch/arm64/sve.rst for an explanation of the "vq"
nomenclature.)
KVM_REG_ARM64_SVE_VLS is only accessible after KVM_ARM_VCPU_INIT.
@@ -2733,7 +2770,7 @@ The isa config register can be read anytime but can only be written before
a Guest VCPU runs. It will have ISA feature bits matching underlying host
set by default.
-RISC-V core registers represent the general excution state of a Guest VCPU
+RISC-V core registers represent the general execution state of a Guest VCPU
and it has the following id bit patterns::
0x8020 0000 02 <index into the kvm_riscv_core struct:24> (32bit Host)
@@ -2850,6 +2887,19 @@ Following are the RISC-V D-extension registers:
0x8020 0000 0600 0020 fcsr Floating point control and status register
======================= ========= =============================================
+LoongArch registers are mapped using the lower 32 bits. The upper 16 bits of
+that is the register group type.
+
+LoongArch csr registers are used to control guest cpu or get status of guest
+cpu, and they have the following id bit patterns::
+
+ 0x9030 0000 0001 00 <reg:5> <sel:3> (64-bit)
+
+LoongArch KVM control registers are used to implement some new defined functions
+such as set vcpu counter or reset vcpu, and they have the following id bit patterns::
+
+ 0x9030 0000 0002 <reg:16>
+
4.69 KVM_GET_ONE_REG
--------------------
@@ -2998,7 +3048,9 @@ KVM_CREATE_PIT2. The state is returned in the following structure::
Valid flags are::
/* disable PIT in HPET legacy mode */
- #define KVM_PIT_FLAGS_HPET_LEGACY 0x00000001
+ #define KVM_PIT_FLAGS_HPET_LEGACY 0x00000001
+ /* speaker port data bit enabled */
+ #define KVM_PIT_FLAGS_SPEAKER_DATA_ON 0x00000002
This IOCTL replaces the obsolete KVM_GET_PIT.
@@ -3070,7 +3122,7 @@ as follow::
};
An entry with a "page_shift" of 0 is unused. Because the array is
-organized in increasing order, a lookup can stop when encoutering
+organized in increasing order, a lookup can stop when encountering
such an entry.
The "slb_enc" field provides the encoding to use in the SLB for the
@@ -3283,6 +3335,7 @@ valid entries found.
----------------------
:Capability: KVM_CAP_DEVICE_CTRL
+:Architectures: all
:Type: vm ioctl
:Parameters: struct kvm_create_device (in/out)
:Returns: 0 on success, -1 on error
@@ -3323,6 +3376,7 @@ number.
:Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device,
KVM_CAP_VCPU_ATTRIBUTES for vcpu device
KVM_CAP_SYS_ATTRIBUTES for system (/dev/kvm) device (no set)
+:Architectures: x86, arm64, s390
:Type: device ioctl, vm ioctl, vcpu ioctl
:Parameters: struct kvm_device_attr
:Returns: 0 on success, -1 on error
@@ -3375,6 +3429,8 @@ return indicates the attribute is implemented. It does not necessarily
indicate that the attribute can be read or written in the device's
current state. "addr" is ignored.
+.. _KVM_ARM_VCPU_INIT:
+
4.82 KVM_ARM_VCPU_INIT
----------------------
@@ -3460,7 +3516,7 @@ Possible features:
- KVM_RUN and KVM_GET_REG_LIST are not available;
- KVM_GET_ONE_REG and KVM_SET_ONE_REG cannot be used to access
- the scalable archietctural SVE registers
+ the scalable architectural SVE registers
KVM_REG_ARM64_SVE_ZREG(), KVM_REG_ARM64_SVE_PREG() or
KVM_REG_ARM64_SVE_FFR;
@@ -3506,7 +3562,7 @@ VCPU matching underlying host.
---------------------
:Capability: basic
-:Architectures: arm64, mips
+:Architectures: arm64, mips, riscv
:Type: vcpu ioctl
:Parameters: struct kvm_reg_list (in/out)
:Returns: 0 on success; -1 on error
@@ -3743,7 +3799,7 @@ The fields in each entry are defined as follows:
:Parameters: struct kvm_s390_mem_op (in)
:Returns: = 0 on success,
< 0 on generic error (e.g. -EFAULT or -ENOMEM),
- > 0 if an exception occurred while walking the page tables
+ 16 bit program exception code if the access causes such an exception
Read or write data from/to the VM's memory.
The KVM_CAP_S390_MEM_OP_EXTENSION capability specifies what functionality is
@@ -3761,6 +3817,8 @@ Parameters are specified via the following structure::
struct {
__u8 ar; /* the access register number */
__u8 key; /* access key, ignored if flag unset */
+ __u8 pad1[6]; /* ignored */
+ __u64 old_addr; /* ignored if flag unset */
};
__u32 sida_offset; /* offset into the sida */
__u8 reserved[32]; /* ignored */
@@ -3788,6 +3846,7 @@ Possible operations are:
* ``KVM_S390_MEMOP_ABSOLUTE_WRITE``
* ``KVM_S390_MEMOP_SIDA_READ``
* ``KVM_S390_MEMOP_SIDA_WRITE``
+ * ``KVM_S390_MEMOP_ABSOLUTE_CMPXCHG``
Logical read/write:
^^^^^^^^^^^^^^^^^^^
@@ -3836,7 +3895,7 @@ the checks required for storage key protection as one operation (as opposed to
user space getting the storage keys, performing the checks, and accessing
memory thereafter, which could lead to a delay between check and access).
Absolute accesses are permitted for the VM ioctl if KVM_CAP_S390_MEM_OP_EXTENSION
-is > 0.
+has the KVM_S390_MEMOP_EXTENSION_CAP_BASE bit set.
Currently absolute accesses are not permitted for VCPU ioctls.
Absolute accesses are permitted for non-protected guests only.
@@ -3844,7 +3903,26 @@ Supported flags:
* ``KVM_S390_MEMOP_F_CHECK_ONLY``
* ``KVM_S390_MEMOP_F_SKEY_PROTECTION``
-The semantics of the flags are as for logical accesses.
+The semantics of the flags common with logical accesses are as for logical
+accesses.
+
+Absolute cmpxchg:
+^^^^^^^^^^^^^^^^^
+
+Perform cmpxchg on absolute guest memory. Intended for use with the
+KVM_S390_MEMOP_F_SKEY_PROTECTION flag.
+Instead of doing an unconditional write, the access occurs only if the target
+location contains the value pointed to by "old_addr".
+This is performed as an atomic cmpxchg with the length specified by the "size"
+parameter. "size" must be a power of two up to and including 16.
+If the exchange did not take place because the target value doesn't match the
+old value, the value "old_addr" points to is replaced by the target value.
+User space can tell if an exchange took place by checking if this replacement
+occurred. The cmpxchg op is permitted for the VM ioctl if
+KVM_CAP_S390_MEM_OP_EXTENSION has flag KVM_S390_MEMOP_EXTENSION_CAP_CMPXCHG set.
+
+Supported flags:
+ * ``KVM_S390_MEMOP_F_SKEY_PROTECTION``
SIDA read/write:
^^^^^^^^^^^^^^^^
@@ -4064,7 +4142,7 @@ Queues an SMI on the thread's vcpu.
4.97 KVM_X86_SET_MSR_FILTER
----------------------------
-:Capability: KVM_X86_SET_MSR_FILTER
+:Capability: KVM_CAP_X86_MSR_FILTER
:Architectures: x86
:Type: vm ioctl
:Parameters: struct kvm_msr_filter
@@ -4094,77 +4172,70 @@ flags values for ``struct kvm_msr_filter_range``:
``KVM_MSR_FILTER_READ``
Filter read accesses to MSRs using the given bitmap. A 0 in the bitmap
- indicates that a read should immediately fail, while a 1 indicates that
- a read for a particular MSR should be handled regardless of the default
+ indicates that read accesses should be denied, while a 1 indicates that
+ a read for a particular MSR should be allowed regardless of the default
filter action.
``KVM_MSR_FILTER_WRITE``
Filter write accesses to MSRs using the given bitmap. A 0 in the bitmap
- indicates that a write should immediately fail, while a 1 indicates that
- a write for a particular MSR should be handled regardless of the default
+ indicates that write accesses should be denied, while a 1 indicates that
+ a write for a particular MSR should be allowed regardless of the default
filter action.
-``KVM_MSR_FILTER_READ | KVM_MSR_FILTER_WRITE``
-
- Filter both read and write accesses to MSRs using the given bitmap. A 0
- in the bitmap indicates that both reads and writes should immediately fail,
- while a 1 indicates that reads and writes for a particular MSR are not
- filtered by this range.
-
flags values for ``struct kvm_msr_filter``:
``KVM_MSR_FILTER_DEFAULT_ALLOW``
If no filter range matches an MSR index that is getting accessed, KVM will
- fall back to allowing access to the MSR.
+ allow accesses to all MSRs by default.
``KVM_MSR_FILTER_DEFAULT_DENY``
If no filter range matches an MSR index that is getting accessed, KVM will
- fall back to rejecting access to the MSR. In this mode, all MSRs that should
- be processed by KVM need to explicitly be marked as allowed in the bitmaps.
+ deny accesses to all MSRs by default.
+
+This ioctl allows userspace to define up to 16 bitmaps of MSR ranges to deny
+guest MSR accesses that would normally be allowed by KVM. If an MSR is not
+covered by a specific range, the "default" filtering behavior applies. Each
+bitmap range covers MSRs from [base .. base+nmsrs).
+
+If an MSR access is denied by userspace, the resulting KVM behavior depends on
+whether or not KVM_CAP_X86_USER_SPACE_MSR's KVM_MSR_EXIT_REASON_FILTER is
+enabled. If KVM_MSR_EXIT_REASON_FILTER is enabled, KVM will exit to userspace
+on denied accesses, i.e. userspace effectively intercepts the MSR access. If
+KVM_MSR_EXIT_REASON_FILTER is not enabled, KVM will inject a #GP into the guest
+on denied accesses.
-This ioctl allows user space to define up to 16 bitmaps of MSR ranges to
-specify whether a certain MSR access should be explicitly filtered for or not.
+If an MSR access is allowed by userspace, KVM will emulate and/or virtualize
+the access in accordance with the vCPU model. Note, KVM may still ultimately
+inject a #GP if an access is allowed by userspace, e.g. if KVM doesn't support
+the MSR, or to follow architectural behavior for the MSR.
-If this ioctl has never been invoked, MSR accesses are not guarded and the
-default KVM in-kernel emulation behavior is fully preserved.
+By default, KVM operates in KVM_MSR_FILTER_DEFAULT_ALLOW mode with no MSR range
+filters.
Calling this ioctl with an empty set of ranges (all nmsrs == 0) disables MSR
filtering. In that mode, ``KVM_MSR_FILTER_DEFAULT_DENY`` is invalid and causes
an error.
-As soon as the filtering is in place, every MSR access is processed through
-the filtering except for accesses to the x2APIC MSRs (from 0x800 to 0x8ff);
-x2APIC MSRs are always allowed, independent of the ``default_allow`` setting,
-and their behavior depends on the ``X2APIC_ENABLE`` bit of the APIC base
-register.
-
.. warning::
- MSR accesses coming from nested vmentry/vmexit are not filtered.
+ MSR accesses as part of nested VM-Enter/VM-Exit are not filtered.
This includes both writes to individual VMCS fields and reads/writes
through the MSR lists pointed to by the VMCS.
-If a bit is within one of the defined ranges, read and write accesses are
-guarded by the bitmap's value for the MSR index if the kind of access
-is included in the ``struct kvm_msr_filter_range`` flags. If no range
-cover this particular access, the behavior is determined by the flags
-field in the kvm_msr_filter struct: ``KVM_MSR_FILTER_DEFAULT_ALLOW``
-and ``KVM_MSR_FILTER_DEFAULT_DENY``.
-
-Each bitmap range specifies a range of MSRs to potentially allow access on.
-The range goes from MSR index [base .. base+nmsrs]. The flags field
-indicates whether reads, writes or both reads and writes are filtered
-by setting a 1 bit in the bitmap for the corresponding MSR index.
+ x2APIC MSR accesses cannot be filtered (KVM silently ignores filters that
+ cover any x2APIC MSRs).
-If an MSR access is not permitted through the filtering, it generates a
-#GP inside the guest. When combined with KVM_CAP_X86_USER_SPACE_MSR, that
-allows user space to deflect and potentially handle various MSR accesses
-into user space.
+Note, invoking this ioctl while a vCPU is running is inherently racy. However,
+KVM does guarantee that vCPUs will see either the previous filter or the new
+filter, e.g. MSRs with identical settings in both the old and new filter will
+have deterministic behavior.
-If a vCPU is in running state while this ioctl is invoked, the vCPU may
-experience inconsistent filtering behavior on MSR accesses.
+Similarly, if userspace wishes to intercept on denied accesses,
+KVM_MSR_EXIT_REASON_FILTER must be enabled before activating any filters, and
+left enabled until after all filters are deactivated. Failure to do so may
+result in KVM injecting a #GP instead of exiting to userspace.
4.98 KVM_CREATE_SPAPR_TCE_64
----------------------------
@@ -4391,7 +4462,7 @@ This will have undefined effects on the guest if it has not already
placed itself in a quiescent state where no vcpu will make MMU enabled
memory accesses.
-On succsful completion, the pending HPT will become the guest's active
+On successful completion, the pending HPT will become the guest's active
HPT and the previous HPT will be discarded.
On failure, the guest will still be operating on its previous HPT.
@@ -4471,6 +4542,18 @@ not holding a previously reported uncorrected error).
:Parameters: struct kvm_s390_cmma_log (in, out)
:Returns: 0 on success, a negative value on error
+Errors:
+
+ ====== =============================================================
+ ENOMEM not enough memory can be allocated to complete the task
+ ENXIO if CMMA is not enabled
+ EINVAL if KVM_S390_CMMA_PEEK is not set but migration mode was not enabled
+ EINVAL if KVM_S390_CMMA_PEEK is not set but dirty tracking has been
+ disabled (and thus migration mode was automatically disabled)
+ EFAULT if the userspace address is invalid or if no page table is
+ present for the addresses (e.g. when using hugepages).
+ ====== =============================================================
+
This ioctl is used to get the values of the CMMA bits on the s390
architecture. It is meant to be used in two scenarios:
@@ -4551,12 +4634,6 @@ mask is unused.
values points to the userspace buffer where the result will be stored.
-This ioctl can fail with -ENOMEM if not enough memory can be allocated to
-complete the task, with -ENXIO if CMMA is not enabled, with -EINVAL if
-KVM_S390_CMMA_PEEK is not set but migration mode was not enabled, with
--EFAULT if the userspace address is invalid or if no page table is
-present for the addresses (e.g. when using hugepages).
-
4.108 KVM_S390_SET_CMMA_BITS
----------------------------
@@ -5000,7 +5077,7 @@ before the vcpu is fully usable.
Between KVM_ARM_VCPU_INIT and KVM_ARM_VCPU_FINALIZE, the feature may be
configured by use of ioctls such as KVM_SET_ONE_REG. The exact configuration
-that should be performaned and how to do it are feature-dependent.
+that should be performed and how to do it are feature-dependent.
Other calls that depend on a particular feature being finalized, such as
KVM_RUN, KVM_GET_REG_LIST, KVM_GET_ONE_REG and KVM_SET_ONE_REG, will fail with
@@ -5019,6 +5096,15 @@ using this ioctl.
:Parameters: struct kvm_pmu_event_filter (in)
:Returns: 0 on success, -1 on error
+Errors:
+
+ ====== ============================================================
+ EFAULT args[0] cannot be accessed
+ EINVAL args[0] contains invalid data in the filter or filter events
+ E2BIG nevents is too large
+ EBUSY not enough memory to allocate the filter
+ ====== ============================================================
+
::
struct kvm_pmu_event_filter {
@@ -5030,20 +5116,93 @@ using this ioctl.
__u64 events[0];
};
-This ioctl restricts the set of PMU events that the guest can program.
-The argument holds a list of events which will be allowed or denied.
-The eventsel+umask of each event the guest attempts to program is compared
-against the events field to determine whether the guest should have access.
-The events field only controls general purpose counters; fixed purpose
-counters are controlled by the fixed_counter_bitmap.
+This ioctl restricts the set of PMU events the guest can program by limiting
+which event select and unit mask combinations are permitted.
+
+The argument holds a list of filter events which will be allowed or denied.
+
+Filter events only control general purpose counters; fixed purpose counters
+are controlled by the fixed_counter_bitmap.
+
+Valid values for 'flags'::
+
+``0``
-No flags are defined yet, the field must be zero.
+To use this mode, clear the 'flags' field.
+
+In this mode each event will contain an event select + unit mask.
+
+When the guest attempts to program the PMU the guest's event select +
+unit mask is compared against the filter events to determine whether the
+guest should have access.
+
+``KVM_PMU_EVENT_FLAG_MASKED_EVENTS``
+:Capability: KVM_CAP_PMU_EVENT_MASKED_EVENTS
+
+In this mode each filter event will contain an event select, mask, match, and
+exclude value. To encode a masked event use::
+
+ KVM_PMU_ENCODE_MASKED_ENTRY()
+
+An encoded event will follow this layout::
+
+ Bits Description
+ ---- -----------
+ 7:0 event select (low bits)
+ 15:8 umask match
+ 31:16 unused
+ 35:32 event select (high bits)
+ 36:54 unused
+ 55 exclude bit
+ 63:56 umask mask
+
+When the guest attempts to program the PMU, these steps are followed in
+determining if the guest should have access:
+
+ 1. Match the event select from the guest against the filter events.
+ 2. If a match is found, match the guest's unit mask to the mask and match
+ values of the included filter events.
+ I.e. (unit mask & mask) == match && !exclude.
+ 3. If a match is found, match the guest's unit mask to the mask and match
+ values of the excluded filter events.
+ I.e. (unit mask & mask) == match && exclude.
+ 4.
+ a. If an included match is found and an excluded match is not found, filter
+ the event.
+ b. For everything else, do not filter the event.
+ 5.
+ a. If the event is filtered and it's an allow list, allow the guest to
+ program the event.
+ b. If the event is filtered and it's a deny list, do not allow the guest to
+ program the event.
+
+When setting a new pmu event filter, -EINVAL will be returned if any of the
+unused fields are set or if any of the high bits (35:32) in the event
+select are set when called on Intel.
Valid values for 'action'::
#define KVM_PMU_EVENT_ALLOW 0
#define KVM_PMU_EVENT_DENY 1
+Via this API, KVM userspace can also control the behavior of the VM's fixed
+counters (if any) by configuring the "action" and "fixed_counter_bitmap" fields.
+
+Specifically, KVM follows the following pseudo-code when determining whether to
+allow the guest FixCtr[i] to count its pre-defined fixed event::
+
+ FixCtr[i]_is_allowed = (action == ALLOW) && (bitmap & BIT(i)) ||
+ (action == DENY) && !(bitmap & BIT(i));
+ FixCtr[i]_is_denied = !FixCtr[i]_is_allowed;
+
+KVM always consumes fixed_counter_bitmap, it's userspace's responsibility to
+ensure fixed_counter_bitmap is set correctly, e.g. if userspace wants to define
+a filter that only affects general purpose counters.
+
+Note, the "events" field also applies to fixed counters' hardcoded event_select
+and unit_mask values. "fixed_counter_bitmap" has higher priority than "events"
+if there is a contradiction between the two.
+
4.121 KVM_PPC_SVM_OFF
---------------------
@@ -5127,7 +5286,15 @@ into ESA mode. This reset is a superset of the initial reset.
__u32 reserved[3];
};
-cmd values:
+**Ultravisor return codes**
+The Ultravisor return (reason) codes are provided by the kernel if a
+Ultravisor call has been executed to achieve the results expected by
+the command. Therefore they are independent of the IOCTL return
+code. If KVM changes `rc`, its value will always be greater than 0
+hence setting it to 0 before issuing a PV command is advised to be
+able to detect a change of `rc`.
+
+**cmd values:**
KVM_PV_ENABLE
Allocate memory and register the VM with the Ultravisor, thereby
@@ -5143,11 +5310,13 @@ KVM_PV_ENABLE
===== =============================
KVM_PV_DISABLE
-
- Deregister the VM from the Ultravisor and reclaim the memory that
- had been donated to the Ultravisor, making it usable by the kernel
- again. All registered VCPUs are converted back to non-protected
- ones.
+ Deregister the VM from the Ultravisor and reclaim the memory that had
+ been donated to the Ultravisor, making it usable by the kernel again.
+ All registered VCPUs are converted back to non-protected ones. If a
+ previous protected VM had been prepared for asynchronous teardown with
+ KVM_PV_ASYNC_CLEANUP_PREPARE and not subsequently torn down with
+ KVM_PV_ASYNC_CLEANUP_PERFORM, it will be torn down in this call
+ together with the current protected VM.
KVM_PV_VM_SET_SEC_PARMS
Pass the image header from VM memory to the Ultravisor in
@@ -5160,109 +5329,147 @@ KVM_PV_VM_VERIFY
Verify the integrity of the unpacked image. Only if this succeeds,
KVM is allowed to start protected VCPUs.
-4.126 KVM_X86_SET_MSR_FILTER
-----------------------------
-
-:Capability: KVM_CAP_X86_MSR_FILTER
-:Architectures: x86
-:Type: vm ioctl
-:Parameters: struct kvm_msr_filter
-:Returns: 0 on success, < 0 on error
-
-::
-
- struct kvm_msr_filter_range {
- #define KVM_MSR_FILTER_READ (1 << 0)
- #define KVM_MSR_FILTER_WRITE (1 << 1)
- __u32 flags;
- __u32 nmsrs; /* number of msrs in bitmap */
- __u32 base; /* MSR index the bitmap starts at */
- __u8 *bitmap; /* a 1 bit allows the operations in flags, 0 denies */
- };
-
- #define KVM_MSR_FILTER_MAX_RANGES 16
- struct kvm_msr_filter {
- #define KVM_MSR_FILTER_DEFAULT_ALLOW (0 << 0)
- #define KVM_MSR_FILTER_DEFAULT_DENY (1 << 0)
- __u32 flags;
- struct kvm_msr_filter_range ranges[KVM_MSR_FILTER_MAX_RANGES];
- };
+KVM_PV_INFO
+ :Capability: KVM_CAP_S390_PROTECTED_DUMP
-flags values for ``struct kvm_msr_filter_range``:
+ Presents an API that provides Ultravisor related data to userspace
+ via subcommands. len_max is the size of the user space buffer,
+ len_written is KVM's indication of how much bytes of that buffer
+ were actually written to. len_written can be used to determine the
+ valid fields if more response fields are added in the future.
-``KVM_MSR_FILTER_READ``
+ ::
- Filter read accesses to MSRs using the given bitmap. A 0 in the bitmap
- indicates that a read should immediately fail, while a 1 indicates that
- a read for a particular MSR should be handled regardless of the default
- filter action.
+ enum pv_cmd_info_id {
+ KVM_PV_INFO_VM,
+ KVM_PV_INFO_DUMP,
+ };
-``KVM_MSR_FILTER_WRITE``
-
- Filter write accesses to MSRs using the given bitmap. A 0 in the bitmap
- indicates that a write should immediately fail, while a 1 indicates that
- a write for a particular MSR should be handled regardless of the default
- filter action.
+ struct kvm_s390_pv_info_header {
+ __u32 id;
+ __u32 len_max;
+ __u32 len_written;
+ __u32 reserved;
+ };
-``KVM_MSR_FILTER_READ | KVM_MSR_FILTER_WRITE``
+ struct kvm_s390_pv_info {
+ struct kvm_s390_pv_info_header header;
+ struct kvm_s390_pv_info_dump dump;
+ struct kvm_s390_pv_info_vm vm;
+ };
- Filter both read and write accesses to MSRs using the given bitmap. A 0
- in the bitmap indicates that both reads and writes should immediately fail,
- while a 1 indicates that reads and writes for a particular MSR are not
- filtered by this range.
+**subcommands:**
-flags values for ``struct kvm_msr_filter``:
+ KVM_PV_INFO_VM
+ This subcommand provides basic Ultravisor information for PV
+ hosts. These values are likely also exported as files in the sysfs
+ firmware UV query interface but they are more easily available to
+ programs in this API.
-``KVM_MSR_FILTER_DEFAULT_ALLOW``
+ The installed calls and feature_indication members provide the
+ installed UV calls and the UV's other feature indications.
- If no filter range matches an MSR index that is getting accessed, KVM will
- fall back to allowing access to the MSR.
+ The max_* members provide information about the maximum number of PV
+ vcpus, PV guests and PV guest memory size.
-``KVM_MSR_FILTER_DEFAULT_DENY``
+ ::
- If no filter range matches an MSR index that is getting accessed, KVM will
- fall back to rejecting access to the MSR. In this mode, all MSRs that should
- be processed by KVM need to explicitly be marked as allowed in the bitmaps.
+ struct kvm_s390_pv_info_vm {
+ __u64 inst_calls_list[4];
+ __u64 max_cpus;
+ __u64 max_guests;
+ __u64 max_guest_addr;
+ __u64 feature_indication;
+ };
-This ioctl allows user space to define up to 16 bitmaps of MSR ranges to
-specify whether a certain MSR access should be explicitly filtered for or not.
-If this ioctl has never been invoked, MSR accesses are not guarded and the
-default KVM in-kernel emulation behavior is fully preserved.
+ KVM_PV_INFO_DUMP
+ This subcommand provides information related to dumping PV guests.
-Calling this ioctl with an empty set of ranges (all nmsrs == 0) disables MSR
-filtering. In that mode, ``KVM_MSR_FILTER_DEFAULT_DENY`` is invalid and causes
-an error.
+ ::
-As soon as the filtering is in place, every MSR access is processed through
-the filtering except for accesses to the x2APIC MSRs (from 0x800 to 0x8ff);
-x2APIC MSRs are always allowed, independent of the ``default_allow`` setting,
-and their behavior depends on the ``X2APIC_ENABLE`` bit of the APIC base
-register.
-
-If a bit is within one of the defined ranges, read and write accesses are
-guarded by the bitmap's value for the MSR index if the kind of access
-is included in the ``struct kvm_msr_filter_range`` flags. If no range
-cover this particular access, the behavior is determined by the flags
-field in the kvm_msr_filter struct: ``KVM_MSR_FILTER_DEFAULT_ALLOW``
-and ``KVM_MSR_FILTER_DEFAULT_DENY``.
-
-Each bitmap range specifies a range of MSRs to potentially allow access on.
-The range goes from MSR index [base .. base+nmsrs]. The flags field
-indicates whether reads, writes or both reads and writes are filtered
-by setting a 1 bit in the bitmap for the corresponding MSR index.
-
-If an MSR access is not permitted through the filtering, it generates a
-#GP inside the guest. When combined with KVM_CAP_X86_USER_SPACE_MSR, that
-allows user space to deflect and potentially handle various MSR accesses
-into user space.
-
-Note, invoking this ioctl with a vCPU is running is inherently racy. However,
-KVM does guarantee that vCPUs will see either the previous filter or the new
-filter, e.g. MSRs with identical settings in both the old and new filter will
-have deterministic behavior.
+ struct kvm_s390_pv_info_dump {
+ __u64 dump_cpu_buffer_len;
+ __u64 dump_config_mem_buffer_per_1m;
+ __u64 dump_config_finalize_len;
+ };
-4.127 KVM_XEN_HVM_SET_ATTR
+KVM_PV_DUMP
+ :Capability: KVM_CAP_S390_PROTECTED_DUMP
+
+ Presents an API that provides calls which facilitate dumping a
+ protected VM.
+
+ ::
+
+ struct kvm_s390_pv_dmp {
+ __u64 subcmd;
+ __u64 buff_addr;
+ __u64 buff_len;
+ __u64 gaddr; /* For dump storage state */
+ };
+
+ **subcommands:**
+
+ KVM_PV_DUMP_INIT
+ Initializes the dump process of a protected VM. If this call does
+ not succeed all other subcommands will fail with -EINVAL. This
+ subcommand will return -EINVAL if a dump process has not yet been
+ completed.
+
+ Not all PV vms can be dumped, the owner needs to set `dump
+ allowed` PCF bit 34 in the SE header to allow dumping.
+
+ KVM_PV_DUMP_CONFIG_STOR_STATE
+ Stores `buff_len` bytes of tweak component values starting with
+ the 1MB block specified by the absolute guest address
+ (`gaddr`). `buff_len` needs to be `conf_dump_storage_state_len`
+ aligned and at least >= the `conf_dump_storage_state_len` value
+ provided by the dump uv_info data. buff_user might be written to
+ even if an error rc is returned. For instance if we encounter a
+ fault after writing the first page of data.
+
+ KVM_PV_DUMP_COMPLETE
+ If the subcommand succeeds it completes the dump process and lets
+ KVM_PV_DUMP_INIT be called again.
+
+ On success `conf_dump_finalize_len` bytes of completion data will be
+ stored to the `buff_addr`. The completion data contains a key
+ derivation seed, IV, tweak nonce and encryption keys as well as an
+ authentication tag all of which are needed to decrypt the dump at a
+ later time.
+
+KVM_PV_ASYNC_CLEANUP_PREPARE
+ :Capability: KVM_CAP_S390_PROTECTED_ASYNC_DISABLE
+
+ Prepare the current protected VM for asynchronous teardown. Most
+ resources used by the current protected VM will be set aside for a
+ subsequent asynchronous teardown. The current protected VM will then
+ resume execution immediately as non-protected. There can be at most
+ one protected VM prepared for asynchronous teardown at any time. If
+ a protected VM had already been prepared for teardown without
+ subsequently calling KVM_PV_ASYNC_CLEANUP_PERFORM, this call will
+ fail. In that case, the userspace process should issue a normal
+ KVM_PV_DISABLE. The resources set aside with this call will need to
+ be cleaned up with a subsequent call to KVM_PV_ASYNC_CLEANUP_PERFORM
+ or KVM_PV_DISABLE, otherwise they will be cleaned up when KVM
+ terminates. KVM_PV_ASYNC_CLEANUP_PREPARE can be called again as soon
+ as cleanup starts, i.e. before KVM_PV_ASYNC_CLEANUP_PERFORM finishes.
+
+KVM_PV_ASYNC_CLEANUP_PERFORM
+ :Capability: KVM_CAP_S390_PROTECTED_ASYNC_DISABLE
+
+ Tear down the protected VM previously prepared for teardown with
+ KVM_PV_ASYNC_CLEANUP_PREPARE. The resources that had been set aside
+ will be freed during the execution of this command. This PV command
+ should ideally be issued by userspace from a separate thread. If a
+ fatal signal is received (or the process terminates naturally), the
+ command will terminate immediately without completing, and the normal
+ KVM shutdown procedure will take care of cleaning up all remaining
+ protected VMs, including the ones whose teardown was interrupted by
+ process termination.
+
+4.126 KVM_XEN_HVM_SET_ATTR
--------------------------
:Capability: KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO
@@ -5279,8 +5486,10 @@ have deterministic behavior.
union {
__u8 long_mode;
__u8 vector;
- struct {
+ __u8 runstate_update_flag;
+ union {
__u64 gfn;
+ __u64 hva;
} shared_info;
struct {
__u32 send_port;
@@ -5308,19 +5517,20 @@ type values:
KVM_XEN_ATTR_TYPE_LONG_MODE
Sets the ABI mode of the VM to 32-bit or 64-bit (long mode). This
- determines the layout of the shared info pages exposed to the VM.
+ determines the layout of the shared_info page exposed to the VM.
KVM_XEN_ATTR_TYPE_SHARED_INFO
- Sets the guest physical frame number at which the Xen "shared info"
+ Sets the guest physical frame number at which the Xen shared_info
page resides. Note that although Xen places vcpu_info for the first
32 vCPUs in the shared_info page, KVM does not automatically do so
- and instead requires that KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO be used
- explicitly even when the vcpu_info for a given vCPU resides at the
- "default" location in the shared_info page. This is because KVM is
- not aware of the Xen CPU id which is used as the index into the
- vcpu_info[] array, so cannot know the correct default location.
-
- Note that the shared info page may be constantly written to by KVM;
+ and instead requires that KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO or
+ KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA be used explicitly even when
+ the vcpu_info for a given vCPU resides at the "default" location
+ in the shared_info page. This is because KVM may not be aware of
+ the Xen CPU id which is used as the index into the vcpu_info[]
+ array, so may know the correct default location.
+
+ Note that the shared_info page may be constantly written to by KVM;
it contains the event channel bitmap used to deliver interrupts to
a Xen guest, amongst other things. It is exempt from dirty tracking
mechanisms — KVM will not explicitly mark the page as dirty each
@@ -5329,23 +5539,41 @@ KVM_XEN_ATTR_TYPE_SHARED_INFO
any vCPU has been running or any event channel interrupts can be
routed to the guest.
+ Setting the gfn to KVM_XEN_INVALID_GFN will disable the shared_info
+ page.
+
+KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA
+ If the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA flag is also set in the
+ Xen capabilities, then this attribute may be used to set the
+ userspace address at which the shared_info page resides, which
+ will always be fixed in the VMM regardless of where it is mapped
+ in guest physical address space. This attribute should be used in
+ preference to KVM_XEN_ATTR_TYPE_SHARED_INFO as it avoids
+ unnecessary invalidation of an internal cache when the page is
+ re-mapped in guest physcial address space.
+
+ Setting the hva to zero will disable the shared_info page.
+
KVM_XEN_ATTR_TYPE_UPCALL_VECTOR
Sets the exception vector used to deliver Xen event channel upcalls.
This is the HVM-wide vector injected directly by the hypervisor
(not through the local APIC), typically configured by a guest via
- HVM_PARAM_CALLBACK_IRQ.
+ HVM_PARAM_CALLBACK_IRQ. This can be disabled again (e.g. for guest
+ SHUTDOWN_soft_reset) by setting it to zero.
KVM_XEN_ATTR_TYPE_EVTCHN
This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates
support for KVM_XEN_HVM_CONFIG_EVTCHN_SEND features. It configures
an outbound port number for interception of EVTCHNOP_send requests
- from the guest. A given sending port number may be directed back
- to a specified vCPU (by APIC ID) / port / priority on the guest,
- or to trigger events on an eventfd. The vCPU and priority can be
- changed by setting KVM_XEN_EVTCHN_UPDATE in a subsequent call,
- but other fields cannot change for a given sending port. A port
- mapping is removed by using KVM_XEN_EVTCHN_DEASSIGN in the flags
- field.
+ from the guest. A given sending port number may be directed back to
+ a specified vCPU (by APIC ID) / port / priority on the guest, or to
+ trigger events on an eventfd. The vCPU and priority can be changed
+ by setting KVM_XEN_EVTCHN_UPDATE in a subsequent call, but other
+ fields cannot change for a given sending port. A port mapping is
+ removed by using KVM_XEN_EVTCHN_DEASSIGN in the flags field. Passing
+ KVM_XEN_EVTCHN_RESET in the flags field removes all interception of
+ outbound event channels. The values of the flags field are mutually
+ exclusive and cannot be combined as a bitmask.
KVM_XEN_ATTR_TYPE_XEN_VERSION
This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates
@@ -5356,6 +5584,14 @@ KVM_XEN_ATTR_TYPE_XEN_VERSION
event channel delivery, so responding within the kernel without
exiting to userspace is beneficial.
+KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG
+ This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates
+ support for KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG. It enables the
+ XEN_RUNSTATE_UPDATE flag which allows guest vCPUs to safely read
+ other vCPUs' vcpu_runstate_info. Xen guests enable this feature via
+ the VMASST_TYPE_runstate_update_flag of the HYPERVISOR_vm_assist
+ hypercall.
+
4.127 KVM_XEN_HVM_GET_ATTR
--------------------------
@@ -5411,15 +5647,33 @@ KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO
As with the shared_info page for the VM, the corresponding page may be
dirtied at any time if event channel interrupt delivery is enabled, so
userspace should always assume that the page is dirty without relying
- on dirty logging.
+ on dirty logging. Setting the gpa to KVM_XEN_INVALID_GPA will disable
+ the vcpu_info.
+
+KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA
+ If the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA flag is also set in the
+ Xen capabilities, then this attribute may be used to set the
+ userspace address of the vcpu_info for a given vCPU. It should
+ only be used when the vcpu_info resides at the "default" location
+ in the shared_info page. In this case it is safe to assume the
+ userspace address will not change, because the shared_info page is
+ an overlay on guest memory and remains at a fixed host address
+ regardless of where it is mapped in guest physical address space
+ and hence unnecessary invalidation of an internal cache may be
+ avoided if the guest memory layout is modified.
+ If the vcpu_info does not reside at the "default" location then
+ it is not guaranteed to remain at the same host address and
+ hence the aforementioned cache invalidation is required.
KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO
Sets the guest physical address of an additional pvclock structure
for a given vCPU. This is typically used for guest vsyscall support.
+ Setting the gpa to KVM_XEN_INVALID_GPA will disable the structure.
KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR
Sets the guest physical address of the vcpu_runstate_info for a given
vCPU. This is how a Xen guest tracks CPU state such as steal time.
+ Setting the gpa to KVM_XEN_INVALID_GPA will disable the runstate area.
KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT
Sets the runstate (RUNSTATE_running/_runnable/_blocked/_offline) of
@@ -5452,7 +5706,8 @@ KVM_XEN_VCPU_ATTR_TYPE_TIMER
This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates
support for KVM_XEN_HVM_CONFIG_EVTCHN_SEND features. It sets the
event channel port/priority for the VIRQ_TIMER of the vCPU, as well
- as allowing a pending timer to be saved/restored.
+ as allowing a pending timer to be saved/restored. Setting the timer
+ port to zero disables kernel handling of the singleshot timer.
KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR
This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates
@@ -5460,7 +5715,8 @@ KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR
per-vCPU local APIC upcall vector, configured by a Xen guest with
the HVMOP_set_evtchn_upcall_vector hypercall. This is typically
used by Windows guests, and is distinct from the HVM-wide upcall
- vector configured with HVM_PARAM_CALLBACK_IRQ.
+ vector configured with HVM_PARAM_CALLBACK_IRQ. It is disabled by
+ setting the vector to zero.
4.129 KVM_XEN_VCPU_GET_ATTR
@@ -5499,7 +5755,8 @@ with the KVM_XEN_VCPU_GET_ATTR ioctl.
};
Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
-``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
+``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned.
+``length`` must not be bigger than 2^31 - PAGE_SIZE bytes. The ``addr``
field must point to a buffer which the tags will be copied to or from.
``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
@@ -5545,7 +5802,7 @@ flags values for ``kvm_sregs2``:
``KVM_SREGS2_FLAGS_PDPTRS_VALID``
- Indicates thats the struct contain valid PDPTR values.
+ Indicates that the struct contains valid PDPTR values.
4.132 KVM_SET_SREGS2
@@ -5811,6 +6068,290 @@ of CPUID leaf 0xD on the host.
This ioctl injects an event channel interrupt directly to the guest vCPU.
+4.136 KVM_S390_PV_CPU_COMMAND
+-----------------------------
+
+:Capability: KVM_CAP_S390_PROTECTED_DUMP
+:Architectures: s390
+:Type: vcpu ioctl
+:Parameters: none
+:Returns: 0 on success, < 0 on error
+
+This ioctl closely mirrors `KVM_S390_PV_COMMAND` but handles requests
+for vcpus. It re-uses the kvm_s390_pv_dmp struct and hence also shares
+the command ids.
+
+**command:**
+
+KVM_PV_DUMP
+ Presents an API that provides calls which facilitate dumping a vcpu
+ of a protected VM.
+
+**subcommand:**
+
+KVM_PV_DUMP_CPU
+ Provides encrypted dump data like register values.
+ The length of the returned data is provided by uv_info.guest_cpu_stor_len.
+
+4.137 KVM_S390_ZPCI_OP
+----------------------
+
+:Capability: KVM_CAP_S390_ZPCI_OP
+:Architectures: s390
+:Type: vm ioctl
+:Parameters: struct kvm_s390_zpci_op (in)
+:Returns: 0 on success, <0 on error
+
+Used to manage hardware-assisted virtualization features for zPCI devices.
+
+Parameters are specified via the following structure::
+
+ struct kvm_s390_zpci_op {
+ /* in */
+ __u32 fh; /* target device */
+ __u8 op; /* operation to perform */
+ __u8 pad[3];
+ union {
+ /* for KVM_S390_ZPCIOP_REG_AEN */
+ struct {
+ __u64 ibv; /* Guest addr of interrupt bit vector */
+ __u64 sb; /* Guest addr of summary bit */
+ __u32 flags;
+ __u32 noi; /* Number of interrupts */
+ __u8 isc; /* Guest interrupt subclass */
+ __u8 sbo; /* Offset of guest summary bit vector */
+ __u16 pad;
+ } reg_aen;
+ __u64 reserved[8];
+ } u;
+ };
+
+The type of operation is specified in the "op" field.
+KVM_S390_ZPCIOP_REG_AEN is used to register the VM for adapter event
+notification interpretation, which will allow firmware delivery of adapter
+events directly to the vm, with KVM providing a backup delivery mechanism;
+KVM_S390_ZPCIOP_DEREG_AEN is used to subsequently disable interpretation of
+adapter event notifications.
+
+The target zPCI function must also be specified via the "fh" field. For the
+KVM_S390_ZPCIOP_REG_AEN operation, additional information to establish firmware
+delivery must be provided via the "reg_aen" struct.
+
+The "pad" and "reserved" fields may be used for future extensions and should be
+set to 0s by userspace.
+
+4.138 KVM_ARM_SET_COUNTER_OFFSET
+--------------------------------
+
+:Capability: KVM_CAP_COUNTER_OFFSET
+:Architectures: arm64
+:Type: vm ioctl
+:Parameters: struct kvm_arm_counter_offset (in)
+:Returns: 0 on success, < 0 on error
+
+This capability indicates that userspace is able to apply a single VM-wide
+offset to both the virtual and physical counters as viewed by the guest
+using the KVM_ARM_SET_CNT_OFFSET ioctl and the following data structure:
+
+::
+
+ struct kvm_arm_counter_offset {
+ __u64 counter_offset;
+ __u64 reserved;
+ };
+
+The offset describes a number of counter cycles that are subtracted from
+both virtual and physical counter views (similar to the effects of the
+CNTVOFF_EL2 and CNTPOFF_EL2 system registers, but only global). The offset
+always applies to all vcpus (already created or created after this ioctl)
+for this VM.
+
+It is userspace's responsibility to compute the offset based, for example,
+on previous values of the guest counters.
+
+Any value other than 0 for the "reserved" field may result in an error
+(-EINVAL) being returned. This ioctl can also return -EBUSY if any vcpu
+ioctl is issued concurrently.
+
+Note that using this ioctl results in KVM ignoring subsequent userspace
+writes to the CNTVCT_EL0 and CNTPCT_EL0 registers using the SET_ONE_REG
+interface. No error will be returned, but the resulting offset will not be
+applied.
+
+.. _KVM_ARM_GET_REG_WRITABLE_MASKS:
+
+4.139 KVM_ARM_GET_REG_WRITABLE_MASKS
+-------------------------------------------
+
+:Capability: KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES
+:Architectures: arm64
+:Type: vm ioctl
+:Parameters: struct reg_mask_range (in/out)
+:Returns: 0 on success, < 0 on error
+
+
+::
+
+ #define KVM_ARM_FEATURE_ID_RANGE 0
+ #define KVM_ARM_FEATURE_ID_RANGE_SIZE (3 * 8 * 8)
+
+ struct reg_mask_range {
+ __u64 addr; /* Pointer to mask array */
+ __u32 range; /* Requested range */
+ __u32 reserved[13];
+ };
+
+This ioctl copies the writable masks for a selected range of registers to
+userspace.
+
+The ``addr`` field is a pointer to the destination array where KVM copies
+the writable masks.
+
+The ``range`` field indicates the requested range of registers.
+``KVM_CHECK_EXTENSION`` for the ``KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES``
+capability returns the supported ranges, expressed as a set of flags. Each
+flag's bit index represents a possible value for the ``range`` field.
+All other values are reserved for future use and KVM may return an error.
+
+The ``reserved[13]`` array is reserved for future use and should be 0, or
+KVM may return an error.
+
+KVM_ARM_FEATURE_ID_RANGE (0)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The Feature ID range is defined as the AArch64 System register space with
+op0==3, op1=={0, 1, 3}, CRn==0, CRm=={0-7}, op2=={0-7}.
+
+The mask returned array pointed to by ``addr`` is indexed by the macro
+``ARM64_FEATURE_ID_RANGE_IDX(op0, op1, crn, crm, op2)``, allowing userspace
+to know what fields can be changed for the system register described by
+``op0, op1, crn, crm, op2``. KVM rejects ID register values that describe a
+superset of the features supported by the system.
+
+4.140 KVM_SET_USER_MEMORY_REGION2
+---------------------------------
+
+:Capability: KVM_CAP_USER_MEMORY2
+:Architectures: all
+:Type: vm ioctl
+:Parameters: struct kvm_userspace_memory_region2 (in)
+:Returns: 0 on success, -1 on error
+
+KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that
+allows mapping guest_memfd memory into a guest. All fields shared with
+KVM_SET_USER_MEMORY_REGION identically. Userspace can set KVM_MEM_GUEST_MEMFD
+in flags to have KVM bind the memory region to a given guest_memfd range of
+[guest_memfd_offset, guest_memfd_offset + memory_size]. The target guest_memfd
+must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and
+the target range must not be bound to any other memory region. All standard
+bounds checks apply (use common sense).
+
+::
+
+ struct kvm_userspace_memory_region2 {
+ __u32 slot;
+ __u32 flags;
+ __u64 guest_phys_addr;
+ __u64 memory_size; /* bytes */
+ __u64 userspace_addr; /* start of the userspace allocated memory */
+ __u64 guest_memfd_offset;
+ __u32 guest_memfd;
+ __u32 pad1;
+ __u64 pad2[14];
+ };
+
+A KVM_MEM_GUEST_MEMFD region _must_ have a valid guest_memfd (private memory) and
+userspace_addr (shared memory). However, "valid" for userspace_addr simply
+means that the address itself must be a legal userspace address. The backing
+mapping for userspace_addr is not required to be valid/populated at the time of
+KVM_SET_USER_MEMORY_REGION2, e.g. shared memory can be lazily mapped/allocated
+on-demand.
+
+When mapping a gfn into the guest, KVM selects shared vs. private, i.e consumes
+userspace_addr vs. guest_memfd, based on the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE
+state. At VM creation time, all memory is shared, i.e. the PRIVATE attribute
+is '0' for all gfns. Userspace can control whether memory is shared/private by
+toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as needed.
+
+4.141 KVM_SET_MEMORY_ATTRIBUTES
+-------------------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_memory_attributes (in)
+:Returns: 0 on success, <0 on error
+
+KVM_SET_MEMORY_ATTRIBUTES allows userspace to set memory attributes for a range
+of guest physical memory.
+
+::
+
+ struct kvm_memory_attributes {
+ __u64 address;
+ __u64 size;
+ __u64 attributes;
+ __u64 flags;
+ };
+
+ #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
+
+The address and size must be page aligned. The supported attributes can be
+retrieved via ioctl(KVM_CHECK_EXTENSION) on KVM_CAP_MEMORY_ATTRIBUTES. If
+executed on a VM, KVM_CAP_MEMORY_ATTRIBUTES precisely returns the attributes
+supported by that VM. If executed at system scope, KVM_CAP_MEMORY_ATTRIBUTES
+returns all attributes supported by KVM. The only attribute defined at this
+time is KVM_MEMORY_ATTRIBUTE_PRIVATE, which marks the associated gfn as being
+guest private memory.
+
+Note, there is no "get" API. Userspace is responsible for explicitly tracking
+the state of a gfn/page as needed.
+
+The "flags" field is reserved for future extensions and must be '0'.
+
+4.142 KVM_CREATE_GUEST_MEMFD
+----------------------------
+
+:Capability: KVM_CAP_GUEST_MEMFD
+:Architectures: none
+:Type: vm ioctl
+:Parameters: struct kvm_create_guest_memfd(in)
+:Returns: 0 on success, <0 on error
+
+KVM_CREATE_GUEST_MEMFD creates an anonymous file and returns a file descriptor
+that refers to it. guest_memfd files are roughly analogous to files created
+via memfd_create(), e.g. guest_memfd files live in RAM, have volatile storage,
+and are automatically released when the last reference is dropped. Unlike
+"regular" memfd_create() files, guest_memfd files are bound to their owning
+virtual machine (see below), cannot be mapped, read, or written by userspace,
+and cannot be resized (guest_memfd files do however support PUNCH_HOLE).
+
+::
+
+ struct kvm_create_guest_memfd {
+ __u64 size;
+ __u64 flags;
+ __u64 reserved[6];
+ };
+
+Conceptually, the inode backing a guest_memfd file represents physical memory,
+i.e. is coupled to the virtual machine as a thing, not to a "struct kvm". The
+file itself, which is bound to a "struct kvm", is that instance's view of the
+underlying memory, e.g. effectively provides the translation of guest addresses
+to host memory. This allows for use cases where multiple KVM structures are
+used to manage a single virtual machine, e.g. when performing intrahost
+migration of a virtual machine.
+
+KVM currently only supports mapping guest_memfd via KVM_SET_USER_MEMORY_REGION2,
+and more specifically via the guest_memfd and guest_memfd_offset fields in
+"struct kvm_userspace_memory_region2", where guest_memfd_offset is the offset
+into the guest_memfd instance. For a given guest_memfd file, there can be at
+most one mapping per page, i.e. binding multiple memory regions to a single
+guest_memfd range is not allowed (any number of memory regions can be bound to
+a single guest_memfd file, but the bound ranges must not overlap).
+
+See KVM_SET_USER_MEMORY_REGION2 for additional details.
+
5. The kvm_run structure
========================
@@ -6000,15 +6541,40 @@ to the byte array.
__u64 nr;
__u64 args[6];
__u64 ret;
- __u32 longmode;
- __u32 pad;
+ __u64 flags;
} hypercall;
-Unused. This was once used for 'hypercall to userspace'. To implement
-such functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO (all except s390).
+
+It is strongly recommended that userspace use ``KVM_EXIT_IO`` (x86) or
+``KVM_EXIT_MMIO`` (all except s390) to implement functionality that
+requires a guest to interact with host userspace.
.. note:: KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO.
+For arm64:
+----------
+
+SMCCC exits can be enabled depending on the configuration of the SMCCC
+filter. See the Documentation/virt/kvm/devices/vm.rst
+``KVM_ARM_SMCCC_FILTER`` for more details.
+
+``nr`` contains the function ID of the guest's SMCCC call. Userspace is
+expected to use the ``KVM_GET_ONE_REG`` ioctl to retrieve the call
+parameters from the vCPU's GPRs.
+
+Definition of ``flags``:
+ - ``KVM_HYPERCALL_EXIT_SMC``: Indicates that the guest used the SMC
+ conduit to initiate the SMCCC call. If this bit is 0 then the guest
+ used the HVC conduit for the SMCCC call.
+
+ - ``KVM_HYPERCALL_EXIT_16BIT``: Indicates that the guest used a 16bit
+ instruction to initiate the SMCCC call. If this bit is 0 then the
+ guest used a 32bit instruction. An AArch64 guest always has this
+ bit set to 0.
+
+At the point of exit, PC points to the instruction immediately following
+the trapping instruction.
+
::
/* KVM_EXIT_TPR_ACCESS */
@@ -6054,7 +6620,7 @@ s390 specific.
} s390_ucontrol;
s390 specific. A page fault has occurred for a user controlled virtual
-machine (KVM_VM_S390_UNCONTROL) on it's host page table that cannot be
+machine (KVM_VM_S390_UNCONTROL) on its host page table that cannot be
resolved by the kernel.
The program code and the translation exception code that were placed
in the cpu's lowcore are presented here as defined by the z Architecture
@@ -6341,31 +6907,35 @@ if it decides to decode and emulate the instruction.
Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is
enabled, MSR accesses to registers that would invoke a #GP by KVM kernel code
-will instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
+may instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
exit for writes.
-The "reason" field specifies why the MSR trap occurred. User space will only
-receive MSR exit traps when a particular reason was requested during through
+The "reason" field specifies why the MSR interception occurred. Userspace will
+only receive MSR exits when a particular reason was requested during through
ENABLE_CAP. Currently valid exit reasons are:
- KVM_MSR_EXIT_REASON_UNKNOWN - access to MSR that is unknown to KVM
- KVM_MSR_EXIT_REASON_INVAL - access to invalid MSRs or reserved bits
- KVM_MSR_EXIT_REASON_FILTER - access blocked by KVM_X86_SET_MSR_FILTER
+============================ ========================================
+ KVM_MSR_EXIT_REASON_UNKNOWN access to MSR that is unknown to KVM
+ KVM_MSR_EXIT_REASON_INVAL access to invalid MSRs or reserved bits
+ KVM_MSR_EXIT_REASON_FILTER access blocked by KVM_X86_SET_MSR_FILTER
+============================ ========================================
-For KVM_EXIT_X86_RDMSR, the "index" field tells user space which MSR the guest
-wants to read. To respond to this request with a successful read, user space
+For KVM_EXIT_X86_RDMSR, the "index" field tells userspace which MSR the guest
+wants to read. To respond to this request with a successful read, userspace
writes the respective data into the "data" field and must continue guest
execution to ensure the read data is transferred into guest register state.
-If the RDMSR request was unsuccessful, user space indicates that with a "1" in
+If the RDMSR request was unsuccessful, userspace indicates that with a "1" in
the "error" field. This will inject a #GP into the guest when the VCPU is
executed again.
-For KVM_EXIT_X86_WRMSR, the "index" field tells user space which MSR the guest
-wants to write. Once finished processing the event, user space must continue
-vCPU execution. If the MSR write was unsuccessful, user space also sets the
+For KVM_EXIT_X86_WRMSR, the "index" field tells userspace which MSR the guest
+wants to write. Once finished processing the event, userspace must continue
+vCPU execution. If the MSR write was unsuccessful, userspace also sets the
"error" field to "1".
+See KVM_X86_SET_MSR_FILTER for details on the interaction with MSR filtering.
+
::
@@ -6416,6 +6986,50 @@ spec refer, https://github.com/riscv/riscv-sbi-doc.
::
+ /* KVM_EXIT_MEMORY_FAULT */
+ struct {
+ #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3)
+ __u64 flags;
+ __u64 gpa;
+ __u64 size;
+ } memory_fault;
+
+KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
+could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe the
+guest physical address range [gpa, gpa + size) of the fault. The 'flags' field
+describes properties of the faulting access that are likely pertinent:
+
+ - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
+ on a private memory access. When clear, indicates the fault occurred on a
+ shared access.
+
+Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
+accompanies a return code of '-1', not '0'! errno will always be set to EFAULT
+or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
+kvm_run.exit_reason is stale/undefined for all other error numbers.
+
+::
+
+ /* KVM_EXIT_NOTIFY */
+ struct {
+ #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
+ __u32 flags;
+ } notify;
+
+Used on x86 systems. When the VM capability KVM_CAP_X86_NOTIFY_VMEXIT is
+enabled, a VM exit generated if no event window occurs in VM non-root mode
+for a specified amount of time. Once KVM_X86_NOTIFY_VMEXIT_USER is set when
+enabling the cap, it would exit to userspace with the exit reason
+KVM_EXIT_NOTIFY for further handling. The "flags" field contains more
+detailed info.
+
+The valid value for 'flags' is:
+
+ - KVM_NOTIFY_CONTEXT_INVALID -- the VM context is corrupted and not valid
+ in VMCS. It would run into unknown result if resume the target VM.
+
+::
+
/* Fix the size of the union. */
char padding[256];
};
@@ -6446,11 +7060,6 @@ Please note that the kernel is allowed to use the kvm_run structure as the
primary storage for certain register types. Therefore, the kernel may use the
values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
-::
-
- };
-
-
6. Capabilities that can be enabled on vCPUs
============================================
@@ -7029,6 +7638,7 @@ and injected exceptions.
will clear DR6.RTM.
7.18 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
+--------------------------------------
:Architectures: x86, arm64, mips
:Parameters: args[0] whether feature should be enabled or not
@@ -7094,14 +7704,13 @@ veto the transition.
:Parameters: args[0] is the maximum poll time in nanoseconds
:Returns: 0 on success; -1 on error
-This capability overrides the kvm module parameter halt_poll_ns for the
-target VM.
+KVM_CAP_HALT_POLL overrides the kvm.halt_poll_ns module parameter to set the
+maximum halt-polling time for all vCPUs in the target VM. This capability can
+be invoked at any time and any number of times to dynamically change the
+maximum halt-polling time.
-VCPU polling allows a VCPU to poll for wakeup events instead of immediately
-scheduling during guest halts. The maximum time a VCPU can spend polling is
-controlled by the kvm module parameter halt_poll_ns. This capability allows
-the maximum halt time to specified on a per-VM basis, effectively overriding
-the module parameter for the target VM.
+See Documentation/virt/kvm/halt-polling.rst for more information on halt
+polling.
7.21 KVM_CAP_X86_USER_SPACE_MSR
-------------------------------
@@ -7111,19 +7720,29 @@ the module parameter for the target VM.
:Parameters: args[0] contains the mask of KVM_MSR_EXIT_REASON_* events to report
:Returns: 0 on success; -1 on error
-This capability enables trapping of #GP invoking RDMSR and WRMSR instructions
-into user space.
+This capability allows userspace to intercept RDMSR and WRMSR instructions if
+access to an MSR is denied. By default, KVM injects #GP on denied accesses.
When a guest requests to read or write an MSR, KVM may not implement all MSRs
that are relevant to a respective system. It also does not differentiate by
CPU type.
-To allow more fine grained control over MSR handling, user space may enable
+To allow more fine grained control over MSR handling, userspace may enable
this capability. With it enabled, MSR accesses that match the mask specified in
-args[0] and trigger a #GP event inside the guest by KVM will instead trigger
-KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR exit notifications which user space
-can then handle to implement model specific MSR handling and/or user notifications
-to inform a user that an MSR was not handled.
+args[0] and would trigger a #GP inside the guest will instead trigger
+KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR exit notifications. Userspace
+can then implement model specific MSR handling and/or user notifications
+to inform a user that an MSR was not emulated/virtualized by KVM.
+
+The valid mask flags are:
+
+============================ ===============================================
+ KVM_MSR_EXIT_REASON_UNKNOWN intercept accesses to unknown (to KVM) MSRs
+ KVM_MSR_EXIT_REASON_INVAL intercept accesses that are architecturally
+ invalid according to the vCPU model and/or mode
+ KVM_MSR_EXIT_REASON_FILTER intercept accesses that are denied by userspace
+ via KVM_X86_SET_MSR_FILTER
+============================ ===============================================
7.22 KVM_CAP_X86_BUS_LOCK_EXIT
-------------------------------
@@ -7199,7 +7818,7 @@ APIC/MSRs/etc).
attribute is not supported by KVM.
KVM_CAP_SGX_ATTRIBUTE enables a userspace VMM to grant a VM access to one or
-more priveleged enclave attributes. args[0] must hold a file handle to a valid
+more privileged enclave attributes. args[0] must hold a file handle to a valid
SGX attribute file corresponding to an attribute that is supported/restricted
by KVM (currently only PROVISIONKEY).
@@ -7210,7 +7829,7 @@ system fingerprint. To prevent userspace from circumventing such restrictions
by running an enclave in a VM, KVM prevents access to privileged attributes by
default.
-See Documentation/x86/sgx.rst for more details.
+See Documentation/arch/x86/sgx.rst for more details.
7.26 KVM_CAP_PPC_RPT_INVALIDATE
-------------------------------
@@ -7266,8 +7885,9 @@ hibernation of the host; however the VMM needs to manually save/restore the
tags as appropriate if the VM is migrated.
When this capability is enabled all memory in memslots must be mapped as
-not-shareable (no MAP_SHARED), attempts to create a memslot with a
-MAP_SHARED mmap will result in an -EINVAL return.
+``MAP_ANONYMOUS`` or with a RAM-based file mapping (``tmpfs``, ``memfd``),
+attempts to create a memslot with an invalid mmap will result in an
+-EINVAL return.
When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
perform a bulk copy of tags to/from the guest.
@@ -7357,8 +7977,92 @@ The valid bits in cap.args[0] are:
hypercall instructions. Executing the
incorrect hypercall instruction will
generate a #UD within the guest.
+
+KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS By default, KVM emulates MONITOR/MWAIT (if
+ they are intercepted) as NOPs regardless of
+ whether or not MONITOR/MWAIT are supported
+ according to guest CPUID. When this quirk
+ is disabled and KVM_X86_DISABLE_EXITS_MWAIT
+ is not set (MONITOR/MWAIT are intercepted),
+ KVM will inject a #UD on MONITOR/MWAIT if
+ they're unsupported per guest CPUID. Note,
+ KVM will modify MONITOR/MWAIT support in
+ guest CPUID on writes to MISC_ENABLE if
+ KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT is
+ disabled.
=================================== ============================================
+7.32 KVM_CAP_MAX_VCPU_ID
+------------------------
+
+:Architectures: x86
+:Target: VM
+:Parameters: args[0] - maximum APIC ID value set for current VM
+:Returns: 0 on success, -EINVAL if args[0] is beyond KVM_MAX_VCPU_IDS
+ supported in KVM or if it has been set.
+
+This capability allows userspace to specify maximum possible APIC ID
+assigned for current VM session prior to the creation of vCPUs, saving
+memory for data structures indexed by the APIC ID. Userspace is able
+to calculate the limit to APIC ID values from designated
+CPU topology.
+
+The value can be changed only until KVM_ENABLE_CAP is set to a nonzero
+value or until a vCPU is created. Upon creation of the first vCPU,
+if the value was set to zero or KVM_ENABLE_CAP was not invoked, KVM
+uses the return value of KVM_CHECK_EXTENSION(KVM_CAP_MAX_VCPU_ID) as
+the maximum APIC ID.
+
+7.33 KVM_CAP_X86_NOTIFY_VMEXIT
+------------------------------
+
+:Architectures: x86
+:Target: VM
+:Parameters: args[0] is the value of notify window as well as some flags
+:Returns: 0 on success, -EINVAL if args[0] contains invalid flags or notify
+ VM exit is unsupported.
+
+Bits 63:32 of args[0] are used for notify window.
+Bits 31:0 of args[0] are for some flags. Valid bits are::
+
+ #define KVM_X86_NOTIFY_VMEXIT_ENABLED (1 << 0)
+ #define KVM_X86_NOTIFY_VMEXIT_USER (1 << 1)
+
+This capability allows userspace to configure the notify VM exit on/off
+in per-VM scope during VM creation. Notify VM exit is disabled by default.
+When userspace sets KVM_X86_NOTIFY_VMEXIT_ENABLED bit in args[0], VMM will
+enable this feature with the notify window provided, which will generate
+a VM exit if no event window occurs in VM non-root mode for a specified of
+time (notify window).
+
+If KVM_X86_NOTIFY_VMEXIT_USER is set in args[0], upon notify VM exits happen,
+KVM would exit to userspace for handling.
+
+This capability is aimed to mitigate the threat that malicious VMs can
+cause CPU stuck (due to event windows don't open up) and make the CPU
+unavailable to host or other VMs.
+
+7.34 KVM_CAP_MEMORY_FAULT_INFO
+------------------------------
+
+:Architectures: x86
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that KVM_RUN will fill
+kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
+there is a valid memslot but no backing VMA for the corresponding host virtual
+address.
+
+The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
+an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
+to KVM_EXIT_MEMORY_FAULT.
+
+Note: Userspaces which attempt to resolve memory faults so that they can retry
+KVM_RUN are encouraged to guard against repeatedly receiving the same
+error/annotated fault.
+
+See KVM_EXIT_MEMORY_FAULT for more information.
+
8. Other capabilities.
======================
@@ -7553,7 +8257,7 @@ writing to the respective MSRs.
This capability indicates that userspace can load HV_X64_MSR_VP_INDEX msr. Its
value is used to denote the target vcpu for a SynIC interrupt. For
-compatibilty, KVM initializes this msr to KVM's internal vcpu index. When this
+compatibility, KVM initializes this msr to KVM's internal vcpu index. When this
capability is absent, userspace can still query this msr's value.
8.13 KVM_CAP_S390_AIS_MIGRATION
@@ -7720,7 +8424,7 @@ KVM_EXIT_X86_WRMSR exit notifications.
This capability indicates that KVM supports that accesses to user defined MSRs
may be rejected. With this capability exposed, KVM exports new VM ioctl
KVM_X86_SET_MSR_FILTER which user space can call to specify bitmaps of MSR
-ranges that KVM should reject access to.
+ranges that KVM should deny access to.
In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to
trap and emulate MSRs that are outside of the scope of KVM as well as
@@ -7736,17 +8440,17 @@ guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf
(0x40000001). Otherwise, a guest may use the paravirtual features
regardless of what has actually been exposed through the CPUID leaf.
-8.29 KVM_CAP_DIRTY_LOG_RING
----------------------------
+8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
+----------------------------------------------------------
-:Architectures: x86
+:Architectures: x86, arm64
:Parameters: args[0] - size of the dirty log ring
KVM is capable of tracking dirty memory using ring buffers that are
-mmaped into userspace; there is one dirty ring per vcpu.
+mmapped into userspace; there is one dirty ring per vcpu.
The dirty ring is available to userspace as an array of
-``struct kvm_dirty_gfn``. Each dirty entry it's defined as::
+``struct kvm_dirty_gfn``. Each dirty entry is defined as::
struct kvm_dirty_gfn {
__u32 flags;
@@ -7785,7 +8489,7 @@ state machine for the entry is as follows::
| |
+------------------------------------------+
-To harvest the dirty pages, userspace accesses the mmaped ring buffer
+To harvest the dirty pages, userspace accesses the mmapped ring buffer
to read the dirty GFNs. If the flags has the DIRTY bit set (at this stage
the RESET bit must be cleared), then it means this GFN is a dirty GFN.
The userspace should harvest this GFN and mark the flags from state
@@ -7795,6 +8499,11 @@ on to the next GFN. The userspace should continue to do this until the
flags of a GFN have the DIRTY bit cleared, meaning that it has harvested
all the dirty GFNs that were available.
+Note that on weakly ordered architectures, userspace accesses to the
+ring buffer (and more specifically the 'flags' field) must be ordered,
+using load-acquire/store-release accessors when available, or any
+other memory barrier that will ensure this ordering.
+
It's not necessary for userspace to harvest the all dirty GFNs at once.
However it must collect the dirty GFNs in sequence, i.e., the userspace
program cannot skip one dirty GFN to collect the one next to it.
@@ -7816,12 +8525,44 @@ flushing is done by the KVM_GET_DIRTY_LOG ioctl). To achieve that, one
needs to kick the vcpu out of KVM_RUN using a signal. The resulting
vmexit ensures that all dirty GFNs are flushed to the dirty rings.
-NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the corresponding
-ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctls
-KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG. After enabling
-KVM_CAP_DIRTY_LOG_RING with an acceptable dirty ring size, the virtual
-machine will switch to ring-buffer dirty page tracking and further
-KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail.
+NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that
+should be exposed by weakly ordered architecture, in order to indicate
+the additional memory ordering requirements imposed on userspace when
+reading the state of an entry and mutating it from DIRTY to HARVESTED.
+Architecture with TSO-like ordering (such as x86) are allowed to
+expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL
+to userspace.
+
+After enabling the dirty rings, the userspace needs to detect the
+capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the
+ring structures can be backed by per-slot bitmaps. With this capability
+advertised, it means the architecture can dirty guest pages without
+vcpu/ring context, so that some of the dirty information will still be
+maintained in the bitmap structure. KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
+can't be enabled if the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL
+hasn't been enabled, or any memslot has been existing.
+
+Note that the bitmap here is only a backup of the ring structure. The
+use of the ring and bitmap combination is only beneficial if there is
+only a very small amount of memory that is dirtied out of vcpu/ring
+context. Otherwise, the stand-alone per-slot bitmap mechanism needs to
+be considered.
+
+To collect dirty bits in the backup bitmap, userspace can use the same
+KVM_GET_DIRTY_LOG ioctl. KVM_CLEAR_DIRTY_LOG isn't needed as long as all
+the generation of the dirty bits is done in a single pass. Collecting
+the dirty bitmap should be the very last thing that the VMM does before
+considering the state as complete. VMM needs to ensure that the dirty
+state is final and avoid missing dirty pages from another ioctl ordered
+after the bitmap collection.
+
+NOTE: Multiple examples of using the backup bitmap: (1) save vgic/its
+tables through command KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} on
+KVM device "kvm-arm-vgic-its". (2) restore vgic/its tables through
+command KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_RESTORE_TABLES} on KVM device
+"kvm-arm-vgic-its". VGICv3 LPI pending status is restored. (3) save
+vgic3 pending table through KVM_DEV_ARM_VGIC_{GRP_CTRL, SAVE_PENDING_TABLES}
+command on KVM device "kvm-arm-vgic-v3".
8.30 KVM_CAP_XEN_HVM
--------------------
@@ -7831,12 +8572,14 @@ KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail.
This capability indicates the features that Xen supports for hosting Xen
PVHVM guests. Valid flags are::
- #define KVM_XEN_HVM_CONFIG_HYPERCALL_MSR (1 << 0)
- #define KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL (1 << 1)
- #define KVM_XEN_HVM_CONFIG_SHARED_INFO (1 << 2)
- #define KVM_XEN_HVM_CONFIG_RUNSTATE (1 << 3)
- #define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4)
- #define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5)
+ #define KVM_XEN_HVM_CONFIG_HYPERCALL_MSR (1 << 0)
+ #define KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL (1 << 1)
+ #define KVM_XEN_HVM_CONFIG_SHARED_INFO (1 << 2)
+ #define KVM_XEN_HVM_CONFIG_RUNSTATE (1 << 3)
+ #define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4)
+ #define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5)
+ #define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG (1 << 6)
+ #define KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE (1 << 7)
The KVM_XEN_HVM_CONFIG_HYPERCALL_MSR flag indicates that the KVM_XEN_HVM_CONFIG
ioctl is available, for the guest to set its hypercall page.
@@ -7868,6 +8611,23 @@ KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID/TIMER/UPCALL_VECTOR vCPU attributes.
related to event channel delivery, timers, and the XENVER_version
interception.
+The KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG flag indicates that KVM supports
+the KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG attribute in the KVM_XEN_SET_ATTR
+and KVM_XEN_GET_ATTR ioctls. This controls whether KVM will set the
+XEN_RUNSTATE_UPDATE flag in guest memory mapped vcpu_runstate_info during
+updates of the runstate information. Note that versions of KVM which support
+the RUNSTATE feature above, but not the RUNSTATE_UPDATE_FLAG feature, will
+always set the XEN_RUNSTATE_UPDATE flag when updating the guest structure,
+which is perhaps counterintuitive. When this flag is advertised, KVM will
+behave more correctly, not using the XEN_RUNSTATE_UPDATE flag until/unless
+specifically enabled (by the guest making the hypercall, causing the VMM
+to enable the KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG attribute).
+
+The KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE flag indicates that KVM supports
+clearing the PVCLOCK_TSC_STABLE_BIT flag in Xen pvclock sources. This will be
+done when the KVM_CAP_XEN_HVM ioctl sets the
+KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE flag.
+
8.31 KVM_CAP_PPC_MULTITCE
-------------------------
@@ -7910,7 +8670,7 @@ Architectures: x86
When enabled, KVM will disable emulated Hyper-V features provided to the
guest according to the bits Hyper-V CPUID feature leaves. Otherwise, all
-currently implmented Hyper-V features are provided unconditionally when
+currently implemented Hyper-V features are provided unconditionally when
Hyper-V identification is set in the HYPERV_CPUID_INTERFACE (0x40000001)
leaf.
@@ -7936,11 +8696,11 @@ ENOSYS for the others.
8.35 KVM_CAP_PMU_CAPABILITY
---------------------------
-:Capability KVM_CAP_PMU_CAPABILITY
+:Capability: KVM_CAP_PMU_CAPABILITY
:Architectures: x86
:Type: vm
:Parameters: arg[0] is bitmask of PMU virtualization capabilities.
-:Returns 0 on success, -EINVAL when arg[0] contains invalid bits
+:Returns: 0 on success, -EINVAL when arg[0] contains invalid bits
This capability alters PMU virtualization in KVM.
@@ -7965,6 +8725,106 @@ should adjust CPUID leaf 0xA to reflect that the PMU is disabled.
When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of
type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request.
+8.37 KVM_CAP_S390_PROTECTED_DUMP
+--------------------------------
+
+:Capability: KVM_CAP_S390_PROTECTED_DUMP
+:Architectures: s390
+:Type: vm
+
+This capability indicates that KVM and the Ultravisor support dumping
+PV guests. The `KVM_PV_DUMP` command is available for the
+`KVM_S390_PV_COMMAND` ioctl and the `KVM_PV_INFO` command provides
+dump related UV data. Also the vcpu ioctl `KVM_S390_PV_CPU_COMMAND` is
+available and supports the `KVM_PV_DUMP_CPU` subcommand.
+
+8.38 KVM_CAP_VM_DISABLE_NX_HUGE_PAGES
+-------------------------------------
+
+:Capability: KVM_CAP_VM_DISABLE_NX_HUGE_PAGES
+:Architectures: x86
+:Type: vm
+:Parameters: arg[0] must be 0.
+:Returns: 0 on success, -EPERM if the userspace process does not
+ have CAP_SYS_BOOT, -EINVAL if args[0] is not 0 or any vCPUs have been
+ created.
+
+This capability disables the NX huge pages mitigation for iTLB MULTIHIT.
+
+The capability has no effect if the nx_huge_pages module parameter is not set.
+
+This capability may only be set before any vCPUs are created.
+
+8.39 KVM_CAP_S390_CPU_TOPOLOGY
+------------------------------
+
+:Capability: KVM_CAP_S390_CPU_TOPOLOGY
+:Architectures: s390
+:Type: vm
+
+This capability indicates that KVM will provide the S390 CPU Topology
+facility which consist of the interpretation of the PTF instruction for
+the function code 2 along with interception and forwarding of both the
+PTF instruction with function codes 0 or 1 and the STSI(15,1,x)
+instruction to the userland hypervisor.
+
+The stfle facility 11, CPU Topology facility, should not be indicated
+to the guest without this capability.
+
+When this capability is present, KVM provides a new attribute group
+on vm fd, KVM_S390_VM_CPU_TOPOLOGY.
+This new attribute allows to get, set or clear the Modified Change
+Topology Report (MTCR) bit of the SCA through the kvm_device_attr
+structure.
+
+When getting the Modified Change Topology Report value, the attr->addr
+must point to a byte where the value will be stored or retrieved from.
+
+8.40 KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
+---------------------------------------
+
+:Capability: KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
+:Architectures: arm64
+:Type: vm
+:Parameters: arg[0] is the new split chunk size.
+:Returns: 0 on success, -EINVAL if any memslot was already created.
+
+This capability sets the chunk size used in Eager Page Splitting.
+
+Eager Page Splitting improves the performance of dirty-logging (used
+in live migrations) when guest memory is backed by huge-pages. It
+avoids splitting huge-pages (into PAGE_SIZE pages) on fault, by doing
+it eagerly when enabling dirty logging (with the
+KVM_MEM_LOG_DIRTY_PAGES flag for a memory region), or when using
+KVM_CLEAR_DIRTY_LOG.
+
+The chunk size specifies how many pages to break at a time, using a
+single allocation for each chunk. Bigger the chunk size, more pages
+need to be allocated ahead of time.
+
+The chunk size needs to be a valid block size. The list of acceptable
+block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
+64-bit bitmap (each bit describing a block size). The default value is
+0, to disable the eager page splitting.
+
+8.41 KVM_CAP_VM_TYPES
+---------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: system ioctl
+
+This capability returns a bitmap of support VM types. The 1-setting of bit @n
+means the VM type with value @n is supported. Possible values of @n are::
+
+ #define KVM_X86_DEFAULT_VM 0
+ #define KVM_X86_SW_PROTECTED_VM 1
+
+Note, KVM_X86_SW_PROTECTED_VM is currently only for development and testing.
+Do not use KVM_X86_SW_PROTECTED_VM for "real" VMs, and especially not in
+production. The behavior and effective ABI for software-protected VMs is
+unstable.
+
9. Known KVM API problems
=========================
@@ -7999,6 +8859,20 @@ CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by ``KVM_GET_SUPPORTED_CPUID``
It can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel
has enabled in-kernel emulation of the local APIC.
+CPU topology
+~~~~~~~~~~~~
+
+Several CPUID values include topology information for the host CPU:
+0x0b and 0x1f for Intel systems, 0x8000001e for AMD systems. Different
+versions of KVM return different values for this information and userspace
+should not rely on it. Currently they return all zeroes.
+
+If userspace wishes to set up a guest topology, it should be careful that
+the values of these three leaves differ for each CPU. In particular,
+the APIC ID is found in EDX for all subleaves of 0x0b and 0x1f, and in EAX
+for 0x8000001e; the latter also encodes the core id and node id in bits
+7:0 of EBX and ECX respectively.
+
Obsolete ioctls and capabilities
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^