aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/Documentation/virt/kvm
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/virt/kvm')
-rw-r--r--Documentation/virt/kvm/amd-memory-encryption.rst2
-rw-r--r--Documentation/virt/kvm/api.rst53
-rw-r--r--Documentation/virt/kvm/arm/pvtime.rst2
-rw-r--r--Documentation/virt/kvm/cpuid.rst8
-rw-r--r--Documentation/virt/kvm/devices/vcpu.rst2
-rw-r--r--Documentation/virt/kvm/hypercalls.rst4
-rw-r--r--Documentation/virt/kvm/index.rst2
-rw-r--r--Documentation/virt/kvm/mmu.rst2
-rw-r--r--Documentation/virt/kvm/msr.rst119
-rw-r--r--Documentation/virt/kvm/nested-vmx.rst5
-rw-r--r--Documentation/virt/kvm/review-checklist.rst2
-rw-r--r--Documentation/virt/kvm/running-nested-guests.rst276
12 files changed, 428 insertions, 49 deletions
diff --git a/Documentation/virt/kvm/amd-memory-encryption.rst b/Documentation/virt/kvm/amd-memory-encryption.rst
index c3129b9ba5cb..57c01f531e61 100644
--- a/Documentation/virt/kvm/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/amd-memory-encryption.rst
@@ -74,7 +74,7 @@ should point to a file descriptor that is opened on the ``/dev/sev``
device, if needed (see individual commands).
On output, ``error`` is zero on success, or an error code. Error codes
-are defined in ``<linux/psp-dev.h>`.
+are defined in ``<linux/psp-dev.h>``.
KVM implements the following commands to support common lifecycle events of SEV
guests, such as launching, running, snapshotting, migrating and decommissioning.
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index efbbe570aa9b..426f94582b7a 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -2572,13 +2572,15 @@ list in 4.68.
:Parameters: None
:Returns: 0 on success, -1 on error
-This signals to the host kernel that the specified guest is being paused by
-userspace. The host will set a flag in the pvclock structure that is checked
-from the soft lockup watchdog. The flag is part of the pvclock structure that
-is shared between guest and host, specifically the second bit of the flags
+This ioctl sets a flag accessible to the guest indicating that the specified
+vCPU has been paused by the host userspace.
+
+The host will set a flag in the pvclock structure that is checked from the
+soft lockup watchdog. The flag is part of the pvclock structure that is
+shared between guest and host, specifically the second bit of the flags
field of the pvclock_vcpu_time_info structure. It will be set exclusively by
the host and read/cleared exclusively by the guest. The guest operation of
-checking and clearing the flag must an atomic operation so
+checking and clearing the flag must be an atomic operation so
load-link/store-conditional, or equivalent must be used. There are two cases
where the guest will clear the flag: when the soft lockup watchdog timer resets
itself or when a soft lockup is detected. This ioctl can be called any time
@@ -4334,9 +4336,13 @@ Errors:
#define KVM_STATE_NESTED_VMX_SMM_GUEST_MODE 0x00000001
#define KVM_STATE_NESTED_VMX_SMM_VMXON 0x00000002
+#define KVM_STATE_VMX_PREEMPTION_TIMER_DEADLINE 0x00000001
+
struct kvm_vmx_nested_state_hdr {
+ __u32 flags;
__u64 vmxon_pa;
__u64 vmcs12_pa;
+ __u64 preemption_timer_deadline;
struct {
__u16 flags;
@@ -5066,10 +5072,13 @@ EOI was received.
struct kvm_hyperv_exit {
#define KVM_EXIT_HYPERV_SYNIC 1
#define KVM_EXIT_HYPERV_HCALL 2
+ #define KVM_EXIT_HYPERV_SYNDBG 3
__u32 type;
+ __u32 pad1;
union {
struct {
__u32 msr;
+ __u32 pad2;
__u64 control;
__u64 evt_page;
__u64 msg_page;
@@ -5079,6 +5088,15 @@ EOI was received.
__u64 result;
__u64 params[2];
} hcall;
+ struct {
+ __u32 msr;
+ __u32 pad2;
+ __u64 control;
+ __u64 status;
+ __u64 send_page;
+ __u64 recv_page;
+ __u64 pending_page;
+ } syndbg;
} u;
};
/* KVM_EXIT_HYPERV */
@@ -5095,6 +5113,12 @@ Hyper-V SynIC state change. Notification is used to remap SynIC
event/message pages and to enable/disable SynIC messages/events processing
in userspace.
+ - KVM_EXIT_HYPERV_SYNDBG -- synchronously notify user-space about
+
+Hyper-V Synthetic debugger state change. Notification is used to either update
+the pending_page location or to send a control command (send the buffer located
+in send_page or recv a buffer to recv_page).
+
::
/* KVM_EXIT_ARM_NISV */
@@ -5777,7 +5801,7 @@ will be initialized to 1 when created. This also improves performance because
dirty logging can be enabled gradually in small chunks on the first call
to KVM_CLEAR_DIRTY_LOG. KVM_DIRTY_LOG_INITIALLY_SET depends on
KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE (it is also only available on
-x86 for now).
+x86 and arm64 for now).
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 was previously available under the name
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT, but the implementation had bugs that make
@@ -5802,6 +5826,23 @@ If present, this capability can be enabled for a VM, meaning that KVM
will allow the transition to secure guest mode. Otherwise KVM will
veto the transition.
+7.20 KVM_CAP_HALT_POLL
+----------------------
+
+:Architectures: all
+:Target: VM
+:Parameters: args[0] is the maximum poll time in nanoseconds
+:Returns: 0 on success; -1 on error
+
+This capability overrides the kvm module parameter halt_poll_ns for the
+target VM.
+
+VCPU polling allows a VCPU to poll for wakeup events instead of immediately
+scheduling during guest halts. The maximum time a VCPU can spend polling is
+controlled by the kvm module parameter halt_poll_ns. This capability allows
+the maximum halt time to specified on a per-VM basis, effectively overriding
+the module parameter for the target VM.
+
8. Other capabilities.
======================
diff --git a/Documentation/virt/kvm/arm/pvtime.rst b/Documentation/virt/kvm/arm/pvtime.rst
index 2357dd2d8655..687b60d76ca9 100644
--- a/Documentation/virt/kvm/arm/pvtime.rst
+++ b/Documentation/virt/kvm/arm/pvtime.rst
@@ -76,5 +76,5 @@ It is advisable that one or more 64k pages are set aside for the purpose of
these structures and not used for other purposes, this enables the guest to map
the region using 64k pages and avoids conflicting attributes with other memory.
-For the user space interface see Documentation/virt/kvm/devices/vcpu.txt
+For the user space interface see Documentation/virt/kvm/devices/vcpu.rst
section "3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL".
diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
index 01b081f6e7ea..a7dff9186bed 100644
--- a/Documentation/virt/kvm/cpuid.rst
+++ b/Documentation/virt/kvm/cpuid.rst
@@ -50,8 +50,8 @@ KVM_FEATURE_NOP_IO_DELAY 1 not necessary to perform delays
KVM_FEATURE_MMU_OP 2 deprecated
KVM_FEATURE_CLOCKSOURCE2 3 kvmclock available at msrs
-
0x4b564d00 and 0x4b564d01
+
KVM_FEATURE_ASYNC_PF 4 async pf can be enabled by
writing to msr 0x4b564d02
@@ -86,6 +86,12 @@ KVM_FEATURE_PV_SCHED_YIELD 13 guest checks this feature bit
before using paravirtualized
sched yield.
+KVM_FEATURE_ASYNC_PF_INT 14 guest checks this feature bit
+ before using the second async
+ pf control msr 0x4b564d06 and
+ async pf acknowledgment msr
+ 0x4b564d07.
+
KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24 host will warn if no guest-side
per-cpu warps are expeced in
kvmclock
diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 9963e680770a..ca374d3fe085 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -110,5 +110,5 @@ Returns:
Specifies the base address of the stolen time structure for this VCPU. The
base address must be 64 byte aligned and exist within a valid guest memory
-region. See Documentation/virt/kvm/arm/pvtime.txt for more information
+region. See Documentation/virt/kvm/arm/pvtime.rst for more information
including the layout of the stolen time structure.
diff --git a/Documentation/virt/kvm/hypercalls.rst b/Documentation/virt/kvm/hypercalls.rst
index dbaf207e560d..ed4fddd364ea 100644
--- a/Documentation/virt/kvm/hypercalls.rst
+++ b/Documentation/virt/kvm/hypercalls.rst
@@ -22,7 +22,7 @@ S390:
number in R1.
For further information on the S390 diagnose call as supported by KVM,
- refer to Documentation/virt/kvm/s390-diag.txt.
+ refer to Documentation/virt/kvm/s390-diag.rst.
PowerPC:
It uses R3-R10 and hypercall number in R11. R4-R11 are used as output registers.
@@ -30,7 +30,7 @@ PowerPC:
KVM hypercalls uses 4 byte opcode, that are patched with 'hypercall-instructions'
property inside the device tree's /hypervisor node.
- For more information refer to Documentation/virt/kvm/ppc-pv.txt
+ For more information refer to Documentation/virt/kvm/ppc-pv.rst
MIPS:
KVM hypercalls use the HYPCALL instruction with code 0 and the hypercall
diff --git a/Documentation/virt/kvm/index.rst b/Documentation/virt/kvm/index.rst
index dcc252634cf9..b6833c7bb474 100644
--- a/Documentation/virt/kvm/index.rst
+++ b/Documentation/virt/kvm/index.rst
@@ -28,3 +28,5 @@ KVM
arm/index
devices/index
+
+ running-nested-guests
diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/mmu.rst
index 60981887d20b..46126ecc70f7 100644
--- a/Documentation/virt/kvm/mmu.rst
+++ b/Documentation/virt/kvm/mmu.rst
@@ -319,7 +319,7 @@ Handling a page fault is performed as follows:
- If both P bit and R/W bit of error code are set, this could possibly
be handled as a "fast page fault" (fixed without taking the MMU lock). See
- the description in Documentation/virt/kvm/locking.txt.
+ the description in Documentation/virt/kvm/locking.rst.
- if needed, walk the guest page tables to determine the guest translation
(gva->gpa or ngpa->gpa)
diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
index 33892036672d..e37a14c323d2 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/msr.rst
@@ -190,41 +190,72 @@ MSR_KVM_ASYNC_PF_EN:
0x4b564d02
data:
- Bits 63-6 hold 64-byte aligned physical address of a
- 64 byte memory area which must be in guest RAM and must be
- zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1
- when asynchronous page faults are enabled on the vcpu 0 when
- disabled. Bit 1 is 1 if asynchronous page faults can be injected
- when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults
- are delivered to L1 as #PF vmexits. Bit 2 can be set only if
- KVM_FEATURE_ASYNC_PF_VMEXIT is present in CPUID.
-
- First 4 byte of 64 byte memory location will be written to by
- the hypervisor at the time of asynchronous page fault (APF)
- injection to indicate type of asynchronous page fault. Value
- of 1 means that the page referred to by the page fault is not
- present. Value 2 means that the page is now available. Disabling
- interrupt inhibits APFs. Guest must not enable interrupt
- before the reason is read, or it may be overwritten by another
- APF. Since APF uses the same exception vector as regular page
- fault guest must reset the reason to 0 before it does
- something that can generate normal page fault. If during page
- fault APF reason is 0 it means that this is regular page
- fault.
-
- During delivery of type 1 APF cr2 contains a token that will
- be used to notify a guest when missing page becomes
- available. When page becomes available type 2 APF is sent with
- cr2 set to the token associated with the page. There is special
- kind of token 0xffffffff which tells vcpu that it should wake
- up all processes waiting for APFs and no individual type 2 APFs
- will be sent.
+ Asynchronous page fault (APF) control MSR.
+
+ Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area
+ which must be in guest RAM and must be zeroed. This memory is expected
+ to hold a copy of the following structure::
+
+ struct kvm_vcpu_pv_apf_data {
+ /* Used for 'page not present' events delivered via #PF */
+ __u32 flags;
+
+ /* Used for 'page ready' events delivered via interrupt notification */
+ __u32 token;
+
+ __u8 pad[56];
+ __u32 enabled;
+ };
+
+ Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
+ when asynchronous page faults are enabled on the vcpu, 0 when disabled.
+ Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
+ cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
+ #PF vmexits. Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
+ present in CPUID. Bit 3 enables interrupt based delivery of 'page ready'
+ events. Bit 3 can only be set if KVM_FEATURE_ASYNC_PF_INT is present in
+ CPUID.
+
+ 'Page not present' events are currently always delivered as synthetic
+ #PF exception. During delivery of these events APF CR2 register contains
+ a token that will be used to notify the guest when missing page becomes
+ available. Also, to make it possible to distinguish between real #PF and
+ APF, first 4 bytes of 64 byte memory location ('flags') will be written
+ to by the hypervisor at the time of injection. Only first bit of 'flags'
+ is currently supported, when set, it indicates that the guest is dealing
+ with asynchronous 'page not present' event. If during a page fault APF
+ 'flags' is '0' it means that this is regular page fault. Guest is
+ supposed to clear 'flags' when it is done handling #PF exception so the
+ next event can be delivered.
+
+ Note, since APF 'page not present' events use the same exception vector
+ as regular page fault, guest must reset 'flags' to '0' before it does
+ something that can generate normal page fault.
+
+ Bytes 5-7 of 64 byte memory location ('token') will be written to by the
+ hypervisor at the time of APF 'page ready' event injection. The content
+ of these bytes is a token which was previously delivered as 'page not
+ present' event. The event indicates the page in now available. Guest is
+ supposed to write '0' to 'token' when it is done handling 'page ready'
+ event and to write 1' to MSR_KVM_ASYNC_PF_ACK after clearing the location;
+ writing to the MSR forces KVM to re-scan its queue and deliver the next
+ pending notification.
+
+ Note, MSR_KVM_ASYNC_PF_INT MSR specifying the interrupt vector for 'page
+ ready' APF delivery needs to be written to before enabling APF mechanism
+ in MSR_KVM_ASYNC_PF_EN or interrupt #0 can get injected. The MSR is
+ available if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
+
+ Note, previously, 'page ready' events were delivered via the same #PF
+ exception as 'page not present' events but this is now deprecated. If
+ bit 3 (interrupt based delivery) is not set APF events are not delivered.
If APF is disabled while there are outstanding APFs, they will
not be delivered.
- Currently type 2 APF will be always delivered on the same vcpu as
- type 1 was, but guest should not rely on that.
+ Currently 'page ready' APF events will be always delivered on the
+ same vcpu as 'page not present' event was, but guest should not rely on
+ that.
MSR_KVM_STEAL_TIME:
0x4b564d03
@@ -319,3 +350,29 @@ data:
KVM guests can request the host not to poll on HLT, for example if
they are performing polling themselves.
+
+MSR_KVM_ASYNC_PF_INT:
+ 0x4b564d06
+
+data:
+ Second asynchronous page fault (APF) control MSR.
+
+ Bits 0-7: APIC vector for delivery of 'page ready' APF events.
+ Bits 8-63: Reserved
+
+ Interrupt vector for asynchnonous 'page ready' notifications delivery.
+ The vector has to be set up before asynchronous page fault mechanism
+ is enabled in MSR_KVM_ASYNC_PF_EN. The MSR is only available if
+ KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
+
+MSR_KVM_ASYNC_PF_ACK:
+ 0x4b564d07
+
+data:
+ Asynchronous page fault (APF) acknowledgment.
+
+ When the guest is done processing 'page ready' APF event and 'token'
+ field in 'struct kvm_vcpu_pv_apf_data' is cleared it is supposed to
+ write '1' to bit 0 of the MSR, this causes the host to re-scan its queue
+ and check if there are more notifications pending. The MSR is available
+ if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
diff --git a/Documentation/virt/kvm/nested-vmx.rst b/Documentation/virt/kvm/nested-vmx.rst
index 592b0ab6970b..89851cbb7df9 100644
--- a/Documentation/virt/kvm/nested-vmx.rst
+++ b/Documentation/virt/kvm/nested-vmx.rst
@@ -116,10 +116,7 @@ struct shadow_vmcs is ever changed.
natural_width cr4_guest_host_mask;
natural_width cr0_read_shadow;
natural_width cr4_read_shadow;
- natural_width cr3_target_value0;
- natural_width cr3_target_value1;
- natural_width cr3_target_value2;
- natural_width cr3_target_value3;
+ natural_width dead_space[4]; /* Last remnants of cr3_target_value[0-3]. */
natural_width exit_qualification;
natural_width guest_linear_address;
natural_width guest_cr0;
diff --git a/Documentation/virt/kvm/review-checklist.rst b/Documentation/virt/kvm/review-checklist.rst
index 1f86a9d3f705..dc01aea4057b 100644
--- a/Documentation/virt/kvm/review-checklist.rst
+++ b/Documentation/virt/kvm/review-checklist.rst
@@ -10,7 +10,7 @@ Review checklist for kvm patches
2. Patches should be against kvm.git master branch.
3. If the patch introduces or modifies a new userspace API:
- - the API must be documented in Documentation/virt/kvm/api.txt
+ - the API must be documented in Documentation/virt/kvm/api.rst
- the API must be discoverable using KVM_CHECK_EXTENSION
4. New state must include support for save/restore.
diff --git a/Documentation/virt/kvm/running-nested-guests.rst b/Documentation/virt/kvm/running-nested-guests.rst
new file mode 100644
index 000000000000..d0a1fc754c84
--- /dev/null
+++ b/Documentation/virt/kvm/running-nested-guests.rst
@@ -0,0 +1,276 @@
+==============================
+Running nested guests with KVM
+==============================
+
+A nested guest is the ability to run a guest inside another guest (it
+can be KVM-based or a different hypervisor). The straightforward
+example is a KVM guest that in turn runs on a KVM guest (the rest of
+this document is built on this example)::
+
+ .----------------. .----------------.
+ | | | |
+ | L2 | | L2 |
+ | (Nested Guest) | | (Nested Guest) |
+ | | | |
+ |----------------'--'----------------|
+ | |
+ | L1 (Guest Hypervisor) |
+ | KVM (/dev/kvm) |
+ | |
+ .------------------------------------------------------.
+ | L0 (Host Hypervisor) |
+ | KVM (/dev/kvm) |
+ |------------------------------------------------------|
+ | Hardware (with virtualization extensions) |
+ '------------------------------------------------------'
+
+Terminology:
+
+- L0 – level-0; the bare metal host, running KVM
+
+- L1 – level-1 guest; a VM running on L0; also called the "guest
+ hypervisor", as it itself is capable of running KVM.
+
+- L2 – level-2 guest; a VM running on L1, this is the "nested guest"
+
+.. note:: The above diagram is modelled after the x86 architecture;
+ s390x, ppc64 and other architectures are likely to have
+ a different design for nesting.
+
+ For example, s390x always has an LPAR (LogicalPARtition)
+ hypervisor running on bare metal, adding another layer and
+ resulting in at least four levels in a nested setup — L0 (bare
+ metal, running the LPAR hypervisor), L1 (host hypervisor), L2
+ (guest hypervisor), L3 (nested guest).
+
+ This document will stick with the three-level terminology (L0,
+ L1, and L2) for all architectures; and will largely focus on
+ x86.
+
+
+Use Cases
+---------
+
+There are several scenarios where nested KVM can be useful, to name a
+few:
+
+- As a developer, you want to test your software on different operating
+ systems (OSes). Instead of renting multiple VMs from a Cloud
+ Provider, using nested KVM lets you rent a large enough "guest
+ hypervisor" (level-1 guest). This in turn allows you to create
+ multiple nested guests (level-2 guests), running different OSes, on
+ which you can develop and test your software.
+
+- Live migration of "guest hypervisors" and their nested guests, for
+ load balancing, disaster recovery, etc.
+
+- VM image creation tools (e.g. ``virt-install``, etc) often run
+ their own VM, and users expect these to work inside a VM.
+
+- Some OSes use virtualization internally for security (e.g. to let
+ applications run safely in isolation).
+
+
+Enabling "nested" (x86)
+-----------------------
+
+From Linux kernel v4.19 onwards, the ``nested`` KVM parameter is enabled
+by default for Intel and AMD. (Though your Linux distribution might
+override this default.)
+
+In case you are running a Linux kernel older than v4.19, to enable
+nesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``. To
+persist this setting across reboots, you can add it in a config file, as
+shown below:
+
+1. On the bare metal host (L0), list the kernel modules and ensure that
+ the KVM modules::
+
+ $ lsmod | grep -i kvm
+ kvm_intel 133627 0
+ kvm 435079 1 kvm_intel
+
+2. Show information for ``kvm_intel`` module::
+
+ $ modinfo kvm_intel | grep -i nested
+ parm: nested:bool
+
+3. For the nested KVM configuration to persist across reboots, place the
+ below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it
+ doesn't exist)::
+
+ $ cat /etc/modprobe.d/kvm_intel.conf
+ options kvm-intel nested=y
+
+4. Unload and re-load the KVM Intel module::
+
+ $ sudo rmmod kvm-intel
+ $ sudo modprobe kvm-intel
+
+5. Verify if the ``nested`` parameter for KVM is enabled::
+
+ $ cat /sys/module/kvm_intel/parameters/nested
+ Y
+
+For AMD hosts, the process is the same as above, except that the module
+name is ``kvm-amd``.
+
+
+Additional nested-related kernel parameters (x86)
+-------------------------------------------------
+
+If your hardware is sufficiently advanced (Intel Haswell processor or
+higher, which has newer hardware virt extensions), the following
+additional features will also be enabled by default: "Shadow VMCS
+(Virtual Machine Control Structure)", APIC Virtualization on your bare
+metal host (L0). Parameters for Intel hosts::
+
+ $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs
+ Y
+
+ $ cat /sys/module/kvm_intel/parameters/enable_apicv
+ Y
+
+ $ cat /sys/module/kvm_intel/parameters/ept
+ Y
+
+.. note:: If you suspect your L2 (i.e. nested guest) is running slower,
+ ensure the above are enabled (particularly
+ ``enable_shadow_vmcs`` and ``ept``).
+
+
+Starting a nested guest (x86)
+-----------------------------
+
+Once your bare metal host (L0) is configured for nesting, you should be
+able to start an L1 guest with::
+
+ $ qemu-kvm -cpu host [...]
+
+The above will pass through the host CPU's capabilities as-is to the
+gues); or for better live migration compatibility, use a named CPU
+model supported by QEMU. e.g.::
+
+ $ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on
+
+then the guest hypervisor will subsequently be capable of running a
+nested guest with accelerated KVM.
+
+
+Enabling "nested" (s390x)
+-------------------------
+
+1. On the host hypervisor (L0), enable the ``nested`` parameter on
+ s390x::
+
+ $ rmmod kvm
+ $ modprobe kvm nested=1
+
+.. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive
+ with the ``nested`` paramter — i.e. to be able to enable
+ ``nested``, the ``hpage`` parameter *must* be disabled.
+
+2. The guest hypervisor (L1) must be provided with the ``sie`` CPU
+ feature — with QEMU, this can be done by using "host passthrough"
+ (via the command-line ``-cpu host``).
+
+3. Now the KVM module can be loaded in the L1 (guest hypervisor)::
+
+ $ modprobe kvm
+
+
+Live migration with nested KVM
+------------------------------
+
+Migrating an L1 guest, with a *live* nested guest in it, to another
+bare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for
+Intel x86 systems, and even on older versions for s390x.
+
+On AMD systems, once an L1 guest has started an L2 guest, the L1 guest
+should no longer be migrated or saved (refer to QEMU documentation on
+"savevm"/"loadvm") until the L2 guest shuts down. Attempting to migrate
+or save-and-load an L1 guest while an L2 guest is running will result in
+undefined behavior. You might see a ``kernel BUG!`` entry in ``dmesg``, a
+kernel 'oops', or an outright kernel panic. Such a migrated or loaded L1
+guest can no longer be considered stable or secure, and must be restarted.
+Migrating an L1 guest merely configured to support nesting, while not
+actually running L2 guests, is expected to function normally even on AMD
+systems but may fail once guests are started.
+
+Migrating an L2 guest is always expected to succeed, so all the following
+scenarios should work even on AMD systems:
+
+- Migrating a nested guest (L2) to another L1 guest on the *same* bare
+ metal host.
+
+- Migrating a nested guest (L2) to another L1 guest on a *different*
+ bare metal host.
+
+- Migrating a nested guest (L2) to a bare metal host.
+
+Reporting bugs from nested setups
+-----------------------------------
+
+Debugging "nested" problems can involve sifting through log files across
+L0, L1 and L2; this can result in tedious back-n-forth between the bug
+reporter and the bug fixer.
+
+- Mention that you are in a "nested" setup. If you are running any kind
+ of "nesting" at all, say so. Unfortunately, this needs to be called
+ out because when reporting bugs, people tend to forget to even
+ *mention* that they're using nested virtualization.
+
+- Ensure you are actually running KVM on KVM. Sometimes people do not
+ have KVM enabled for their guest hypervisor (L1), which results in
+ them running with pure emulation or what QEMU calls it as "TCG", but
+ they think they're running nested KVM. Thus confusing "nested Virt"
+ (which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM).
+
+Information to collect (generic)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following is not an exhaustive list, but a very good starting point:
+
+ - Kernel, libvirt, and QEMU version from L0
+
+ - Kernel, libvirt and QEMU version from L1
+
+ - QEMU command-line of L1 -- when using libvirt, you'll find it here:
+ ``/var/log/libvirt/qemu/instance.log``
+
+ - QEMU command-line of L2 -- as above, when using libvirt, get the
+ complete libvirt-generated QEMU command-line
+
+ - ``cat /sys/cpuinfo`` from L0
+
+ - ``cat /sys/cpuinfo`` from L1
+
+ - ``lscpu`` from L0
+
+ - ``lscpu`` from L1
+
+ - Full ``dmesg`` output from L0
+
+ - Full ``dmesg`` output from L1
+
+x86-specific info to collect
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Both the below commands, ``x86info`` and ``dmidecode``, should be
+available on most Linux distributions with the same name:
+
+ - Output of: ``x86info -a`` from L0
+
+ - Output of: ``x86info -a`` from L1
+
+ - Output of: ``dmidecode`` from L0
+
+ - Output of: ``dmidecode`` from L1
+
+s390x-specific info to collect
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Along with the earlier mentioned generic details, the below is
+also recommended:
+
+ - ``/proc/sysinfo`` from L1; this will also include the info from L0