aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/virtual
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/virtual')
-rw-r--r--Documentation/virtual/kvm/api.txt281
-rw-r--r--Documentation/virtual/kvm/cpuid.txt6
-rw-r--r--Documentation/virtual/kvm/msr.txt4
-rw-r--r--Documentation/virtual/virtio-spec.txt1164
4 files changed, 1370 insertions, 85 deletions
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 6386f8c0482e..930126698a0f 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2,6 +2,7 @@ The Definitive KVM (Kernel-based Virtual Machine) API Documentation
===================================================================
1. General description
+----------------------
The kvm API is a set of ioctls that are issued to control various aspects
of a virtual machine. The ioctls belong to three classes
@@ -23,7 +24,9 @@ of a virtual machine. The ioctls belong to three classes
Only run vcpu ioctls from the same thread that was used to create the
vcpu.
+
2. File descriptors
+-------------------
The kvm API is centered around file descriptors. An initial
open("/dev/kvm") obtains a handle to the kvm subsystem; this handle
@@ -41,7 +44,9 @@ not cause harm to the host, their actual behavior is not guaranteed by
the API. The only supported use is one virtual machine per process,
and one vcpu per thread.
+
3. Extensions
+-------------
As of Linux 2.6.22, the KVM ABI has been stabilized: no backward
incompatible change are allowed. However, there is an extension
@@ -53,7 +58,9 @@ Instead, kvm defines extension identifiers and a facility to query
whether a particular extension identifier is available. If it is, a
set of ioctls is available for application use.
+
4. API description
+------------------
This section describes ioctls that can be used to control kvm guests.
For each ioctl, the following information is provided along with a
@@ -75,6 +82,7 @@ description:
Returns: the return value. General error numbers (EBADF, ENOMEM, EINVAL)
are not detailed, but errors with specific meanings are.
+
4.1 KVM_GET_API_VERSION
Capability: basic
@@ -90,6 +98,7 @@ supported. Applications should refuse to run if KVM_GET_API_VERSION
returns a value other than 12. If this check passes, all ioctls
described as 'basic' will be available.
+
4.2 KVM_CREATE_VM
Capability: basic
@@ -109,6 +118,7 @@ In order to create user controlled virtual machines on S390, check
KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
privileged user (CAP_SYS_ADMIN).
+
4.3 KVM_GET_MSR_INDEX_LIST
Capability: basic
@@ -135,6 +145,7 @@ Note: if kvm indicates supports MCE (KVM_CAP_MCE), then the MCE bank MSRs are
not returned in the MSR list, as different vcpus can have a different number
of banks, as set via the KVM_X86_SETUP_MCE ioctl.
+
4.4 KVM_CHECK_EXTENSION
Capability: basic
@@ -149,6 +160,7 @@ receives an integer that describes the extension availability.
Generally 0 means no and 1 means yes, but some extensions may report
additional information in the integer return value.
+
4.5 KVM_GET_VCPU_MMAP_SIZE
Capability: basic
@@ -161,6 +173,7 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
memory region. This ioctl returns the size of that region. See the
KVM_RUN documentation for details.
+
4.6 KVM_SET_MEMORY_REGION
Capability: basic
@@ -171,6 +184,7 @@ Returns: 0 on success, -1 on error
This ioctl is obsolete and has been removed.
+
4.7 KVM_CREATE_VCPU
Capability: basic
@@ -223,6 +237,7 @@ machines, the resulting vcpu fd can be memory mapped at page offset
KVM_S390_SIE_PAGE_OFFSET in order to obtain a memory map of the virtual
cpu's hardware control block.
+
4.8 KVM_GET_DIRTY_LOG (vm ioctl)
Capability: basic
@@ -246,6 +261,7 @@ since the last call to this ioctl. Bit 0 is the first page in the
memory slot. Ensure the entire structure is cleared to avoid padding
issues.
+
4.9 KVM_SET_MEMORY_ALIAS
Capability: basic
@@ -256,6 +272,7 @@ Returns: 0 (success), -1 (error)
This ioctl is obsolete and has been removed.
+
4.10 KVM_RUN
Capability: basic
@@ -272,6 +289,7 @@ obtained by mmap()ing the vcpu fd at offset 0, with the size given by
KVM_GET_VCPU_MMAP_SIZE. The parameter block is formatted as a 'struct
kvm_run' (see below).
+
4.11 KVM_GET_REGS
Capability: basic
@@ -292,6 +310,7 @@ struct kvm_regs {
__u64 rip, rflags;
};
+
4.12 KVM_SET_REGS
Capability: basic
@@ -304,6 +323,7 @@ Writes the general purpose registers into the vcpu.
See KVM_GET_REGS for the data structure.
+
4.13 KVM_GET_SREGS
Capability: basic
@@ -331,6 +351,7 @@ interrupt_bitmap is a bitmap of pending external interrupts. At most
one bit may be set. This interrupt has been acknowledged by the APIC
but not yet injected into the cpu core.
+
4.14 KVM_SET_SREGS
Capability: basic
@@ -342,6 +363,7 @@ Returns: 0 on success, -1 on error
Writes special registers into the vcpu. See KVM_GET_SREGS for the
data structures.
+
4.15 KVM_TRANSLATE
Capability: basic
@@ -365,6 +387,7 @@ struct kvm_translation {
__u8 pad[5];
};
+
4.16 KVM_INTERRUPT
Capability: basic
@@ -413,6 +436,7 @@ c) KVM_INTERRUPT_SET_LEVEL
Note that any value for 'irq' other than the ones stated above is invalid
and incurs unexpected behavior.
+
4.17 KVM_DEBUG_GUEST
Capability: basic
@@ -423,6 +447,7 @@ Returns: -1 on error
Support for this has been removed. Use KVM_SET_GUEST_DEBUG instead.
+
4.18 KVM_GET_MSRS
Capability: basic
@@ -451,6 +476,7 @@ Application code should set the 'nmsrs' member (which indicates the
size of the entries array) and the 'index' member of each array entry.
kvm will fill in the 'data' member.
+
4.19 KVM_SET_MSRS
Capability: basic
@@ -466,6 +492,7 @@ Application code should set the 'nmsrs' member (which indicates the
size of the entries array), and the 'index' and 'data' members of each
array entry.
+
4.20 KVM_SET_CPUID
Capability: basic
@@ -494,6 +521,7 @@ struct kvm_cpuid {
struct kvm_cpuid_entry entries[0];
};
+
4.21 KVM_SET_SIGNAL_MASK
Capability: basic
@@ -516,6 +544,7 @@ struct kvm_signal_mask {
__u8 sigset[0];
};
+
4.22 KVM_GET_FPU
Capability: basic
@@ -541,6 +570,7 @@ struct kvm_fpu {
__u32 pad2;
};
+
4.23 KVM_SET_FPU
Capability: basic
@@ -566,6 +596,7 @@ struct kvm_fpu {
__u32 pad2;
};
+
4.24 KVM_CREATE_IRQCHIP
Capability: KVM_CAP_IRQCHIP
@@ -579,6 +610,7 @@ ioapic, a virtual PIC (two PICs, nested), and sets up future vcpus to have a
local APIC. IRQ routing for GSIs 0-15 is set to both PIC and IOAPIC; GSI 16-23
only go to the IOAPIC. On ia64, a IOSAPIC is created.
+
4.25 KVM_IRQ_LINE
Capability: KVM_CAP_IRQCHIP
@@ -600,6 +632,7 @@ struct kvm_irq_level {
__u32 level; /* 0 or 1 */
};
+
4.26 KVM_GET_IRQCHIP
Capability: KVM_CAP_IRQCHIP
@@ -621,6 +654,7 @@ struct kvm_irqchip {
} chip;
};
+
4.27 KVM_SET_IRQCHIP
Capability: KVM_CAP_IRQCHIP
@@ -642,6 +676,7 @@ struct kvm_irqchip {
} chip;
};
+
4.28 KVM_XEN_HVM_CONFIG
Capability: KVM_CAP_XEN_HVM
@@ -666,6 +701,7 @@ struct kvm_xen_hvm_config {
__u8 pad2[30];
};
+
4.29 KVM_GET_CLOCK
Capability: KVM_CAP_ADJUST_CLOCK
@@ -684,6 +720,7 @@ struct kvm_clock_data {
__u32 pad[9];
};
+
4.30 KVM_SET_CLOCK
Capability: KVM_CAP_ADJUST_CLOCK
@@ -702,6 +739,7 @@ struct kvm_clock_data {
__u32 pad[9];
};
+
4.31 KVM_GET_VCPU_EVENTS
Capability: KVM_CAP_VCPU_EVENTS
@@ -741,6 +779,7 @@ struct kvm_vcpu_events {
KVM_VCPUEVENT_VALID_SHADOW may be set in the flags field to signal that
interrupt.shadow contains a valid state. Otherwise, this field is undefined.
+
4.32 KVM_SET_VCPU_EVENTS
Capability: KVM_CAP_VCPU_EVENTS
@@ -767,6 +806,7 @@ If KVM_CAP_INTR_SHADOW is available, KVM_VCPUEVENT_VALID_SHADOW can be set in
the flags field to signal that interrupt.shadow contains a valid state and
shall be written into the VCPU.
+
4.33 KVM_GET_DEBUGREGS
Capability: KVM_CAP_DEBUGREGS
@@ -785,6 +825,7 @@ struct kvm_debugregs {
__u64 reserved[9];
};
+
4.34 KVM_SET_DEBUGREGS
Capability: KVM_CAP_DEBUGREGS
@@ -798,6 +839,7 @@ Writes debug registers into the vcpu.
See KVM_GET_DEBUGREGS for the data structure. The flags field is unused
yet and must be cleared on entry.
+
4.35 KVM_SET_USER_MEMORY_REGION
Capability: KVM_CAP_USER_MEM
@@ -844,6 +886,7 @@ It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl.
The KVM_SET_MEMORY_REGION does not allow fine grained control over memory
allocation and is deprecated.
+
4.36 KVM_SET_TSS_ADDR
Capability: KVM_CAP_SET_TSS_ADDR
@@ -862,6 +905,7 @@ This ioctl is required on Intel-based hosts. This is needed on Intel hardware
because of a quirk in the virtualization implementation (see the internals
documentation when it pops into existence).
+
4.37 KVM_ENABLE_CAP
Capability: KVM_CAP_ENABLE_CAP
@@ -897,6 +941,7 @@ function properly, this is the place to put them.
__u8 pad[64];
};
+
4.38 KVM_GET_MP_STATE
Capability: KVM_CAP_MP_STATE
@@ -927,6 +972,7 @@ Possible values are:
This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel
irqchip, the multiprocessing state must be maintained by userspace.
+
4.39 KVM_SET_MP_STATE
Capability: KVM_CAP_MP_STATE
@@ -941,6 +987,7 @@ arguments.
This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel
irqchip, the multiprocessing state must be maintained by userspace.
+
4.40 KVM_SET_IDENTITY_MAP_ADDR
Capability: KVM_CAP_SET_IDENTITY_MAP_ADDR
@@ -959,6 +1006,7 @@ This ioctl is required on Intel-based hosts. This is needed on Intel hardware
because of a quirk in the virtualization implementation (see the internals
documentation when it pops into existence).
+
4.41 KVM_SET_BOOT_CPU_ID
Capability: KVM_CAP_SET_BOOT_CPU_ID
@@ -971,6 +1019,7 @@ Define which vcpu is the Bootstrap Processor (BSP). Values are the same
as the vcpu id in KVM_CREATE_VCPU. If this ioctl is not called, the default
is vcpu 0.
+
4.42 KVM_GET_XSAVE
Capability: KVM_CAP_XSAVE
@@ -985,6 +1034,7 @@ struct kvm_xsave {
This ioctl would copy current vcpu's xsave struct to the userspace.
+
4.43 KVM_SET_XSAVE
Capability: KVM_CAP_XSAVE
@@ -999,6 +1049,7 @@ struct kvm_xsave {
This ioctl would copy userspace's xsave struct to the kernel.
+
4.44 KVM_GET_XCRS
Capability: KVM_CAP_XCRS
@@ -1022,6 +1073,7 @@ struct kvm_xcrs {
This ioctl would copy current vcpu's xcrs to the userspace.
+
4.45 KVM_SET_XCRS
Capability: KVM_CAP_XCRS
@@ -1045,6 +1097,7 @@ struct kvm_xcrs {
This ioctl would set vcpu's xcr to the value userspace specified.
+
4.46 KVM_GET_SUPPORTED_CPUID
Capability: KVM_CAP_EXT_CPUID
@@ -1119,6 +1172,7 @@ support. Instead it is reported via
if that returns true and you use KVM_CREATE_IRQCHIP, or if you emulate the
feature in userspace, then you can enable the feature for KVM_SET_CPUID2.
+
4.47 KVM_PPC_GET_PVINFO
Capability: KVM_CAP_PPC_GET_PVINFO
@@ -1142,6 +1196,7 @@ of 4 instructions that make up a hypercall.
If any additional field gets added to this structure later on, a bit for that
additional piece of information will be set in the flags bitmap.
+
4.48 KVM_ASSIGN_PCI_DEVICE
Capability: KVM_CAP_DEVICE_ASSIGNMENT
@@ -1185,6 +1240,7 @@ Only PCI header type 0 devices with PCI BAR resources are supported by
device assignment. The user requesting this ioctl must have read/write
access to the PCI sysfs resource files associated with the device.
+
4.49 KVM_DEASSIGN_PCI_DEVICE
Capability: KVM_CAP_DEVICE_DEASSIGNMENT
@@ -1198,6 +1254,7 @@ Ends PCI device assignment, releasing all associated resources.
See KVM_CAP_DEVICE_ASSIGNMENT for the data structure. Only assigned_dev_id is
used in kvm_assigned_pci_dev to identify the device.
+
4.50 KVM_ASSIGN_DEV_IRQ
Capability: KVM_CAP_ASSIGN_DEV_IRQ
@@ -1231,6 +1288,7 @@ The following flags are defined:
It is not valid to specify multiple types per host or guest IRQ. However, the
IRQ type of host and guest can differ or can even be null.
+
4.51 KVM_DEASSIGN_DEV_IRQ
Capability: KVM_CAP_ASSIGN_DEV_IRQ
@@ -1245,6 +1303,7 @@ See KVM_ASSIGN_DEV_IRQ for the data structure. The target device is specified
by assigned_dev_id, flags must correspond to the IRQ type specified on
KVM_ASSIGN_DEV_IRQ. Partial deassignment of host or guest IRQ is allowed.
+
4.52 KVM_SET_GSI_ROUTING
Capability: KVM_CAP_IRQ_ROUTING
@@ -1293,6 +1352,7 @@ struct kvm_irq_routing_msi {
__u32 pad;
};
+
4.53 KVM_ASSIGN_SET_MSIX_NR
Capability: KVM_CAP_DEVICE_MSIX
@@ -1314,6 +1374,7 @@ struct kvm_assigned_msix_nr {
#define KVM_MAX_MSIX_PER_DEV 256
+
4.54 KVM_ASSIGN_SET_MSIX_ENTRY
Capability: KVM_CAP_DEVICE_MSIX
@@ -1332,7 +1393,8 @@ struct kvm_assigned_msix_entry {
__u16 padding[3];
};
-4.54 KVM_SET_TSC_KHZ
+
+4.55 KVM_SET_TSC_KHZ
Capability: KVM_CAP_TSC_CONTROL
Architectures: x86
@@ -1343,7 +1405,8 @@ Returns: 0 on success, -1 on error
Specifies the tsc frequency for the virtual machine. The unit of the
frequency is KHz.
-4.55 KVM_GET_TSC_KHZ
+
+4.56 KVM_GET_TSC_KHZ
Capability: KVM_CAP_GET_TSC_KHZ
Architectures: x86
@@ -1355,7 +1418,8 @@ Returns the tsc frequency of the guest. The unit of the return value is
KHz. If the host has unstable tsc this ioctl returns -EIO instead as an
error.
-4.56 KVM_GET_LAPIC
+
+4.57 KVM_GET_LAPIC
Capability: KVM_CAP_IRQCHIP
Architectures: x86
@@ -1371,7 +1435,8 @@ struct kvm_lapic_state {
Reads the Local APIC registers and copies them into the input argument. The
data format and layout are the same as documented in the architecture manual.
-4.57 KVM_SET_LAPIC
+
+4.58 KVM_SET_LAPIC
Capability: KVM_CAP_IRQCHIP
Architectures: x86
@@ -1387,7 +1452,8 @@ struct kvm_lapic_state {
Copies the input argument into the the Local APIC registers. The data format
and layout are the same as documented in the architecture manual.
-4.58 KVM_IOEVENTFD
+
+4.59 KVM_IOEVENTFD
Capability: KVM_CAP_IOEVENTFD
Architectures: all
@@ -1417,7 +1483,8 @@ The following flags are defined:
If datamatch flag is set, the event will be signaled only if the written value
to the registered address is equal to datamatch in struct kvm_ioeventfd.
-4.59 KVM_DIRTY_TLB
+
+4.60 KVM_DIRTY_TLB
Capability: KVM_CAP_SW_TLB
Architectures: ppc
@@ -1449,7 +1516,8 @@ The "num_dirty" field is a performance hint for KVM to determine whether it
should skip processing the bitmap and just invalidate everything. It must
be set to the number of set bits in the bitmap.
-4.60 KVM_ASSIGN_SET_INTX_MASK
+
+4.61 KVM_ASSIGN_SET_INTX_MASK
Capability: KVM_CAP_PCI_2_3
Architectures: x86
@@ -1482,6 +1550,7 @@ See KVM_ASSIGN_DEV_IRQ for the data structure. The target device is specified
by assigned_dev_id. In the flags field, only KVM_DEV_ASSIGN_MASK_INTX is
evaluated.
+
4.62 KVM_CREATE_SPAPR_TCE
Capability: KVM_CAP_SPAPR_TCE
@@ -1517,6 +1586,7 @@ the entries written by kernel-handled H_PUT_TCE calls, and also lets
userspace update the TCE table directly which is useful in some
circumstances.
+
4.63 KVM_ALLOCATE_RMA
Capability: KVM_CAP_PPC_RMA
@@ -1549,6 +1619,7 @@ is supported; 2 if the processor requires all virtual machines to have
an RMA, or 1 if the processor can use an RMA but doesn't require it,
because it supports the Virtual RMA (VRMA) facility.
+
4.64 KVM_NMI
Capability: KVM_CAP_USER_NMI
@@ -1574,6 +1645,7 @@ following algorithm:
Some guests configure the LINT1 NMI input to cause a panic, aiding in
debugging.
+
4.65 KVM_S390_UCAS_MAP
Capability: KVM_CAP_S390_UCONTROL
@@ -1593,6 +1665,7 @@ This ioctl maps the memory at "user_addr" with the length "length" to
the vcpu's address space starting at "vcpu_addr". All parameters need to
be alligned by 1 megabyte.
+
4.66 KVM_S390_UCAS_UNMAP
Capability: KVM_CAP_S390_UCONTROL
@@ -1612,6 +1685,7 @@ This ioctl unmaps the memory in the vcpu's address space starting at
"vcpu_addr" with the length "length". The field "user_addr" is ignored.
All parameters need to be alligned by 1 megabyte.
+
4.67 KVM_S390_VCPU_FAULT
Capability: KVM_CAP_S390_UCONTROL
@@ -1628,6 +1702,7 @@ table upfront. This is useful to handle validity intercepts for user
controlled virtual machines to fault in the virtual cpu's lowcore pages
prior to calling the KVM_RUN ioctl.
+
4.68 KVM_SET_ONE_REG
Capability: KVM_CAP_ONE_REG
@@ -1653,6 +1728,7 @@ registers, find a list below:
| |
PPC | KVM_REG_PPC_HIOR | 64
+
4.69 KVM_GET_ONE_REG
Capability: KVM_CAP_ONE_REG
@@ -1669,7 +1745,193 @@ at the memory location pointed to by "addr".
The list of registers accessible using this interface is identical to the
list in 4.64.
+
+4.70 KVM_KVMCLOCK_CTRL
+
+Capability: KVM_CAP_KVMCLOCK_CTRL
+Architectures: Any that implement pvclocks (currently x86 only)
+Type: vcpu ioctl
+Parameters: None
+Returns: 0 on success, -1 on error
+
+This signals to the host kernel that the specified guest is being paused by
+userspace. The host will set a flag in the pvclock structure that is checked
+from the soft lockup watchdog. The flag is part of the pvclock structure that
+is shared between guest and host, specifically the second bit of the flags
+field of the pvclock_vcpu_time_info structure. It will be set exclusively by
+the host and read/cleared exclusively by the guest. The guest operation of
+checking and clearing the flag must an atomic operation so
+load-link/store-conditional, or equivalent must be used. There are two cases
+where the guest will clear the flag: when the soft lockup watchdog timer resets
+itself or when a soft lockup is detected. This ioctl can be called any time
+after pausing the vcpu, but before it is resumed.
+
+
+4.71 KVM_SIGNAL_MSI
+
+Capability: KVM_CAP_SIGNAL_MSI
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_msi (in)
+Returns: >0 on delivery, 0 if guest blocked the MSI, and -1 on error
+
+Directly inject a MSI message. Only valid with in-kernel irqchip that handles
+MSI messages.
+
+struct kvm_msi {
+ __u32 address_lo;
+ __u32 address_hi;
+ __u32 data;
+ __u32 flags;
+ __u8 pad[16];
+};
+
+No flags are defined so far. The corresponding field must be 0.
+
+
+4.71 KVM_CREATE_PIT2
+
+Capability: KVM_CAP_PIT2
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_pit_config (in)
+Returns: 0 on success, -1 on error
+
+Creates an in-kernel device model for the i8254 PIT. This call is only valid
+after enabling in-kernel irqchip support via KVM_CREATE_IRQCHIP. The following
+parameters have to be passed:
+
+struct kvm_pit_config {
+ __u32 flags;
+ __u32 pad[15];
+};
+
+Valid flags are:
+
+#define KVM_PIT_SPEAKER_DUMMY 1 /* emulate speaker port stub */
+
+PIT timer interrupts may use a per-VM kernel thread for injection. If it
+exists, this thread will have a name of the following pattern:
+
+kvm-pit/<owner-process-pid>
+
+When running a guest with elevated priorities, the scheduling parameters of
+this thread may have to be adjusted accordingly.
+
+This IOCTL replaces the obsolete KVM_CREATE_PIT.
+
+
+4.72 KVM_GET_PIT2
+
+Capability: KVM_CAP_PIT_STATE2
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_pit_state2 (out)
+Returns: 0 on success, -1 on error
+
+Retrieves the state of the in-kernel PIT model. Only valid after
+KVM_CREATE_PIT2. The state is returned in the following structure:
+
+struct kvm_pit_state2 {
+ struct kvm_pit_channel_state channels[3];
+ __u32 flags;
+ __u32 reserved[9];
+};
+
+Valid flags are:
+
+/* disable PIT in HPET legacy mode */
+#define KVM_PIT_FLAGS_HPET_LEGACY 0x00000001
+
+This IOCTL replaces the obsolete KVM_GET_PIT.
+
+
+4.73 KVM_SET_PIT2
+
+Capability: KVM_CAP_PIT_STATE2
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_pit_state2 (in)
+Returns: 0 on success, -1 on error
+
+Sets the state of the in-kernel PIT model. Only valid after KVM_CREATE_PIT2.
+See KVM_GET_PIT2 for details on struct kvm_pit_state2.
+
+This IOCTL replaces the obsolete KVM_SET_PIT.
+
+
+4.74 KVM_PPC_GET_SMMU_INFO
+
+Capability: KVM_CAP_PPC_GET_SMMU_INFO
+Architectures: powerpc
+Type: vm ioctl
+Parameters: None
+Returns: 0 on success, -1 on error
+
+This populates and returns a structure describing the features of
+the "Server" class MMU emulation supported by KVM.
+This can in turn be used by userspace to generate the appropariate
+device-tree properties for the guest operating system.
+
+The structure contains some global informations, followed by an
+array of supported segment page sizes:
+
+ struct kvm_ppc_smmu_info {
+ __u64 flags;
+ __u32 slb_size;
+ __u32 pad;
+ struct kvm_ppc_one_seg_page_size sps[KVM_PPC_PAGE_SIZES_MAX_SZ];
+ };
+
+The supported flags are:
+
+ - KVM_PPC_PAGE_SIZES_REAL:
+ When that flag is set, guest page sizes must "fit" the backing
+ store page sizes. When not set, any page size in the list can
+ be used regardless of how they are backed by userspace.
+
+ - KVM_PPC_1T_SEGMENTS
+ The emulated MMU supports 1T segments in addition to the
+ standard 256M ones.
+
+The "slb_size" field indicates how many SLB entries are supported
+
+The "sps" array contains 8 entries indicating the supported base
+page sizes for a segment in increasing order. Each entry is defined
+as follow:
+
+ struct kvm_ppc_one_seg_page_size {
+ __u32 page_shift; /* Base page shift of segment (or 0) */
+ __u32 slb_enc; /* SLB encoding for BookS */
+ struct kvm_ppc_one_page_size enc[KVM_PPC_PAGE_SIZES_MAX_SZ];
+ };
+
+An entry with a "page_shift" of 0 is unused. Because the array is
+organized in increasing order, a lookup can stop when encoutering
+such an entry.
+
+The "slb_enc" field provides the encoding to use in the SLB for the
+page size. The bits are in positions such as the value can directly
+be OR'ed into the "vsid" argument of the slbmte instruction.
+
+The "enc" array is a list which for each of those segment base page
+size provides the list of supported actual page sizes (which can be
+only larger or equal to the base page size), along with the
+corresponding encoding in the hash PTE. Similarily, the array is
+8 entries sorted by increasing sizes and an entry with a "0" shift
+is an empty entry and a terminator:
+
+ struct kvm_ppc_one_page_size {
+ __u32 page_shift; /* Page shift (or 0) */
+ __u32 pte_enc; /* Encoding in the HPTE (>>12) */
+ };
+
+The "pte_enc" field provides a value that can OR'ed into the hash
+PTE's RPN field (ie, it needs to be shifted left by 12 to OR it
+into the hash PTE second double word).
+
5. The kvm_run structure
+------------------------
Application code obtains a pointer to the kvm_run structure by
mmap()ing a vcpu fd. From that point, application code can control
@@ -1910,7 +2172,9 @@ and usually define the validity of a groups of registers. (e.g. one bit
};
+
6. Capabilities that can be enabled
+-----------------------------------
There are certain capabilities that change the behavior of the virtual CPU when
enabled. To enable them, please see section 4.37. Below you can find a list of
@@ -1926,6 +2190,7 @@ The following information is provided along with the description:
Returns: the return value. General error numbers (EBADF, ENOMEM, EINVAL)
are not detailed, but errors with specific meanings are.
+
6.1 KVM_CAP_PPC_OSI
Architectures: ppc
@@ -1939,6 +2204,7 @@ between the guest and the host.
When this capability is enabled, KVM_EXIT_OSI can occur.
+
6.2 KVM_CAP_PPC_PAPR
Architectures: ppc
@@ -1957,6 +2223,7 @@ HTAB invisible to the guest.
When this capability is enabled, KVM_EXIT_PAPR_HCALL can occur.
+
6.3 KVM_CAP_SW_TLB
Architectures: ppc
diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
index 882068538c9c..83afe65d4966 100644
--- a/Documentation/virtual/kvm/cpuid.txt
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -10,11 +10,15 @@ a guest.
KVM cpuid functions are:
function: KVM_CPUID_SIGNATURE (0x40000000)
-returns : eax = 0,
+returns : eax = 0x40000001,
ebx = 0x4b4d564b,
ecx = 0x564b4d56,
edx = 0x4d.
Note that this value in ebx, ecx and edx corresponds to the string "KVMKVMKVM".
+The value in eax corresponds to the maximum cpuid function present in this leaf,
+and will be updated if more functions are added in the future.
+Note also that old hosts set eax value to 0x0. This should
+be interpreted as if the value was 0x40000001.
This function queries the presence of KVM cpuid leafs.
diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt
index 50317809113d..96b41bd97523 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -109,6 +109,10 @@ MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01
0 | 24 | multiple cpus are guaranteed to
| | be monotonic
-------------------------------------------------------------
+ | | guest vcpu has been paused by
+ 1 | N/A | the host
+ | | See 4.70 in api.txt
+ -------------------------------------------------------------
Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
leaf prior to usage.
diff --git a/Documentation/virtual/virtio-spec.txt b/Documentation/virtual/virtio-spec.txt
index da094737e2f8..0d6ec85481cb 100644
--- a/Documentation/virtual/virtio-spec.txt
+++ b/Documentation/virtual/virtio-spec.txt
@@ -1,11 +1,11 @@
[Generated file: see http://ozlabs.org/~rusty/virtio-spec/]
Virtio PCI Card Specification
-v0.9.1 DRAFT
+v0.9.5 DRAFT
-
-Rusty Russell <rusty@rustcorp.com.au>IBM Corporation (Editor)
+Rusty Russell <rusty@rustcorp.com.au> IBM Corporation (Editor)
-2011 August 1.
+2012 May 7.
Purpose and Description
@@ -68,11 +68,11 @@ and consists of three parts:
+-------------------+-----------------------------------+-----------+
-When the driver wants to send buffers to the device, it puts them
-in one or more slots in the descriptor table, and writes the
-descriptor indices into the available ring. It then notifies the
-device. When the device has finished with the buffers, it writes
-the descriptors into the used ring, and sends an interrupt.
+When the driver wants to send a buffer to the device, it fills in
+a slot in the descriptor table (or chains several together), and
+writes the descriptor index into the available ring. It then
+notifies the device. When the device has finished a buffer, it
+writes the descriptor into the used ring, and sends an interrupt.
Specification
@@ -106,8 +106,14 @@ for informational purposes by the guest).
+----------------------+--------------------+---------------+
| 6 | ioMemory | - |
+----------------------+--------------------+---------------+
+| 7 | rpmsg | Appendix H |
++----------------------+--------------------+---------------+
+| 8 | SCSI host | Appendix I |
++----------------------+--------------------+---------------+
| 9 | 9P transport | - |
+----------------------+--------------------+---------------+
+| 10 | mac80211 wlan | - |
++----------------------+--------------------+---------------+
Device Configuration
@@ -127,7 +133,7 @@ Note that this is possible because while the virtio header is PCI
the native endian of the guest (where such distinction is
applicable).
- Device Initialization Sequence
+ Device Initialization Sequence<sub:Device-Initialization-Sequence>
We start with an overview of device initialization, then expand
on the details of the device and how each step is preformed.
@@ -177,7 +183,10 @@ The virtio header looks as follows:
If MSI-X is enabled for the device, two additional fields
-immediately follow this header:
+immediately follow this header:[footnote:
+ie. once you enable MSI-X on the device, the other fields move.
+If you turn it off again, they move back!
+]
+------------++----------------+--------+
@@ -191,20 +200,6 @@ immediately follow this header:
+------------++----------------+--------+
-Finally, if feature bits (VIRTIO_F_FEATURES_HI) this is
-immediately followed by two additional fields:
-
-
-+------------++----------------------+----------------------
-| Bits || 32 | 32
-+------------++----------------------+----------------------
-| Read/Write || R | R+W
-+------------++----------------------+----------------------
-| Purpose || Device | Guest
-| || Features bits 32:63 | Features bits 32:63
-+------------++----------------------+----------------------
-
-
Immediately following these general headers, there may be
device-specific headers:
@@ -238,31 +233,25 @@ at least one bit should be set:
may be a significant (or infinite) delay before setting this
bit.
- DRIVER_OK (3) Indicates that the driver is set up and ready to
+ DRIVER_OK (4) Indicates that the driver is set up and ready to
drive the device.
- FAILED (8) Indicates that something went wrong in the guest,
+ FAILED (128) Indicates that something went wrong in the guest,
and it has given up on the device. This could be an internal
error, or the driver didn't like the device for some reason, or
even a fatal error during device operation. The device must be
reset before attempting to re-initialize.
- Feature Bits
+ Feature Bits<sub:Feature-Bits>
-The least significant 31 bits of the first configuration field
-indicates the features that the device supports (the high bit is
-reserved, and will be used to indicate the presence of future
-feature bits elsewhere). If more than 31 feature bits are
-supported, the device indicates so by setting feature bit 31 (see
-[cha:Reserved-Feature-Bits]). The bits are allocated as follows:
+Thefirst configuration field indicates the features that the
+device supports. The bits are allocated as follows:
0 to 23 Feature bits for the specific device type
- 24 to 40 Feature bits reserved for extensions to the queue and
+ 24 to 32 Feature bits reserved for extensions to the queue and
feature negotiation mechanisms
- 41 to 63 Feature bits reserved for future extensions
-
For example, feature bit 0 for a network device (i.e. Subsystem
Device ID 1) indicates that the device supports checksumming of
packets.
@@ -286,10 +275,6 @@ will not see that feature bit in the Device Features field and
can go into backwards compatibility mode (or, for poor
implementations, set the FAILED Device Status bit).
-Access to feature bits 32 to 63 is enabled by Guest by setting
-feature bit 31. If this bit is unset, Device must assume that all
-feature bits > 31 are unset.
-
Configuration/Queue Vectors
When MSI-X capability is present and enabled in the device
@@ -324,7 +309,7 @@ success, the previously written value is returned, and on
failure, NO_VECTOR is returned. If a mapping failure is detected,
the driver can retry mapping with fewervectors, or disable MSI-X.
- Virtqueue Configuration
+ Virtqueue Configuration<sec:Virtqueue-Configuration>
As a device can have zero or more virtqueues for bulk data
transport (for example, the network driver has two), the driver
@@ -587,7 +572,7 @@ and Red Hat under the (3-clause) BSD license so that it can be
freely used by all other projects, and is reproduced (with slight
variation to remove Linux assumptions) in Appendix A.
- Device Operation
+ Device Operation<sec:Device-Operation>
There are two parts to device operation: supplying new buffers to
the device, and processing used buffers from the device. As an
@@ -813,7 +798,7 @@ vring.used->ring[vq->last_seen_used%vsz];
}
- Dealing With Configuration Changes
+ Dealing With Configuration Changes<sub:Dealing-With-Configuration>
Some virtio PCI devices can change the device configuration
state, as reflected in the virtio header in the PCI configuration
@@ -1260,18 +1245,6 @@ Currently there are five device-independent feature bits defined:
driver should ignore the used_event field; the device should
ignore the avail_event field; the flags field is used
- VIRTIO_F_BAD_FEATURE(30) This feature should never be
- negotiated by the guest; doing so is an indication that the
- guest is faulty[footnote:
-An experimental virtio PCI driver contained in Linux version
-2.6.25 had this problem, and this feature bit can be used to
-detect it.
-]
-
- VIRTIO_F_FEATURES_HIGH(31) This feature indicates that the
- device supports feature bits 32:63. If unset, feature bits
- 32:63 are unset.
-
Appendix C: Network Device
The virtio network device is a virtual ethernet card, and is the
@@ -1335,11 +1308,17 @@ were required.
VIRTIO_NET_F_CTRL_VLAN (19) Control channel VLAN filtering.
+ VIRTIO_NET_F_GUEST_ANNOUNCE(21) Guest can send gratuitous
+ packets.
+
Device configuration layout Two configuration fields are
currently defined. The mac address field always exists (though
is only valid if VIRTIO_NET_F_MAC is set), and the status field
- only exists if VIRTIO_NET_F_STATUS is set. Only one bit is
- currently defined for the status field: VIRTIO_NET_S_LINK_UP. #define VIRTIO_NET_S_LINK_UP 1
+ only exists if VIRTIO_NET_F_STATUS is set. Two read-only bits
+ are currently defined for the status field:
+ VIRTIO_NET_S_LINK_UP and VIRTIO_NET_S_ANNOUNCE. #define VIRTIO_NET_S_LINK_UP 1
+
+#define VIRTIO_NET_S_ANNOUNCE 2
@@ -1377,12 +1356,19 @@ struct virtio_net_config {
packets by negotating the VIRTIO_NET_F_CSUM feature. This “
checksum offload” is a common feature on modern network cards.
- If that feature is negotiated, a driver can use TCP or UDP
- segmentation offload by negotiating the VIRTIO_NET_F_HOST_TSO4
- (IPv4 TCP), VIRTIO_NET_F_HOST_TSO6 (IPv6 TCP) and
- VIRTIO_NET_F_HOST_UFO (UDP fragmentation) features. It should
- not send TCP packets requiring segmentation offload which have
- the Explicit Congestion Notification bit set, unless the
+ If that feature is negotiated[footnote:
+ie. VIRTIO_NET_F_HOST_TSO* and VIRTIO_NET_F_HOST_UFO are
+dependent on VIRTIO_NET_F_CSUM; a dvice which offers the offload
+features must offer the checksum feature, and a driver which
+accepts the offload features must accept the checksum feature.
+Similar logic applies to the VIRTIO_NET_F_GUEST_TSO4 features
+depending on VIRTIO_NET_F_GUEST_CSUM.
+], a driver can use TCP or UDP segmentation offload by
+ negotiating the VIRTIO_NET_F_HOST_TSO4 (IPv4 TCP),
+ VIRTIO_NET_F_HOST_TSO6 (IPv6 TCP) and VIRTIO_NET_F_HOST_UFO
+ (UDP fragmentation) features. It should not send TCP packets
+ requiring segmentation offload which have the Explicit
+ Congestion Notification bit set, unless the
VIRTIO_NET_F_HOST_ECN feature is negotiated.[footnote:
This is a common restriction in real, older network cards.
]
@@ -1403,7 +1389,7 @@ segmentation, if both guests are amenable.
Packets are transmitted by placing them in the transmitq, and
buffers for incoming packets are placed in the receiveq. In each
-case, the packet itself is preceded by a header:
+case, the packet itself is preceeded by a header:
struct virtio_net_hdr {
@@ -1462,9 +1448,10 @@ It will have a 14 byte ethernet header and 20 byte IP header
followed by the TCP header (with the TCP checksum field 16 bytes
into that header). csum_start will be 14+20 = 34 (the TCP
checksum includes the header), and csum_offset will be 16. The
-value in the TCP checksum field will be the sum of the TCP pseudo
-header, so that replacing it by the ones' complement checksum of
-the TCP header and body will give the correct result.
+value in the TCP checksum field should be initialized to the sum
+of the TCP pseudo header, so that replacing it by the ones'
+complement checksum of the TCP header and body will give the
+correct result.
]
<enu:If-the-driver>If the driver negotiated
@@ -1483,8 +1470,8 @@ Due to various bugs in implementations, this field is not useful
as a guarantee of the transport header size.
]
- gso_size is the size of the packet beyond that header (ie.
- MSS).
+ gso_size is the maximum size of each packet beyond that header
+ (ie. MSS).
If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, the
VIRTIO_NET_HDR_GSO_ECN bit may be set in “gso_type” as well,
@@ -1567,7 +1554,9 @@ Processing packet involves:
If the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options were
negotiated, then the “gso_type” may be something other than
VIRTIO_NET_HDR_GSO_NONE, and the “gso_size” field indicates the
- desired MSS (see [enu:If-the-driver]).Control Virtqueue
+ desired MSS (see [enu:If-the-driver]).
+
+ Control Virtqueue
The driver uses the control virtqueue (if VIRTIO_NET_F_VTRL_VQ is
negotiated) to send commands to manipulate various features of
@@ -1642,7 +1631,7 @@ struct virtio_net_ctrl_mac {
The device can filter incoming packets by any number of
destination MAC addresses.[footnote:
-Since there are no guarantees, it can use a hash filter
+Since there are no guarentees, it can use a hash filter
orsilently switch to allmulti or promiscuous mode if it is given
too many addresses.
] This table is set using the class VIRTIO_NET_CTRL_MAC and the
@@ -1665,6 +1654,38 @@ can control a VLAN filter table in the device.
Both the VIRTIO_NET_CTRL_VLAN_ADD and VIRTIO_NET_CTRL_VLAN_DEL
command take a 16-bit VLAN id as the command-specific-data.
+ Gratuitous Packet Sending
+
+If the driver negotiates the VIRTIO_NET_F_GUEST_ANNOUNCE (depends
+on VIRTIO_NET_F_CTRL_VQ), it can ask the guest to send gratuitous
+packets; this is usually done after the guest has been physically
+migrated, and needs to announce its presence on the new network
+links. (As hypervisor does not have the knowledge of guest
+network configuration (eg. tagged vlan) it is simplest to prod
+the guest in this way).
+
+#define VIRTIO_NET_CTRL_ANNOUNCE 3
+
+ #define VIRTIO_NET_CTRL_ANNOUNCE_ACK 0
+
+The Guest needs to check VIRTIO_NET_S_ANNOUNCE bit in status
+field when it notices the changes of device configuration. The
+command VIRTIO_NET_CTRL_ANNOUNCE_ACK is used to indicate that
+driver has recevied the notification and device would clear the
+VIRTIO_NET_S_ANNOUNCE bit in the status filed after it received
+this command.
+
+Processing this notification involves:
+
+ Sending the gratuitous packets or marking there are pending
+ gratuitous packets to be sent and letting deferred routine to
+ send them.
+
+ Sending VIRTIO_NET_CTRL_ANNOUNCE_ACK command through control
+ vq.
+
+ .
+
Appendix D: Block Device
The virtio block device is a simple virtual block device (ie.
@@ -1699,8 +1720,6 @@ device except where noted.
VIRTIO_BLK_F_FLUSH (9) Cache flush command support.
-
-
Device configuration layout The capacity of the device
(expressed in 512-byte sectors) is always present. The
availability of the others all depend on various feature bits
@@ -1743,8 +1762,6 @@ device except where noted.
If the VIRTIO_BLK_F_RO feature is set by the device, any write
requests will fail.
-
-
Device Operation
The driver queues requests to the virtqueue, and they are used by
@@ -1805,7 +1822,7 @@ the FLUSH and FLUSH_OUT types are equivalent, the device does not
distinguish between them
]). If the device has VIRTIO_BLK_F_BARRIER feature the high bit
(VIRTIO_BLK_T_BARRIER) indicates that this request acts as a
-barrier and that all preceding requests must be complete before
+barrier and that all preceeding requests must be complete before
this one, and all following requests must not be started until
this is complete. Note that a barrier does not flush caches in
the underlying backend device in host, and thus does not serve as
@@ -2118,7 +2135,7 @@ This is historical, and independent of the guest page size
Otherwise, the guest may begin to re-use pages previously given
to the balloon before the device has acknowledged their
- withdrawal. [footnote:
+ withdrawl. [footnote:
In this case, deflation advice is merely a courtesy
]
@@ -2198,3 +2215,996 @@ as follows:
VIRTIO_BALLOON_S_MEMTOT The total amount of memory available
(in bytes).
+Appendix H: Rpmsg: Remote Processor Messaging
+
+Virtio rpmsg devices represent remote processors on the system
+which run in asymmetric multi-processing (AMP) configuration, and
+which are usually used to offload cpu-intensive tasks from the
+main application processor (a typical SoC methodology).
+
+Virtio is being used to communicate with those remote processors;
+empty buffers are placed in one virtqueue for receiving messages,
+and non-empty buffers, containing outbound messages, are enqueued
+in a second virtqueue for transmission.
+
+Numerous communication channels can be multiplexed over those two
+virtqueues, so different entities, running on the application and
+remote processor, can directly communicate in a point-to-point
+fashion.
+
+ Configuration
+
+ Subsystem Device ID 7
+
+ Virtqueues 0:receiveq. 1:transmitq.
+
+ Feature bits
+
+ VIRTIO_RPMSG_F_NS (0) Device sends (and capable of receiving)
+ name service messages announcing the creation (or
+ destruction) of a channel:/**
+
+ * struct rpmsg_ns_msg - dynamic name service announcement
+message
+
+ * @name: name of remote service that is published
+
+ * @addr: address of remote service that is published
+
+ * @flags: indicates whether service is created or destroyed
+
+ *
+
+ * This message is sent across to publish a new service (or
+announce
+
+ * about its removal). When we receives these messages, an
+appropriate
+
+ * rpmsg channel (i.e device) is created/destroyed.
+
+ */
+
+struct rpmsg_ns_msgoon_config {
+
+ char name[RPMSG_NAME_SIZE];
+
+ u32 addr;
+
+ u32 flags;
+
+} __packed;
+
+
+
+/**
+
+ * enum rpmsg_ns_flags - dynamic name service announcement flags
+
+ *
+
+ * @RPMSG_NS_CREATE: a new remote service was just created
+
+ * @RPMSG_NS_DESTROY: a remote service was just destroyed
+
+ */
+
+enum rpmsg_ns_flags {
+
+ RPMSG_NS_CREATE = 0,
+
+ RPMSG_NS_DESTROY = 1,
+
+};
+
+ Device configuration layout
+
+At his point none currently defined.
+
+ Device Initialization
+
+ The initialization routine should identify the receive and
+ transmission virtqueues.
+
+ The receive virtqueue should be filled with receive buffers.
+
+ Device Operation
+
+Messages are transmitted by placing them in the transmitq, and
+buffers for inbound messages are placed in the receiveq. In any
+case, messages are always preceded by the following header: /**
+
+ * struct rpmsg_hdr - common header for all rpmsg messages
+
+ * @src: source address
+
+ * @dst: destination address
+
+ * @reserved: reserved for future use
+
+ * @len: length of payload (in bytes)
+
+ * @flags: message flags
+
+ * @data: @len bytes of message payload data
+
+ *
+
+ * Every message sent(/received) on the rpmsg bus begins with
+this header.
+
+ */
+
+struct rpmsg_hdr {
+
+ u32 src;
+
+ u32 dst;
+
+ u32 reserved;
+
+ u16 len;
+
+ u16 flags;
+
+ u8 data[0];
+
+} __packed;
+
+Appendix I: SCSI Host Device
+
+The virtio SCSI host device groups together one or more virtual
+logical units (such as disks), and allows communicating to them
+using the SCSI protocol. An instance of the device represents a
+SCSI host to which many targets and LUNs are attached.
+
+The virtio SCSI device services two kinds of requests:
+
+ command requests for a logical unit;
+
+ task management functions related to a logical unit, target or
+ command.
+
+The device is also able to send out notifications about added and
+removed logical units. Together, these capabilities provide a
+SCSI transport protocol that uses virtqueues as the transfer
+medium. In the transport protocol, the virtio driver acts as the
+initiator, while the virtio SCSI host provides one or more
+targets that receive and process the requests.
+
+ Configuration
+
+ Subsystem Device ID 8
+
+ Virtqueues 0:controlq; 1:eventq; 2..n:request queues.
+
+ Feature bits
+
+ VIRTIO_SCSI_F_INOUT (0) A single request can include both
+ read-only and write-only data buffers.
+
+ VIRTIO_SCSI_F_HOTPLUG (1) The host should enable
+ hot-plug/hot-unplug of new LUNs and targets on the SCSI bus.
+
+ Device configuration layout All fields of this configuration
+ are always available. sense_size and cdb_size are writable by
+ the guest.struct virtio_scsi_config {
+
+ u32 num_queues;
+
+ u32 seg_max;
+
+ u32 max_sectors;
+
+ u32 cmd_per_lun;
+
+ u32 event_info_size;
+
+ u32 sense_size;
+
+ u32 cdb_size;
+
+ u16 max_channel;
+
+ u16 max_target;
+
+ u32 max_lun;
+
+};
+
+ num_queues is the total number of request virtqueues exposed by
+ the device. The driver is free to use only one request queue,
+ or it can use more to achieve better performance.
+
+ seg_max is the maximum number of segments that can be in a
+ command. A bidirectional command can include seg_max input
+ segments and seg_max output segments.
+
+ max_sectors is a hint to the guest about the maximum transfer
+ size it should use.
+
+ cmd_per_lun is a hint to the guest about the maximum number of
+ linked commands it should send to one LUN. The actual value
+ to be used is the minimum of cmd_per_lun and the virtqueue
+ size.
+
+ event_info_size is the maximum size that the device will fill
+ for buffers that the driver places in the eventq. The driver
+ should always put buffers at least of this size. It is
+ written by the device depending on the set of negotated
+ features.
+
+ sense_size is the maximum size of the sense data that the
+ device will write. The default value is written by the device
+ and will always be 96, but the driver can modify it. It is
+ restored to the default when the device is reset.
+
+ cdb_size is the maximum size of the CDB that the driver will
+ write. The default value is written by the device and will
+ always be 32, but the driver can likewise modify it. It is
+ restored to the default when the device is reset.
+
+ max_channel, max_target and max_lun can be used by the driver
+ as hints to constrain scanning the logical units on the
+ host.h
+
+ Device Initialization
+
+The initialization routine should first of all discover the
+device's virtqueues.
+
+If the driver uses the eventq, it should then place at least a
+buffer in the eventq.
+
+The driver can immediately issue requests (for example, INQUIRY
+or REPORT LUNS) or task management functions (for example, I_T
+RESET).
+
+ Device Operation: request queues
+
+The driver queues requests to an arbitrary request queue, and
+they are used by the device on that same queue. It is the
+responsibility of the driver to ensure strict request ordering
+for commands placed on different queues, because they will be
+consumed with no order constraints.
+
+Requests have the following format:
+
+struct virtio_scsi_req_cmd {
+
+ // Read-only
+
+ u8 lun[8];
+
+ u64 id;
+
+ u8 task_attr;
+
+ u8 prio;
+
+ u8 crn;
+
+ char cdb[cdb_size];
+
+ char dataout[];
+
+ // Write-only part
+
+ u32 sense_len;
+
+ u32 residual;
+
+ u16 status_qualifier;
+
+ u8 status;
+
+ u8 response;
+
+ u8 sense[sense_size];
+
+ char datain[];
+
+};
+
+
+
+/* command-specific response values */
+
+#define VIRTIO_SCSI_S_OK 0
+
+#define VIRTIO_SCSI_S_OVERRUN 1
+
+#define VIRTIO_SCSI_S_ABORTED 2
+
+#define VIRTIO_SCSI_S_BAD_TARGET 3
+
+#define VIRTIO_SCSI_S_RESET 4
+
+#define VIRTIO_SCSI_S_BUSY 5
+
+#define VIRTIO_SCSI_S_TRANSPORT_FAILURE 6
+
+#define VIRTIO_SCSI_S_TARGET_FAILURE 7
+
+#define VIRTIO_SCSI_S_NEXUS_FAILURE 8
+
+#define VIRTIO_SCSI_S_FAILURE 9
+
+
+
+/* task_attr */
+
+#define VIRTIO_SCSI_S_SIMPLE 0
+
+#define VIRTIO_SCSI_S_ORDERED 1
+
+#define VIRTIO_SCSI_S_HEAD 2
+
+#define VIRTIO_SCSI_S_ACA 3
+
+The lun field addresses a target and logical unit in the
+virtio-scsi device's SCSI domain. The only supported format for
+the LUN field is: first byte set to 1, second byte set to target,
+third and fourth byte representing a single level LUN structure,
+followed by four zero bytes. With this representation, a
+virtio-scsi device can serve up to 256 targets and 16384 LUNs per
+target.
+
+The id field is the command identifier (“tag”).
+
+task_attr, prio and crn should be left to zero. task_attr defines
+the task attribute as in the table above, but all task attributes
+may be mapped to SIMPLE by the device; crn may also be provided
+by clients, but is generally expected to be 0. The maximum CRN
+value defined by the protocol is 255, since CRN is stored in an
+8-bit integer.
+
+All of these fields are defined in SAM. They are always
+read-only, as are the cdb and dataout field. The cdb_size is
+taken from the configuration space.
+
+sense and subsequent fields are always write-only. The sense_len
+field indicates the number of bytes actually written to the sense
+buffer. The residual field indicates the residual size,
+calculated as “data_length - number_of_transferred_bytes”, for
+read or write operations. For bidirectional commands, the
+number_of_transferred_bytes includes both read and written bytes.
+A residual field that is less than the size of datain means that
+the dataout field was processed entirely. A residual field that
+exceeds the size of datain means that the dataout field was
+processed partially and the datain field was not processed at
+all.
+
+The status byte is written by the device to be the status code as
+defined in SAM.
+
+The response byte is written by the device to be one of the
+following:
+
+ VIRTIO_SCSI_S_OK when the request was completed and the status
+ byte is filled with a SCSI status code (not necessarily
+ "GOOD").
+
+ VIRTIO_SCSI_S_OVERRUN if the content of the CDB requires
+ transferring more data than is available in the data buffers.
+
+ VIRTIO_SCSI_S_ABORTED if the request was cancelled due to an
+ ABORT TASK or ABORT TASK SET task management function.
+
+ VIRTIO_SCSI_S_BAD_TARGET if the request was never processed
+ because the target indicated by the lun field does not exist.
+
+ VIRTIO_SCSI_S_RESET if the request was cancelled due to a bus
+ or device reset (including a task management function).
+
+ VIRTIO_SCSI_S_TRANSPORT_FAILURE if the request failed due to a
+ problem in the connection between the host and the target
+ (severed link).
+
+ VIRTIO_SCSI_S_TARGET_FAILURE if the target is suffering a
+ failure and the guest should not retry on other paths.
+
+ VIRTIO_SCSI_S_NEXUS_FAILURE if the nexus is suffering a failure
+ but retrying on other paths might yield a different result.
+
+ VIRTIO_SCSI_S_BUSY if the request failed but retrying on the
+ same path should work.
+
+ VIRTIO_SCSI_S_FAILURE for other host or guest error. In
+ particular, if neither dataout nor datain is empty, and the
+ VIRTIO_SCSI_F_INOUT feature has not been negotiated, the
+ request will be immediately returned with a response equal to
+ VIRTIO_SCSI_S_FAILURE.
+
+ Device Operation: controlq
+
+The controlq is used for other SCSI transport operations.
+Requests have the following format:
+
+struct virtio_scsi_ctrl {
+
+ u32 type;
+
+ ...
+
+ u8 response;
+
+};
+
+
+
+/* response values valid for all commands */
+
+#define VIRTIO_SCSI_S_OK 0
+
+#define VIRTIO_SCSI_S_BAD_TARGET 3
+
+#define VIRTIO_SCSI_S_BUSY 5
+
+#define VIRTIO_SCSI_S_TRANSPORT_FAILURE 6
+
+#define VIRTIO_SCSI_S_TARGET_FAILURE 7
+
+#define VIRTIO_SCSI_S_NEXUS_FAILURE 8
+
+#define VIRTIO_SCSI_S_FAILURE 9
+
+#define VIRTIO_SCSI_S_INCORRECT_LUN 12
+
+The type identifies the remaining fields.
+
+The following commands are defined:
+
+ Task management function
+#define VIRTIO_SCSI_T_TMF 0
+
+
+
+#define VIRTIO_SCSI_T_TMF_ABORT_TASK 0
+
+#define VIRTIO_SCSI_T_TMF_ABORT_TASK_SET 1
+
+#define VIRTIO_SCSI_T_TMF_CLEAR_ACA 2
+
+#define VIRTIO_SCSI_T_TMF_CLEAR_TASK_SET 3
+
+#define VIRTIO_SCSI_T_TMF_I_T_NEXUS_RESET 4
+
+#define VIRTIO_SCSI_T_TMF_LOGICAL_UNIT_RESET 5
+
+#define VIRTIO_SCSI_T_TMF_QUERY_TASK 6
+
+#define VIRTIO_SCSI_T_TMF_QUERY_TASK_SET 7
+
+
+
+struct virtio_scsi_ctrl_tmf
+
+{
+
+ // Read-only part
+
+ u32 type;
+
+ u32 subtype;
+
+ u8 lun[8];
+
+ u64 id;
+
+ // Write-only part
+
+ u8 response;
+
+}
+
+
+
+/* command-specific response values */
+
+#define VIRTIO_SCSI_S_FUNCTION_COMPLETE 0
+
+#define VIRTIO_SCSI_S_FUNCTION_SUCCEEDED 10
+
+#define VIRTIO_SCSI_S_FUNCTION_REJECTED 11
+
+ The type is VIRTIO_SCSI_T_TMF; the subtype field defines. All
+ fields except response are filled by the driver. The subtype
+ field must always be specified and identifies the requested
+ task management function.
+
+ Other fields may be irrelevant for the requested TMF; if so,
+ they are ignored but they should still be present. The lun
+ field is in the same format specified for request queues; the
+ single level LUN is ignored when the task management function
+ addresses a whole I_T nexus. When relevant, the value of the id
+ field is matched against the id values passed on the requestq.
+
+ The outcome of the task management function is written by the
+ device in the response field. The command-specific response
+ values map 1-to-1 with those defined in SAM.
+
+ Asynchronous notification query
+#define VIRTIO_SCSI_T_AN_QUERY 1
+
+
+
+struct virtio_scsi_ctrl_an {
+
+ // Read-only part
+
+ u32 type;
+
+ u8 lun[8];
+
+ u32 event_requested;
+
+ // Write-only part
+
+ u32 event_actual;
+
+ u8 response;
+
+}
+
+
+
+#define VIRTIO_SCSI_EVT_ASYNC_OPERATIONAL_CHANGE 2
+
+#define VIRTIO_SCSI_EVT_ASYNC_POWER_MGMT 4
+
+#define VIRTIO_SCSI_EVT_ASYNC_EXTERNAL_REQUEST 8
+
+#define VIRTIO_SCSI_EVT_ASYNC_MEDIA_CHANGE 16
+
+#define VIRTIO_SCSI_EVT_ASYNC_MULTI_HOST 32
+
+#define VIRTIO_SCSI_EVT_ASYNC_DEVICE_BUSY 64
+
+ By sending this command, the driver asks the device which
+ events the given LUN can report, as described in paragraphs 6.6
+ and A.6 of the SCSI MMC specification. The driver writes the
+ events it is interested in into the event_requested; the device
+ responds by writing the events that it supports into
+ event_actual.
+
+ The type is VIRTIO_SCSI_T_AN_QUERY. The lun and event_requested
+ fields are written by the driver. The event_actual and response
+ fields are written by the device.
+
+ No command-specific values are defined for the response byte.
+
+ Asynchronous notification subscription
+#define VIRTIO_SCSI_T_AN_SUBSCRIBE 2
+
+
+
+struct virtio_scsi_ctrl_an {
+
+ // Read-only part
+
+ u32 type;
+
+ u8 lun[8];
+
+ u32 event_requested;
+
+ // Write-only part
+
+ u32 event_actual;
+
+ u8 response;
+
+}
+
+ By sending this command, the driver asks the specified LUN to
+ report events for its physical interface, again as described in
+ the SCSI MMC specification. The driver writes the events it is
+ interested in into the event_requested; the device responds by
+ writing the events that it supports into event_actual.
+
+ Event types are the same as for the asynchronous notification
+ query message.
+
+ The type is VIRTIO_SCSI_T_AN_SUBSCRIBE. The lun and
+ event_requested fields are written by the driver. The
+ event_actual and response fields are written by the device.
+
+ No command-specific values are defined for the response byte.
+
+ Device Operation: eventq
+
+The eventq is used by the device to report information on logical
+units that are attached to it. The driver should always leave a
+few buffers ready in the eventq. In general, the device will not
+queue events to cope with an empty eventq, and will end up
+dropping events if it finds no buffer ready. However, when
+reporting events for many LUNs (e.g. when a whole target
+disappears), the device can throttle events to avoid dropping
+them. For this reason, placing 10-15 buffers on the event queue
+should be enough.
+
+Buffers are placed in the eventq and filled by the device when
+interesting events occur. The buffers should be strictly
+write-only (device-filled) and the size of the buffers should be
+at least the value given in the device's configuration
+information.
+
+Buffers returned by the device on the eventq will be referred to
+as "events" in the rest of this section. Events have the
+following format:
+
+#define VIRTIO_SCSI_T_EVENTS_MISSED 0x80000000
+
+
+
+struct virtio_scsi_event {
+
+ // Write-only part
+
+ u32 event;
+
+ ...
+
+}
+
+If bit 31 is set in the event field, the device failed to report
+an event due to missing buffers. In this case, the driver should
+poll the logical units for unit attention conditions, and/or do
+whatever form of bus scan is appropriate for the guest operating
+system.
+
+Other data that the device writes to the buffer depends on the
+contents of the event field. The following events are defined:
+
+ No event
+#define VIRTIO_SCSI_T_NO_EVENT 0
+
+ This event is fired in the following cases:
+
+ When the device detects in the eventq a buffer that is shorter
+ than what is indicated in the configuration field, it might
+ use it immediately and put this dummy value in the event
+ field. A well-written driver will never observe this
+ situation.
+
+ When events are dropped, the device may signal this event as
+ soon as the drivers makes a buffer available, in order to
+ request action from the driver. In this case, of course, this
+ event will be reported with the VIRTIO_SCSI_T_EVENTS_MISSED
+ flag.
+
+ Transport reset
+#define VIRTIO_SCSI_T_TRANSPORT_RESET 1
+
+
+
+struct virtio_scsi_event_reset {
+
+ // Write-only part
+
+ u32 event;
+
+ u8 lun[8];
+
+ u32 reason;
+
+}
+
+
+
+#define VIRTIO_SCSI_EVT_RESET_HARD 0
+
+#define VIRTIO_SCSI_EVT_RESET_RESCAN 1
+
+#define VIRTIO_SCSI_EVT_RESET_REMOVED 2
+
+ By sending this event, the device signals that a logical unit
+ on a target has been reset, including the case of a new device
+ appearing or disappearing on the bus.The device fills in all
+ fields. The event field is set to
+ VIRTIO_SCSI_T_TRANSPORT_RESET. The lun field addresses a
+ logical unit in the SCSI host.
+
+ The reason value is one of the three #define values appearing
+ above:
+
+ VIRTIO_SCSI_EVT_RESET_REMOVED (“LUN/target removed”) is used if
+ the target or logical unit is no longer able to receive
+ commands.
+
+ VIRTIO_SCSI_EVT_RESET_HARD (“LUN hard reset”) is used if the
+ logical unit has been reset, but is still present.
+
+ VIRTIO_SCSI_EVT_RESET_RESCAN (“rescan LUN/target”) is used if a
+ target or logical unit has just appeared on the device.
+
+ The “removed” and “rescan” events, when sent for LUN 0, may
+ apply to the entire target. After receiving them the driver
+ should ask the initiator to rescan the target, in order to
+ detect the case when an entire target has appeared or
+ disappeared. These two events will never be reported unless the
+ VIRTIO_SCSI_F_HOTPLUG feature was negotiated between the host
+ and the guest.
+
+ Events will also be reported via sense codes (this obviously
+ does not apply to newly appeared buses or targets, since the
+ application has never discovered them):
+
+ “LUN/target removed” maps to sense key ILLEGAL REQUEST, asc
+ 0x25, ascq 0x00 (LOGICAL UNIT NOT SUPPORTED)
+
+ “LUN hard reset” maps to sense key UNIT ATTENTION, asc 0x29
+ (POWER ON, RESET OR BUS DEVICE RESET OCCURRED)
+
+ “rescan LUN/target” maps to sense key UNIT ATTENTION, asc 0x3f,
+ ascq 0x0e (REPORTED LUNS DATA HAS CHANGED)
+
+ The preferred way to detect transport reset is always to use
+ events, because sense codes are only seen by the driver when it
+ sends a SCSI command to the logical unit or target. However, in
+ case events are dropped, the initiator will still be able to
+ synchronize with the actual state of the controller if the
+ driver asks the initiator to rescan of the SCSI bus. During the
+ rescan, the initiator will be able to observe the above sense
+ codes, and it will process them as if it the driver had
+ received the equivalent event.
+
+ Asynchronous notification
+#define VIRTIO_SCSI_T_ASYNC_NOTIFY 2
+
+
+
+struct virtio_scsi_event_an {
+
+ // Write-only part
+
+ u32 event;
+
+ u8 lun[8];
+
+ u32 reason;
+
+}
+
+ By sending this event, the device signals that an asynchronous
+ event was fired from a physical interface.
+
+ All fields are written by the device. The event field is set to
+ VIRTIO_SCSI_T_ASYNC_NOTIFY. The lun field addresses a logical
+ unit in the SCSI host. The reason field is a subset of the
+ events that the driver has subscribed to via the "Asynchronous
+ notification subscription" command.
+
+ When dropped events are reported, the driver should poll for
+ asynchronous events manually using SCSI commands.
+
+Appendix X: virtio-mmio
+
+Virtual environments without PCI support (a common situation in
+embedded devices models) might use simple memory mapped device (“
+virtio-mmio”) instead of the PCI device.
+
+The memory mapped virtio device behaviour is based on the PCI
+device specification. Therefore most of operations like device
+initialization, queues configuration and buffer transfers are
+nearly identical. Existing differences are described in the
+following sections.
+
+ Device Initialization
+
+Instead of using the PCI IO space for virtio header, the “
+virtio-mmio” device provides a set of memory mapped control
+registers, all 32 bits wide, followed by device-specific
+configuration space. The following list presents their layout:
+
+ Offset from the device base address | Direction | Name
+ Description
+
+ 0x000 | R | MagicValue
+ “virt” string.
+
+ 0x004 | R | Version
+ Device version number. Currently must be 1.
+
+ 0x008 | R | DeviceID
+ Virtio Subsystem Device ID (ie. 1 for network card).
+
+ 0x00c | R | VendorID
+ Virtio Subsystem Vendor ID.
+
+ 0x010 | R | HostFeatures
+ Flags representing features the device supports.
+ Reading from this register returns 32 consecutive flag bits,
+ first bit depending on the last value written to
+ HostFeaturesSel register. Access to this register returns bits HostFeaturesSel*32
+
+ to (HostFeaturesSel*32)+31
+, eg. feature bits 0 to 31 if
+ HostFeaturesSel is set to 0 and features bits 32 to 63 if
+ HostFeaturesSel is set to 1. Also see [sub:Feature-Bits]
+
+ 0x014 | W | HostFeaturesSel
+ Device (Host) features word selection.
+ Writing to this register selects a set of 32 device feature bits
+ accessible by reading from HostFeatures register. Device driver
+ must write a value to the HostFeaturesSel register before
+ reading from the HostFeatures register.
+
+ 0x020 | W | GuestFeatures
+ Flags representing device features understood and activated by
+ the driver.
+ Writing to this register sets 32 consecutive flag bits, first
+ bit depending on the last value written to GuestFeaturesSel
+ register. Access to this register sets bits GuestFeaturesSel*32
+
+ to (GuestFeaturesSel*32)+31
+, eg. feature bits 0 to 31 if
+ GuestFeaturesSel is set to 0 and features bits 32 to 63 if
+ GuestFeaturesSel is set to 1. Also see [sub:Feature-Bits]
+
+ 0x024 | W | GuestFeaturesSel
+ Activated (Guest) features word selection.
+ Writing to this register selects a set of 32 activated feature
+ bits accessible by writing to the GuestFeatures register.
+ Device driver must write a value to the GuestFeaturesSel
+ register before writing to the GuestFeatures register.
+
+ 0x028 | W | GuestPageSize
+ Guest page size.
+ Device driver must write the guest page size in bytes to the
+ register during initialization, before any queues are used.
+ This value must be a power of 2 and is used by the Host to
+ calculate Guest address of the first queue page (see QueuePFN).
+
+ 0x030 | W | QueueSel
+ Virtual queue index (first queue is 0).
+ Writing to this register selects the virtual queue that the
+ following operations on QueueNum, QueueAlign and QueuePFN apply
+ to.
+
+ 0x034 | R | QueueNumMax
+ Maximum virtual queue size.
+ Reading from the register returns the maximum size of the queue
+ the Host is ready to process or zero (0x0) if the queue is not
+ available. This applies to the queue selected by writing to
+ QueueSel and is allowed only when QueuePFN is set to zero
+ (0x0), so when the queue is not actively used.
+
+ 0x038 | W | QueueNum
+ Virtual queue size.
+ Queue size is a number of elements in the queue, therefore size
+ of the descriptor table and both available and used rings.
+ Writing to this register notifies the Host what size of the
+ queue the Guest will use. This applies to the queue selected by
+ writing to QueueSel.
+
+ 0x03c | W | QueueAlign
+ Used Ring alignment in the virtual queue.
+ Writing to this register notifies the Host about alignment
+ boundary of the Used Ring in bytes. This value must be a power
+ of 2 and applies to the queue selected by writing to QueueSel.
+
+ 0x040 | RW | QueuePFN
+ Guest physical page number of the virtual queue.
+ Writing to this register notifies the host about location of the
+ virtual queue in the Guest's physical address space. This value
+ is the index number of a page starting with the queue
+ Descriptor Table. Value zero (0x0) means physical address zero
+ (0x00000000) and is illegal. When the Guest stops using the
+ queue it must write zero (0x0) to this register.
+ Reading from this register returns the currently used page
+ number of the queue, therefore a value other than zero (0x0)
+ means that the queue is in use.
+ Both read and write accesses apply to the queue selected by
+ writing to QueueSel.
+
+ 0x050 | W | QueueNotify
+ Queue notifier.
+ Writing a queue index to this register notifies the Host that
+ there are new buffers to process in the queue.
+
+ 0x60 | R | InterruptStatus
+Interrupt status.
+Reading from this register returns a bit mask of interrupts
+ asserted by the device. An interrupt is asserted if the
+ corresponding bit is set, ie. equals one (1).
+
+ Bit 0 | Used Ring Update
+This interrupt is asserted when the Host has updated the Used
+ Ring in at least one of the active virtual queues.
+
+ Bit 1 | Configuration change
+This interrupt is asserted when configuration of the device has
+ changed.
+
+ 0x064 | W | InterruptACK
+ Interrupt acknowledge.
+ Writing to this register notifies the Host that the Guest
+ finished handling interrupts. Set bits in the value clear the
+ corresponding bits of the InterruptStatus register.
+
+ 0x070 | RW | Status
+ Device status.
+ Reading from this register returns the current device status
+ flags.
+ Writing non-zero values to this register sets the status flags,
+ indicating the Guest progress. Writing zero (0x0) to this
+ register triggers a device reset.
+ Also see [sub:Device-Initialization-Sequence]
+
+ 0x100+ | RW | Config
+ Device-specific configuration space starts at an offset 0x100
+ and is accessed with byte alignment. Its meaning and size
+ depends on the device and the driver.
+
+Virtual queue size is a number of elements in the queue,
+therefore size of the descriptor table and both available and
+used rings.
+
+The endianness of the registers follows the native endianness of
+the Guest. Writing to registers described as “R” and reading from
+registers described as “W” is not permitted and can cause
+undefined behavior.
+
+The device initialization is performed as described in [sub:Device-Initialization-Sequence]
+ with one exception: the Guest must notify the Host about its
+page size, writing the size in bytes to GuestPageSize register
+before the initialization is finished.
+
+The memory mapped virtio devices generate single interrupt only,
+therefore no special configuration is required.
+
+ Virtqueue Configuration
+
+The virtual queue configuration is performed in a similar way to
+the one described in [sec:Virtqueue-Configuration] with a few
+additional operations:
+
+ Select the queue writing its index (first queue is 0) to the
+ QueueSel register.
+
+ Check if the queue is not already in use: read QueuePFN
+ register, returned value should be zero (0x0).
+
+ Read maximum queue size (number of elements) from the
+ QueueNumMax register. If the returned value is zero (0x0) the
+ queue is not available.
+
+ Allocate and zero the queue pages in contiguous virtual memory,
+ aligning the Used Ring to an optimal boundary (usually page
+ size). Size of the allocated queue may be smaller than or equal
+ to the maximum size returned by the Host.
+
+ Notify the Host about the queue size by writing the size to
+ QueueNum register.
+
+ Notify the Host about the used alignment by writing its value
+ in bytes to QueueAlign register.
+
+ Write the physical number of the first page of the queue to the
+ QueuePFN register.
+
+The queue and the device are ready to begin normal operations
+now.
+
+ Device Operation
+
+The memory mapped virtio device behaves in the same way as
+described in [sec:Device-Operation], with the following
+exceptions:
+
+ The device is notified about new buffers available in a queue
+ by writing the queue index to register QueueNum instead of the
+ virtio header in PCI I/O space ([sub:Notifying-The-Device]).
+
+ The memory mapped virtio device is using single, dedicated
+ interrupt signal, which is raised when at least one of the
+ interrupts described in the InterruptStatus register
+ description is asserted. After receiving an interrupt, the
+ driver must read the InterruptStatus register to check what
+ caused the interrupt (see the register description). After the
+ interrupt is handled, the driver must acknowledge it by writing
+ a bit mask corresponding to the serviced interrupt to the
+ InterruptACK register.
+