diff options
Diffstat (limited to 'Documentation/virt/hyperv')
-rw-r--r-- | Documentation/virt/hyperv/clocks.rst | 82 | ||||
-rw-r--r-- | Documentation/virt/hyperv/coco.rst | 260 | ||||
-rw-r--r-- | Documentation/virt/hyperv/hibernation.rst | 336 | ||||
-rw-r--r-- | Documentation/virt/hyperv/index.rst | 15 | ||||
-rw-r--r-- | Documentation/virt/hyperv/overview.rst | 207 | ||||
-rw-r--r-- | Documentation/virt/hyperv/vmbus.rst | 346 | ||||
-rw-r--r-- | Documentation/virt/hyperv/vpci.rst | 316 |
7 files changed, 1562 insertions, 0 deletions
diff --git a/Documentation/virt/hyperv/clocks.rst b/Documentation/virt/hyperv/clocks.rst new file mode 100644 index 000000000000..176043265803 --- /dev/null +++ b/Documentation/virt/hyperv/clocks.rst @@ -0,0 +1,82 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Clocks and Timers +================= + +arm64 +----- +On arm64, Hyper-V virtualizes the ARMv8 architectural system counter +and timer. Guest VMs use this virtualized hardware as the Linux +clocksource and clockevents via the standard arm_arch_timer.c +driver, just as they would on bare metal. Linux vDSO support for the +architectural system counter is functional in guest VMs on Hyper-V. +While Hyper-V also provides a synthetic system clock and four synthetic +per-CPU timers as described in the TLFS, they are not used by the +Linux kernel in a Hyper-V guest on arm64. However, older versions +of Hyper-V for arm64 only partially virtualize the ARMv8 +architectural timer, such that the timer does not generate +interrupts in the VM. Because of this limitation, running current +Linux kernel versions on these older Hyper-V versions requires an +out-of-tree patch to use the Hyper-V synthetic clocks/timers instead. + +x86/x64 +------- +On x86/x64, Hyper-V provides guest VMs with a synthetic system clock +and four synthetic per-CPU timers as described in the TLFS. Hyper-V +also provides access to the virtualized TSC via the RDTSC and +related instructions. These TSC instructions do not trap to +the hypervisor and so provide excellent performance in a VM. +Hyper-V performs TSC calibration, and provides the TSC frequency +to the guest VM via a synthetic MSR. Hyper-V initialization code +in Linux reads this MSR to get the frequency, so it skips TSC +calibration and sets tsc_reliable. Hyper-V provides virtualized +versions of the PIT (in Hyper-V Generation 1 VMs only), local +APIC timer, and RTC. Hyper-V does not provide a virtualized HPET in +guest VMs. + +The Hyper-V synthetic system clock can be read via a synthetic MSR, +but this access traps to the hypervisor. As a faster alternative, +the guest can configure a memory page to be shared between the guest +and the hypervisor. Hyper-V populates this memory page with a +64-bit scale value and offset value. To read the synthetic clock +value, the guest reads the TSC and then applies the scale and offset +as described in the Hyper-V TLFS. The resulting value advances +at a constant 10 MHz frequency. In the case of a live migration +to a host with a different TSC frequency, Hyper-V adjusts the +scale and offset values in the shared page so that the 10 MHz +frequency is maintained. + +Starting with Windows Server 2022 Hyper-V, Hyper-V uses hardware +support for TSC frequency scaling to enable live migration of VMs +across Hyper-V hosts where the TSC frequency may be different. +When a Linux guest detects that this Hyper-V functionality is +available, it prefers to use Linux's standard TSC-based clocksource. +Otherwise, it uses the clocksource for the Hyper-V synthetic system +clock implemented via the shared page (identified as +"hyperv_clocksource_tsc_page"). + +The Hyper-V synthetic system clock is available to user space via +vDSO, and gettimeofday() and related system calls can execute +entirely in user space. The vDSO is implemented by mapping the +shared page with scale and offset values into user space. User +space code performs the same algorithm of reading the TSC and +applying the scale and offset to get the constant 10 MHz clock. + +Linux clockevents are based on Hyper-V synthetic timer 0 (stimer0). +While Hyper-V offers 4 synthetic timers for each CPU, Linux only uses +timer 0. In older versions of Hyper-V, an interrupt from stimer0 +results in a VMBus control message that is demultiplexed by +vmbus_isr() as described in the Documentation/virt/hyperv/vmbus.rst +documentation. In newer versions of Hyper-V, stimer0 interrupts can +be mapped to an architectural interrupt, which is referred to as +"Direct Mode". Linux prefers to use Direct Mode when available. Since +x86/x64 doesn't support per-CPU interrupts, Direct Mode statically +allocates an x86 interrupt vector (HYPERV_STIMER0_VECTOR) across all CPUs +and explicitly codes it to call the stimer0 interrupt handler. Hence +interrupts from stimer0 are recorded on the "HVS" line in /proc/interrupts +rather than being associated with a Linux IRQ. Clockevents based on the +virtualized PIT and local APIC timer also work, but Hyper-V stimer0 +is preferred. + +The driver for the Hyper-V synthetic system clock and timers is +drivers/clocksource/hyperv_timer.c. diff --git a/Documentation/virt/hyperv/coco.rst b/Documentation/virt/hyperv/coco.rst new file mode 100644 index 000000000000..c15d6fe34b4e --- /dev/null +++ b/Documentation/virt/hyperv/coco.rst @@ -0,0 +1,260 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Confidential Computing VMs +========================== +Hyper-V can create and run Linux guests that are Confidential Computing +(CoCo) VMs. Such VMs cooperate with the physical processor to better protect +the confidentiality and integrity of data in the VM's memory, even in the +face of a hypervisor/VMM that has been compromised and may behave maliciously. +CoCo VMs on Hyper-V share the generic CoCo VM threat model and security +objectives described in Documentation/security/snp-tdx-threat-model.rst. Note +that Hyper-V specific code in Linux refers to CoCo VMs as "isolated VMs" or +"isolation VMs". + +A Linux CoCo VM on Hyper-V requires the cooperation and interaction of the +following: + +* Physical hardware with a processor that supports CoCo VMs + +* The hardware runs a version of Windows/Hyper-V with support for CoCo VMs + +* The VM runs a version of Linux that supports being a CoCo VM + +The physical hardware requirements are as follows: + +* AMD processor with SEV-SNP. Hyper-V does not run guest VMs with AMD SME, + SEV, or SEV-ES encryption, and such encryption is not sufficient for a CoCo + VM on Hyper-V. + +* Intel processor with TDX + +To create a CoCo VM, the "Isolated VM" attribute must be specified to Hyper-V +when the VM is created. A VM cannot be changed from a CoCo VM to a normal VM, +or vice versa, after it is created. + +Operational Modes +----------------- +Hyper-V CoCo VMs can run in two modes. The mode is selected when the VM is +created and cannot be changed during the life of the VM. + +* Fully-enlightened mode. In this mode, the guest operating system is + enlightened to understand and manage all aspects of running as a CoCo VM. + +* Paravisor mode. In this mode, a paravisor layer between the guest and the + host provides some operations needed to run as a CoCo VM. The guest operating + system can have fewer CoCo enlightenments than is required in the + fully-enlightened case. + +Conceptually, fully-enlightened mode and paravisor mode may be treated as +points on a spectrum spanning the degree of guest enlightenment needed to run +as a CoCo VM. Fully-enlightened mode is one end of the spectrum. A full +implementation of paravisor mode is the other end of the spectrum, where all +aspects of running as a CoCo VM are handled by the paravisor, and a normal +guest OS with no knowledge of memory encryption or other aspects of CoCo VMs +can run successfully. However, the Hyper-V implementation of paravisor mode +does not go this far, and is somewhere in the middle of the spectrum. Some +aspects of CoCo VMs are handled by the Hyper-V paravisor while the guest OS +must be enlightened for other aspects. Unfortunately, there is no +standardized enumeration of feature/functions that might be provided in the +paravisor, and there is no standardized mechanism for a guest OS to query the +paravisor for the feature/functions it provides. The understanding of what +the paravisor provides is hard-coded in the guest OS. + +Paravisor mode has similarities to the `Coconut project`_, which aims to provide +a limited paravisor to provide services to the guest such as a virtual TPM. +However, the Hyper-V paravisor generally handles more aspects of CoCo VMs +than is currently envisioned for Coconut, and so is further toward the "no +guest enlightenments required" end of the spectrum. + +.. _Coconut project: https://github.com/coconut-svsm/svsm + +In the CoCo VM threat model, the paravisor is in the guest security domain +and must be trusted by the guest OS. By implication, the hypervisor/VMM must +protect itself against a potentially malicious paravisor just like it +protects against a potentially malicious guest. + +The hardware architectural approach to fully-enlightened vs. paravisor mode +varies depending on the underlying processor. + +* With AMD SEV-SNP processors, in fully-enlightened mode the guest OS runs in + VMPL 0 and has full control of the guest context. In paravisor mode, the + guest OS runs in VMPL 2 and the paravisor runs in VMPL 0. The paravisor + running in VMPL 0 has privileges that the guest OS in VMPL 2 does not have. + Certain operations require the guest to invoke the paravisor. Furthermore, in + paravisor mode the guest OS operates in "virtual Top Of Memory" (vTOM) mode + as defined by the SEV-SNP architecture. This mode simplifies guest management + of memory encryption when a paravisor is used. + +* With Intel TDX processor, in fully-enlightened mode the guest OS runs in an + L1 VM. In paravisor mode, TD partitioning is used. The paravisor runs in the + L1 VM, and the guest OS runs in a nested L2 VM. + +Hyper-V exposes a synthetic MSR to guests that describes the CoCo mode. This +MSR indicates if the underlying processor uses AMD SEV-SNP or Intel TDX, and +whether a paravisor is being used. It is straightforward to build a single +kernel image that can boot and run properly on either architecture, and in +either mode. + +Paravisor Effects +----------------- +Running in paravisor mode affects the following areas of generic Linux kernel +CoCo VM functionality: + +* Initial guest memory setup. When a new VM is created in paravisor mode, the + paravisor runs first and sets up the guest physical memory as encrypted. The + guest Linux does normal memory initialization, except for explicitly marking + appropriate ranges as decrypted (shared). In paravisor mode, Linux does not + perform the early boot memory setup steps that are particularly tricky with + AMD SEV-SNP in fully-enlightened mode. + +* #VC/#VE exception handling. In paravisor mode, Hyper-V configures the guest + CoCo VM to route #VC and #VE exceptions to VMPL 0 and the L1 VM, + respectively, and not the guest Linux. Consequently, these exception handlers + do not run in the guest Linux and are not a required enlightenment for a + Linux guest in paravisor mode. + +* CPUID flags. Both AMD SEV-SNP and Intel TDX provide a CPUID flag in the + guest indicating that the VM is operating with the respective hardware + support. While these CPUID flags are visible in fully-enlightened CoCo VMs, + the paravisor filters out these flags and the guest Linux does not see them. + Throughout the Linux kernel, explicitly testing these flags has mostly been + eliminated in favor of the cc_platform_has() function, with the goal of + abstracting the differences between SEV-SNP and TDX. But the + cc_platform_has() abstraction also allows the Hyper-V paravisor configuration + to selectively enable aspects of CoCo VM functionality even when the CPUID + flags are not set. The exception is early boot memory setup on SEV-SNP, which + tests the CPUID SEV-SNP flag. But not having the flag in Hyper-V paravisor + mode VM achieves the desired effect or not running SEV-SNP specific early + boot memory setup. + +* Device emulation. In paravisor mode, the Hyper-V paravisor provides + emulation of devices such as the IO-APIC and TPM. Because the emulation + happens in the paravisor in the guest context (instead of the hypervisor/VMM + context), MMIO accesses to these devices must be encrypted references instead + of the decrypted references that would be used in a fully-enlightened CoCo + VM. The __ioremap_caller() function has been enhanced to make a callback to + check whether a particular address range should be treated as encrypted + (private). See the "is_private_mmio" callback. + +* Encrypt/decrypt memory transitions. In a CoCo VM, transitioning guest + memory between encrypted and decrypted requires coordinating with the + hypervisor/VMM. This is done via callbacks invoked from + __set_memory_enc_pgtable(). In fully-enlightened mode, the normal SEV-SNP and + TDX implementations of these callbacks are used. In paravisor mode, a Hyper-V + specific set of callbacks is used. These callbacks invoke the paravisor so + that the paravisor can coordinate the transitions and inform the hypervisor + as necessary. See hv_vtom_init() where these callback are set up. + +* Interrupt injection. In fully enlightened mode, a malicious hypervisor + could inject interrupts into the guest OS at times that violate x86/x64 + architectural rules. For full protection, the guest OS should include + enlightenments that use the interrupt injection management features provided + by CoCo-capable processors. In paravisor mode, the paravisor mediates + interrupt injection into the guest OS, and ensures that the guest OS only + sees interrupts that are "legal". The paravisor uses the interrupt injection + management features provided by the CoCo-capable physical processor, thereby + masking these complexities from the guest OS. + +Hyper-V Hypercalls +------------------ +When in fully-enlightened mode, hypercalls made by the Linux guest are routed +directly to the hypervisor, just as in a non-CoCo VM. But in paravisor mode, +normal hypercalls trap to the paravisor first, which may in turn invoke the +hypervisor. But the paravisor is idiosyncratic in this regard, and a few +hypercalls made by the Linux guest must always be routed directly to the +hypervisor. These hypercall sites test for a paravisor being present, and use +a special invocation sequence. See hv_post_message(), for example. + +Guest communication with Hyper-V +-------------------------------- +Separate from the generic Linux kernel handling of memory encryption in Linux +CoCo VMs, Hyper-V has VMBus and VMBus devices that communicate using memory +shared between the Linux guest and the host. This shared memory must be +marked decrypted to enable communication. Furthermore, since the threat model +includes a compromised and potentially malicious host, the guest must guard +against leaking any unintended data to the host through this shared memory. + +These Hyper-V and VMBus memory pages are marked as decrypted: + +* VMBus monitor pages + +* Synthetic interrupt controller (synic) related pages (unless supplied by + the paravisor) + +* Per-cpu hypercall input and output pages (unless running with a paravisor) + +* VMBus ring buffers. The direct mapping is marked decrypted in + __vmbus_establish_gpadl(). The secondary mapping created in + hv_ringbuffer_init() must also include the "decrypted" attribute. + +When the guest writes data to memory that is shared with the host, it must +ensure that only the intended data is written. Padding or unused fields must +be initialized to zeros before copying into the shared memory so that random +kernel data is not inadvertently given to the host. + +Similarly, when the guest reads memory that is shared with the host, it must +validate the data before acting on it so that a malicious host cannot induce +the guest to expose unintended data. Doing such validation can be tricky +because the host can modify the shared memory areas even while or after +validation is performed. For messages passed from the host to the guest in a +VMBus ring buffer, the length of the message is validated, and the message is +copied into a temporary (encrypted) buffer for further validation and +processing. The copying adds a small amount of overhead, but is the only way +to protect against a malicious host. See hv_pkt_iter_first(). + +Many drivers for VMBus devices have been "hardened" by adding code to fully +validate messages received over VMBus, instead of assuming that Hyper-V is +acting cooperatively. Such drivers are marked as "allowed_in_isolated" in the +vmbus_devs[] table. Other drivers for VMBus devices that are not needed in a +CoCo VM have not been hardened, and they are not allowed to load in a CoCo +VM. See vmbus_is_valid_offer() where such devices are excluded. + +Two VMBus devices depend on the Hyper-V host to do DMA data transfers: +storvsc for disk I/O and netvsc for network I/O. storvsc uses the normal +Linux kernel DMA APIs, and so bounce buffering through decrypted swiotlb +memory is done implicitly. netvsc has two modes for data transfers. The first +mode goes through send and receive buffer space that is explicitly allocated +by the netvsc driver, and is used for most smaller packets. These send and +receive buffers are marked decrypted by __vmbus_establish_gpadl(). Because +the netvsc driver explicitly copies packets to/from these buffers, the +equivalent of bounce buffering between encrypted and decrypted memory is +already part of the data path. The second mode uses the normal Linux kernel +DMA APIs, and is bounce buffered through swiotlb memory implicitly like in +storvsc. + +Finally, the VMBus virtual PCI driver needs special handling in a CoCo VM. +Linux PCI device drivers access PCI config space using standard APIs provided +by the Linux PCI subsystem. On Hyper-V, these functions directly access MMIO +space, and the access traps to Hyper-V for emulation. But in CoCo VMs, memory +encryption prevents Hyper-V from reading the guest instruction stream to +emulate the access. So in a CoCo VM, these functions must make a hypercall +with arguments explicitly describing the access. See +_hv_pcifront_read_config() and _hv_pcifront_write_config() and the +"use_calls" flag indicating to use hypercalls. + +load_unaligned_zeropad() +------------------------ +When transitioning memory between encrypted and decrypted, the caller of +set_memory_encrypted() or set_memory_decrypted() is responsible for ensuring +the memory isn't in use and isn't referenced while the transition is in +progress. The transition has multiple steps, and includes interaction with +the Hyper-V host. The memory is in an inconsistent state until all steps are +complete. A reference while the state is inconsistent could result in an +exception that can't be cleanly fixed up. + +However, the kernel load_unaligned_zeropad() mechanism may make stray +references that can't be prevented by the caller of set_memory_encrypted() or +set_memory_decrypted(), so there's specific code in the #VC or #VE exception +handler to fixup this case. But a CoCo VM running on Hyper-V may be +configured to run with a paravisor, with the #VC or #VE exception routed to +the paravisor. There's no architectural way to forward the exceptions back to +the guest kernel, and in such a case, the load_unaligned_zeropad() fixup code +in the #VC/#VE handlers doesn't run. + +To avoid this problem, the Hyper-V specific functions for notifying the +hypervisor of the transition mark pages as "not present" while a transition +is in progress. If load_unaligned_zeropad() causes a stray reference, a +normal page fault is generated instead of #VC or #VE, and the page-fault- +based handlers for load_unaligned_zeropad() fixup the reference. When the +encrypted/decrypted transition is complete, the pages are marked as "present" +again. See hv_vtom_clear_present() and hv_vtom_set_host_visibility(). diff --git a/Documentation/virt/hyperv/hibernation.rst b/Documentation/virt/hyperv/hibernation.rst new file mode 100644 index 000000000000..4ff27f4a317a --- /dev/null +++ b/Documentation/virt/hyperv/hibernation.rst @@ -0,0 +1,336 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Hibernating Guest VMs +===================== + +Background +---------- +Linux supports the ability to hibernate itself in order to save power. +Hibernation is sometimes called suspend-to-disk, as it writes a memory +image to disk and puts the hardware into the lowest possible power +state. Upon resume from hibernation, the hardware is restarted and the +memory image is restored from disk so that it can resume execution +where it left off. See the "Hibernation" section of +Documentation/admin-guide/pm/sleep-states.rst. + +Hibernation is usually done on devices with a single user, such as a +personal laptop. For example, the laptop goes into hibernation when +the cover is closed, and resumes when the cover is opened again. +Hibernation and resume happen on the same hardware, and Linux kernel +code orchestrating the hibernation steps assumes that the hardware +configuration is not changed while in the hibernated state. + +Hibernation can be initiated within Linux by writing "disk" to +/sys/power/state or by invoking the reboot system call with the +appropriate arguments. This functionality may be wrapped by user space +commands such "systemctl hibernate" that are run directly from a +command line or in response to events such as the laptop lid closing. + +Considerations for Guest VM Hibernation +--------------------------------------- +Linux guests on Hyper-V can also be hibernated, in which case the +hardware is the virtual hardware provided by Hyper-V to the guest VM. +Only the targeted guest VM is hibernated, while other guest VMs and +the underlying Hyper-V host continue to run normally. While the +underlying Windows Hyper-V and physical hardware on which it is +running might also be hibernated using hibernation functionality in +the Windows host, host hibernation and its impact on guest VMs is not +in scope for this documentation. + +Resuming a hibernated guest VM can be more challenging than with +physical hardware because VMs make it very easy to change the hardware +configuration between the hibernation and resume. Even when the resume +is done on the same VM that hibernated, the memory size might be +changed, or virtual NICs or SCSI controllers might be added or +removed. Virtual PCI devices assigned to the VM might be added or +removed. Most such changes cause the resume steps to fail, though +adding a new virtual NIC, SCSI controller, or vPCI device should work. + +Additional complexity can ensue because the disks of the hibernated VM +can be moved to another newly created VM that otherwise has the same +virtual hardware configuration. While it is desirable for resume from +hibernation to succeed after such a move, there are challenges. See +details on this scenario and its limitations in the "Resuming on a +Different VM" section below. + +Hyper-V also provides ways to move a VM from one Hyper-V host to +another. Hyper-V tries to ensure processor model and Hyper-V version +compatibility using VM Configuration Versions, and prevents moves to +a host that isn't compatible. Linux adapts to host and processor +differences by detecting them at boot time, but such detection is not +done when resuming execution in the hibernation image. If a VM is +hibernated on one host, then resumed on a host with a different processor +model or Hyper-V version, settings recorded in the hibernation image +may not match the new host. Because Linux does not detect such +mismatches when resuming the hibernation image, undefined behavior +and failures could result. + + +Enabling Guest VM Hibernation +----------------------------- +Hibernation of a Hyper-V guest VM is disabled by default because +hibernation is incompatible with memory hot-add, as provided by the +Hyper-V balloon driver. If hot-add is used and the VM hibernates, it +hibernates with more memory than it started with. But when the VM +resumes from hibernation, Hyper-V gives the VM only the originally +assigned memory, and the memory size mismatch causes resume to fail. + +To enable a Hyper-V VM for hibernation, the Hyper-V administrator must +enable the ACPI virtual S4 sleep state in the ACPI configuration that +Hyper-V provides to the guest VM. Such enablement is accomplished by +modifying a WMI property of the VM, the steps for which are outside +the scope of this documentation but are available on the web. +Enablement is treated as the indicator that the administrator +prioritizes Linux hibernation in the VM over hot-add, so the Hyper-V +balloon driver in Linux disables hot-add. Enablement is indicated if +the contents of /sys/power/disk contains "platform" as an option. The +enablement is also visible in /sys/bus/vmbus/hibernation. See function +hv_is_hibernation_supported(). + +Linux supports ACPI sleep states on x86, but not on arm64. So Linux +guest VM hibernation is not available on Hyper-V for arm64. + +Initiating Guest VM Hibernation +------------------------------- +Guest VMs can self-initiate hibernation using the standard Linux +methods of writing "disk" to /sys/power/state or the reboot system +call. As an additional layer, Linux guests on Hyper-V support the +"Shutdown" integration service, via which a Hyper-V administrator can +tell a Linux VM to hibernate using a command outside the VM. The +command generates a request to the Hyper-V shutdown driver in Linux, +which sends the uevent "EVENT=hibernate". See kernel functions +shutdown_onchannelcallback() and send_hibernate_uevent(). A udev rule +must be provided in the VM that handles this event and initiates +hibernation. + +Handling VMBus Devices During Hibernation & Resume +-------------------------------------------------- +The VMBus bus driver, and the individual VMBus device drivers, +implement suspend and resume functions that are called as part of the +Linux orchestration of hibernation and of resuming from hibernation. +The overall approach is to leave in place the data structures for the +primary VMBus channels and their associated Linux devices, such as +SCSI controllers and others, so that they are captured in the +hibernation image. This approach allows any state associated with the +device to be persisted across the hibernation/resume. When the VM +resumes, the devices are re-offered by Hyper-V and are connected to +the data structures that already exist in the resumed hibernation +image. + +VMBus devices are identified by class and instance GUID. (See section +"VMBus device creation/deletion" in +Documentation/virt/hyperv/vmbus.rst.) Upon resume from hibernation, +the resume functions expect that the devices offered by Hyper-V have +the same class/instance GUIDs as the devices present at the time of +hibernation. Having the same class/instance GUIDs allows the offered +devices to be matched to the primary VMBus channel data structures in +the memory of the now resumed hibernation image. If any devices are +offered that don't match primary VMBus channel data structures that +already exist, they are processed normally as newly added devices. If +primary VMBus channels that exist in the resumed hibernation image are +not matched with a device offered in the resumed VM, the resume +sequence waits for 10 seconds, then proceeds. But the unmatched device +is likely to cause errors in the resumed VM. + +When resuming existing primary VMBus channels, the newly offered +relids might be different because relids can change on each VM boot, +even if the VM configuration hasn't changed. The VMBus bus driver +resume function matches the class/instance GUIDs, and updates the +relids in case they have changed. + +VMBus sub-channels are not persisted in the hibernation image. Each +VMBus device driver's suspend function must close any sub-channels +prior to hibernation. Closing a sub-channel causes Hyper-V to send a +RESCIND_CHANNELOFFER message, which Linux processes by freeing the +channel data structures so that all vestiges of the sub-channel are +removed. By contrast, primary channels are marked closed and their +ring buffers are freed, but Hyper-V does not send a rescind message, +so the channel data structure continues to exist. Upon resume, the +device driver's resume function re-allocates the ring buffer and +re-opens the existing channel. It then communicates with Hyper-V to +re-open sub-channels from scratch. + +The Linux ends of Hyper-V sockets are forced closed at the time of +hibernation. The guest can't force closing the host end of the socket, +but any host-side actions on the host end will produce an error. + +VMBus devices use the same suspend function for the "freeze" and the +"poweroff" phases, and the same resume function for the "thaw" and +"restore" phases. See the "Entering Hibernation" section of +Documentation/driver-api/pm/devices.rst for the sequencing of the +phases. + +Detailed Hibernation Sequence +----------------------------- +1. The Linux power management (PM) subsystem prepares for + hibernation by freezing user space processes and allocating + memory to hold the hibernation image. +2. As part of the "freeze" phase, Linux PM calls the "suspend" + function for each VMBus device in turn. As described above, this + function removes sub-channels, and leaves the primary channel in + a closed state. +3. Linux PM calls the "suspend" function for the VMBus bus, which + closes any Hyper-V socket channels and unloads the top-level + VMBus connection with the Hyper-V host. +4. Linux PM disables non-boot CPUs, creates the hibernation image in + the previously allocated memory, then re-enables non-boot CPUs. + The hibernation image contains the memory data structures for the + closed primary channels, but no sub-channels. +5. As part of the "thaw" phase, Linux PM calls the "resume" function + for the VMBus bus, which re-establishes the top-level VMBus + connection and requests that Hyper-V re-offer the VMBus devices. + As offers are received for the primary channels, the relids are + updated as previously described. +6. Linux PM calls the "resume" function for each VMBus device. Each + device re-opens its primary channel, and communicates with Hyper-V + to re-establish sub-channels if appropriate. The sub-channels + are re-created as new channels since they were previously removed + entirely in Step 2. +7. With VMBus devices now working again, Linux PM writes the + hibernation image from memory to disk. +8. Linux PM repeats Steps 2 and 3 above as part of the "poweroff" + phase. VMBus channels are closed and the top-level VMBus + connection is unloaded. +9. Linux PM disables non-boot CPUs, and then enters ACPI sleep state + S4. Hibernation is now complete. + +Detailed Resume Sequence +------------------------ +1. The guest VM boots into a fresh Linux OS instance. During boot, + the top-level VMBus connection is established, and synthetic + devices are enabled. This happens via the normal paths that don't + involve hibernation. +2. Linux PM hibernation code reads swap space is to find and read + the hibernation image into memory. If there is no hibernation + image, then this boot becomes a normal boot. +3. If this is a resume from hibernation, the "freeze" phase is used + to shutdown VMBus devices and unload the top-level VMBus + connection in the running fresh OS instance, just like Steps 2 + and 3 in the hibernation sequence. +4. Linux PM disables non-boot CPUs, and transfers control to the + read-in hibernation image. In the now-running hibernation image, + non-boot CPUs are restarted. +5. As part of the "resume" phase, Linux PM repeats Steps 5 and 6 + from the hibernation sequence. The top-level VMBus connection is + re-established, and offers are received and matched to primary + channels in the image. Relids are updated. VMBus device resume + functions re-open primary channels and re-create sub-channels. +6. Linux PM exits the hibernation resume sequence and the VM is now + running normally from the hibernation image. + +Key-Value Pair (KVP) Pseudo-Device Anomalies +-------------------------------------------- +The VMBus KVP device behaves differently from other pseudo-devices +offered by Hyper-V. When the KVP primary channel is closed, Hyper-V +sends a rescind message, which causes all vestiges of the device to be +removed. But Hyper-V then re-offers the device, causing it to be newly +re-created. The removal and re-creation occurs during the "freeze" +phase of hibernation, so the hibernation image contains the re-created +KVP device. Similar behavior occurs during the "freeze" phase of the +resume sequence while still in the fresh OS instance. But in both +cases, the top-level VMBus connection is subsequently unloaded, which +causes the device to be discarded on the Hyper-V side. So no harm is +done and everything still works. + +Virtual PCI devices +------------------- +Virtual PCI devices are physical PCI devices that are mapped directly +into the VM's physical address space so the VM can interact directly +with the hardware. vPCI devices include those accessed via what Hyper-V +calls "Discrete Device Assignment" (DDA), as well as SR-IOV NIC +Virtual Functions (VF) devices. See Documentation/virt/hyperv/vpci.rst. + +Hyper-V DDA devices are offered to guest VMs after the top-level VMBus +connection is established, just like VMBus synthetic devices. They are +statically assigned to the VM, and their instance GUIDs don't change +unless the Hyper-V administrator makes changes to the configuration. +DDA devices are represented in Linux as virtual PCI devices that have +a VMBus identity as well as a PCI identity. Consequently, Linux guest +hibernation first handles DDA devices as VMBus devices in order to +manage the VMBus channel. But then they are also handled as PCI +devices using the hibernation functions implemented by their native +PCI driver. + +SR-IOV NIC VFs also have a VMBus identity as well as a PCI +identity, and overall are processed similarly to DDA devices. A +difference is that VFs are not offered to the VM during initial boot +of the VM. Instead, the VMBus synthetic NIC driver first starts +operating and communicates to Hyper-V that it is prepared to accept a +VF, and then the VF offer is made. However, the VMBus connection +might later be unloaded and then re-established without the VM being +rebooted, as happens in Steps 3 and 5 in the Detailed Hibernation +Sequence above and in the Detailed Resume Sequence. In such a case, +the VFs likely became part of the VM during initial boot, so when the +VMBus connection is re-established, the VFs are offered on the +re-established connection without intervention by the synthetic NIC driver. + +UIO Devices +----------- +A VMBus device can be exposed to user space using the Hyper-V UIO +driver (uio_hv_generic.c) so that a user space driver can control and +operate the device. However, the VMBus UIO driver does not support the +suspend and resume operations needed for hibernation. If a VMBus +device is configured to use the UIO driver, hibernating the VM fails +and Linux continues to run normally. The most common use of the Hyper-V +UIO driver is for DPDK networking, but there are other uses as well. + +Resuming on a Different VM +-------------------------- +This scenario occurs in the Azure public cloud in that a hibernated +customer VM only exists as saved configuration and disks -- the VM no +longer exists on any Hyper-V host. When the customer VM is resumed, a +new Hyper-V VM with identical configuration is created, likely on a +different Hyper-V host. That new Hyper-V VM becomes the resumed +customer VM, and the steps the Linux kernel takes to resume from the +hibernation image must work in that new VM. + +While the disks and their contents are preserved from the original VM, +the Hyper-V-provided VMBus instance GUIDs of the disk controllers and +other synthetic devices would typically be different. The difference +would cause the resume from hibernation to fail, so several things are +done to solve this problem: + +* For VMBus synthetic devices that support only a single instance, + Hyper-V always assigns the same instance GUIDs. For example, the + Hyper-V mouse, the shutdown pseudo-device, the time sync pseudo + device, etc., always have the same instance GUID, both for local + Hyper-V installs as well as in the Azure cloud. + +* VMBus synthetic SCSI controllers may have multiple instances in a + VM, and in the general case instance GUIDs vary from VM to VM. + However, Azure VMs always have exactly two synthetic SCSI + controllers, and Azure code overrides the normal Hyper-V behavior + so these controllers are always assigned the same two instance + GUIDs. Consequently, when a customer VM is resumed on a newly + created VM, the instance GUIDs match. But this guarantee does not + hold for local Hyper-V installs. + +* Similarly, VMBus synthetic NICs may have multiple instances in a + VM, and the instance GUIDs vary from VM to VM. Again, Azure code + overrides the normal Hyper-V behavior so that the instance GUID + of a synthetic NIC in a customer VM does not change, even if the + customer VM is deallocated or hibernated, and then re-constituted + on a newly created VM. As with SCSI controllers, this behavior + does not hold for local Hyper-V installs. + +* vPCI devices do not have the same instance GUIDs when resuming + from hibernation on a newly created VM. Consequently, Azure does + not support hibernation for VMs that have DDA devices such as + NVMe controllers or GPUs. For SR-IOV NIC VFs, Azure removes the + VF from the VM before it hibernates so that the hibernation image + does not contain a VF device. When the VM is resumed it + instantiates a new VF, rather than trying to match against a VF + that is present in the hibernation image. Because Azure must + remove any VFs before initiating hibernation, Azure VM + hibernation must be initiated externally from the Azure Portal or + Azure CLI, which in turn uses the Shutdown integration service to + tell Linux to do the hibernation. If hibernation is self-initiated + within the Azure VM, VFs remain in the hibernation image, and are + not resumed properly. + +In summary, Azure takes special actions to remove VFs and to ensure +that VMBus device instance GUIDs match on a new/different VM, allowing +hibernation to work for most general-purpose Azure VMs sizes. While +similar special actions could be taken when resuming on a different VM +on a local Hyper-V install, orchestrating such actions is not provided +out-of-the-box by local Hyper-V and so requires custom scripting. diff --git a/Documentation/virt/hyperv/index.rst b/Documentation/virt/hyperv/index.rst new file mode 100644 index 000000000000..c84c40fd61c9 --- /dev/null +++ b/Documentation/virt/hyperv/index.rst @@ -0,0 +1,15 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================== +Hyper-V Enlightenments +====================== + +.. toctree:: + :maxdepth: 1 + + overview + vmbus + clocks + vpci + hibernation + coco diff --git a/Documentation/virt/hyperv/overview.rst b/Documentation/virt/hyperv/overview.rst new file mode 100644 index 000000000000..77408a89d1a4 --- /dev/null +++ b/Documentation/virt/hyperv/overview.rst @@ -0,0 +1,207 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Overview +======== +The Linux kernel contains a variety of code for running as a fully +enlightened guest on Microsoft's Hyper-V hypervisor. Hyper-V +consists primarily of a bare-metal hypervisor plus a virtual machine +management service running in the parent partition (roughly +equivalent to KVM and QEMU, for example). Guest VMs run in child +partitions. In this documentation, references to Hyper-V usually +encompass both the hypervisor and the VMM service without making a +distinction about which functionality is provided by which +component. + +Hyper-V runs on x86/x64 and arm64 architectures, and Linux guests +are supported on both. The functionality and behavior of Hyper-V is +generally the same on both architectures unless noted otherwise. + +Linux Guest Communication with Hyper-V +-------------------------------------- +Linux guests communicate with Hyper-V in four different ways: + +* Implicit traps: As defined by the x86/x64 or arm64 architecture, + some guest actions trap to Hyper-V. Hyper-V emulates the action and + returns control to the guest. This behavior is generally invisible + to the Linux kernel. + +* Explicit hypercalls: Linux makes an explicit function call to + Hyper-V, passing parameters. Hyper-V performs the requested action + and returns control to the caller. Parameters are passed in + processor registers or in memory shared between the Linux guest and + Hyper-V. On x86/x64, hypercalls use a Hyper-V specific calling + sequence. On arm64, hypercalls use the ARM standard SMCCC calling + sequence. + +* Synthetic register access: Hyper-V implements a variety of + synthetic registers. On x86/x64 these registers appear as MSRs in + the guest, and the Linux kernel can read or write these MSRs using + the normal mechanisms defined by the x86/x64 architecture. On + arm64, these synthetic registers must be accessed using explicit + hypercalls. + +* VMBus: VMBus is a higher-level software construct that is built on + the other 3 mechanisms. It is a message passing interface between + the Hyper-V host and the Linux guest. It uses memory that is shared + between Hyper-V and the guest, along with various signaling + mechanisms. + +The first three communication mechanisms are documented in the +`Hyper-V Top Level Functional Spec (TLFS)`_. The TLFS describes +general Hyper-V functionality and provides details on the hypercalls +and synthetic registers. The TLFS is currently written for the +x86/x64 architecture only. + +.. _Hyper-V Top Level Functional Spec (TLFS): https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/tlfs + +VMBus is not documented. This documentation provides a high-level +overview of VMBus and how it works, but the details can be discerned +only from the code. + +Sharing Memory +-------------- +Many aspects are communication between Hyper-V and Linux are based +on sharing memory. Such sharing is generally accomplished as +follows: + +* Linux allocates memory from its physical address space using + standard Linux mechanisms. + +* Linux tells Hyper-V the guest physical address (GPA) of the + allocated memory. Many shared areas are kept to 1 page so that a + single GPA is sufficient. Larger shared areas require a list of + GPAs, which usually do not need to be contiguous in the guest + physical address space. How Hyper-V is told about the GPA or list + of GPAs varies. In some cases, a single GPA is written to a + synthetic register. In other cases, a GPA or list of GPAs is sent + in a VMBus message. + +* Hyper-V translates the GPAs into "real" physical memory addresses, + and creates a virtual mapping that it can use to access the memory. + +* Linux can later revoke sharing it has previously established by + telling Hyper-V to set the shared GPA to zero. + +Hyper-V operates with a page size of 4 Kbytes. GPAs communicated to +Hyper-V may be in the form of page numbers, and always describe a +range of 4 Kbytes. Since the Linux guest page size on x86/x64 is +also 4 Kbytes, the mapping from guest page to Hyper-V page is 1-to-1. +On arm64, Hyper-V supports guests with 4/16/64 Kbyte pages as +defined by the arm64 architecture. If Linux is using 16 or 64 +Kbyte pages, Linux code must be careful to communicate with Hyper-V +only in terms of 4 Kbyte pages. HV_HYP_PAGE_SIZE and related macros +are used in code that communicates with Hyper-V so that it works +correctly in all configurations. + +As described in the TLFS, a few memory pages shared between Hyper-V +and the Linux guest are "overlay" pages. With overlay pages, Linux +uses the usual approach of allocating guest memory and telling +Hyper-V the GPA of the allocated memory. But Hyper-V then replaces +that physical memory page with a page it has allocated, and the +original physical memory page is no longer accessible in the guest +VM. Linux may access the memory normally as if it were the memory +that it originally allocated. The "overlay" behavior is visible +only because the contents of the page (as seen by Linux) change at +the time that Linux originally establishes the sharing and the +overlay page is inserted. Similarly, the contents change if Linux +revokes the sharing, in which case Hyper-V removes the overlay page, +and the guest page originally allocated by Linux becomes visible +again. + +Before Linux does a kexec to a kdump kernel or any other kernel, +memory shared with Hyper-V should be revoked. Hyper-V could modify +a shared page or remove an overlay page after the new kernel is +using the page for a different purpose, corrupting the new kernel. +Hyper-V does not provide a single "set everything" operation to +guest VMs, so Linux code must individually revoke all sharing before +doing kexec. See hv_kexec_handler() and hv_crash_handler(). But +the crash/panic path still has holes in cleanup because some shared +pages are set using per-CPU synthetic registers and there's no +mechanism to revoke the shared pages for CPUs other than the CPU +running the panic path. + +CPU Management +-------------- +Hyper-V does not have a ability to hot-add or hot-remove a CPU +from a running VM. However, Windows Server 2019 Hyper-V and +earlier versions may provide guests with ACPI tables that indicate +more CPUs than are actually present in the VM. As is normal, Linux +treats these additional CPUs as potential hot-add CPUs, and reports +them as such even though Hyper-V will never actually hot-add them. +Starting in Windows Server 2022 Hyper-V, the ACPI tables reflect +only the CPUs actually present in the VM, so Linux does not report +any hot-add CPUs. + +A Linux guest CPU may be taken offline using the normal Linux +mechanisms, provided no VMBus channel interrupts are assigned to +the CPU. See the section on VMBus Interrupts for more details +on how VMBus channel interrupts can be re-assigned to permit +taking a CPU offline. + +32-bit and 64-bit +----------------- +On x86/x64, Hyper-V supports 32-bit and 64-bit guests, and Linux +will build and run in either version. While the 32-bit version is +expected to work, it is used rarely and may suffer from undetected +regressions. + +On arm64, Hyper-V supports only 64-bit guests. + +Endian-ness +----------- +All communication between Hyper-V and guest VMs uses Little-Endian +format on both x86/x64 and arm64. Big-endian format on arm64 is not +supported by Hyper-V, and Linux code does not use endian-ness macros +when accessing data shared with Hyper-V. + +Versioning +---------- +Current Linux kernels operate correctly with older versions of +Hyper-V back to Windows Server 2012 Hyper-V. Support for running +on the original Hyper-V release in Windows Server 2008/2008 R2 +has been removed. + +A Linux guest on Hyper-V outputs in dmesg the version of Hyper-V +it is running on. This version is in the form of a Windows build +number and is for display purposes only. Linux code does not +test this version number at runtime to determine available features +and functionality. Hyper-V indicates feature/function availability +via flags in synthetic MSRs that Hyper-V provides to the guest, +and the guest code tests these flags. + +VMBus has its own protocol version that is negotiated during the +initial VMBus connection from the guest to Hyper-V. This version +number is also output to dmesg during boot. This version number +is checked in a few places in the code to determine if specific +functionality is present. + +Furthermore, each synthetic device on VMBus also has a protocol +version that is separate from the VMBus protocol version. Device +drivers for these synthetic devices typically negotiate the device +protocol version, and may test that protocol version to determine +if specific device functionality is present. + +Code Packaging +-------------- +Hyper-V related code appears in the Linux kernel code tree in three +main areas: + +1. drivers/hv + +2. arch/x86/hyperv and arch/arm64/hyperv + +3. individual device driver areas such as drivers/scsi, drivers/net, + drivers/clocksource, etc. + +A few miscellaneous files appear elsewhere. See the full list under +"Hyper-V/Azure CORE AND DRIVERS" and "DRM DRIVER FOR HYPERV +SYNTHETIC VIDEO DEVICE" in the MAINTAINERS file. + +The code in #1 and #2 is built only when CONFIG_HYPERV is set. +Similarly, the code for most Hyper-V related drivers is built only +when CONFIG_HYPERV is set. + +Most Hyper-V related code in #1 and #3 can be built as a module. +The architecture specific code in #2 must be built-in. Also, +drivers/hv/hv_common.c is low-level code that is common across +architectures and must be built-in. diff --git a/Documentation/virt/hyperv/vmbus.rst b/Documentation/virt/hyperv/vmbus.rst new file mode 100644 index 000000000000..654bb4849972 --- /dev/null +++ b/Documentation/virt/hyperv/vmbus.rst @@ -0,0 +1,346 @@ +.. SPDX-License-Identifier: GPL-2.0 + +VMBus +===== +VMBus is a software construct provided by Hyper-V to guest VMs. It +consists of a control path and common facilities used by synthetic +devices that Hyper-V presents to guest VMs. The control path is +used to offer synthetic devices to the guest VM and, in some cases, +to rescind those devices. The common facilities include software +channels for communicating between the device driver in the guest VM +and the synthetic device implementation that is part of Hyper-V, and +signaling primitives to allow Hyper-V and the guest to interrupt +each other. + +VMBus is modeled in Linux as a bus, with the expected /sys/bus/vmbus +entry in a running Linux guest. The VMBus driver (drivers/hv/vmbus_drv.c) +establishes the VMBus control path with the Hyper-V host, then +registers itself as a Linux bus driver. It implements the standard +bus functions for adding and removing devices to/from the bus. + +Most synthetic devices offered by Hyper-V have a corresponding Linux +device driver. These devices include: + +* SCSI controller +* NIC +* Graphics frame buffer +* Keyboard +* Mouse +* PCI device pass-thru +* Heartbeat +* Time Sync +* Shutdown +* Memory balloon +* Key/Value Pair (KVP) exchange with Hyper-V +* Hyper-V online backup (a.k.a. VSS) + +Guest VMs may have multiple instances of the synthetic SCSI +controller, synthetic NIC, and PCI pass-thru devices. Other +synthetic devices are limited to a single instance per VM. Not +listed above are a small number of synthetic devices offered by +Hyper-V that are used only by Windows guests and for which Linux +does not have a driver. + +Hyper-V uses the terms "VSP" and "VSC" in describing synthetic +devices. "VSP" refers to the Hyper-V code that implements a +particular synthetic device, while "VSC" refers to the driver for +the device in the guest VM. For example, the Linux driver for the +synthetic NIC is referred to as "netvsc" and the Linux driver for +the synthetic SCSI controller is "storvsc". These drivers contain +functions with names like "storvsc_connect_to_vsp". + +VMBus channels +-------------- +An instance of a synthetic device uses VMBus channels to communicate +between the VSP and the VSC. Channels are bi-directional and used +for passing messages. Most synthetic devices use a single channel, +but the synthetic SCSI controller and synthetic NIC may use multiple +channels to achieve higher performance and greater parallelism. + +Each channel consists of two ring buffers. These are classic ring +buffers from a university data structures textbook. If the read +and writes pointers are equal, the ring buffer is considered to be +empty, so a full ring buffer always has at least one byte unused. +The "in" ring buffer is for messages from the Hyper-V host to the +guest, and the "out" ring buffer is for messages from the guest to +the Hyper-V host. In Linux, the "in" and "out" designations are as +viewed by the guest side. The ring buffers are memory that is +shared between the guest and the host, and they follow the standard +paradigm where the memory is allocated by the guest, with the list +of GPAs that make up the ring buffer communicated to the host. Each +ring buffer consists of a header page (4 Kbytes) with the read and +write indices and some control flags, followed by the memory for the +actual ring. The size of the ring is determined by the VSC in the +guest and is specific to each synthetic device. The list of GPAs +making up the ring is communicated to the Hyper-V host over the +VMBus control path as a GPA Descriptor List (GPADL). See function +vmbus_establish_gpadl(). + +Each ring buffer is mapped into contiguous Linux kernel virtual +space in three parts: 1) the 4 Kbyte header page, 2) the memory +that makes up the ring itself, and 3) a second mapping of the memory +that makes up the ring itself. Because (2) and (3) are contiguous +in kernel virtual space, the code that copies data to and from the +ring buffer need not be concerned with ring buffer wrap-around. +Once a copy operation has completed, the read or write index may +need to be reset to point back into the first mapping, but the +actual data copy does not need to be broken into two parts. This +approach also allows complex data structures to be easily accessed +directly in the ring without handling wrap-around. + +On arm64 with page sizes > 4 Kbytes, the header page must still be +passed to Hyper-V as a 4 Kbyte area. But the memory for the actual +ring must be aligned to PAGE_SIZE and have a size that is a multiple +of PAGE_SIZE so that the duplicate mapping trick can be done. Hence +a portion of the header page is unused and not communicated to +Hyper-V. This case is handled by vmbus_establish_gpadl(). + +Hyper-V enforces a limit on the aggregate amount of guest memory +that can be shared with the host via GPADLs. This limit ensures +that a rogue guest can't force the consumption of excessive host +resources. For Windows Server 2019 and later, this limit is +approximately 1280 Mbytes. For versions prior to Windows Server +2019, the limit is approximately 384 Mbytes. + +VMBus channel messages +---------------------- +All messages sent in a VMBus channel have a standard header that includes +the message length, the offset of the message payload, some flags, and a +transactionID. The portion of the message after the header is +unique to each VSP/VSC pair. + +Messages follow one of two patterns: + +* Unidirectional: Either side sends a message and does not + expect a response message +* Request/response: One side (usually the guest) sends a message + and expects a response + +The transactionID (a.k.a. "requestID") is for matching requests & +responses. Some synthetic devices allow multiple requests to be in- +flight simultaneously, so the guest specifies a transactionID when +sending a request. Hyper-V sends back the same transactionID in the +matching response. + +Messages passed between the VSP and VSC are control messages. For +example, a message sent from the storvsc driver might be "execute +this SCSI command". If a message also implies some data transfer +between the guest and the Hyper-V host, the actual data to be +transferred may be embedded with the control message, or it may be +specified as a separate data buffer that the Hyper-V host will +access as a DMA operation. The former case is used when the size of +the data is small and the cost of copying the data to and from the +ring buffer is minimal. For example, time sync messages from the +Hyper-V host to the guest contain the actual time value. When the +data is larger, a separate data buffer is used. In this case, the +control message contains a list of GPAs that describe the data +buffer. For example, the storvsc driver uses this approach to +specify the data buffers to/from which disk I/O is done. + +Three functions exist to send VMBus channel messages: + +1. vmbus_sendpacket(): Control-only messages and messages with + embedded data -- no GPAs +2. vmbus_sendpacket_pagebuffer(): Message with list of GPAs + identifying data to transfer. An offset and length is + associated with each GPA so that multiple discontinuous areas + of guest memory can be targeted. +3. vmbus_sendpacket_mpb_desc(): Message with list of GPAs + identifying data to transfer. A single offset and length is + associated with a list of GPAs. The GPAs must describe a + single logical area of guest memory to be targeted. + +Historically, Linux guests have trusted Hyper-V to send well-formed +and valid messages, and Linux drivers for synthetic devices did not +fully validate messages. With the introduction of processor +technologies that fully encrypt guest memory and that allow the +guest to not trust the hypervisor (AMD SEV-SNP, Intel TDX), trusting +the Hyper-V host is no longer a valid assumption. The drivers for +VMBus synthetic devices are being updated to fully validate any +values read from memory that is shared with Hyper-V, which includes +messages from VMBus devices. To facilitate such validation, +messages read by the guest from the "in" ring buffer are copied to a +temporary buffer that is not shared with Hyper-V. Validation is +performed in this temporary buffer without the risk of Hyper-V +maliciously modifying the message after it is validated but before +it is used. + +Synthetic Interrupt Controller (synic) +-------------------------------------- +Hyper-V provides each guest CPU with a synthetic interrupt controller +that is used by VMBus for host-guest communication. While each synic +defines 16 synthetic interrupts (SINT), Linux uses only one of the 16 +(VMBUS_MESSAGE_SINT). All interrupts related to communication between +the Hyper-V host and a guest CPU use that SINT. + +The SINT is mapped to a single per-CPU architectural interrupt (i.e, +an 8-bit x86/x64 interrupt vector, or an arm64 PPI INTID). Because +each CPU in the guest has a synic and may receive VMBus interrupts, +they are best modeled in Linux as per-CPU interrupts. This model works +well on arm64 where a single per-CPU Linux IRQ is allocated for +VMBUS_MESSAGE_SINT. This IRQ appears in /proc/interrupts as an IRQ labelled +"Hyper-V VMbus". Since x86/x64 lacks support for per-CPU IRQs, an x86 +interrupt vector is statically allocated (HYPERVISOR_CALLBACK_VECTOR) +across all CPUs and explicitly coded to call vmbus_isr(). In this case, +there's no Linux IRQ, and the interrupts are visible in aggregate in +/proc/interrupts on the "HYP" line. + +The synic provides the means to demultiplex the architectural interrupt into +one or more logical interrupts and route the logical interrupt to the proper +VMBus handler in Linux. This demultiplexing is done by vmbus_isr() and +related functions that access synic data structures. + +The synic is not modeled in Linux as an irq chip or irq domain, +and the demultiplexed logical interrupts are not Linux IRQs. As such, +they don't appear in /proc/interrupts or /proc/irq. The CPU +affinity for one of these logical interrupts is controlled via an +entry under /sys/bus/vmbus as described below. + +VMBus interrupts +---------------- +VMBus provides a mechanism for the guest to interrupt the host when +the guest has queued new messages in a ring buffer. The host +expects that the guest will send an interrupt only when an "out" +ring buffer transitions from empty to non-empty. If the guest sends +interrupts at other times, the host deems such interrupts to be +unnecessary. If a guest sends an excessive number of unnecessary +interrupts, the host may throttle that guest by suspending its +execution for a few seconds to prevent a denial-of-service attack. + +Similarly, the host will interrupt the guest via the synic when +it sends a new message on the VMBus control path, or when a VMBus +channel "in" ring buffer transitions from empty to non-empty due to +the host inserting a new VMBus channel message. The control message stream +and each VMBus channel "in" ring buffer are separate logical interrupts +that are demultiplexed by vmbus_isr(). It demultiplexes by first checking +for channel interrupts by calling vmbus_chan_sched(), which looks at a synic +bitmap to determine which channels have pending interrupts on this CPU. +If multiple channels have pending interrupts for this CPU, they are +processed sequentially. When all channel interrupts have been processed, +vmbus_isr() checks for and processes any messages received on the VMBus +control path. + +The guest CPU that a VMBus channel will interrupt is selected by the +guest when the channel is created, and the host is informed of that +selection. VMBus devices are broadly grouped into two categories: + +1. "Slow" devices that need only one VMBus channel. The devices + (such as keyboard, mouse, heartbeat, and timesync) generate + relatively few interrupts. Their VMBus channels are all + assigned to interrupt the VMBUS_CONNECT_CPU, which is always + CPU 0. + +2. "High speed" devices that may use multiple VMBus channels for + higher parallelism and performance. These devices include the + synthetic SCSI controller and synthetic NIC. Their VMBus + channels interrupts are assigned to CPUs that are spread out + among the available CPUs in the VM so that interrupts on + multiple channels can be processed in parallel. + +The assignment of VMBus channel interrupts to CPUs is done in the +function init_vp_index(). This assignment is done outside of the +normal Linux interrupt affinity mechanism, so the interrupts are +neither "unmanaged" nor "managed" interrupts. + +The CPU that a VMBus channel will interrupt can be seen in +/sys/bus/vmbus/devices/<deviceGUID>/ channels/<channelRelID>/cpu. +When running on later versions of Hyper-V, the CPU can be changed +by writing a new value to this sysfs entry. Because VMBus channel +interrupts are not Linux IRQs, there are no entries in /proc/interrupts +or /proc/irq corresponding to individual VMBus channel interrupts. + +An online CPU in a Linux guest may not be taken offline if it has +VMBus channel interrupts assigned to it. Starting in kernel v6.15, +any such interrupts are automatically reassigned to some other CPU +at the time of offlining. The "other" CPU is chosen by the +implementation and is not load balanced or otherwise intelligently +determined. If the CPU is onlined again, channel interrupts previously +assigned to it are not moved back. As a result, after multiple CPUs +have been offlined, and perhaps onlined again, the interrupt-to-CPU +mapping may be scrambled and non-optimal. In such a case, optimal +assignments must be re-established manually. For kernels v6.14 and +earlier, any conflicting channel interrupts must first be manually +reassigned to another CPU as described above. Then when no channel +interrupts are assigned to the CPU, it can be taken offline. + +The VMBus channel interrupt handling code is designed to work +correctly even if an interrupt is received on a CPU other than the +CPU assigned to the channel. Specifically, the code does not use +CPU-based exclusion for correctness. In normal operation, Hyper-V +will interrupt the assigned CPU. But when the CPU assigned to a +channel is being changed via sysfs, the guest doesn't know exactly +when Hyper-V will make the transition. The code must work correctly +even if there is a time lag before Hyper-V starts interrupting the +new CPU. See comments in target_cpu_store(). + +VMBus device creation/deletion +------------------------------ +Hyper-V and the Linux guest have a separate message-passing path +that is used for synthetic device creation and deletion. This +path does not use a VMBus channel. See vmbus_post_msg() and +vmbus_on_msg_dpc(). + +The first step is for the guest to connect to the generic +Hyper-V VMBus mechanism. As part of establishing this connection, +the guest and Hyper-V agree on a VMBus protocol version they will +use. This negotiation allows newer Linux kernels to run on older +Hyper-V versions, and vice versa. + +The guest then tells Hyper-V to "send offers". Hyper-V sends an +offer message to the guest for each synthetic device that the VM +is configured to have. Each VMBus device type has a fixed GUID +known as the "class ID", and each VMBus device instance is also +identified by a GUID. The offer message from Hyper-V contains +both GUIDs to uniquely (within the VM) identify the device. +There is one offer message for each device instance, so a VM with +two synthetic NICs will get two offers messages with the NIC +class ID. The ordering of offer messages can vary from boot-to-boot +and must not be assumed to be consistent in Linux code. Offer +messages may also arrive long after Linux has initially booted +because Hyper-V supports adding devices, such as synthetic NICs, +to running VMs. A new offer message is processed by +vmbus_process_offer(), which indirectly invokes vmbus_add_channel_work(). + +Upon receipt of an offer message, the guest identifies the device +type based on the class ID, and invokes the correct driver to set up +the device. Driver/device matching is performed using the standard +Linux mechanism. + +The device driver probe function opens the primary VMBus channel to +the corresponding VSP. It allocates guest memory for the channel +ring buffers and shares the ring buffer with the Hyper-V host by +giving the host a list of GPAs for the ring buffer memory. See +vmbus_establish_gpadl(). + +Once the ring buffer is set up, the device driver and VSP exchange +setup messages via the primary channel. These messages may include +negotiating the device protocol version to be used between the Linux +VSC and the VSP on the Hyper-V host. The setup messages may also +include creating additional VMBus channels, which are somewhat +mis-named as "sub-channels" since they are functionally +equivalent to the primary channel once they are created. + +Finally, the device driver may create entries in /dev as with +any device driver. + +The Hyper-V host can send a "rescind" message to the guest to +remove a device that was previously offered. Linux drivers must +handle such a rescind message at any time. Rescinding a device +invokes the device driver "remove" function to cleanly shut +down the device and remove it. Once a synthetic device is +rescinded, neither Hyper-V nor Linux retains any state about +its previous existence. Such a device might be re-added later, +in which case it is treated as an entirely new device. See +vmbus_onoffer_rescind(). + +For some devices, such as the KVP device, Hyper-V automatically +sends a rescind message when the primary channel is closed, +likely as a result of unbinding the device from its driver. +The rescind causes Linux to remove the device. But then Hyper-V +immediately reoffers the device to the guest, causing a new +instance of the device to be created in Linux. For other +devices, such as the synthetic SCSI and NIC devices, closing the +primary channel does *not* result in Hyper-V sending a rescind +message. The device continues to exist in Linux on the VMBus, +but with no driver bound to it. The same driver or a new driver +can subsequently be bound to the existing instance of the device. diff --git a/Documentation/virt/hyperv/vpci.rst b/Documentation/virt/hyperv/vpci.rst new file mode 100644 index 000000000000..b65b2126ede3 --- /dev/null +++ b/Documentation/virt/hyperv/vpci.rst @@ -0,0 +1,316 @@ +.. SPDX-License-Identifier: GPL-2.0 + +PCI pass-thru devices +========================= +In a Hyper-V guest VM, PCI pass-thru devices (also called +virtual PCI devices, or vPCI devices) are physical PCI devices +that are mapped directly into the VM's physical address space. +Guest device drivers can interact directly with the hardware +without intermediation by the host hypervisor. This approach +provides higher bandwidth access to the device with lower +latency, compared with devices that are virtualized by the +hypervisor. The device should appear to the guest just as it +would when running on bare metal, so no changes are required +to the Linux device drivers for the device. + +Hyper-V terminology for vPCI devices is "Discrete Device +Assignment" (DDA). Public documentation for Hyper-V DDA is +available here: `DDA`_ + +.. _DDA: https://learn.microsoft.com/en-us/windows-server/virtualization/hyper-v/plan/plan-for-deploying-devices-using-discrete-device-assignment + +DDA is typically used for storage controllers, such as NVMe, +and for GPUs. A similar mechanism for NICs is called SR-IOV +and produces the same benefits by allowing a guest device +driver to interact directly with the hardware. See Hyper-V +public documentation here: `SR-IOV`_ + +.. _SR-IOV: https://learn.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-single-root-i-o-virtualization--sr-iov- + +This discussion of vPCI devices includes DDA and SR-IOV +devices. + +Device Presentation +------------------- +Hyper-V provides full PCI functionality for a vPCI device when +it is operating, so the Linux device driver for the device can +be used unchanged, provided it uses the correct Linux kernel +APIs for accessing PCI config space and for other integration +with Linux. But the initial detection of the PCI device and +its integration with the Linux PCI subsystem must use Hyper-V +specific mechanisms. Consequently, vPCI devices on Hyper-V +have a dual identity. They are initially presented to Linux +guests as VMBus devices via the standard VMBus "offer" +mechanism, so they have a VMBus identity and appear under +/sys/bus/vmbus/devices. The VMBus vPCI driver in Linux at +drivers/pci/controller/pci-hyperv.c handles a newly introduced +vPCI device by fabricating a PCI bus topology and creating all +the normal PCI device data structures in Linux that would +exist if the PCI device were discovered via ACPI on a bare- +metal system. Once those data structures are set up, the +device also has a normal PCI identity in Linux, and the normal +Linux device driver for the vPCI device can function as if it +were running in Linux on bare-metal. Because vPCI devices are +presented dynamically through the VMBus offer mechanism, they +do not appear in the Linux guest's ACPI tables. vPCI devices +may be added to a VM or removed from a VM at any time during +the life of the VM, and not just during initial boot. + +With this approach, the vPCI device is a VMBus device and a +PCI device at the same time. In response to the VMBus offer +message, the hv_pci_probe() function runs and establishes a +VMBus connection to the vPCI VSP on the Hyper-V host. That +connection has a single VMBus channel. The channel is used to +exchange messages with the vPCI VSP for the purpose of setting +up and configuring the vPCI device in Linux. Once the device +is fully configured in Linux as a PCI device, the VMBus +channel is used only if Linux changes the vCPU to be interrupted +in the guest, or if the vPCI device is removed from +the VM while the VM is running. The ongoing operation of the +device happens directly between the Linux device driver for +the device and the hardware, with VMBus and the VMBus channel +playing no role. + +PCI Device Setup +---------------- +PCI device setup follows a sequence that Hyper-V originally +created for Windows guests, and that can be ill-suited for +Linux guests due to differences in the overall structure of +the Linux PCI subsystem compared with Windows. Nonetheless, +with a bit of hackery in the Hyper-V virtual PCI driver for +Linux, the virtual PCI device is setup in Linux so that +generic Linux PCI subsystem code and the Linux driver for the +device "just work". + +Each vPCI device is set up in Linux to be in its own PCI +domain with a host bridge. The PCI domainID is derived from +bytes 4 and 5 of the instance GUID assigned to the VMBus vPCI +device. The Hyper-V host does not guarantee that these bytes +are unique, so hv_pci_probe() has an algorithm to resolve +collisions. The collision resolution is intended to be stable +across reboots of the same VM so that the PCI domainIDs don't +change, as the domainID appears in the user space +configuration of some devices. + +hv_pci_probe() allocates a guest MMIO range to be used as PCI +config space for the device. This MMIO range is communicated +to the Hyper-V host over the VMBus channel as part of telling +the host that the device is ready to enter d0. See +hv_pci_enter_d0(). When the guest subsequently accesses this +MMIO range, the Hyper-V host intercepts the accesses and maps +them to the physical device PCI config space. + +hv_pci_probe() also gets BAR information for the device from +the Hyper-V host, and uses this information to allocate MMIO +space for the BARs. That MMIO space is then setup to be +associated with the host bridge so that it works when generic +PCI subsystem code in Linux processes the BARs. + +Finally, hv_pci_probe() creates the root PCI bus. At this +point the Hyper-V virtual PCI driver hackery is done, and the +normal Linux PCI machinery for scanning the root bus works to +detect the device, to perform driver matching, and to +initialize the driver and device. + +PCI Device Removal +------------------ +A Hyper-V host may initiate removal of a vPCI device from a +guest VM at any time during the life of the VM. The removal +is instigated by an admin action taken on the Hyper-V host and +is not under the control of the guest OS. + +A guest VM is notified of the removal by an unsolicited +"Eject" message sent from the host to the guest over the VMBus +channel associated with the vPCI device. Upon receipt of such +a message, the Hyper-V virtual PCI driver in Linux +asynchronously invokes Linux kernel PCI subsystem calls to +shutdown and remove the device. When those calls are +complete, an "Ejection Complete" message is sent back to +Hyper-V over the VMBus channel indicating that the device has +been removed. At this point, Hyper-V sends a VMBus rescind +message to the Linux guest, which the VMBus driver in Linux +processes by removing the VMBus identity for the device. Once +that processing is complete, all vestiges of the device having +been present are gone from the Linux kernel. The rescind +message also indicates to the guest that Hyper-V has stopped +providing support for the vPCI device in the guest. If the +guest were to attempt to access that device's MMIO space, it +would be an invalid reference. Hypercalls affecting the device +return errors, and any further messages sent in the VMBus +channel are ignored. + +After sending the Eject message, Hyper-V allows the guest VM +60 seconds to cleanly shutdown the device and respond with +Ejection Complete before sending the VMBus rescind +message. If for any reason the Eject steps don't complete +within the allowed 60 seconds, the Hyper-V host forcibly +performs the rescind steps, which will likely result in +cascading errors in the guest because the device is now no +longer present from the guest standpoint and accessing the +device MMIO space will fail. + +Because ejection is asynchronous and can happen at any point +during the guest VM lifecycle, proper synchronization in the +Hyper-V virtual PCI driver is very tricky. Ejection has been +observed even before a newly offered vPCI device has been +fully setup. The Hyper-V virtual PCI driver has been updated +several times over the years to fix race conditions when +ejections happen at inopportune times. Care must be taken when +modifying this code to prevent re-introducing such problems. +See comments in the code. + +Interrupt Assignment +-------------------- +The Hyper-V virtual PCI driver supports vPCI devices using +MSI, multi-MSI, or MSI-X. Assigning the guest vCPU that will +receive the interrupt for a particular MSI or MSI-X message is +complex because of the way the Linux setup of IRQs maps onto +the Hyper-V interfaces. For the single-MSI and MSI-X cases, +Linux calls hv_compse_msi_msg() twice, with the first call +containing a dummy vCPU and the second call containing the +real vCPU. Furthermore, hv_irq_unmask() is finally called +(on x86) or the GICD registers are set (on arm64) to specify +the real vCPU again. Each of these three calls interact +with Hyper-V, which must decide which physical CPU should +receive the interrupt before it is forwarded to the guest VM. +Unfortunately, the Hyper-V decision-making process is a bit +limited, and can result in concentrating the physical +interrupts on a single CPU, causing a performance bottleneck. +See details about how this is resolved in the extensive +comment above the function hv_compose_msi_req_get_cpu(). + +The Hyper-V virtual PCI driver implements the +irq_chip.irq_compose_msi_msg function as hv_compose_msi_msg(). +Unfortunately, on Hyper-V the implementation requires sending +a VMBus message to the Hyper-V host and awaiting an interrupt +indicating receipt of a reply message. Since +irq_chip.irq_compose_msi_msg can be called with IRQ locks +held, it doesn't work to do the normal sleep until awakened by +the interrupt. Instead hv_compose_msi_msg() must send the +VMBus message, and then poll for the completion message. As +further complexity, the vPCI device could be ejected/rescinded +while the polling is in progress, so this scenario must be +detected as well. See comments in the code regarding this +very tricky area. + +Most of the code in the Hyper-V virtual PCI driver (pci- +hyperv.c) applies to Hyper-V and Linux guests running on x86 +and on arm64 architectures. But there are differences in how +interrupt assignments are managed. On x86, the Hyper-V +virtual PCI driver in the guest must make a hypercall to tell +Hyper-V which guest vCPU should be interrupted by each +MSI/MSI-X interrupt, and the x86 interrupt vector number that +the x86_vector IRQ domain has picked for the interrupt. This +hypercall is made by hv_arch_irq_unmask(). On arm64, the +Hyper-V virtual PCI driver manages the allocation of an SPI +for each MSI/MSI-X interrupt. The Hyper-V virtual PCI driver +stores the allocated SPI in the architectural GICD registers, +which Hyper-V emulates, so no hypercall is necessary as with +x86. Hyper-V does not support using LPIs for vPCI devices in +arm64 guest VMs because it does not emulate a GICv3 ITS. + +The Hyper-V virtual PCI driver in Linux supports vPCI devices +whose drivers create managed or unmanaged Linux IRQs. If the +smp_affinity for an unmanaged IRQ is updated via the /proc/irq +interface, the Hyper-V virtual PCI driver is called to tell +the Hyper-V host to change the interrupt targeting and +everything works properly. However, on x86 if the x86_vector +IRQ domain needs to reassign an interrupt vector due to +running out of vectors on a CPU, there's no path to inform the +Hyper-V host of the change, and things break. Fortunately, +guest VMs operate in a constrained device environment where +using all the vectors on a CPU doesn't happen. Since such a +problem is only a theoretical concern rather than a practical +concern, it has been left unaddressed. + +DMA +--- +By default, Hyper-V pins all guest VM memory in the host +when the VM is created, and programs the physical IOMMU to +allow the VM to have DMA access to all its memory. Hence +it is safe to assign PCI devices to the VM, and allow the +guest operating system to program the DMA transfers. The +physical IOMMU prevents a malicious guest from initiating +DMA to memory belonging to the host or to other VMs on the +host. From the Linux guest standpoint, such DMA transfers +are in "direct" mode since Hyper-V does not provide a virtual +IOMMU in the guest. + +Hyper-V assumes that physical PCI devices always perform +cache-coherent DMA. When running on x86, this behavior is +required by the architecture. When running on arm64, the +architecture allows for both cache-coherent and +non-cache-coherent devices, with the behavior of each device +specified in the ACPI DSDT. But when a PCI device is assigned +to a guest VM, that device does not appear in the DSDT, so the +Hyper-V VMBus driver propagates cache-coherency information +from the VMBus node in the ACPI DSDT to all VMBus devices, +including vPCI devices (since they have a dual identity as a VMBus +device and as a PCI device). See vmbus_dma_configure(). +Current Hyper-V versions always indicate that the VMBus is +cache coherent, so vPCI devices on arm64 always get marked as +cache coherent and the CPU does not perform any sync +operations as part of dma_map/unmap_*() calls. + +vPCI protocol versions +---------------------- +As previously described, during vPCI device setup and teardown +messages are passed over a VMBus channel between the Hyper-V +host and the Hyper-v vPCI driver in the Linux guest. Some +messages have been revised in newer versions of Hyper-V, so +the guest and host must agree on the vPCI protocol version to +be used. The version is negotiated when communication over +the VMBus channel is first established. See +hv_pci_protocol_negotiation(). Newer versions of the protocol +extend support to VMs with more than 64 vCPUs, and provide +additional information about the vPCI device, such as the +guest virtual NUMA node to which it is most closely affined in +the underlying hardware. + +Guest NUMA node affinity +------------------------ +When the vPCI protocol version provides it, the guest NUMA +node affinity of the vPCI device is stored as part of the Linux +device information for subsequent use by the Linux driver. See +hv_pci_assign_numa_node(). If the negotiated protocol version +does not support the host providing NUMA affinity information, +the Linux guest defaults the device NUMA node to 0. But even +when the negotiated protocol version includes NUMA affinity +information, the ability of the host to provide such +information depends on certain host configuration options. If +the guest receives NUMA node value "0", it could mean NUMA +node 0, or it could mean "no information is available". +Unfortunately it is not possible to distinguish the two cases +from the guest side. + +PCI config space access in a CoCo VM +------------------------------------ +Linux PCI device drivers access PCI config space using a +standard set of functions provided by the Linux PCI subsystem. +In Hyper-V guests these standard functions map to functions +hv_pcifront_read_config() and hv_pcifront_write_config() +in the Hyper-V virtual PCI driver. In normal VMs, +these hv_pcifront_*() functions directly access the PCI config +space, and the accesses trap to Hyper-V to be handled. +But in CoCo VMs, memory encryption prevents Hyper-V +from reading the guest instruction stream to emulate the +access, so the hv_pcifront_*() functions must invoke +hypercalls with explicit arguments describing the access to be +made. + +Config Block back-channel +------------------------- +The Hyper-V host and Hyper-V virtual PCI driver in Linux +together implement a non-standard back-channel communication +path between the host and guest. The back-channel path uses +messages sent over the VMBus channel associated with the vPCI +device. The functions hyperv_read_cfg_blk() and +hyperv_write_cfg_blk() are the primary interfaces provided to +other parts of the Linux kernel. As of this writing, these +interfaces are used only by the Mellanox mlx5 driver to pass +diagnostic data to a Hyper-V host running in the Azure public +cloud. The functions hyperv_read_cfg_blk() and +hyperv_write_cfg_blk() are implemented in a separate module +(pci-hyperv-intf.c, under CONFIG_PCI_HYPERV_INTERFACE) that +effectively stubs them out when running in non-Hyper-V +environments. |