7 files changed, 1562 insertions, 0 deletions
diff --git a/Documentation/virt/hyperv/clocks.rst b/Documentation/virt/hyperv/clocks.rst
new file mode 100644
index 000000000000..176043265803
--- /dev/null
+++ b/Documentation/virt/hyperv/clocks.rst
@@ -0,0 +1,82 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Clocks and Timers
+=================
+
+arm64
+-----
+On arm64, Hyper-V virtualizes the ARMv8 architectural system counter
+and timer. Guest VMs use this virtualized hardware as the Linux
+clocksource and clockevents via the standard arm_arch_timer.c
+driver, just as they would on bare metal. Linux vDSO support for the
+architectural system counter is functional in guest VMs on Hyper-V.
+While Hyper-V also provides a synthetic system clock and four synthetic
+per-CPU timers as described in the TLFS, they are not used by the
+Linux kernel in a Hyper-V guest on arm64.  However, older versions
+of Hyper-V for arm64 only partially virtualize the ARMv8
+architectural timer, such that the timer does not generate
+interrupts in the VM. Because of this limitation, running current
+Linux kernel versions on these older Hyper-V versions requires an
+out-of-tree patch to use the Hyper-V synthetic clocks/timers instead.
+
+x86/x64
+-------
+On x86/x64, Hyper-V provides guest VMs with a synthetic system clock
+and four synthetic per-CPU timers as described in the TLFS. Hyper-V
+also provides access to the virtualized TSC via the RDTSC and
+related instructions. These TSC instructions do not trap to
+the hypervisor and so provide excellent performance in a VM.
+Hyper-V performs TSC calibration, and provides the TSC frequency
+to the guest VM via a synthetic MSR.  Hyper-V initialization code
+in Linux reads this MSR to get the frequency, so it skips TSC
+calibration and sets tsc_reliable. Hyper-V provides virtualized
+versions of the PIT (in Hyper-V  Generation 1 VMs only), local
+APIC timer, and RTC. Hyper-V does not provide a virtualized HPET in
+guest VMs.
+
+The Hyper-V synthetic system clock can be read via a synthetic MSR,
+but this access traps to the hypervisor. As a faster alternative,
+the guest can configure a memory page to be shared between the guest
+and the hypervisor.  Hyper-V populates this memory page with a
+64-bit scale value and offset value. To read the synthetic clock
+value, the guest reads the TSC and then applies the scale and offset
+as described in the Hyper-V TLFS. The resulting value advances
+at a constant 10 MHz frequency. In the case of a live migration
+to a host with a different TSC frequency, Hyper-V adjusts the
+scale and offset values in the shared page so that the 10 MHz
+frequency is maintained.
+
+Starting with Windows Server 2022 Hyper-V, Hyper-V uses hardware
+support for TSC frequency scaling to enable live migration of VMs
+across Hyper-V hosts where the TSC frequency may be different.
+When a Linux guest detects that this Hyper-V functionality is
+available, it prefers to use Linux's standard TSC-based clocksource.
+Otherwise, it uses the clocksource for the Hyper-V synthetic system
+clock implemented via the shared page (identified as
+"hyperv_clocksource_tsc_page").
+
+The Hyper-V synthetic system clock is available to user space via
+vDSO, and gettimeofday() and related system calls can execute
+entirely in user space.  The vDSO is implemented by mapping the
+shared page with scale and offset values into user space.  User
+space code performs the same algorithm of reading the TSC and
+applying the scale and offset to get the constant 10 MHz clock.
+
+Linux clockevents are based on Hyper-V synthetic timer 0 (stimer0).
+While Hyper-V offers 4 synthetic timers for each CPU, Linux only uses
+timer 0. In older versions of Hyper-V, an interrupt from stimer0
+results in a VMBus control message that is demultiplexed by
+vmbus_isr() as described in the Documentation/virt/hyperv/vmbus.rst
+documentation. In newer versions of Hyper-V, stimer0 interrupts can
+be mapped to an architectural interrupt, which is referred to as
+"Direct Mode". Linux prefers to use Direct Mode when available. Since
+x86/x64 doesn't support per-CPU interrupts, Direct Mode statically
+allocates an x86 interrupt vector (HYPERV_STIMER0_VECTOR) across all CPUs
+and explicitly codes it to call the stimer0 interrupt handler. Hence
+interrupts from stimer0 are recorded on the "HVS" line in /proc/interrupts
+rather than being associated with a Linux IRQ. Clockevents based on the
+virtualized PIT and local APIC timer also work, but Hyper-V stimer0
+is preferred.
+
+The driver for the Hyper-V synthetic system clock and timers is
+drivers/clocksource/hyperv_timer.c.
diff --git a/Documentation/virt/hyperv/coco.rst b/Documentation/virt/hyperv/coco.rst
new file mode 100644
index 000000000000..c15d6fe34b4e
--- /dev/null
+++ b/Documentation/virt/hyperv/coco.rst
@@ -0,0 +1,260 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Confidential Computing VMs
+==========================
+Hyper-V can create and run Linux guests that are Confidential Computing
+(CoCo) VMs. Such VMs cooperate with the physical processor to better protect
+the confidentiality and integrity of data in the VM's memory, even in the
+face of a hypervisor/VMM that has been compromised and may behave maliciously.
+CoCo VMs on Hyper-V share the generic CoCo VM threat model and security
+objectives described in Documentation/security/snp-tdx-threat-model.rst. Note
+that Hyper-V specific code in Linux refers to CoCo VMs as "isolated VMs" or
+"isolation VMs".
+
+A Linux CoCo VM on Hyper-V requires the cooperation and interaction of the
+following:
+
+* Physical hardware with a processor that supports CoCo VMs
+
+* The hardware runs a version of Windows/Hyper-V with support for CoCo VMs
+
+* The VM runs a version of Linux that supports being a CoCo VM
+
+The physical hardware requirements are as follows:
+
+* AMD processor with SEV-SNP. Hyper-V does not run guest VMs with AMD SME,
+  SEV, or SEV-ES encryption, and such encryption is not sufficient for a CoCo
+  VM on Hyper-V.
+
+* Intel processor with TDX
+
+To create a CoCo VM, the "Isolated VM" attribute must be specified to Hyper-V
+when the VM is created. A VM cannot be changed from a CoCo VM to a normal VM,
+or vice versa, after it is created.
+
+Operational Modes
+-----------------
+Hyper-V CoCo VMs can run in two modes. The mode is selected when the VM is
+created and cannot be changed during the life of the VM.
+
+* Fully-enlightened mode. In this mode, the guest operating system is
+  enlightened to understand and manage all aspects of running as a CoCo VM.
+
+* Paravisor mode. In this mode, a paravisor layer between the guest and the
+  host provides some operations needed to run as a CoCo VM. The guest operating
+  system can have fewer CoCo enlightenments than is required in the
+  fully-enlightened case.
+
+Conceptually, fully-enlightened mode and paravisor mode may be treated as
+points on a spectrum spanning the degree of guest enlightenment needed to run
+as a CoCo VM. Fully-enlightened mode is one end of the spectrum. A full
+implementation of paravisor mode is the other end of the spectrum, where all
+aspects of running as a CoCo VM are handled by the paravisor, and a normal
+guest OS with no knowledge of memory encryption or other aspects of CoCo VMs
+can run successfully. However, the Hyper-V implementation of paravisor mode
+does not go this far, and is somewhere in the middle of the spectrum. Some
+aspects of CoCo VMs are handled by the Hyper-V paravisor while the guest OS
+must be enlightened for other aspects. Unfortunately, there is no
+standardized enumeration of feature/functions that might be provided in the
+paravisor, and there is no standardized mechanism for a guest OS to query the
+paravisor for the feature/functions it provides. The understanding of what
+the paravisor provides is hard-coded in the guest OS.
+
+Paravisor mode has similarities to the `Coconut project`_, which aims to provide
+a limited paravisor to provide services to the guest such as a virtual TPM.
+However, the Hyper-V paravisor generally handles more aspects of CoCo VMs
+than is currently envisioned for Coconut, and so is further toward the "no
+guest enlightenments required" end of the spectrum.
+
+.. _Coconut project: https://github.com/coconut-svsm/svsm
+
+In the CoCo VM threat model, the paravisor is in the guest security domain
+and must be trusted by the guest OS. By implication, the hypervisor/VMM must
+protect itself against a potentially malicious paravisor just like it
+protects against a potentially malicious guest.
+
+The hardware architectural approach to fully-enlightened vs. paravisor mode
+varies depending on the underlying processor.
+
+* With AMD SEV-SNP processors, in fully-enlightened mode the guest OS runs in
+  VMPL 0 and has full control of the guest context. In paravisor mode, the
+  guest OS runs in VMPL 2 and the paravisor runs in VMPL 0. The paravisor
+  running in VMPL 0 has privileges that the guest OS in VMPL 2 does not have.
+  Certain operations require the guest to invoke the paravisor. Furthermore, in
+  paravisor mode the guest OS operates in "virtual Top Of Memory" (vTOM) mode
+  as defined by the SEV-SNP architecture. This mode simplifies guest management
+  of memory encryption when a paravisor is used.
+
+* With Intel TDX processor, in fully-enlightened mode the guest OS runs in an
+  L1 VM. In paravisor mode, TD partitioning is used. The paravisor runs in the
+  L1 VM, and the guest OS runs in a nested L2 VM.
+
+Hyper-V exposes a synthetic MSR to guests that describes the CoCo mode. This
+MSR indicates if the underlying processor uses AMD SEV-SNP or Intel TDX, and
+whether a paravisor is being used. It is straightforward to build a single
+kernel image that can boot and run properly on either architecture, and in
+either mode.
+
+Paravisor Effects
+-----------------
+Running in paravisor mode affects the following areas of generic Linux kernel
+CoCo VM functionality:
+
+* Initial guest memory setup. When a new VM is created in paravisor mode, the
+  paravisor runs first and sets up the guest physical memory as encrypted. The
+  guest Linux does normal memory initialization, except for explicitly marking
+  appropriate ranges as decrypted (shared). In paravisor mode, Linux does not
+  perform the early boot memory setup steps that are particularly tricky with
+  AMD SEV-SNP in fully-enlightened mode.
+
+* #VC/#VE exception handling. In paravisor mode, Hyper-V configures the guest
+  CoCo VM to route #VC and #VE exceptions to VMPL 0 and the L1 VM,
+  respectively, and not the guest Linux. Consequently, these exception handlers
+  do not run in the guest Linux and are not a required enlightenment for a
+  Linux guest in paravisor mode.
+
+* CPUID flags. Both AMD SEV-SNP and Intel TDX provide a CPUID flag in the
+  guest indicating that the VM is operating with the respective hardware
+  support. While these CPUID flags are visible in fully-enlightened CoCo VMs,
+  the paravisor filters out these flags and the guest Linux does not see them.
+  Throughout the Linux kernel, explicitly testing these flags has mostly been
+  eliminated in favor of the cc_platform_has() function, with the goal of
+  abstracting the differences between SEV-SNP and TDX. But the
+  cc_platform_has() abstraction also allows the Hyper-V paravisor configuration
+  to selectively enable aspects of CoCo VM functionality even when the CPUID
+  flags are not set. The exception is early boot memory setup on SEV-SNP, which
+  tests the CPUID SEV-SNP flag. But not having the flag in Hyper-V paravisor
+  mode VM achieves the desired effect or not running SEV-SNP specific early
+  boot memory setup.
+
+* Device emulation. In paravisor mode, the Hyper-V paravisor provides
+  emulation of devices such as the IO-APIC and TPM. Because the emulation
+  happens in the paravisor in the guest context (instead of the hypervisor/VMM
+  context), MMIO accesses to these devices must be encrypted references instead
+  of the decrypted references that would be used in a fully-enlightened CoCo
+  VM. The __ioremap_caller() function has been enhanced to make a callback to
+  check whether a particular address range should be treated as encrypted
+  (private). See the "is_private_mmio" callback.
+
+* Encrypt/decrypt memory transitions. In a CoCo VM, transitioning guest
+  memory between encrypted and decrypted requires coordinating with the
+  hypervisor/VMM. This is done via callbacks invoked from
+  __set_memory_enc_pgtable(). In fully-enlightened mode, the normal SEV-SNP and
+  TDX implementations of these callbacks are used. In paravisor mode, a Hyper-V
+  specific set of callbacks is used. These callbacks invoke the paravisor so
+  that the paravisor can coordinate the transitions and inform the hypervisor
+  as necessary. See hv_vtom_init() where these callback are set up.
+
+* Interrupt injection. In fully enlightened mode, a malicious hypervisor
+  could inject interrupts into the guest OS at times that violate x86/x64
+  architectural rules. For full protection, the guest OS should include
+  enlightenments that use the interrupt injection management features provided
+  by CoCo-capable processors. In paravisor mode, the paravisor mediates
+  interrupt injection into the guest OS, and ensures that the guest OS only
+  sees interrupts that are "legal". The paravisor uses the interrupt injection
+  management features provided by the CoCo-capable physical processor, thereby
+  masking these complexities from the guest OS.
+
+Hyper-V Hypercalls
+------------------
+When in fully-enlightened mode, hypercalls made by the Linux guest are routed
+directly to the hypervisor, just as in a non-CoCo VM. But in paravisor mode,
+normal hypercalls trap to the paravisor first, which may in turn invoke the
+hypervisor. But the paravisor is idiosyncratic in this regard, and a few
+hypercalls made by the Linux guest must always be routed directly to the
+hypervisor. These hypercall sites test for a paravisor being present, and use
+a special invocation sequence. See hv_post_message(), for example.
+
+Guest communication with Hyper-V
+--------------------------------
+Separate from the generic Linux kernel handling of memory encryption in Linux
+CoCo VMs, Hyper-V has VMBus and VMBus devices that communicate using memory
+shared between the Linux guest and the host. This shared memory must be
+marked decrypted to enable communication. Furthermore, since the threat model
+includes a compromised and potentially malicious host, the guest must guard
+against leaking any unintended data to the host through this shared memory.
+
+These Hyper-V and VMBus memory pages are marked as decrypted:
+
+* VMBus monitor pages
+
+* Synthetic interrupt controller (synic) related pages (unless supplied by
+  the paravisor)
+
+* Per-cpu hypercall input and output pages (unless running with a paravisor)
+
+* VMBus ring buffers. The direct mapping is marked decrypted in
+  __vmbus_establish_gpadl(). The secondary mapping created in
+  hv_ringbuffer_init() must also include the "decrypted" attribute.
+
+When the guest writes data to memory that is shared with the host, it must
+ensure that only the intended data is written. Padding or unused fields must
+be initialized to zeros before copying into the shared memory so that random
+kernel data is not inadvertently given to the host.
+
+Similarly, when the guest reads memory that is shared with the host, it must
+validate the data before acting on it so that a malicious host cannot induce
+the guest to expose unintended data. Doing such validation can be tricky
+because the host can modify the shared memory areas even while or after
+validation is performed. For messages passed from the host to the guest in a
+VMBus ring buffer, the length of the message is validated, and the message is
+copied into a temporary (encrypted) buffer for further validation and
+processing. The copying adds a small amount of overhead, but is the only way
+to protect against a malicious host. See hv_pkt_iter_first().
+
+Many drivers for VMBus devices have been "hardened" by adding code to fully
+validate messages received over VMBus, instead of assuming that Hyper-V is
+acting cooperatively. Such drivers are marked as "allowed_in_isolated" in the
+vmbus_devs[] table. Other drivers for VMBus devices that are not needed in a
+CoCo VM have not been hardened, and they are not allowed to load in a CoCo
+VM. See vmbus_is_valid_offer() where such devices are excluded.
+
+Two VMBus devices depend on the Hyper-V host to do DMA data transfers:
+storvsc for disk I/O and netvsc for network I/O. storvsc uses the normal
+Linux kernel DMA APIs, and so bounce buffering through decrypted swiotlb
+memory is done implicitly. netvsc has two modes for data transfers. The first
+mode goes through send and receive buffer space that is explicitly allocated
+by the netvsc driver, and is used for most smaller packets. These send and
+receive buffers are marked decrypted by __vmbus_establish_gpadl(). Because
+the netvsc driver explicitly copies packets to/from these buffers, the
+equivalent of bounce buffering between encrypted and decrypted memory is
+already part of the data path. The second mode uses the normal Linux kernel
+DMA APIs, and is bounce buffered through swiotlb memory implicitly like in
+storvsc.
+
+Finally, the VMBus virtual PCI driver needs special handling in a CoCo VM.
+Linux PCI device drivers access PCI config space using standard APIs provided
+by the Linux PCI subsystem. On Hyper-V, these functions directly access MMIO
+space, and the access traps to Hyper-V for emulation. But in CoCo VMs, memory
+encryption prevents Hyper-V from reading the guest instruction stream to
+emulate the access. So in a CoCo VM, these functions must make a hypercall
+with arguments explicitly describing the access. See
+_hv_pcifront_read_config() and _hv_pcifront_write_config() and the
+"use_calls" flag indicating to use hypercalls.
+
+load_unaligned_zeropad()
+------------------------
+When transitioning memory between encrypted and decrypted, the caller of
+set_memory_encrypted() or set_memory_decrypted() is responsible for ensuring
+the memory isn't in use and isn't referenced while the transition is in
+progress. The transition has multiple steps, and includes interaction with
+the Hyper-V host. The memory is in an inconsistent state until all steps are
+complete. A reference while the state is inconsistent could result in an
+exception that can't be cleanly fixed up.
+
+However, the kernel load_unaligned_zeropad() mechanism may make stray
+references that can't be prevented by the caller of set_memory_encrypted() or
+set_memory_decrypted(), so there's specific code in the #VC or #VE exception
+handler to fixup this case. But a CoCo VM running on Hyper-V may be
+configured to run with a paravisor, with the #VC or #VE exception routed to
+the paravisor. There's no architectural way to forward the exceptions back to
+the guest kernel, and in such a case, the load_unaligned_zeropad() fixup code
+in the #VC/#VE handlers doesn't run.
+
+To avoid this problem, the Hyper-V specific functions for notifying the
+hypervisor of the transition mark pages as "not present" while a transition
+is in progress. If load_unaligned_zeropad() causes a stray reference, a
+normal page fault is generated instead of #VC or #VE, and the page-fault-
+based handlers for load_unaligned_zeropad() fixup the reference. When the
+encrypted/decrypted transition is complete, the pages are marked as "present"
+again. See hv_vtom_clear_present() and hv_vtom_set_host_visibility().
diff --git a/Documentation/virt/hyperv/hibernation.rst b/Documentation/virt/hyperv/hibernation.rst
new file mode 100644
index 000000000000..4ff27f4a317a
--- /dev/null
+++ b/Documentation/virt/hyperv/hibernation.rst
@@ -0,0 +1,336 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Hibernating Guest VMs
+=====================
+
+Background
+----------
+Linux supports the ability to hibernate itself in order to save power.
+Hibernation is sometimes called suspend-to-disk, as it writes a memory
+image to disk and puts the hardware into the lowest possible power
+state. Upon resume from hibernation, the hardware is restarted and the
+memory image is restored from disk so that it can resume execution
+where it left off. See the "Hibernation" section of
+Documentation/admin-guide/pm/sleep-states.rst.
+
+Hibernation is usually done on devices with a single user, such as a
+personal laptop. For example, the laptop goes into hibernation when
+the cover is closed, and resumes when the cover is opened again.
+Hibernation and resume happen on the same hardware, and Linux kernel
+code orchestrating the hibernation steps assumes that the hardware
+configuration is not changed while in the hibernated state.
+
+Hibernation can be initiated within Linux by writing "disk" to
+/sys/power/state or by invoking the reboot system call with the
+appropriate arguments. This functionality may be wrapped by user space
+commands such "systemctl hibernate" that are run directly from a
+command line or in response to events such as the laptop lid closing.
+
+Considerations for Guest VM Hibernation
+---------------------------------------
+Linux guests on Hyper-V can also be hibernated, in which case the
+hardware is the virtual hardware provided by Hyper-V to the guest VM.
+Only the targeted guest VM is hibernated, while other guest VMs and
+the underlying Hyper-V host continue to run normally. While the
+underlying Windows Hyper-V and physical hardware on which it is
+running might also be hibernated using hibernation functionality in
+the Windows host, host hibernation and its impact on guest VMs is not
+in scope for this documentation.
+
+Resuming a hibernated guest VM can be more challenging than with
+physical hardware because VMs make it very easy to change the hardware
+configuration between the hibernation and resume. Even when the resume
+is done on the same VM that hibernated, the memory size might be
+changed, or virtual NICs or SCSI controllers might be added or
+removed. Virtual PCI devices assigned to the VM might be added or
+removed. Most such changes cause the resume steps to fail, though
+adding a new virtual NIC, SCSI controller, or vPCI device should work.
+
+Additional complexity can ensue because the disks of the hibernated VM
+can be moved to another newly created VM that otherwise has the same
+virtual hardware configuration. While it is desirable for resume from
+hibernation to succeed after such a move, there are challenges. See
+details on this scenario and its limitations in the "Resuming on a
+Different VM" section below.
+
+Hyper-V also provides ways to move a VM from one Hyper-V host to
+another. Hyper-V tries to ensure processor model and Hyper-V version
+compatibility using VM Configuration Versions, and prevents moves to
+a host that isn't compatible. Linux adapts to host and processor
+differences by detecting them at boot time, but such detection is not
+done when resuming execution in the hibernation image. If a VM is
+hibernated on one host, then resumed on a host with a different processor
+model or Hyper-V version, settings recorded in the hibernation image
+may not match the new host. Because Linux does not detect such
+mismatches when resuming the hibernation image, undefined behavior
+and failures could result.
+
+
+Enabling Guest VM Hibernation
+-----------------------------
+Hibernation of a Hyper-V guest VM is disabled by default because
+hibernation is incompatible with memory hot-add, as provided by the
+Hyper-V balloon driver. If hot-add is used and the VM hibernates, it
+hibernates with more memory than it started with. But when the VM
+resumes from hibernation, Hyper-V gives the VM only the originally
+assigned memory, and the memory size mismatch causes resume to fail.
+
+To enable a Hyper-V VM for hibernation, the Hyper-V administrator must
+enable the ACPI virtual S4 sleep state in the ACPI configuration that
+Hyper-V provides to the guest VM. Such enablement is accomplished by
+modifying a WMI property of the VM, the steps for which are outside
+the scope of this documentation but are available on the web.
+Enablement is treated as the indicator that the administrator
+prioritizes Linux hibernation in the VM over hot-add, so the Hyper-V
+balloon driver in Linux disables hot-add. Enablement is indicated if
+the contents of /sys/power/disk contains "platform" as an option. The
+enablement is also visible in /sys/bus/vmbus/hibernation. See function
+hv_is_hibernation_supported().
+
+Linux supports ACPI sleep states on x86, but not on arm64. So Linux
+guest VM hibernation is not available on Hyper-V for arm64.
+
+Initiating Guest VM Hibernation
+-------------------------------
+Guest VMs can self-initiate hibernation using the standard Linux
+methods of writing "disk" to /sys/power/state or the reboot system
+call. As an additional layer, Linux guests on Hyper-V support the
+"Shutdown" integration service, via which a Hyper-V administrator can
+tell a Linux VM to hibernate using a command outside the VM. The
+command generates a request to the Hyper-V shutdown driver in Linux,
+which sends the uevent "EVENT=hibernate". See kernel functions
+shutdown_onchannelcallback() and send_hibernate_uevent(). A udev rule
+must be provided in the VM that handles this event and initiates
+hibernation.
+
+Handling VMBus Devices During Hibernation & Resume
+--------------------------------------------------
+The VMBus bus driver, and the individual VMBus device drivers,
+implement suspend and resume functions that are called as part of the
+Linux orchestration of hibernation and of resuming from hibernation.
+The overall approach is to leave in place the data structures for the
+primary VMBus channels and their associated Linux devices, such as
+SCSI controllers and others, so that they are captured in the
+hibernation image. This approach allows any state associated with the
+device to be persisted across the hibernation/resume. When the VM
+resumes, the devices are re-offered by Hyper-V and are connected to
+the data structures that already exist in the resumed hibernation
+image.
+
+VMBus devices are identified by class and instance GUID. (See section
+"VMBus device creation/deletion" in
+Documentation/virt/hyperv/vmbus.rst.) Upon resume from hibernation,
+the resume functions expect that the devices offered by Hyper-V have
+the same class/instance GUIDs as the devices present at the time of
+hibernation. Having the same class/instance GUIDs allows the offered
+devices to be matched to the primary VMBus channel data structures in
+the memory of the now resumed hibernation image. If any devices are
+offered that don't match primary VMBus channel data structures that
+already exist, they are processed normally as newly added devices. If
+primary VMBus channels that exist in the resumed hibernation image are
+not matched with a device offered in the resumed VM, the resume
+sequence waits for 10 seconds, then proceeds. But the unmatched device
+is likely to cause errors in the resumed VM.
+
+When resuming existing primary VMBus channels, the newly offered
+relids might be different because relids can change on each VM boot,
+even if the VM configuration hasn't changed. The VMBus bus driver
+resume function matches the class/instance GUIDs, and updates the
+relids in case they have changed.
+
+VMBus sub-channels are not persisted in the hibernation image. Each
+VMBus device driver's suspend function must close any sub-channels
+prior to hibernation. Closing a sub-channel causes Hyper-V to send a
+RESCIND_CHANNELOFFER message, which Linux processes by freeing the
+channel data structures so that all vestiges of the sub-channel are
+removed. By contrast, primary channels are marked closed and their
+ring buffers are freed, but Hyper-V does not send a rescind message,
+so the channel data structure continues to exist. Upon resume, the
+device driver's resume function re-allocates the ring buffer and
+re-opens the existing channel. It then communicates with Hyper-V to
+re-open sub-channels from scratch.
+
+The Linux ends of Hyper-V sockets are forced closed at the time of
+hibernation. The guest can't force closing the host end of the socket,
+but any host-side actions on the host end will produce an error.
+
+VMBus devices use the same suspend function for the "freeze" and the
+"poweroff" phases, and the same resume function for the "thaw" and
+"restore" phases. See the "Entering Hibernation" section of
+Documentation/driver-api/pm/devices.rst for the sequencing of the
+phases.
+
+Detailed Hibernation Sequence
+-----------------------------
+1. The Linux power management (PM) subsystem prepares for
+   hibernation by freezing user space processes and allocating
+   memory to hold the hibernation image.
+2. As part of the "freeze" phase, Linux PM calls the "suspend"
+   function for each VMBus device in turn. As described above, this
+   function removes sub-channels, and leaves the primary channel in
+   a closed state.
+3. Linux PM calls the "suspend" function for the VMBus bus, which
+   closes any Hyper-V socket channels and unloads the top-level
+   VMBus connection with the Hyper-V host.
+4. Linux PM disables non-boot CPUs, creates the hibernation image in
+   the previously allocated memory, then re-enables non-boot CPUs.
+   The hibernation image contains the memory data structures for the
+   closed primary channels, but no sub-channels.
+5. As part of the "thaw" phase, Linux PM calls the "resume" function
+   for the VMBus bus, which re-establishes the top-level VMBus
+   connection and requests that Hyper-V re-offer the VMBus devices.
+   As offers are received for the primary channels, the relids are
+   updated as previously described.
+6. Linux PM calls the "resume" function for each VMBus device. Each
+   device re-opens its primary channel, and communicates with Hyper-V
+   to re-establish sub-channels if appropriate. The sub-channels
+   are re-created as new channels since they were previously removed
+   entirely in Step 2.
+7. With VMBus devices now working again, Linux PM writes the
+   hibernation image from memory to disk.
+8. Linux PM repeats Steps 2 and 3 above as part of the "poweroff"
+   phase. VMBus channels are closed and the top-level VMBus
+   connection is unloaded.
+9. Linux PM disables non-boot CPUs, and then enters ACPI sleep state
+   S4. Hibernation is now complete.
+
+Detailed Resume Sequence
+------------------------
+1. The guest VM boots into a fresh Linux OS instance. During boot,
+   the top-level VMBus connection is established, and synthetic
+   devices are enabled. This happens via the normal paths that don't
+   involve hibernation.
+2. Linux PM hibernation code reads swap space is to find and read
+   the hibernation image into memory. If there is no hibernation
+   image, then this boot becomes a normal boot.
+3. If this is a resume from hibernation, the "freeze" phase is used
+   to shutdown VMBus devices and unload the top-level VMBus
+   connection in the running fresh OS instance, just like Steps 2
+   and 3 in the hibernation sequence.
+4. Linux PM disables non-boot CPUs, and transfers control to the
+   read-in hibernation image. In the now-running hibernation image,
+   non-boot CPUs are restarted.
+5. As part of the "resume" phase, Linux PM repeats Steps 5 and 6
+   from the hibernation sequence. The top-level VMBus connection is
+   re-established, and offers are received and matched to primary
+   channels in the image. Relids are updated. VMBus device resume
+   functions re-open primary channels and re-create sub-channels.
+6. Linux PM exits the hibernation resume sequence and the VM is now
+   running normally from the hibernation image.
+
+Key-Value Pair (KVP) Pseudo-Device Anomalies
+--------------------------------------------
+The VMBus KVP device behaves differently from other pseudo-devices
+offered by Hyper-V.  When the KVP primary channel is closed, Hyper-V
+sends a rescind message, which causes all vestiges of the device to be
+removed. But Hyper-V then re-offers the device, causing it to be newly
+re-created. The removal and re-creation occurs during the "freeze"
+phase of hibernation, so the hibernation image contains the re-created
+KVP device. Similar behavior occurs during the "freeze" phase of the
+resume sequence while still in the fresh OS instance. But in both
+cases, the top-level VMBus connection is subsequently unloaded, which
+causes the device to be discarded on the Hyper-V side. So no harm is
+done and everything still works.
+
+Virtual PCI devices
+-------------------
+Virtual PCI devices are physical PCI devices that are mapped directly
+into the VM's physical address space so the VM can interact directly
+with the hardware. vPCI devices include those accessed via what Hyper-V
+calls "Discrete Device Assignment" (DDA), as well as SR-IOV NIC
+Virtual Functions (VF) devices. See Documentation/virt/hyperv/vpci.rst.
+
+Hyper-V DDA devices are offered to guest VMs after the top-level VMBus
+connection is established, just like VMBus synthetic devices. They are
+statically assigned to the VM, and their instance GUIDs don't change
+unless the Hyper-V administrator makes changes to the configuration.
+DDA devices are represented in Linux as virtual PCI devices that have
+a VMBus identity as well as a PCI identity. Consequently, Linux guest
+hibernation first handles DDA devices as VMBus devices in order to
+manage the VMBus channel. But then they are also handled as PCI
+devices using the hibernation functions implemented by their native
+PCI driver.
+
+SR-IOV NIC VFs also have a VMBus identity as well as a PCI
+identity, and overall are processed similarly to DDA devices. A
+difference is that VFs are not offered to the VM during initial boot
+of the VM. Instead, the VMBus synthetic NIC driver first starts
+operating and communicates to Hyper-V that it is prepared to accept a
+VF, and then the VF offer is made. However, the VMBus connection
+might later be unloaded and then re-established without the VM being
+rebooted, as happens in Steps 3 and 5 in the Detailed Hibernation
+Sequence above and in the Detailed Resume Sequence. In such a case,
+the VFs likely became part of the VM during initial boot, so when the
+VMBus connection is re-established, the VFs are offered on the
+re-established connection without intervention by the synthetic NIC driver.
+
+UIO Devices
+-----------
+A VMBus device can be exposed to user space using the Hyper-V UIO
+driver (uio_hv_generic.c) so that a user space driver can control and
+operate the device. However, the VMBus UIO driver does not support the
+suspend and resume operations needed for hibernation. If a VMBus
+device is configured to use the UIO driver, hibernating the VM fails
+and Linux continues to run normally. The most common use of the Hyper-V
+UIO driver is for DPDK networking, but there are other uses as well.
+
+Resuming on a Different VM
+--------------------------
+This scenario occurs in the Azure public cloud in that a hibernated
+customer VM only exists as saved configuration and disks -- the VM no
+longer exists on any Hyper-V host. When the customer VM is resumed, a
+new Hyper-V VM with identical configuration is created, likely on a
+different Hyper-V host. That new Hyper-V VM becomes the resumed
+customer VM, and the steps the Linux kernel takes to resume from the
+hibernation image must work in that new VM.
+
+While the disks and their contents are preserved from the original VM,
+the Hyper-V-provided VMBus instance GUIDs of the disk controllers and
+other synthetic devices would typically be different. The difference
+would cause the resume from hibernation to fail, so several things are
+done to solve this problem:
+
+* For VMBus synthetic devices that support only a single instance,
+  Hyper-V always assigns the same instance GUIDs. For example, the
+  Hyper-V mouse, the shutdown pseudo-device, the time sync pseudo
+  device, etc., always have the same instance GUID, both for local
+  Hyper-V installs as well as in the Azure cloud.
+
+* VMBus synthetic SCSI controllers may have multiple instances in a
+  VM, and in the general case instance GUIDs vary from VM to VM.
+  However, Azure VMs always have exactly two synthetic SCSI
+  controllers, and Azure code overrides the normal Hyper-V behavior
+  so these controllers are always assigned the same two instance
+  GUIDs. Consequently, when a customer VM is resumed on a newly
+  created VM, the instance GUIDs match. But this guarantee does not
+  hold for local Hyper-V installs.
+
+* Similarly, VMBus synthetic NICs may have multiple instances in a
+  VM, and the instance GUIDs vary from VM to VM. Again, Azure code
+  overrides the normal Hyper-V behavior so that the instance GUID
+  of a synthetic NIC in a customer VM does not change, even if the
+  customer VM is deallocated or hibernated, and then re-constituted
+  on a newly created VM. As with SCSI controllers, this behavior
+  does not hold for local Hyper-V installs.
+
+* vPCI devices do not have the same instance GUIDs when resuming
+  from hibernation on a newly created VM. Consequently, Azure does
+  not support hibernation for VMs that have DDA devices such as
+  NVMe controllers or GPUs. For SR-IOV NIC VFs, Azure removes the
+  VF from the VM before it hibernates so that the hibernation image
+  does not contain a VF device. When the VM is resumed it
+  instantiates a new VF, rather than trying to match against a VF
+  that is present in the hibernation image. Because Azure must
+  remove any VFs before initiating hibernation, Azure VM
+  hibernation must be initiated externally from the Azure Portal or
+  Azure CLI, which in turn uses the Shutdown integration service to
+  tell Linux to do the hibernation. If hibernation is self-initiated
+  within the Azure VM, VFs remain in the hibernation image, and are
+  not resumed properly.
+
+In summary, Azure takes special actions to remove VFs and to ensure
+that VMBus device instance GUIDs match on a new/different VM, allowing
+hibernation to work for most general-purpose Azure VMs sizes. While
+similar special actions could be taken when resuming on a different VM
+on a local Hyper-V install, orchestrating such actions is not provided
+out-of-the-box by local Hyper-V and so requires custom scripting.
diff --git a/Documentation/virt/hyperv/index.rst b/Documentation/virt/hyperv/index.rst
new file mode 100644
index 000000000000..c84c40fd61c9
--- /dev/null
+++ b/Documentation/virt/hyperv/index.rst
@@ -0,0 +1,15 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Hyper-V Enlightenments
+======================
+
+.. toctree::
+   :maxdepth: 1
+
+   overview
+   vmbus
+   clocks
+   vpci
+   hibernation
+   coco
diff --git a/Documentation/virt/hyperv/overview.rst b/Documentation/virt/hyperv/overview.rst
new file mode 100644
index 000000000000..77408a89d1a4
--- /dev/null
+++ b/Documentation/virt/hyperv/overview.rst
@@ -0,0 +1,207 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Overview
+========
+The Linux kernel contains a variety of code for running as a fully
+enlightened guest on Microsoft's Hyper-V hypervisor.  Hyper-V
+consists primarily of a bare-metal hypervisor plus a virtual machine
+management service running in the parent partition (roughly
+equivalent to KVM and QEMU, for example).  Guest VMs run in child
+partitions.  In this documentation, references to Hyper-V usually
+encompass both the hypervisor and the VMM service without making a
+distinction about which functionality is provided by which
+component.
+
+Hyper-V runs on x86/x64 and arm64 architectures, and Linux guests
+are supported on both.  The functionality and behavior of Hyper-V is
+generally the same on both architectures unless noted otherwise.
+
+Linux Guest Communication with Hyper-V
+--------------------------------------
+Linux guests communicate with Hyper-V in four different ways:
+
+* Implicit traps: As defined by the x86/x64 or arm64 architecture,
+  some guest actions trap to Hyper-V.  Hyper-V emulates the action and
+  returns control to the guest.  This behavior is generally invisible
+  to the Linux kernel.
+
+* Explicit hypercalls: Linux makes an explicit function call to
+  Hyper-V, passing parameters.  Hyper-V performs the requested action
+  and returns control to the caller.  Parameters are passed in
+  processor registers or in memory shared between the Linux guest and
+  Hyper-V.   On x86/x64, hypercalls use a Hyper-V specific calling
+  sequence.  On arm64, hypercalls use the ARM standard SMCCC calling
+  sequence.
+
+* Synthetic register access: Hyper-V implements a variety of
+  synthetic registers.  On x86/x64 these registers appear as MSRs in
+  the guest, and the Linux kernel can read or write these MSRs using
+  the normal mechanisms defined by the x86/x64 architecture.  On
+  arm64, these synthetic registers must be accessed using explicit
+  hypercalls.
+
+* VMBus: VMBus is a higher-level software construct that is built on
+  the other 3 mechanisms.  It is a message passing interface between
+  the Hyper-V host and the Linux guest.  It uses memory that is shared
+  between Hyper-V and the guest, along with various signaling
+  mechanisms.
+
+The first three communication mechanisms are documented in the
+`Hyper-V Top Level Functional Spec (TLFS)`_.  The TLFS describes
+general Hyper-V functionality and provides details on the hypercalls
+and synthetic registers.  The TLFS is currently written for the
+x86/x64 architecture only.
+
+.. _Hyper-V Top Level Functional Spec (TLFS): https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/tlfs
+
+VMBus is not documented.  This documentation provides a high-level
+overview of VMBus and how it works, but the details can be discerned
+only from the code.
+
+Sharing Memory
+--------------
+Many aspects are communication between Hyper-V and Linux are based
+on sharing memory.  Such sharing is generally accomplished as
+follows:
+
+* Linux allocates memory from its physical address space using
+  standard Linux mechanisms.
+
+* Linux tells Hyper-V the guest physical address (GPA) of the
+  allocated memory.  Many shared areas are kept to 1 page so that a
+  single GPA is sufficient.   Larger shared areas require a list of
+  GPAs, which usually do not need to be contiguous in the guest
+  physical address space.  How Hyper-V is told about the GPA or list
+  of GPAs varies.  In some cases, a single GPA is written to a
+  synthetic register.  In other cases, a GPA or list of GPAs is sent
+  in a VMBus message.
+
+* Hyper-V translates the GPAs into "real" physical memory addresses,
+  and creates a virtual mapping that it can use to access the memory.
+
+* Linux can later revoke sharing it has previously established by
+  telling Hyper-V to set the shared GPA to zero.
+
+Hyper-V operates with a page size of 4 Kbytes. GPAs communicated to
+Hyper-V may be in the form of page numbers, and always describe a
+range of 4 Kbytes.  Since the Linux guest page size on x86/x64 is
+also 4 Kbytes, the mapping from guest page to Hyper-V page is 1-to-1.
+On arm64, Hyper-V supports guests with 4/16/64 Kbyte pages as
+defined by the arm64 architecture.   If Linux is using 16 or 64
+Kbyte pages, Linux code must be careful to communicate with Hyper-V
+only in terms of 4 Kbyte pages.  HV_HYP_PAGE_SIZE and related macros
+are used in code that communicates with Hyper-V so that it works
+correctly in all configurations.
+
+As described in the TLFS, a few memory pages shared between Hyper-V
+and the Linux guest are "overlay" pages.  With overlay pages, Linux
+uses the usual approach of allocating guest memory and telling
+Hyper-V the GPA of the allocated memory.  But Hyper-V then replaces
+that physical memory page with a page it has allocated, and the
+original physical memory page is no longer accessible in the guest
+VM.  Linux may access the memory normally as if it were the memory
+that it originally allocated.  The "overlay" behavior is visible
+only because the contents of the page (as seen by Linux) change at
+the time that Linux originally establishes the sharing and the
+overlay page is inserted.  Similarly, the contents change if Linux
+revokes the sharing, in which case Hyper-V removes the overlay page,
+and the guest page originally allocated by Linux becomes visible
+again.
+
+Before Linux does a kexec to a kdump kernel or any other kernel,
+memory shared with Hyper-V should be revoked.  Hyper-V could modify
+a shared page or remove an overlay page after the new kernel is
+using the page for a different purpose, corrupting the new kernel.
+Hyper-V does not provide a single "set everything" operation to
+guest VMs, so Linux code must individually revoke all sharing before
+doing kexec.   See hv_kexec_handler() and hv_crash_handler().  But
+the crash/panic path still has holes in cleanup because some shared
+pages are set using per-CPU synthetic registers and there's no
+mechanism to revoke the shared pages for CPUs other than the CPU
+running the panic path.
+
+CPU Management
+--------------
+Hyper-V does not have a ability to hot-add or hot-remove a CPU
+from a running VM.  However, Windows Server 2019 Hyper-V and
+earlier versions may provide guests with ACPI tables that indicate
+more CPUs than are actually present in the VM.  As is normal, Linux
+treats these additional CPUs as potential hot-add CPUs, and reports
+them as such even though Hyper-V will never actually hot-add them.
+Starting in Windows Server 2022 Hyper-V, the ACPI tables reflect
+only the CPUs actually present in the VM, so Linux does not report
+any hot-add CPUs.
+
+A Linux guest CPU may be taken offline using the normal Linux
+mechanisms, provided no VMBus channel interrupts are assigned to
+the CPU.  See the section on VMBus Interrupts for more details
+on how VMBus channel interrupts can be re-assigned to permit
+taking a CPU offline.
+
+32-bit and 64-bit
+-----------------
+On x86/x64, Hyper-V supports 32-bit and 64-bit guests, and Linux
+will build and run in either version. While the 32-bit version is
+expected to work, it is used rarely and may suffer from undetected
+regressions.
+
+On arm64, Hyper-V supports only 64-bit guests.
+
+Endian-ness
+-----------
+All communication between Hyper-V and guest VMs uses Little-Endian
+format on both x86/x64 and arm64.  Big-endian format on arm64 is not
+supported by Hyper-V, and Linux code does not use endian-ness macros
+when accessing data shared with Hyper-V.
+
+Versioning
+----------
+Current Linux kernels operate correctly with older versions of
+Hyper-V back to Windows Server 2012 Hyper-V. Support for running
+on the original Hyper-V release in Windows Server 2008/2008 R2
+has been removed.
+
+A Linux guest on Hyper-V outputs in dmesg the version of Hyper-V
+it is running on.  This version is in the form of a Windows build
+number and is for display purposes only. Linux code does not
+test this version number at runtime to determine available features
+and functionality. Hyper-V indicates feature/function availability
+via flags in synthetic MSRs that Hyper-V provides to the guest,
+and the guest code tests these flags.
+
+VMBus has its own protocol version that is negotiated during the
+initial VMBus connection from the guest to Hyper-V. This version
+number is also output to dmesg during boot.  This version number
+is checked in a few places in the code to determine if specific
+functionality is present.
+
+Furthermore, each synthetic device on VMBus also has a protocol
+version that is separate from the VMBus protocol version. Device
+drivers for these synthetic devices typically negotiate the device
+protocol version, and may test that protocol version to determine
+if specific device functionality is present.
+
+Code Packaging
+--------------
+Hyper-V related code appears in the Linux kernel code tree in three
+main areas:
+
+1. drivers/hv
+
+2. arch/x86/hyperv and arch/arm64/hyperv
+
+3. individual device driver areas such as drivers/scsi, drivers/net,
+   drivers/clocksource, etc.
+
+A few miscellaneous files appear elsewhere. See the full list under
+"Hyper-V/Azure CORE AND DRIVERS" and "DRM DRIVER FOR HYPERV
+SYNTHETIC VIDEO DEVICE" in the MAINTAINERS file.
+
+The code in #1 and #2 is built only when CONFIG_HYPERV is set.
+Similarly, the code for most Hyper-V related drivers is built only
+when CONFIG_HYPERV is set.
+
+Most Hyper-V related code in #1 and #3 can be built as a module.
+The architecture specific code in #2 must be built-in.  Also,
+drivers/hv/hv_common.c is low-level code that is common across
+architectures and must be built-in.
diff --git a/Documentation/virt/hyperv/vmbus.rst b/Documentation/virt/hyperv/vmbus.rst
new file mode 100644
index 000000000000..654bb4849972
--- /dev/null
+++ b/Documentation/virt/hyperv/vmbus.rst
@@ -0,0 +1,346 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+VMBus
+=====
+VMBus is a software construct provided by Hyper-V to guest VMs.  It
+consists of a control path and common facilities used by synthetic
+devices that Hyper-V presents to guest VMs.   The control path is
+used to offer synthetic devices to the guest VM and, in some cases,
+to rescind those devices.   The common facilities include software
+channels for communicating between the device driver in the guest VM
+and the synthetic device implementation that is part of Hyper-V, and
+signaling primitives to allow Hyper-V and the guest to interrupt
+each other.
+
+VMBus is modeled in Linux as a bus, with the expected /sys/bus/vmbus
+entry in a running Linux guest.  The VMBus driver (drivers/hv/vmbus_drv.c)
+establishes the VMBus control path with the Hyper-V host, then
+registers itself as a Linux bus driver.  It implements the standard
+bus functions for adding and removing devices to/from the bus.
+
+Most synthetic devices offered by Hyper-V have a corresponding Linux
+device driver.  These devices include:
+
+* SCSI controller
+* NIC
+* Graphics frame buffer
+* Keyboard
+* Mouse
+* PCI device pass-thru
+* Heartbeat
+* Time Sync
+* Shutdown
+* Memory balloon
+* Key/Value Pair (KVP) exchange with Hyper-V
+* Hyper-V online backup (a.k.a. VSS)
+
+Guest VMs may have multiple instances of the synthetic SCSI
+controller, synthetic NIC, and PCI pass-thru devices.  Other
+synthetic devices are limited to a single instance per VM.  Not
+listed above are a small number of synthetic devices offered by
+Hyper-V that are used only by Windows guests and for which Linux
+does not have a driver.
+
+Hyper-V uses the terms "VSP" and "VSC" in describing synthetic
+devices.  "VSP" refers to the Hyper-V code that implements a
+particular synthetic device, while "VSC" refers to the driver for
+the device in the guest VM.  For example, the Linux driver for the
+synthetic NIC is referred to as "netvsc" and the Linux driver for
+the synthetic SCSI controller is "storvsc".  These drivers contain
+functions with names like "storvsc_connect_to_vsp".
+
+VMBus channels
+--------------
+An instance of a synthetic device uses VMBus channels to communicate
+between the VSP and the VSC.  Channels are bi-directional and used
+for passing messages.   Most synthetic devices use a single channel,
+but the synthetic SCSI controller and synthetic NIC may use multiple
+channels to achieve higher performance and greater parallelism.
+
+Each channel consists of two ring buffers.  These are classic ring
+buffers from a university data structures textbook.  If the read
+and writes pointers are equal, the ring buffer is considered to be
+empty, so a full ring buffer always has at least one byte unused.
+The "in" ring buffer is for messages from the Hyper-V host to the
+guest, and the "out" ring buffer is for messages from the guest to
+the Hyper-V host.  In Linux, the "in" and "out" designations are as
+viewed by the guest side.  The ring buffers are memory that is
+shared between the guest and the host, and they follow the standard
+paradigm where the memory is allocated by the guest, with the list
+of GPAs that make up the ring buffer communicated to the host.  Each
+ring buffer consists of a header page (4 Kbytes) with the read and
+write indices and some control flags, followed by the memory for the
+actual ring.  The size of the ring is determined by the VSC in the
+guest and is specific to each synthetic device.   The list of GPAs
+making up the ring is communicated to the Hyper-V host over the
+VMBus control path as a GPA Descriptor List (GPADL).  See function
+vmbus_establish_gpadl().
+
+Each ring buffer is mapped into contiguous Linux kernel virtual
+space in three parts:  1) the 4 Kbyte header page, 2) the memory
+that makes up the ring itself, and 3) a second mapping of the memory
+that makes up the ring itself.  Because (2) and (3) are contiguous
+in kernel virtual space, the code that copies data to and from the
+ring buffer need not be concerned with ring buffer wrap-around.
+Once a copy operation has completed, the read or write index may
+need to be reset to point back into the first mapping, but the
+actual data copy does not need to be broken into two parts.  This
+approach also allows complex data structures to be easily accessed
+directly in the ring without handling wrap-around.
+
+On arm64 with page sizes > 4 Kbytes, the header page must still be
+passed to Hyper-V as a 4 Kbyte area.  But the memory for the actual
+ring must be aligned to PAGE_SIZE and have a size that is a multiple
+of PAGE_SIZE so that the duplicate mapping trick can be done.  Hence
+a portion of the header page is unused and not communicated to
+Hyper-V.  This case is handled by vmbus_establish_gpadl().
+
+Hyper-V enforces a limit on the aggregate amount of guest memory
+that can be shared with the host via GPADLs.  This limit ensures
+that a rogue guest can't force the consumption of excessive host
+resources.  For Windows Server 2019 and later, this limit is
+approximately 1280 Mbytes.  For versions prior to Windows Server
+2019, the limit is approximately 384 Mbytes.
+
+VMBus channel messages
+----------------------
+All messages sent in a VMBus channel have a standard header that includes
+the message length, the offset of the message payload, some flags, and a
+transactionID.  The portion of the message after the header is
+unique to each VSP/VSC pair.
+
+Messages follow one of two patterns:
+
+* Unidirectional:  Either side sends a message and does not
+  expect a response message
+* Request/response:  One side (usually the guest) sends a message
+  and expects a response
+
+The transactionID (a.k.a. "requestID") is for matching requests &
+responses.  Some synthetic devices allow multiple requests to be in-
+flight simultaneously, so the guest specifies a transactionID when
+sending a request.  Hyper-V sends back the same transactionID in the
+matching response.
+
+Messages passed between the VSP and VSC are control messages.  For
+example, a message sent from the storvsc driver might be "execute
+this SCSI command".   If a message also implies some data transfer
+between the guest and the Hyper-V host, the actual data to be
+transferred may be embedded with the control message, or it may be
+specified as a separate data buffer that the Hyper-V host will
+access as a DMA operation.  The former case is used when the size of
+the data is small and the cost of copying the data to and from the
+ring buffer is minimal.  For example, time sync messages from the
+Hyper-V host to the guest contain the actual time value.  When the
+data is larger, a separate data buffer is used.  In this case, the
+control message contains a list of GPAs that describe the data
+buffer.  For example, the storvsc driver uses this approach to
+specify the data buffers to/from which disk I/O is done.
+
+Three functions exist to send VMBus channel messages:
+
+1. vmbus_sendpacket():  Control-only messages and messages with
+   embedded data -- no GPAs
+2. vmbus_sendpacket_pagebuffer(): Message with list of GPAs
+   identifying data to transfer.  An offset and length is
+   associated with each GPA so that multiple discontinuous areas
+   of guest memory can be targeted.
+3. vmbus_sendpacket_mpb_desc(): Message with list of GPAs
+   identifying data to transfer.  A single offset and length is
+   associated with a list of GPAs.  The GPAs must describe a
+   single logical area of guest memory to be targeted.
+
+Historically, Linux guests have trusted Hyper-V to send well-formed
+and valid messages, and Linux drivers for synthetic devices did not
+fully validate messages.  With the introduction of processor
+technologies that fully encrypt guest memory and that allow the
+guest to not trust the hypervisor (AMD SEV-SNP, Intel TDX), trusting
+the Hyper-V host is no longer a valid assumption.  The drivers for
+VMBus synthetic devices are being updated to fully validate any
+values read from memory that is shared with Hyper-V, which includes
+messages from VMBus devices.  To facilitate such validation,
+messages read by the guest from the "in" ring buffer are copied to a
+temporary buffer that is not shared with Hyper-V.  Validation is
+performed in this temporary buffer without the risk of Hyper-V
+maliciously modifying the message after it is validated but before
+it is used.
+
+Synthetic Interrupt Controller (synic)
+--------------------------------------
+Hyper-V provides each guest CPU with a synthetic interrupt controller
+that is used by VMBus for host-guest communication. While each synic
+defines 16 synthetic interrupts (SINT), Linux uses only one of the 16
+(VMBUS_MESSAGE_SINT). All interrupts related to communication between
+the Hyper-V host and a guest CPU use that SINT.
+
+The SINT is mapped to a single per-CPU architectural interrupt (i.e,
+an 8-bit x86/x64 interrupt vector, or an arm64 PPI INTID). Because
+each CPU in the guest has a synic and may receive VMBus interrupts,
+they are best modeled in Linux as per-CPU interrupts. This model works
+well on arm64 where a single per-CPU Linux IRQ is allocated for
+VMBUS_MESSAGE_SINT. This IRQ appears in /proc/interrupts as an IRQ labelled
+"Hyper-V VMbus". Since x86/x64 lacks support for per-CPU IRQs, an x86
+interrupt vector is statically allocated (HYPERVISOR_CALLBACK_VECTOR)
+across all CPUs and explicitly coded to call vmbus_isr(). In this case,
+there's no Linux IRQ, and the interrupts are visible in aggregate in
+/proc/interrupts on the "HYP" line.
+
+The synic provides the means to demultiplex the architectural interrupt into
+one or more logical interrupts and route the logical interrupt to the proper
+VMBus handler in Linux. This demultiplexing is done by vmbus_isr() and
+related functions that access synic data structures.
+
+The synic is not modeled in Linux as an irq chip or irq domain,
+and the demultiplexed logical interrupts are not Linux IRQs. As such,
+they don't appear in /proc/interrupts or /proc/irq. The CPU
+affinity for one of these logical interrupts is controlled via an
+entry under /sys/bus/vmbus as described below.
+
+VMBus interrupts
+----------------
+VMBus provides a mechanism for the guest to interrupt the host when
+the guest has queued new messages in a ring buffer.  The host
+expects that the guest will send an interrupt only when an "out"
+ring buffer transitions from empty to non-empty.  If the guest sends
+interrupts at other times, the host deems such interrupts to be
+unnecessary.  If a guest sends an excessive number of unnecessary
+interrupts, the host may throttle that guest by suspending its
+execution for a few seconds to prevent a denial-of-service attack.
+
+Similarly, the host will interrupt the guest via the synic when
+it sends a new message on the VMBus control path, or when a VMBus
+channel "in" ring buffer transitions from empty to non-empty due to
+the host inserting a new VMBus channel message. The control message stream
+and each VMBus channel "in" ring buffer are separate logical interrupts
+that are demultiplexed by vmbus_isr(). It demultiplexes by first checking
+for channel interrupts by calling vmbus_chan_sched(), which looks at a synic
+bitmap to determine which channels have pending interrupts on this CPU.
+If multiple channels have pending interrupts for this CPU, they are
+processed sequentially.  When all channel interrupts have been processed,
+vmbus_isr() checks for and processes any messages received on the VMBus
+control path.
+
+The guest CPU that a VMBus channel will interrupt is selected by the
+guest when the channel is created, and the host is informed of that
+selection.  VMBus devices are broadly grouped into two categories:
+
+1. "Slow" devices that need only one VMBus channel.  The devices
+   (such as keyboard, mouse, heartbeat, and timesync) generate
+   relatively few interrupts.  Their VMBus channels are all
+   assigned to interrupt the VMBUS_CONNECT_CPU, which is always
+   CPU 0.
+
+2. "High speed" devices that may use multiple VMBus channels for
+   higher parallelism and performance.  These devices include the
+   synthetic SCSI controller and synthetic NIC.  Their VMBus
+   channels interrupts are assigned to CPUs that are spread out
+   among the available CPUs in the VM so that interrupts on
+   multiple channels can be processed in parallel.
+
+The assignment of VMBus channel interrupts to CPUs is done in the
+function init_vp_index().  This assignment is done outside of the
+normal Linux interrupt affinity mechanism, so the interrupts are
+neither "unmanaged" nor "managed" interrupts.
+
+The CPU that a VMBus channel will interrupt can be seen in
+/sys/bus/vmbus/devices/<deviceGUID>/ channels/<channelRelID>/cpu.
+When running on later versions of Hyper-V, the CPU can be changed
+by writing a new value to this sysfs entry. Because VMBus channel
+interrupts are not Linux IRQs, there are no entries in /proc/interrupts
+or /proc/irq corresponding to individual VMBus channel interrupts.
+
+An online CPU in a Linux guest may not be taken offline if it has
+VMBus channel interrupts assigned to it. Starting in kernel v6.15,
+any such interrupts are automatically reassigned to some other CPU
+at the time of offlining. The "other" CPU is chosen by the
+implementation and is not load balanced or otherwise intelligently
+determined. If the CPU is onlined again, channel interrupts previously
+assigned to it are not moved back. As a result, after multiple CPUs
+have been offlined, and perhaps onlined again, the interrupt-to-CPU
+mapping may be scrambled and non-optimal. In such a case, optimal
+assignments must be re-established manually. For kernels v6.14 and
+earlier, any conflicting channel interrupts must first be manually
+reassigned to another CPU as described above. Then when no channel
+interrupts are assigned to the CPU, it can be taken offline.
+
+The VMBus channel interrupt handling code is designed to work
+correctly even if an interrupt is received on a CPU other than the
+CPU assigned to the channel.  Specifically, the code does not use
+CPU-based exclusion for correctness.  In normal operation, Hyper-V
+will interrupt the assigned CPU.  But when the CPU assigned to a
+channel is being changed via sysfs, the guest doesn't know exactly
+when Hyper-V will make the transition.  The code must work correctly
+even if there is a time lag before Hyper-V starts interrupting the
+new CPU.  See comments in target_cpu_store().
+
+VMBus device creation/deletion
+------------------------------
+Hyper-V and the Linux guest have a separate message-passing path
+that is used for synthetic device creation and deletion. This
+path does not use a VMBus channel.  See vmbus_post_msg() and
+vmbus_on_msg_dpc().
+
+The first step is for the guest to connect to the generic
+Hyper-V VMBus mechanism.  As part of establishing this connection,
+the guest and Hyper-V agree on a VMBus protocol version they will
+use.  This negotiation allows newer Linux kernels to run on older
+Hyper-V versions, and vice versa.
+
+The guest then tells Hyper-V to "send offers".  Hyper-V sends an
+offer message to the guest for each synthetic device that the VM
+is configured to have. Each VMBus device type has a fixed GUID
+known as the "class ID", and each VMBus device instance is also
+identified by a GUID. The offer message from Hyper-V contains
+both GUIDs to uniquely (within the VM) identify the device.
+There is one offer message for each device instance, so a VM with
+two synthetic NICs will get two offers messages with the NIC
+class ID. The ordering of offer messages can vary from boot-to-boot
+and must not be assumed to be consistent in Linux code. Offer
+messages may also arrive long after Linux has initially booted
+because Hyper-V supports adding devices, such as synthetic NICs,
+to running VMs. A new offer message is processed by
+vmbus_process_offer(), which indirectly invokes vmbus_add_channel_work().
+
+Upon receipt of an offer message, the guest identifies the device
+type based on the class ID, and invokes the correct driver to set up
+the device.  Driver/device matching is performed using the standard
+Linux mechanism.
+
+The device driver probe function opens the primary VMBus channel to
+the corresponding VSP. It allocates guest memory for the channel
+ring buffers and shares the ring buffer with the Hyper-V host by
+giving the host a list of GPAs for the ring buffer memory.  See
+vmbus_establish_gpadl().
+
+Once the ring buffer is set up, the device driver and VSP exchange
+setup messages via the primary channel.  These messages may include
+negotiating the device protocol version to be used between the Linux
+VSC and the VSP on the Hyper-V host.  The setup messages may also
+include creating additional VMBus channels, which are somewhat
+mis-named as "sub-channels" since they are functionally
+equivalent to the primary channel once they are created.
+
+Finally, the device driver may create entries in /dev as with
+any device driver.
+
+The Hyper-V host can send a "rescind" message to the guest to
+remove a device that was previously offered. Linux drivers must
+handle such a rescind message at any time. Rescinding a device
+invokes the device driver "remove" function to cleanly shut
+down the device and remove it. Once a synthetic device is
+rescinded, neither Hyper-V nor Linux retains any state about
+its previous existence. Such a device might be re-added later,
+in which case it is treated as an entirely new device. See
+vmbus_onoffer_rescind().
+
+For some devices, such as the KVP device, Hyper-V automatically
+sends a rescind message when the primary channel is closed,
+likely as a result of unbinding the device from its driver.
+The rescind causes Linux to remove the device. But then Hyper-V
+immediately reoffers the device to the guest, causing a new
+instance of the device to be created in Linux. For other
+devices, such as the synthetic SCSI and NIC devices, closing the
+primary channel does *not* result in Hyper-V sending a rescind
+message. The device continues to exist in Linux on the VMBus,
+but with no driver bound to it. The same driver or a new driver
+can subsequently be bound to the existing instance of the device.
diff --git a/Documentation/virt/hyperv/vpci.rst b/Documentation/virt/hyperv/vpci.rst
new file mode 100644
index 000000000000..b65b2126ede3
--- /dev/null
+++ b/Documentation/virt/hyperv/vpci.rst
@@ -0,0 +1,316 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+PCI pass-thru devices
+=========================
+In a Hyper-V guest VM, PCI pass-thru devices (also called
+virtual PCI devices, or vPCI devices) are physical PCI devices
+that are mapped directly into the VM's physical address space.
+Guest device drivers can interact directly with the hardware
+without intermediation by the host hypervisor.  This approach
+provides higher bandwidth access to the device with lower
+latency, compared with devices that are virtualized by the
+hypervisor.  The device should appear to the guest just as it
+would when running on bare metal, so no changes are required
+to the Linux device drivers for the device.
+
+Hyper-V terminology for vPCI devices is "Discrete Device
+Assignment" (DDA).  Public documentation for Hyper-V DDA is
+available here: `DDA`_
+
+.. _DDA: https://learn.microsoft.com/en-us/windows-server/virtualization/hyper-v/plan/plan-for-deploying-devices-using-discrete-device-assignment
+
+DDA is typically used for storage controllers, such as NVMe,
+and for GPUs.  A similar mechanism for NICs is called SR-IOV
+and produces the same benefits by allowing a guest device
+driver to interact directly with the hardware.  See Hyper-V
+public documentation here: `SR-IOV`_
+
+.. _SR-IOV: https://learn.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-single-root-i-o-virtualization--sr-iov-
+
+This discussion of vPCI devices includes DDA and SR-IOV
+devices.
+
+Device Presentation
+-------------------
+Hyper-V provides full PCI functionality for a vPCI device when
+it is operating, so the Linux device driver for the device can
+be used unchanged, provided it uses the correct Linux kernel
+APIs for accessing PCI config space and for other integration
+with Linux.  But the initial detection of the PCI device and
+its integration with the Linux PCI subsystem must use Hyper-V
+specific mechanisms.  Consequently, vPCI devices on Hyper-V
+have a dual identity.  They are initially presented to Linux
+guests as VMBus devices via the standard VMBus "offer"
+mechanism, so they have a VMBus identity and appear under
+/sys/bus/vmbus/devices.  The VMBus vPCI driver in Linux at
+drivers/pci/controller/pci-hyperv.c handles a newly introduced
+vPCI device by fabricating a PCI bus topology and creating all
+the normal PCI device data structures in Linux that would
+exist if the PCI device were discovered via ACPI on a bare-
+metal system.  Once those data structures are set up, the
+device also has a normal PCI identity in Linux, and the normal
+Linux device driver for the vPCI device can function as if it
+were running in Linux on bare-metal.  Because vPCI devices are
+presented dynamically through the VMBus offer mechanism, they
+do not appear in the Linux guest's ACPI tables.  vPCI devices
+may be added to a VM or removed from a VM at any time during
+the life of the VM, and not just during initial boot.
+
+With this approach, the vPCI device is a VMBus device and a
+PCI device at the same time.  In response to the VMBus offer
+message, the hv_pci_probe() function runs and establishes a
+VMBus connection to the vPCI VSP on the Hyper-V host.  That
+connection has a single VMBus channel.  The channel is used to
+exchange messages with the vPCI VSP for the purpose of setting
+up and configuring the vPCI device in Linux.  Once the device
+is fully configured in Linux as a PCI device, the VMBus
+channel is used only if Linux changes the vCPU to be interrupted
+in the guest, or if the vPCI device is removed from
+the VM while the VM is running.  The ongoing operation of the
+device happens directly between the Linux device driver for
+the device and the hardware, with VMBus and the VMBus channel
+playing no role.
+
+PCI Device Setup
+----------------
+PCI device setup follows a sequence that Hyper-V originally
+created for Windows guests, and that can be ill-suited for
+Linux guests due to differences in the overall structure of
+the Linux PCI subsystem compared with Windows.  Nonetheless,
+with a bit of hackery in the Hyper-V virtual PCI driver for
+Linux, the virtual PCI device is setup in Linux so that
+generic Linux PCI subsystem code and the Linux driver for the
+device "just work".
+
+Each vPCI device is set up in Linux to be in its own PCI
+domain with a host bridge.  The PCI domainID is derived from
+bytes 4 and 5 of the instance GUID assigned to the VMBus vPCI
+device.  The Hyper-V host does not guarantee that these bytes
+are unique, so hv_pci_probe() has an algorithm to resolve
+collisions.  The collision resolution is intended to be stable
+across reboots of the same VM so that the PCI domainIDs don't
+change, as the domainID appears in the user space
+configuration of some devices.
+
+hv_pci_probe() allocates a guest MMIO range to be used as PCI
+config space for the device.  This MMIO range is communicated
+to the Hyper-V host over the VMBus channel as part of telling
+the host that the device is ready to enter d0.  See
+hv_pci_enter_d0().  When the guest subsequently accesses this
+MMIO range, the Hyper-V host intercepts the accesses and maps
+them to the physical device PCI config space.
+
+hv_pci_probe() also gets BAR information for the device from
+the Hyper-V host, and uses this information to allocate MMIO
+space for the BARs.  That MMIO space is then setup to be
+associated with the host bridge so that it works when generic
+PCI subsystem code in Linux processes the BARs.
+
+Finally, hv_pci_probe() creates the root PCI bus.  At this
+point the Hyper-V virtual PCI driver hackery is done, and the
+normal Linux PCI machinery for scanning the root bus works to
+detect the device, to perform driver matching, and to
+initialize the driver and device.
+
+PCI Device Removal
+------------------
+A Hyper-V host may initiate removal of a vPCI device from a
+guest VM at any time during the life of the VM.  The removal
+is instigated by an admin action taken on the Hyper-V host and
+is not under the control of the guest OS.
+
+A guest VM is notified of the removal by an unsolicited
+"Eject" message sent from the host to the guest over the VMBus
+channel associated with the vPCI device.  Upon receipt of such
+a message, the Hyper-V virtual PCI driver in Linux
+asynchronously invokes Linux kernel PCI subsystem calls to
+shutdown and remove the device.  When those calls are
+complete, an "Ejection Complete" message is sent back to
+Hyper-V over the VMBus channel indicating that the device has
+been removed.  At this point, Hyper-V sends a VMBus rescind
+message to the Linux guest, which the VMBus driver in Linux
+processes by removing the VMBus identity for the device.  Once
+that processing is complete, all vestiges of the device having
+been present are gone from the Linux kernel.  The rescind
+message also indicates to the guest that Hyper-V has stopped
+providing support for the vPCI device in the guest.  If the
+guest were to attempt to access that device's MMIO space, it
+would be an invalid reference. Hypercalls affecting the device
+return errors, and any further messages sent in the VMBus
+channel are ignored.
+
+After sending the Eject message, Hyper-V allows the guest VM
+60 seconds to cleanly shutdown the device and respond with
+Ejection Complete before sending the VMBus rescind
+message.  If for any reason the Eject steps don't complete
+within the allowed 60 seconds, the Hyper-V host forcibly
+performs the rescind steps, which will likely result in
+cascading errors in the guest because the device is now no
+longer present from the guest standpoint and accessing the
+device MMIO space will fail.
+
+Because ejection is asynchronous and can happen at any point
+during the guest VM lifecycle, proper synchronization in the
+Hyper-V virtual PCI driver is very tricky.  Ejection has been
+observed even before a newly offered vPCI device has been
+fully setup.  The Hyper-V virtual PCI driver has been updated
+several times over the years to fix race conditions when
+ejections happen at inopportune times. Care must be taken when
+modifying this code to prevent re-introducing such problems.
+See comments in the code.
+
+Interrupt Assignment
+--------------------
+The Hyper-V virtual PCI driver supports vPCI devices using
+MSI, multi-MSI, or MSI-X.  Assigning the guest vCPU that will
+receive the interrupt for a particular MSI or MSI-X message is
+complex because of the way the Linux setup of IRQs maps onto
+the Hyper-V interfaces.  For the single-MSI and MSI-X cases,
+Linux calls hv_compse_msi_msg() twice, with the first call
+containing a dummy vCPU and the second call containing the
+real vCPU.  Furthermore, hv_irq_unmask() is finally called
+(on x86) or the GICD registers are set (on arm64) to specify
+the real vCPU again.  Each of these three calls interact
+with Hyper-V, which must decide which physical CPU should
+receive the interrupt before it is forwarded to the guest VM.
+Unfortunately, the Hyper-V decision-making process is a bit
+limited, and can result in concentrating the physical
+interrupts on a single CPU, causing a performance bottleneck.
+See details about how this is resolved in the extensive
+comment above the function hv_compose_msi_req_get_cpu().
+
+The Hyper-V virtual PCI driver implements the
+irq_chip.irq_compose_msi_msg function as hv_compose_msi_msg().
+Unfortunately, on Hyper-V the implementation requires sending
+a VMBus message to the Hyper-V host and awaiting an interrupt
+indicating receipt of a reply message.  Since
+irq_chip.irq_compose_msi_msg can be called with IRQ locks
+held, it doesn't work to do the normal sleep until awakened by
+the interrupt. Instead hv_compose_msi_msg() must send the
+VMBus message, and then poll for the completion message. As
+further complexity, the vPCI device could be ejected/rescinded
+while the polling is in progress, so this scenario must be
+detected as well.  See comments in the code regarding this
+very tricky area.
+
+Most of the code in the Hyper-V virtual PCI driver (pci-
+hyperv.c) applies to Hyper-V and Linux guests running on x86
+and on arm64 architectures.  But there are differences in how
+interrupt assignments are managed.  On x86, the Hyper-V
+virtual PCI driver in the guest must make a hypercall to tell
+Hyper-V which guest vCPU should be interrupted by each
+MSI/MSI-X interrupt, and the x86 interrupt vector number that
+the x86_vector IRQ domain has picked for the interrupt.  This
+hypercall is made by hv_arch_irq_unmask().  On arm64, the
+Hyper-V virtual PCI driver manages the allocation of an SPI
+for each MSI/MSI-X interrupt.  The Hyper-V virtual PCI driver
+stores the allocated SPI in the architectural GICD registers,
+which Hyper-V emulates, so no hypercall is necessary as with
+x86.  Hyper-V does not support using LPIs for vPCI devices in
+arm64 guest VMs because it does not emulate a GICv3 ITS.
+
+The Hyper-V virtual PCI driver in Linux supports vPCI devices
+whose drivers create managed or unmanaged Linux IRQs.  If the
+smp_affinity for an unmanaged IRQ is updated via the /proc/irq
+interface, the Hyper-V virtual PCI driver is called to tell
+the Hyper-V host to change the interrupt targeting and
+everything works properly.  However, on x86 if the x86_vector
+IRQ domain needs to reassign an interrupt vector due to
+running out of vectors on a CPU, there's no path to inform the
+Hyper-V host of the change, and things break.  Fortunately,
+guest VMs operate in a constrained device environment where
+using all the vectors on a CPU doesn't happen. Since such a
+problem is only a theoretical concern rather than a practical
+concern, it has been left unaddressed.
+
+DMA
+---
+By default, Hyper-V pins all guest VM memory in the host
+when the VM is created, and programs the physical IOMMU to
+allow the VM to have DMA access to all its memory.  Hence
+it is safe to assign PCI devices to the VM, and allow the
+guest operating system to program the DMA transfers.  The
+physical IOMMU prevents a malicious guest from initiating
+DMA to memory belonging to the host or to other VMs on the
+host. From the Linux guest standpoint, such DMA transfers
+are in "direct" mode since Hyper-V does not provide a virtual
+IOMMU in the guest.
+
+Hyper-V assumes that physical PCI devices always perform
+cache-coherent DMA.  When running on x86, this behavior is
+required by the architecture.  When running on arm64, the
+architecture allows for both cache-coherent and
+non-cache-coherent devices, with the behavior of each device
+specified in the ACPI DSDT.  But when a PCI device is assigned
+to a guest VM, that device does not appear in the DSDT, so the
+Hyper-V VMBus driver propagates cache-coherency information
+from the VMBus node in the ACPI DSDT to all VMBus devices,
+including vPCI devices (since they have a dual identity as a VMBus
+device and as a PCI device).  See vmbus_dma_configure().
+Current Hyper-V versions always indicate that the VMBus is
+cache coherent, so vPCI devices on arm64 always get marked as
+cache coherent and the CPU does not perform any sync
+operations as part of dma_map/unmap_*() calls.
+
+vPCI protocol versions
+----------------------
+As previously described, during vPCI device setup and teardown
+messages are passed over a VMBus channel between the Hyper-V
+host and the Hyper-v vPCI driver in the Linux guest.  Some
+messages have been revised in newer versions of Hyper-V, so
+the guest and host must agree on the vPCI protocol version to
+be used.  The version is negotiated when communication over
+the VMBus channel is first established.  See
+hv_pci_protocol_negotiation(). Newer versions of the protocol
+extend support to VMs with more than 64 vCPUs, and provide
+additional information about the vPCI device, such as the
+guest virtual NUMA node to which it is most closely affined in
+the underlying hardware.
+
+Guest NUMA node affinity
+------------------------
+When the vPCI protocol version provides it, the guest NUMA
+node affinity of the vPCI device is stored as part of the Linux
+device information for subsequent use by the Linux driver. See
+hv_pci_assign_numa_node().  If the negotiated protocol version
+does not support the host providing NUMA affinity information,
+the Linux guest defaults the device NUMA node to 0.  But even
+when the negotiated protocol version includes NUMA affinity
+information, the ability of the host to provide such
+information depends on certain host configuration options.  If
+the guest receives NUMA node value "0", it could mean NUMA
+node 0, or it could mean "no information is available".
+Unfortunately it is not possible to distinguish the two cases
+from the guest side.
+
+PCI config space access in a CoCo VM
+------------------------------------
+Linux PCI device drivers access PCI config space using a
+standard set of functions provided by the Linux PCI subsystem.
+In Hyper-V guests these standard functions map to functions
+hv_pcifront_read_config() and hv_pcifront_write_config()
+in the Hyper-V virtual PCI driver.  In normal VMs,
+these hv_pcifront_*() functions directly access the PCI config
+space, and the accesses trap to Hyper-V to be handled.
+But in CoCo VMs, memory encryption prevents Hyper-V
+from reading the guest instruction stream to emulate the
+access, so the hv_pcifront_*() functions must invoke
+hypercalls with explicit arguments describing the access to be
+made.
+
+Config Block back-channel
+-------------------------
+The Hyper-V host and Hyper-V virtual PCI driver in Linux
+together implement a non-standard back-channel communication
+path between the host and guest.  The back-channel path uses
+messages sent over the VMBus channel associated with the vPCI
+device.  The functions hyperv_read_cfg_blk() and
+hyperv_write_cfg_blk() are the primary interfaces provided to
+other parts of the Linux kernel.  As of this writing, these
+interfaces are used only by the Mellanox mlx5 driver to pass
+diagnostic data to a Hyper-V host running in the Azure public
+cloud.  The functions hyperv_read_cfg_blk() and
+hyperv_write_cfg_blk() are implemented in a separate module
+(pci-hyperv-intf.c, under CONFIG_PCI_HYPERV_INTERFACE) that
+effectively stubs them out when running in non-Hyper-V
+environments.