aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/include/uapi/linux/vfio.h (follow)
AgeCommit message (Collapse)AuthorFilesLines
2020-03-24vfio: Introduce VFIO_DEVICE_FEATURE ioctl and first userAlex Williamson1-0/+37
The VFIO_DEVICE_FEATURE ioctl is meant to be a general purpose, device agnostic ioctl for setting, retrieving, and probing device features. This implementation provides a 16-bit field for specifying a feature index, where the data porition of the ioctl is determined by the semantics for the given feature. Additional flag bits indicate the direction and nature of the operation; SET indicates user data is provided into the device feature, GET indicates the device feature is written out into user data. The PROBE flag augments determining whether the given feature is supported, and if provided, whether the given operation on the feature is supported. The first user of this ioctl is for setting the vfio-pci VF token, where the user provides a shared secret key (UUID) on a SR-IOV PF device, which users must provide when opening associated VF devices. Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2019-08-23Merge branches 'v5.4/vfio/alexey-tce-memory-free-v1', 'v5.4/vfio/connie-re-arrange-v2', 'v5.4/vfio/hexin-pci-reset-v3', 'v5.4/vfio/parav-mtty-uuid-v2' and 'v5.4/vfio/shameer-iova-list-v8' into v5.4/vfio/nextAlex Williamson1-20/+51
2019-08-19vfio/type1: Add IOVA range capability supportShameer Kolothum1-1/+25
This allows the user-space to retrieve the supported IOVA range(s), excluding any non-relaxable reserved regions. The implementation is based on capability chains, added to VFIO_IOMMU_GET_INFO ioctl. Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2019-08-19vfio: re-arrange vfio region definitionsCornelia Huck1-19/+26
It is easy to miss already defined region types. Let's re-arrange the definitions a bit and add more comments to make it hopefully a bit clearer. No functional change. Signed-off-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2019-04-24vfio-ccw: add handling for async channel instructionsCornelia Huck1-0/+2
Add a region to the vfio-ccw device that can be used to submit asynchronous I/O instructions. ssch continues to be handled by the existing I/O region; the new region handles hsch and csch. Interrupt status continues to be reported through the same channels as for ssch. Acked-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Farhan Ali <alifm@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
2019-04-24vfio-ccw: add capabilities chainCornelia Huck1-0/+2
Allow to extend the regions used by vfio-ccw. The first user will be handling of halt and clear subchannel. Reviewed-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Farhan Ali <alifm@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
2018-12-21vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriverAlexey Kardashevskiy1-0/+42
POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not pluggable PCIe devices but still have PCIe links which are used for config space and MMIO. In addition to that the GPUs have 6 NVLinks which are connected to other GPUs and the POWER9 CPU. POWER9 chips have a special unit on a die called an NPU which is an NVLink2 host bus adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each. These systems also support ATS (address translation services) which is a part of the NVLink2 protocol. Such GPUs also share on-board RAM (16GB or 32GB) to the system via the same NVLink2 so a CPU has cache-coherent access to a GPU RAM. This exports GPU RAM to the userspace as a new VFIO device region. This preregisters the new memory as device memory as it might be used for DMA. This inserts pfns from the fault handler as the GPU memory is not onlined until the vendor driver is loaded and trained the NVLinks so doing this earlier causes low level errors which we fence in the firmware so it does not hurt the host system but still better be avoided; for the same reason this does not map GPU RAM into the host kernel (usual thing for emulated access otherwise). This exports an ATSD (Address Translation Shootdown) register of NPU which allows TLB invalidations inside GPU for an operating system. The register conveniently occupies a single 64k page. It is also presented to the userspace as a new VFIO device region. One NPU has 8 ATSD registers, each of them can be used for TLB invalidation in a GPU linked to this NPU. This allocates one ATSD register per an NVLink bridge allowing passing up to 6 registers. Due to the host firmware bug (just recently fixed), only 1 ATSD register per NPU was actually advertised to the host system so this passes that alone register via the first NVLink bridge device in the group which is still enough as QEMU collects them all back and presents to the guest via vPHB to mimic the emulated NPU PHB on the host. In order to provide the userspace with the information about GPU-to-NVLink connections, this exports an additional capability called "tgt" (which is an abbreviated host system bus address). The "tgt" property tells the GPU its own system address and allows the guest driver to conglomerate the routing information so each GPU knows how to get directly to the other GPUs. For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to know LPID (a logical partition ID or a KVM guest hardware ID in other words) and PID (a memory context ID of a userspace process, not to be confused with a linux pid). This assigns a GPU to LPID in the NPU and this is why this adds a listener for KVM on an IOMMU group. A PID comes via NVLink from a GPU and NPU uses a PID wildcard to pass it through. This requires coherent memory and ATSD to be available on the host as the GPU vendor only supports configurations with both features enabled and other configurations are known not to work. Because of this and because of the ways the features are advertised to the host system (which is a device tree with very platform specific properties), this requires enabled POWERNV platform. The V100 GPUs do not advertise any of these capabilities via the config space and there are more than just one device ID so this relies on the platform to tell whether these GPUs have special abilities such as NVLinks. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-31Merge tag 'vfio-v4.20-rc1.v2' of git://github.com/awilliam/linux-vfioLinus Torvalds1-0/+50
Pull VFIO updates from Alex Williamson: - EDID interfaces for vfio devices supporting display extensions (Gerd Hoffmann) - Generically select Type-1 IOMMU model support on ARM/ARM64 (Geert Uytterhoeven) - Quirk for VFs reporting INTx pin (Alex Williamson) - Fix error path memory leak in MSI support (Li Qiang) * tag 'vfio-v4.20-rc1.v2' of git://github.com/awilliam/linux-vfio: vfio: add edid support to mbochs sample driver vfio: add edid api for display (vgpu) devices. drivers/vfio: Allow type-1 IOMMU instantiation with all ARM/ARM64 IOMMUs vfio/pci: Mask buggy SR-IOV VF INTx support vfio/pci: Fix potential memory leak in vfio_msi_cap_len
2018-10-11vfio: add edid api for display (vgpu) devices.Gerd Hoffmann1-0/+50
This allows to set EDID monitor information for the vgpu display, for a more flexible display configuration, using a special vfio region. Check the comment describing struct vfio_region_gfx_edid for more details. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-09-28s390: vfio-ap: implement VFIO_DEVICE_GET_INFO ioctlTony Krowiak1-0/+1
Adds support for the VFIO_DEVICE_GET_INFO ioctl to the VFIO AP Matrix device driver. This is a minimal implementation, as vfio-ap does not use I/O regions. Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Acked-by: Halil Pasic <pasic@linux.ibm.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Farhan Ali <alifm@linux.ibm.com> Tested-by: Pierre Morel <pmorel@linux.ibm.com> Message-Id: <20180925231641.4954-13-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-26s390: vfio-ap: register matrix device with VFIO mdev frameworkTony Krowiak1-0/+1
Registers the matrix device created by the VFIO AP device driver with the VFIO mediated device framework. Registering the matrix device will create the sysfs structures needed to create mediated matrix devices each of which will be used to configure the AP matrix for a guest and connect it to the VFIO AP device driver. Registering the matrix device with the VFIO mediated device framework will create the following sysfs structures: /sys/devices/vfio_ap/matrix/ ...... [mdev_supported_types] ......... [vfio_ap-passthrough] ............ create To create a mediated device for the AP matrix device, write a UUID to the create file: uuidgen > create A symbolic link to the mediated device's directory will be created in the devices subdirectory named after the generated $uuid: /sys/devices/vfio_ap/matrix/ ...... [mdev_supported_types] ......... [vfio_ap-passthrough] ............ [devices] ............... [$uuid] A symbolic link to the mediated device will also be created in the vfio_ap matrix's directory: /sys/devices/vfio_ap/matrix/[$uuid] Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Farhan Ali <alifm@linux.ibm.com> Message-Id: <20180925231641.4954-6-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-03-26vfio/pci: Add ioeventfd supportAlex Williamson1-0/+27
The ioeventfd here is actually irqfd handling of an ioeventfd such as supported in KVM. A user is able to pre-program a device write to occur when the eventfd triggers. This is yet another instance of eventfd-irqfd triggering between KVM and vfio. The impetus for this is high frequency writes to pages which are virtualized in QEMU. Enabling this near-direct write path for selected registers within the virtualized page can improve performance and reduce overhead. Specifically this is initially targeted at NVIDIA graphics cards where the driver issues a write to an MMIO register within a virtualized region in order to allow the MSI interrupt to re-trigger. Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-02-01Merge tag 'drm-for-v4.16' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds1-0/+62
Pull drm updates from Dave Airlie: "This seems to have been a comparatively quieter merge window, I assume due to holidays etc. The "biggest" change is AMD header cleanups, which merge/remove a bunch of them. The AMD gpu scheduler is now being made generic with the etnaviv driver wanting to reuse the code, hopefully other drivers can go in the same direction. Otherwise it's the usual lots of stuff in i915/amdgpu, not so much stuff elsewhere. Core: - Add .last_close and .output_poll_changed helpers to reduce driver footprints - Fix plane clipping - Improved debug printing support - Add panel orientation property - Update edid derived properties at edid setting - Reduction in fbdev driver footprint - Move amdgpu scheduler into core for other drivers to use. i915: - Selftest and IGT improvements - Fast boot prep work on IPS, pipe config - HW workarounds for Cannonlake, Geminilake - Cannonlake clock and HDMI2.0 fixes - GPU cache invalidation and context switch improvements - Display planes cleanup - New PMU interface for perf queries - New firmware support for KBL/SKL - Geminilake HW workaround for perforamce - Coffeelake stolen memory improvements - GPU reset robustness work - Cannonlake horizontal plane flipping - GVT work amdgpu/radeon: - RV and Vega header file cleanups (lots of lines gone!) - TTM operation context support - 48-bit GPUVM support for Vega/RV - ECC support for Vega - Resizeable BAR support - Multi-display sync support - Enable swapout for reserved BOs during allocation - S3 fixes on Raven - GPU reset cleanup and fixes - 2+1 level GPU page table amdkfd: - GFX7/8 SDMA user queues support - Hardware scheduling for multiple processes - dGPU prep work rcar: - Added R8A7743/5 support - System suspend/resume support sun4i: - Multi-plane support for YUV formats - A83T and LVDS support msm: - Devfreq support for GPU tegra: - Prep work for adding Tegra186 support - Tegra186 HDMI support - HDMI2.0 and zpos support by using generic helpers tilcdc: - Misc fixes omapdrm: - Support memory bandwidth limits - DSI command mode panel cleanups - DMM error handling exynos: - drop the old IPP subdriver. etnaviv: - Occlusion query fixes - Job handling fixes - Prep work for hooking in gpu scheduler armada: - Move closer to atomic modesetting - Allow disabling primary plane if overlay is full screen imx: - Format modifier support - Add tile prefetch to PRE - Runtime PM support for PRG ast: - fix LUT loading" * tag 'drm-for-v4.16' of git://people.freedesktop.org/~airlied/linux: (1471 commits) drm/ast: Load lut in crtc_commit drm: Check for lessee in DROP_MASTER ioctl drm: fix gpu scheduler link order drm/amd/display: Demote error print to debug print when ATOM impl missing dma-buf: fix reservation_object_wait_timeout_rcu once more v2 drm/amdgpu: Avoid leaking PM domain on driver unbind (v2) drm/amd/amdgpu: Add Polaris version check drm/amdgpu: Reenable manual GPU reset from sysfs drm/amdgpu: disable MMHUB power gating on raven drm/ttm: Don't unreserve swapped BOs that were previously reserved drm/ttm: Don't add swapped BOs to swap-LRU list drm/amdgpu: only check for ECC on Vega10 drm/amd/powerplay: Fix smu_table_entry.handle type drm/ttm: add VADDR_FLAG_UPDATED_COUNT to correctly update dma_page global count drm: Fix PANEL_ORIENTATION_QUIRKS breaking the Kconfig DRM menuconfig drm/radeon: fill in rb backend map on evergreen/ni. drm/amdgpu/gfx9: fix ngg enablement to clear gds reserved memory (v2) drm/ttm: only free pages rather than update global memory count together drm/amdgpu: fix CPU based VM updates drm/amdgpu: fix typo in amdgpu_vce_validate_bo ...
2017-12-20vfio-pci: Allow mapping MSIX BARAlexey Kardashevskiy1-0/+10
By default VFIO disables mapping of MSIX BAR to the userspace as the userspace may program it in a way allowing spurious interrupts; instead the userspace uses the VFIO_DEVICE_SET_IRQS ioctl. In order to eliminate guessing from the userspace about what is mmapable, VFIO also advertises a sparse list of regions allowed to mmap. This works fine as long as the system page size equals to the MSIX alignment requirement which is 4KB. However with a bigger page size the existing code prohibits mapping non-MSIX parts of a page with MSIX structures so these parts have to be emulated via slow reads/writes on a VFIO device fd. If these emulated bits are accessed often, this has serious impact on performance. This allows mmap of the entire BAR containing MSIX vector table. This removes the sparse capability for PCI devices as it becomes useless. As the userspace needs to know for sure whether mmapping of the MSIX vector containing data can succeed, this adds a new capability - VFIO_REGION_INFO_CAP_MSIX_MAPPABLE - which explicitly tells the userspace that the entire BAR can be mmapped. This does not touch the MSIX mangling in the BAR read/write handlers as we are doing this just to enable direct access to non MSIX registers. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw - fixup whitespace, trim function name] Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-12-08Merge airlied/drm-next into drm-intel-next-queuedRodrigo Vivi1-0/+1
Chris requested this backmerge for a reconciliation on drm_print.h between drm-misc-next and drm-intel-next-queued Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2017-12-04vfio: ABI for mdev display dma-buf operationTina Zhang1-0/+62
Add VFIO_DEVICE_QUERY_GFX_PLANE ioctl command to let user query and get a plane and its information. So far, two types of buffers are supported: buffers based on dma-buf and buffers based on region. This ioctl can be invoked with: 1) Either DMABUF or REGION flag. Vendor driver returns a plane_info successfully only when the specific kind of buffer is supported. 2) Flag PROBE. And at the same time either DMABUF or REGION must be set, so that vendor driver returns success only when the specific kind of buffer is supported. Add VFIO_DEVICE_GET_GFX_DMABUF ioctl command to let user get a specific dma-buf fd of an exposed MDEV buffer provided by dmabuf_id which was returned in VFIO_DEVICE_QUERY_GFX_PLANE ioctl command. The life cycle of an exposed MDEV buffer is handled by userspace and tracked by kernel space. The returned dmabuf_id in struct vfio_device_ query_gfx_plane can be a new id of a new exposed buffer or an old id of a re-exported buffer. Host user can check the value of dmabuf_id to see if it needs to create new resources according to the new exposed buffer or just re-use the existing resource related to the old buffer. v18: - update comments for VFIO_DEVICE_GET_GFX_DMABUF. (Alex) v17: - modify VFIO_DEVICE_GET_GFX_DMABUF interface. (Alex) v16: - add x_hot and y_hot fields. (Gerd) - add comments for VFIO_DEVICE_GET_GFX_DMABUF. (Alex) - rebase to 4.14.0-rc6. v15: - add a ioctl to get a dmabuf for a given dmabuf id. (Gerd) v14: - add PROBE, DMABUF and REGION flags. (Alex) v12: - add drm_format_mod back. (Gerd and Zhenyu) - add region_index. (Gerd) v11: - rename plane_type to drm_plane_type. (Gerd) - move fields of vfio_device_query_gfx_plane to vfio_device_gfx_plane_info. (Gerd) - remove drm_format_mod, start fields. (Daniel) - remove plane_id. v10: - refine the ABI API VFIO_DEVICE_QUERY_GFX_PLANE. (Alex) (Gerd) v3: - add a field gvt_plane_info in the drm_i915_gem_obj structure to save the decoded plane information to avoid look up while need the plane info. (Gerd) Signed-off-by: Tina Zhang <tina.zhang@intel.com> Reviewed-by: Gerd Hoffmann <kraxel@redhat.com> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
2017-11-02License cleanup: add SPDX license identifier to uapi header files with a licenseGreg Kroah-Hartman1-0/+1
Many user space API headers have licensing information, which is either incomplete, badly formatted or just a shorthand for referring to the license under which the file is supposed to be. This makes it hard for compliance tools to determine the correct license. Update these files with an SPDX license identifier. The identifier was chosen based on the license information in the file. GPL/LGPL licensed headers get the matching GPL/LGPL SPDX license identifier with the added 'WITH Linux-syscall-note' exception, which is the officially assigned exception identifier for the kernel syscall exception: NOTE! This copyright does *not* cover user programs that use kernel services by normal system calls - this is merely considered normal use of the kernel, and does *not* fall under the heading of "derived work". This exception makes it possible to include GPL headers into non GPL code, without confusing license compliance tools. Headers which have either explicit dual licensing or are just licensed under a non GPL license are updated with the corresponding SPDX identifier and the GPLv2 with syscall exception identifier. The format is: ((GPL-2.0 WITH Linux-syscall-note) OR SPDX-ID-OF-OTHER-LICENSE) SPDX license identifiers are a legally binding shorthand, which can be used instead of the full boiler plate text. The update does not remove existing license information as this has to be done on a case by case basis and the copyright holders might have to be consulted. This will happen in a separate step. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. See the previous patch in this series for the methodology of how this patch was researched. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-31vfio: ccw: realize VFIO_DEVICE_G(S)ET_IRQ_INFO ioctlsDong Jia Shi1-2/+8
Realize VFIO_DEVICE_GET_IRQ_INFO ioctl to retrieve VFIO_CCW_IO_IRQ information. Realize VFIO_DEVICE_SET_IRQS ioctl to set an eventfd fd for VFIO_CCW_IO_IRQ. Once a write operation to the ccw_io_region was performed, trigger a signal on this fd. Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com> Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Message-Id: <20170317031743.40128-12-bjsdjshi@linux.vnet.ibm.com> Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
2017-03-31vfio: ccw: realize VFIO_DEVICE_GET_REGION_INFO ioctlDong Jia Shi1-0/+11
Introduce device information about vfio-ccw: VFIO_DEVICE_FLAGS_CCW. Realize VFIO_DEVICE_GET_REGION_INFO ioctl for vfio-ccw. Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com> Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Message-Id: <20170317031743.40128-10-bjsdjshi@linux.vnet.ibm.com> Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
2017-03-31vfio: ccw: define device_api stringsDong Jia Shi1-0/+1
Define vfio-ccw device API strings. CCW vendor driver using mediated device framework should use this string for device_api attribute. Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com> Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Message-Id: <20170317031743.40128-4-bjsdjshi@linux.vnet.ibm.com> Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
2016-11-17vfio: Define device_api stringsKirti Wankhede1-0/+10
Defined device API strings. Vendor driver using mediated device framework should use corresponding string for device_api attribute. Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Neo Jia <cjia@nvidia.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22vfio/pci: Intel IGD host and LCP bridge config space accessAlex Williamson1-0/+3
Provide read-only access to PCI config space of the PCI host bridge and LPC bridge through device specific regions. This may be used to configure a VM with matching register contents to satisfy driver requirements. Providing this through the vfio file descriptor removes an additional userspace requirement for access through pci-sysfs and removes the CAP_SYS_ADMIN requirement that doesn't appear to apply to the specific devices we're accessing. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22vfio/pci: Intel IGD OpRegion supportAlex Williamson1-0/+5
This is the first consumer of vfio device specific resource support, providing read-only access to the OpRegion for Intel graphics devices. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22vfio: Define device specific region type capabilityAlex Williamson1-1/+30
To this point vfio has only provided an interface to the user that allows them to determine the number of regions and specifics about each region. What the region represents is left to the vfio bus driver. vfio-pci chooses to use fixed indexes for fixed resources, index 0 is BAR0, 1 is BAR1,... 7 is config space, etc. This works pretty well since all PCI devices have these regions, even if they don't necessarily populate all of them. Then we start to add things like VGA, which only certain device even support. We added this the same way, but now we've wasted a region index, and due to our offset implementation the corresponding address space, for all devices. Rather than continuing that process, let's try to make regions self describing by including a capability that defines their type. For vfio-pci we'll make the current VFIO_PCI_NUM_REGIONS fixed, defining the end of the static indexes and the beginning of self describing regions. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22vfio: Define sparse mmap capability for regionsAlex Williamson1-1/+25
We can't always support mmap across an entire device region, for example we deny mmaps covering the MSI-X table of PCI devices, but we don't really have a way to report it. We expect the user to implicitly know this restriction. We also can't split the region because vfio-pci defines an API with fixed region index to BAR number mapping. We therefore define a new capability which lists areas within the region that may be mmap'd. In addition to the MSI-X case, this potentially enables in-kernel emulation and extensions to devices. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22vfio: Define capability chainsAlex Williamson1-0/+27
We have a few cases where we need to extend the data returned from the INFO ioctls in VFIO. For instance we already have devices exposed through vfio-pci where VFIO_DEVICE_GET_REGION_INFO reports the region as mmap-capable, but really only supports sparse mmaps, avoiding the MSI-X table. If we wanted to provide in-kernel emulation or extended functionality for devices, we'd also want the ability to tell the user not to mmap various regions, rather than forcing them to figure it out on their own. Another example is VFIO_IOMMU_GET_INFO. We'd really like to expose the actual IOVA capabilities of the IOMMU rather than letting the user assume the address space they have available to them. We could add IOVA base and size fields to struct vfio_iommu_type1_info, but what if we have multiple IOVA ranges. For instance x86 uses a range of addresses at 0xfee00000 for MSI vectors. These typically are not available for standard DMA IOVA mappings and splits our available IOVA space into two ranges. POWER systems have both an IOVA window below 4G as well as dynamic data window which they can use to remap all of guest memory. Representing variable sized arrays within a fixed structure makes it very difficult to parse, we'd therefore like to put this data beyond fixed fields within the data structures. One way to do this is to emulate capabilities in PCI configuration space. A new flag indciates whether capabilties are supported and a new fixed field reports the offset of the first entry. Users can then walk the chain to find capabilities, adding capabilities does not require additional fields in the fixed structure, and parsing variable sized data becomes trivial. This patch outlines the theory and base header structure, which should be shared by all future users. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-12-21vfio: Include No-IOMMU modeAlex Williamson1-0/+7
There is really no way to safely give a user full access to a DMA capable device without an IOMMU to protect the host system. There is also no way to provide DMA translation, for use cases such as device assignment to virtual machines. However, there are still those users that want userspace drivers even under those conditions. The UIO driver exists for this use case, but does not provide the degree of device access and programming that VFIO has. In an effort to avoid code duplication, this introduces a No-IOMMU mode for VFIO. This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling the "enable_unsafe_noiommu_mode" option on the vfio driver. This should make it very clear that this mode is not safe. Additionally, CAP_SYS_RAWIO privileges are necessary to work with groups and containers using this mode. Groups making use of this support are named /dev/vfio/noiommu-$GROUP and can only make use of the special VFIO_NOIOMMU_IOMMU for the container. Use of this mode, specifically binding a device without a native IOMMU group to a VFIO bus driver will taint the kernel and should therefore not be considered supported. This patch includes no-iommu support for the vfio-pci bus driver only. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com>
2015-12-21vfio: Add explicit alignments in vfio_iommu_spapr_tce_createAlexey Kardashevskiy1-0/+2
The vfio_iommu_spapr_tce_create struct has 4x32bit and 2x64bit fields which should have resulted in sizeof(fio_iommu_spapr_tce_create) equal to 32 bytes. However due to the gcc's default alignment, the actual size of this struct is 40 bytes. This fills gaps with __resv1/2 fields. This should not cause any change in behavior. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-12-04Revert: "vfio: Include No-IOMMU mode"Alex Williamson1-7/+0
Revert commit 033291eccbdb ("vfio: Include No-IOMMU mode") due to lack of a user. This was originally intended to fill a need for the DPDK driver, but uptake has been slow so rather than support an unproven kernel interface revert it and revisit when userspace catches up. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-04vfio: Include No-IOMMU modeAlex Williamson1-0/+7
There is really no way to safely give a user full access to a DMA capable device without an IOMMU to protect the host system. There is also no way to provide DMA translation, for use cases such as device assignment to virtual machines. However, there are still those users that want userspace drivers even under those conditions. The UIO driver exists for this use case, but does not provide the degree of device access and programming that VFIO has. In an effort to avoid code duplication, this introduces a No-IOMMU mode for VFIO. This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling the "enable_unsafe_noiommu_mode" option on the vfio driver. This should make it very clear that this mode is not safe. Additionally, CAP_SYS_RAWIO privileges are necessary to work with groups and containers using this mode. Groups making use of this support are named /dev/vfio/noiommu-$GROUP and can only make use of the special VFIO_NOIOMMU_IOMMU for the container. Use of this mode, specifically binding a device without a native IOMMU group to a VFIO bus driver will taint the kernel and should therefore not be considered supported. This patch includes no-iommu support for the vfio-pci bus driver only. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com>
2015-06-11vfio: powerpc/spapr: Support Dynamic DMA windowsAlexey Kardashevskiy1-3/+58
This adds create/remove window ioctls to create and remove DMA windows. sPAPR defines a Dynamic DMA windows capability which allows para-virtualized guests to create additional DMA windows on a PCI bus. The existing linux kernels use this new window to map the entire guest memory and switch to the direct DMA operations saving time on map/unmap requests which would normally happen in a big amounts. This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows. Up to 2 windows are supported now by the hardware and by this driver. This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional information such as a number of supported windows and maximum number levels of TCE tables. DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature as we still want to support v2 on platforms which cannot do DDW for the sake of TCE acceleration in KVM (coming soon). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for the vfio related changes] Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11vfio: powerpc/spapr: Register memory and define IOMMU v2Alexey Kardashevskiy1-0/+27
The existing implementation accounts the whole DMA window in the locked_vm counter. This is going to be worse with multiple containers and huge DMA windows. Also, real-time accounting would requite additional tracking of accounted pages due to the page size difference - IOMMU uses 4K pages and system uses 4K or 64K pages. Another issue is that actual pages pinning/unpinning happens on every DMA map/unmap request. This does not affect the performance much now as we spend way too much time now on switching context between guest/userspace/host but this will start to matter when we add in-kernel DMA map/unmap acceleration. This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU. New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces 2 new ioctls to register/unregister DMA memory - VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - which receive user space address and size of a memory region which needs to be pinned/unpinned and counted in locked_vm. New IOMMU splits physical pages pinning and TCE table update into 2 different operations. It requires: 1) guest pages to be registered first 2) consequent map/unmap requests to work only with pre-registered memory. For the default single window case this means that the entire guest (instead of 2GB) needs to be pinned before using VFIO. When a huge DMA window is added, no additional pinning will be required, otherwise it would be guest RAM + 2GB. The new memory registration ioctls are not supported by VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration will require memory to be preregistered in order to work. The accounting is done per the user process. This advertises v2 SPAPR TCE IOMMU and restricts what the userspace can do with v1 or v2 IOMMUs. In order to support memory pre-registration, we need a way to track the use of every registered memory region and only allow unregistration if a region is not in use anymore. So we need a way to tell from what region the just cleared TCE was from. This adds a userspace view of the TCE table into iommu_table struct. It contains userspace address, one per TCE entry. The table is only allocated when the ownership over an IOMMU group is taken which means it is only used from outside of the powernv code (such as VFIO). As v2 IOMMU supports IODA2 and pre-IODA2 IOMMUs (which do not support DDW API), this creates a default DMA window for IODA2 for consistency. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for the vfio related changes] Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-05-12drivers/vfio: Support EEH error injectionGavin Shan1-1/+13
The patch adds one more EEH sub-command (VFIO_EEH_PE_INJECT_ERR) to inject the specified EEH error, which is represented by (struct vfio_eeh_pe_err), to the indicated PE for testing purpose. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-03-16vfio: amba: VFIO support for AMBA devicesAntonios Motakis1-0/+1
Add support for discovering AMBA devices with VFIO and handle them similarly to Linux platform devices. Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com> Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com> Reviewed-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16vfio: platform: probe to devices on the platform busAntonios Motakis1-0/+1
Driver to bind to Linux platform devices, and callbacks to discover their resources to be used by the main VFIO PLATFORM code. Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com> Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com> Reviewed-by: Eric Auger <eric.auger@linaro.org> Tested-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-02-10vfio-pci: Add device request interfaceAlex Williamson1-0/+1
Userspace can opt to receive a device request notification, indicating that the device should be released. This is setup the same way as the error IRQ and also supports eventfd signaling. Future support may forcefully remove the device from the user if the request is ignored. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-09-29vfio/iommu_type1: add new VFIO_TYPE1_NESTING_IOMMU IOMMU typeWill Deacon1-0/+3
VFIO allows devices to be safely handed off to userspace by putting them behind an IOMMU configured to ensure DMA and interrupt isolation. This enables userspace KVM clients, such as kvmtool and qemu, to further map the device into a virtual machine. With IOMMUs such as the ARM SMMU, it is then possible to provide SMMU translation services to the guest operating system, which are nested with the existing translation installed by VFIO. However, enabling this feature means that the IOMMU driver must be informed that the VFIO domain is being created for the purposes of nested translation. This patch adds a new IOMMU type (VFIO_TYPE1_NESTING_IOMMU) to the VFIO type-1 driver. The new IOMMU type acts identically to the VFIO_TYPE1v2_IOMMU type, but additionally sets the DOMAIN_ATTR_NESTING attribute on its IOMMU domains. Cc: Joerg Roedel <joro@8bytes.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-05drivers/vfio: EEH support for VFIO PCI deviceGavin Shan1-0/+34
The patch adds new IOCTL commands for sPAPR VFIO container device to support EEH functionality for PCI devices, which have been passed through from host to somebody else via VFIO. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Alexander Graf <agraf@suse.de> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-26vfio/type1: Add extension to test DMA cache coherence of IOMMUAlex Williamson1-0/+5
Now that the type1 IOMMU backend can support IOMMU_CACHE, we need to be able to test whether coherency is currently enforced. Add an extension for this. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-02-26vfio/iommu_type1: Multi-IOMMU domain supportAlex Williamson1-0/+1
We currently have a problem that we cannot support advanced features of an IOMMU domain (ex. IOMMU_CACHE), because we have no guarantee that those features will be supported by all of the hardware units involved with the domain over its lifetime. For instance, the Intel VT-d architecture does not require that all DRHDs support snoop control. If we create a domain based on a device behind a DRHD that does support snoop control and enable SNP support via the IOMMU_CACHE mapping option, we cannot then add a device behind a DRHD which does not support snoop control or we'll get reserved bit faults from the SNP bit in the pagetables. To add to the complexity, we can't know the properties of a domain until a device is attached. We could pass this problem off to userspace and require that a separate vfio container be used, but we don't know how to handle page accounting in that case. How do we know that a page pinned in one container is the same page as a different container and avoid double billing the user for the page. The solution is therefore to support multiple IOMMU domains per container. In the majority of cases, only one domain will be required since hardware is typically consistent within a system. However, this provides us the ability to validate compatibility of domains and support mixed environments where page table flags can be different between domains. To do this, our DMA tracking needs to change. We currently try to coalesce user mappings into as few tracking entries as possible. The problem then becomes that we lose granularity of user mappings. We've never guaranteed that a user is able to unmap at a finer granularity than the original mapping, but we must honor the granularity of the original mapping. This coalescing code is therefore removed, allowing only unmaps covering complete maps. The change in accounting is fairly small here, a typical QEMU VM will start out with roughly a dozen entries, so it's arguable if this coalescing was ever needed. We also move IOMMU domain creation to the point where a group is attached to the container. An interesting side-effect of this is that we now have access to the device at the time of domain creation and can probe the devices within the group to determine the bus_type. This finally makes vfio_iommu_type1 completely device/bus agnostic. In fact, each IOMMU domain can host devices on different buses managed by different physical IOMMUs, and present a single DMA mapping interface to the user. When a new domain is created, mappings are replayed to bring the IOMMU pagetables up to the state of the current container. And of course, DMA mapping and unmapping automatically traverse all of the configured IOMMU domains. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Cc: Varun Sethi <Varun.Sethi@freescale.com>
2013-09-04vfio-pci: PCI hot reset interfaceAlex Williamson1-0/+38
The current VFIO_DEVICE_RESET interface only maps to PCI use cases where we can isolate the reset to the individual PCI function. This means the device must support FLR (PCIe or AF), PM reset on D3hot->D0 transition, device specific reset, or be a singleton device on a bus for a secondary bus reset. FLR does not have widespread support, PM reset is not very reliable, and bus topology is dictated by the system and device design. We need to provide a means for a user to induce a bus reset in cases where the existing mechanisms are not available or not reliable. This device specific extension to VFIO provides the user with this ability. Two new ioctls are introduced: - VFIO_DEVICE_PCI_GET_HOT_RESET_INFO - VFIO_DEVICE_PCI_HOT_RESET The first provides the user with information about the extent of devices affected by a hot reset. This is essentially a list of devices and the IOMMU groups they belong to. The user may then initiate a hot reset by calling the second ioctl. We must be careful that the user has ownership of all the affected devices found via the first ioctl, so the second ioctl takes a list of file descriptors for the VFIO groups affected by the reset. Each group must have IOMMU protection established for the ioctl to succeed. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-07-10Merge tag 'vfio-v3.11' of git://github.com/awilliam/linux-vfioLinus Torvalds1-2/+6
Pull vfio updates from Alex Williamson: "Largely hugepage support for vfio/type1 iommu and surrounding cleanups and fixes" * tag 'vfio-v3.11' of git://github.com/awilliam/linux-vfio: vfio/type1: Fix leak on error path vfio: Limit group opens vfio/type1: Fix missed frees and zero sized removes vfio: fix documentation vfio: Provide module option to disable vfio_iommu_type1 hugepage support vfio: hugepage support for vfio_iommu_type1 vfio: Convert type1 iommu to use rbtree
2013-06-21vfio: hugepage support for vfio_iommu_type1Alex Williamson1-2/+6
We currently send all mappings to the iommu in PAGE_SIZE chunks, which prevents the iommu from enabling support for larger page sizes. We still need to pin pages, which means we step through them in PAGE_SIZE chunks, but we can batch up contiguous physical memory chunks to allow the iommu the opportunity to use larger pages. The approach here is a bit different that the one currently used for legacy KVM device assignment. Rather than looking at the vma page size and using that as the maximum size to pass to the iommu, we instead simply look at whether the next page is physically contiguous. This means we might ask the iommu to map a 4MB region, while legacy KVM might limit itself to a maximum of 2MB. Splitting our mapping path also allows us to be smarter about locked memory because we can more easily unwind if the user attempts to exceed the limit. Therefore, rather than assuming that a mapping will result in locked memory, we test each page as it is pinned to determine whether it locks RAM vs an mmap'd MMIO region. This should result in better locking granularity and less locked page fudge factors in userspace. The unmap path uses the same algorithm as legacy KVM. We don't want to track the pfn for each mapping ourselves, but we need the pfn in order to unpin pages. We therefore ask the iommu for the iova to physical address translation, ask it to unpin a page, and see how many pages were actually unpinned. iommus supporting large pages will often return something bigger than a page here, which we know will be physically contiguous and we can unpin a batch of pfns. iommus not supporting large mappings won't see an improvement in batching here as they only unmap a page at a time. With this change, we also make a clarification to the API for mapping and unmapping DMA. We can only guarantee unmaps at the same granularity as used for the original mapping. In other words, unmapping a subregion of a previous mapping is not guaranteed and may result in a larger or smaller unmapping than requested. The size field in the unmapping structure is updated to reflect this. Previously this was unmodified on mapping, always returning the the requested unmap size. This is now updated to return the actual unmap size on success, allowing userspace to appropriately track mappings. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-06-20powerpc/vfio: Implement IOMMU driver for VFIOAlexey Kardashevskiy1-0/+34
VFIO implements platform independent stuff such as a PCI driver, BAR access (via read/write on a file descriptor or direct mapping when possible) and IRQ signaling. The platform dependent part includes IOMMU initialization and handling. This implements an IOMMU driver for VFIO which does mapping/unmapping pages for the guest IO and provides information about DMA window (required by a POWER guest). Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-03-11VFIO-AER: Vfio-pci driver changes for supporting AERVijay Mohan Pandarathil1-0/+1
- New VFIO_SET_IRQ ioctl option to pass the eventfd that is signaled when an error occurs in the vfio_pci_device - Register pci_error_handler for the vfio_pci driver - When the device encounters an error, the error handler registered by the vfio_pci driver gets invoked by the AER infrastructure - In the error handler, signal the eventfd registered for the device. - This results in the qemu eventfd handler getting invoked and appropriate action taken for the guest. Signed-off-by: Vijay Mohan Pandarathil <vijaymohan.pandarathil@hp.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-02-18vfio-pci: Add support for VGA region accessAlex Williamson1-0/+9
PCI defines display class VGA regions at I/O port address 0x3b0, 0x3c0 and MMIO address 0xa0000. As these are non-overlapping, we can ignore the I/O port vs MMIO difference and expose them both in a single region. We make use of the VGA arbiter around each access to configure chipset access as necessary. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2012-10-13UAPI: (Scripted) Disintegrate include/linuxDavid Howells1-0/+368
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>