aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/vfio/vfio_iommu_spapr_tce.c
diff options
context:
space:
mode:
authorAlexey Kardashevskiy <aik@ozlabs.ru>2018-10-15 21:08:41 +1100
committerPaul Mackerras <paulus@ozlabs.org>2018-10-20 20:47:02 +1100
commit6e301a8e56e429d6b01d83d427a9e54fdbb0fa60 (patch)
tree921d4d8c82882d10ea6a415f42fc950c49606bfe /drivers/vfio/vfio_iommu_spapr_tce.c
parentKVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips (diff)
downloadlinux-dev-6e301a8e56e429d6b01d83d427a9e54fdbb0fa60.tar.xz
linux-dev-6e301a8e56e429d6b01d83d427a9e54fdbb0fa60.zip
KVM: PPC: Optimize clearing TCEs for sparse tables
The powernv platform maintains 2 TCE tables for VFIO - a hardware TCE table and a table with userspace addresses. These tables are radix trees, we allocate indirect levels when they are written to. Since the memory allocation is problematic in real mode, we have 2 accessors to the entries: - for virtual mode: it allocates the memory and it is always expected to return non-NULL; - fr real mode: it does not allocate and can return NULL. Also, DMA windows can span to up to 55 bits of the address space and since we never have this much RAM, such windows are sparse. However currently the SPAPR TCE IOMMU driver walks through all TCEs to unpin DMA memory. Since we maintain a userspace addresses table for VFIO which is a mirror of the hardware table, we can use it to know which parts of the DMA window have not been mapped and skip these so does this patch. The bare metal systems do not have this problem as they use a bypass mode of a PHB which maps RAM directly. This helps a lot with sparse DMA windows, reducing the shutdown time from about 3 minutes per 1 billion TCEs to a few seconds for 32GB sparse guest. Just skipping the last level seems to be good enough. As non-allocating accessor is used now in virtual mode as well, rename it from IOMMU_TABLE_USERSPACE_ENTRY_RM (real mode) to _RO (read only). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Diffstat (limited to 'drivers/vfio/vfio_iommu_spapr_tce.c')
-rw-r--r--drivers/vfio/vfio_iommu_spapr_tce.c23
1 files changed, 21 insertions, 2 deletions
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 96721b154454..b30926e11d87 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -444,7 +444,7 @@ static void tce_iommu_unuse_page_v2(struct tce_container *container,
struct mm_iommu_table_group_mem_t *mem = NULL;
int ret;
unsigned long hpa = 0;
- __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+ __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY_RO(tbl, entry);
if (!pua)
return;
@@ -467,8 +467,27 @@ static int tce_iommu_clear(struct tce_container *container,
unsigned long oldhpa;
long ret;
enum dma_data_direction direction;
+ unsigned long lastentry = entry + pages;
+
+ for ( ; entry < lastentry; ++entry) {
+ if (tbl->it_indirect_levels && tbl->it_userspace) {
+ /*
+ * For multilevel tables, we can take a shortcut here
+ * and skip some TCEs as we know that the userspace
+ * addresses cache is a mirror of the real TCE table
+ * and if it is missing some indirect levels, then
+ * the hardware table does not have them allocated
+ * either and therefore does not require updating.
+ */
+ __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY_RO(tbl,
+ entry);
+ if (!pua) {
+ /* align to level_size which is power of two */
+ entry |= tbl->it_level_size - 1;
+ continue;
+ }
+ }
- for ( ; pages; --pages, ++entry) {
cond_resched();
direction = DMA_NONE;