| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
| |
is that IOVA allocations always have a gap in-between which produces a fault
on access. If a transfer to a given allocation runs further than expected
we should be able to see it. We pre-allocate IOVA on bus DMA map creation,
and as long as we don't allocate a PTE descriptor, this comes with no cost.
We have plenty of address space anyway, so adding a page-sized gap does not
hurt at all and can only have positive effects.
Idea from kettenis@
|
|
|
|
|
|
|
|
|
| |
pointer address. Not allowing this one to be allocated might help find
driver bugs, where the device is programmed with a NULL pointer. We have
plenty of address space anyway, so excluding this single page does not
hurt at all and can only have positive effects.
Idea from kettenis@
|
|
|
|
|
|
|
|
| |
Adjust the region managed by the extend accordingly but avoid the first
and last page. The last page collides with the MSI address used by the
PCIe controller and not using the first page helps finding bugs.
ok patrick@
|
|
|
|
|
|
|
|
|
| |
The EBADF error is always overwritten for the standby, suspend and
hibernate ioctls, only the mode ioctl has it right.
Merge the now identical casese while here.
OK patrick
|
|
|
|
|
|
| |
can be removed. The only thing left to implement for smmu(4) to work
out of the box with PCIe devices is to reserve the PCIe MMIO windows.
Let's see how we can do this properly.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
typically pass the physical address, however retrieved, to our PCIe
controller code. This physical address can in practise be directly
given to the PCIe, but it is not a given that the CPU and the PCIe
controller are able to use the same physical addresses.
This is even more obvious with an smmu(4) inbetween, which can change
the world view by introducing I/O virtual addresses. Hence for this
it is indeed necessary to map those pages, which thanks to integration
with bus_dma(9) works easily.
For this we remember the PCI devices' DMA tag in the interrupt handle
during the MSI map, so that we can use the smmu(4)-hooked DMA tag to
load the physical address.
While some systems might prefer to implement "trapping" pages for MSIs,
to make sure devices cannot trigger other devices' interrupts, we only
make sure the whole page is mapped.
Having the IOMMU create a mapping for each MSI is a bit wasteful, but
for now it's the simplest way to implement it.
Discussed with and ok kettenis@
|
|
|
|
|
|
|
| |
map them. This makes ACPI's call to acpi_iommu_device_map() do work
through acpiiort(4).
ok kettenis@
|
|
|
|
|
|
|
|
|
|
| |
PCI attach args and replacing the DMA tag inside. Our other IOMMU API
though takes a DMA tag and returns the old one or a new one. To have
acpiiort(4) integrate better with non-PCI ACPI devices, change the API
so that it is more similar to the other API. This also makes the code
easier to understand.
ok kettenis@
|
| |
|
|
|
|
| |
ok patrick@
|
|
|
|
|
|
|
|
|
|
| |
domain per pagetable, there's no need for a backpointer to the domain
in the pagetable entry descriptor. There can't be any other domain.
Also since there's no list, no list entry member is needed either.
This reduces early allocation to half of the previous size. I think
it's possible to reduce it even further and not need a pagetable entry
descriptor at all, but I need to think about that a bit more.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
of the IOVA allocation. As far as I can see the current "best solution"
is to cache IOVA ranges in percpu magazines. I don't think we have this
issue at all thanks to bus_dmamap_create(9). The map is created ahead
of time, and we know the maximum size of the DMA transfer. Since with
smmu(4) we have IOVA per domain, allocating IOVA 'early' is essentially
free. But pagetable mapping also incurs a performance penalty, since we
allocate pagetable entry descriptors through pools. Since we have the
IOVA early, we can allocate those early as well. This allocation is a
bit more expensive though, but can be optimized further.
All this means that there is no allocation overhead in hot code paths.
The "only" thing remaining is assigning IOVA to the segments, adjusting
the pagetable mappings, and flushing the IOTLB on unload. Maybe there's
a way to do a combined flush for NICs, because we give a list of mbufs
to the network stack and we could do the IOTLB invalidation only once
right before we hand over the mbuf list to the upper layers.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
obviously reduces the overhead of IOVA allocation, but instead you have the
problem of doubly mapped pages, and making sure a page is only unmapped once
the last user is gone. My initial attempt, modeled after apldart(4), calls
the allocator for each segment. Unfortunately this introduces a performance
penalty which reduces performance from around 700 Mbit/s to about 20 Mbit/s,
or even less, in a simple single stream tcpbench scenario. Most mbufs from
userland seem to have at least 3 segments. Calculating the needed IOVA space
upfront reduces this penalty. IOVA allocation overhead could be reduced once
and for all if it is possible to reserve IOVA during bus_dmamap_create(9), as
it is only called upon creation and basically never for each DMA cycle. This
needs some more thought.
With this we now put the pressure on the PTED pools instead. Additionally, but
not part of this diff, percpu pools for the PTEDs seem to reduce the overhead
for that single stream tcpbench scenario to 0.3%. Right now this means we're
hitting a different bottleneck, not related to the IOMMU. The next bottleneck
will be discovered once forwarding is unlocked. Though it should be possible
to benchmark the current implementation, and different designs, using a cycles
counter.
With IOVA allocation it's not easily possible to correlate memory passed to
bus_dmamem_map(9) with memory passed to bus_dmamap_load(9). So far my code
try to use the same cachability attributes as the kenrel uses for its userland
mappings. For the devices we support, there seems to be no need so far. If
this ever gives us any trouble in the feature, I'll have a look and fix it.
While drivers should call bus_dmamap_unload(9) before bus_dmamap_destroy(9),
the API explicitly states that bus_dmamap_destroy(9) should unload the map
if it is still loaded. Hence we need to do exactly that. I actually have
found one network driver which behaves that way, and the developer intends
to change the network driver's behaviour.
|
|
|
|
|
|
|
| |
or which regions need to be reserved. As it turns out, a region we should
not map is the PCIe address space. Making a PCIe device try to do DMA to
an address in PCIe address space will obviously not make its way to SMMU
and host memory. We'll probably have to add an API for that.
|
|
|
|
|
|
|
| |
delay is awful in a hot path, and the SMMU is actually quite quick on
invalidation, so simply removing the delay is worth a thousand roses.
Found with mental support from dlg@ (and btrace)
|
|
|
|
| |
there until we have a proper way of making the MSI pages available.
|
|
|
|
|
|
|
|
|
|
| |
which is based on the IOMMU's. If you think about it, using the IOMMU's
DMA tag makes more sense because it is the IOMMU that does the actual DMA.
Noticed while debugging, since the SMMU's map function was called twice:
once for the PCI device, and once for its ppb(4). As the transaction has
the PCI device's Stream ID, not the ppb(4)'s, this would be useless work.
Suggested by kettenis@
|
|
|
|
| |
ok kettenis@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
regular ARM CPU MMU re-used for I/O devices. Implementations can have a
mix of stage-2 only and stage-2/stage-2 context blocks (domains). The
IOMMU allows different ways of grouping devices into a single domain.
This implementation only supports SMMUv2, since there is basically
no relevant SMMUv1 hardware. It also only supports AArch64
pagetables, the same as our pmap. Hence lots of code was taken from
there. There is no support for 32-bit pagetables, which would have
also been needed for SMMUv1 support. I have not yet seen any
machines with SMMUv3, which will probably need a new driver.
There is some work to be done, but the code works and it's about
time it hits the tree.
ok kettenis@
|
|
|
|
|
|
| |
contains information which IOMMUs we have and how the devices are routed.
ok kettenis@
|
|
|
|
| |
ok kettenis@
|
|
|
|
| |
ok kettenis@
|
|
|
|
| |
ok patrick@
|
|
|
|
| |
ok patrick@
|
|
|
|
|
|
| |
so that we can provide IOMMU-hooked bus DMA tags for each PCI device.
ok kettenis@
|
|
|
|
|
|
| |
Apple M1 SoCs.
ok patrick@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The timecounter struct is large and I think it may change in the
future. Changing it later will be easier if we use C99-style
initialization for all timecounter structs. It also makes reading the
code a bit easier.
For reasons I cannot explain, switching to C99-style initialization
sometimes changes the hash of the resulting object file, even though
the resulting struct should be the same. So there is a binary change
here, but only sometimes. No behavior should change in either case.
I can't compile-test this everywhere but I have been staring at the
diff for days now and I'm relatively confident this will not break
compilation. Fingers crossed.
ok gnezdo@
|
|
|
|
|
|
| |
This allows us to reboot the machine.
ok patrick@
|
|
|
|
|
|
|
|
| |
since its interrupts seem to be hardwared to trigger an FIQ instead of an
IRQ. This means we need to manipulate both the F and the I bit in the
DAIF register when enabling and disabling interrupts.
ok patrick@
|
|
|
|
|
|
|
|
| |
posted and non-posted device memory mappings and set the right memory
attributes for them. Needed because on the Apple M1 using the wrong
mapping will fault.
ok patrick@, dlg@
|
|
|
|
| |
ok tb@, deraadt@
|
|
|
|
|
|
| |
the generic IORT node information but also the Root Complex's attributes.
ok kettenis@
|
|
|
|
|
|
| |
so that it can be used by more drivers.
ok kettenis@
|
|
|
|
|
|
|
|
|
|
|
|
| |
Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.
Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.
OK mpi@
|
|
|
|
| |
ok patrick@
|
|
|
|
|
|
|
|
|
| |
EOImode == 1, which we don't do. Hence there's no need to touch the
register at all. This allows OpenBSD to progress on ESXi-Arm. This
bug in ESXi-Arm will be fixed there as well.
Noticed by Jared McNeill
ok kettenis@
|
|
|
|
| |
ok patrick@
|
|
|
|
| |
ok patrick@
|
|
|
|
| |
ok patrick@
|
|
|
|
|
|
|
|
| |
based on the ampintc(4) version, which already has support for it. For
now only do this for SPIs. SGIs are always edge-triggered and cannot be
modified, and PPIs can be both, but let's keep it safe for now.
ok kettenis@
|
|
|
|
|
| |
from Matt Baulch
discussed with kettenis and drahn
|
|
|
|
| |
OK deraadt@, mpi@
|
|
|
|
|
|
|
|
|
|
|
| |
have stored the struct cpu_info * in the wrapper around the interrupt
handler cookie, but since we can have a few layers inbetween, this does
not seem very nice. Instead have each and every interrupt controller
provide a barrier function. This means that intr_barrier(9) will in the
end be executed by the interrupt controller that actually wired the pin
to a core. And that's the only place where the information is stored.
ok kettenis@
|
|
|
|
|
|
|
|
|
| |
already assume every cookie is wrapped and simply retrieve the pointer
from it. It's a bit of a layer violation though, since only the intc
should actually store that kind of information. This is good enough for
now, but I'm already cooking up a diff to resolve this.
ok dlg@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
somehow gain access to the struct cpu_info * used to establish the
interrupt. One possibility is to store the pointer in the cookie
returned by the establish methods. A better way would be to ask
the interrupt controller directly to do barrier.
This means that all external facing interrupt establish functions
need to wrap the cookie in a common way. We already do this for
FDT-based interrupts. Also most PCI controllers already return
the cookie from the FDT API, which is already wrapped. So arm64's
acpi_intr_establish() and acpipci(4) now need to explicitly wrap
it, since they call ic->ic_establish directly, which does not wrap.
ok dlg@
|
|
|
|
| |
ok naddy@
|
|
|
|
|
|
|
|
|
| |
the driver expected that it can find all CPUs referenced by the interrupt
controller. Since on non-MP we only spin one core up, the driver won't
ever be able to find them. Relax the requirement for non-MP, since the
info extracted there is only needed for MP.
ok kettenis@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
supporting code was already there. The driver supports establishing multiple
handlers on the same pin. Make sure that a single pin can only be established
on a specific core by recording the struct cpu_info * of the first establish,
and returning NULL if someone tries to share the pin with a different core.
For LPIs, typically used for MSIs, the routing is done by targetting an LPI
to a specific "collection". We create a collection per core, indexing it by
cpu_number().
For this we need to know a CPU's "processor number", unless GITS_TYPER_PTA is
set. Since we now attach CPUs early, and the redistributors are not banked,
we can retrieve that information early on. It's important to move this as far
up as possible, as it's not as easy as on ampintc(4) to re-route LPIs.
To establish an LPI on a different core, we now only have the pass the CPU's
number as part of the map command which is sent to the hardware.
Prompted by dlg@
ok kettenis@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
supporting code was already there. The driver supports establishing multiple
handlers on the same pin. Make sure that a single pin can only be established
on a specific core by recording the struct cpu_info * of the first establish,
and returning NULL if someone tries to share the pin with a different core.
Since the array of CPU masks, used for enabling/disabling interrupt routing to
specific cores, is only populated during cpu_boot_secondary_processors(), each
core will re-route the interrupts once a core read its mask. Until then, the
core will not receive interrupts for that pin.
While there, remove a call to ampintc_setipl(), which seems to be a no-op. It
tries to set the same value that's already set. Since the function that calls
it is supposed to calculate a pin's mask and do the routing, this doesn't seem
to be the correct place for such a call. agintc(4) doesn't have it either.
Prompted by dlg@
ok kettenis@
|