wireguard-openbsd - WireGuard implementation for the OpenBSD kernel

	Commit message (Collapse)	Author	Age	Files	Lines
*	Add a guard page between I/O virtual address space allocations. The idea	patrick	2021-04-03	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \|	is that IOVA allocations always have a gap in-between which produces a fault on access. If a transfer to a given allocation runs further than expected we should be able to see it. We pre-allocate IOVA on bus DMA map creation, and as long as we don't allocate a PTE descriptor, this comes with no cost. We have plenty of address space anyway, so adding a page-sized gap does not hurt at all and can only have positive effects. Idea from kettenis@
*	Exclude the first page from I/O virtual address space, which is the NULL	patrick	2021-04-03	1	-3/+4
\| \| \| \| \| \| \| \| \|	pointer address. Not allowing this one to be allocated might help find driver bugs, where the device is programmed with a NULL pointer. We have plenty of address space anyway, so excluding this single page does not hurt at all and can only have positive effects. Idea from kettenis@
*	Turns out the PCIe DARTs support a full 32-bit device virtual address space.	kettenis	2021-03-29	1	-4/+9
\| \| \| \| \| \| \| \|	Adjust the region managed by the extend accordingly but avoid the first and last page. The last page collides with the MSI address used by the PCIe controller and not using the first page helps finding bugs. ok patrick@
*	Fix errno, merge ioctl cases	kn	2021-03-26	1	-13/+5
\| \| \| \| \| \| \| \| \|	The EBADF error is always overwritten for the standby, suspend and hibernate ioctls, only the mode ioctl has it right. Merge the now identical casese while here. OK patrick
*	Now that MSI pages are properly mapped, all that debug code in smmu(4)	patrick	2021-03-22	1	-34/+2
\| \| \| \| \| \|	can be removed. The only thing left to implement for smmu(4) to work out of the box with PCIe devices is to reserve the PCIe MMIO windows. Let's see how we can do this properly.
*	Load MSI pages through bus_dma(9). Our interrupt controllers for MSIs	patrick	2021-03-22	3	-17/+98
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	typically pass the physical address, however retrieved, to our PCIe controller code. This physical address can in practise be directly given to the PCIe, but it is not a given that the CPU and the PCIe controller are able to use the same physical addresses. This is even more obvious with an smmu(4) inbetween, which can change the world view by introducing I/O virtual addresses. Hence for this it is indeed necessary to map those pages, which thanks to integration with bus_dma(9) works easily. For this we remember the PCI devices' DMA tag in the interrupt handle during the MSI map, so that we can use the smmu(4)-hooked DMA tag to load the physical address. While some systems might prefer to implement "trapping" pages for MSIs, to make sure devices cannot trigger other devices' interrupts, we only make sure the whole page is mapped. Having the IOMMU create a mapping for each MSI is a bit wasteful, but for now it's the simplest way to implement it. Discussed with and ok kettenis@
*	Add code to acpiiort(4) to look up named components in the IORT and	patrick	2021-03-15	2	-2/+76
\| \| \| \| \| \| \|	map them. This makes ACPI's call to acpi_iommu_device_map() do work through acpiiort(4). ok kettenis@
*	Change API of acpiiort(4). It was written as a hook before, taking the	patrick	2021-03-15	7	-33/+23
\| \| \| \| \| \| \| \| \| \|	PCI attach args and replacing the DMA tag inside. Our other IOMMU API though takes a DMA tag and returns the old one or a new one. To have acpiiort(4) integrate better with non-PCI ACPI devices, change the API so that it is more similar to the other API. This also makes the code easier to understand. ok kettenis@
*	spelling	jsg	2021-03-11	2	-4/+4
\|
*	Add SMP support.	kettenis	2021-03-11	1	-14/+99
\| \| \| \|	ok patrick@
*	Since with the current design there's one device per domain, and one	patrick	2021-03-06	1	-17/+11
\| \| \| \| \| \| \| \| \| \|	domain per pagetable, there's no need for a backpointer to the domain in the pagetable entry descriptor. There can't be any other domain. Also since there's no list, no list entry member is needed either. This reduces early allocation to half of the previous size. I think it's possible to reduce it even further and not need a pagetable entry descriptor at all, but I need to think about that a bit more.
*	One major issue talked about in research papers is reducing the overhead	patrick	2021-03-06	1	-61/+103
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	of the IOVA allocation. As far as I can see the current "best solution" is to cache IOVA ranges in percpu magazines. I don't think we have this issue at all thanks to bus_dmamap_create(9). The map is created ahead of time, and we know the maximum size of the DMA transfer. Since with smmu(4) we have IOVA per domain, allocating IOVA 'early' is essentially free. But pagetable mapping also incurs a performance penalty, since we allocate pagetable entry descriptors through pools. Since we have the IOVA early, we can allocate those early as well. This allocation is a bit more expensive though, but can be optimized further. All this means that there is no allocation overhead in hot code paths. The "only" thing remaining is assigning IOVA to the segments, adjusting the pagetable mappings, and flushing the IOTLB on unload. Maybe there's a way to do a combined flush for NICs, because we give a list of mbufs to the network stack and we could do the IOTLB invalidation only once right before we hand over the mbuf list to the upper layers.
*	Improve readability of softc accesses.	patrick	2021-03-05	1	-13/+20
\|
*	Introduce an IOVA allocator instead of mapping pages 1:1. Mapping pages 1:1	patrick	2021-03-05	2	-106/+129
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	obviously reduces the overhead of IOVA allocation, but instead you have the problem of doubly mapped pages, and making sure a page is only unmapped once the last user is gone. My initial attempt, modeled after apldart(4), calls the allocator for each segment. Unfortunately this introduces a performance penalty which reduces performance from around 700 Mbit/s to about 20 Mbit/s, or even less, in a simple single stream tcpbench scenario. Most mbufs from userland seem to have at least 3 segments. Calculating the needed IOVA space upfront reduces this penalty. IOVA allocation overhead could be reduced once and for all if it is possible to reserve IOVA during bus_dmamap_create(9), as it is only called upon creation and basically never for each DMA cycle. This needs some more thought. With this we now put the pressure on the PTED pools instead. Additionally, but not part of this diff, percpu pools for the PTEDs seem to reduce the overhead for that single stream tcpbench scenario to 0.3%. Right now this means we're hitting a different bottleneck, not related to the IOMMU. The next bottleneck will be discovered once forwarding is unlocked. Though it should be possible to benchmark the current implementation, and different designs, using a cycles counter. With IOVA allocation it's not easily possible to correlate memory passed to bus_dmamem_map(9) with memory passed to bus_dmamap_load(9). So far my code try to use the same cachability attributes as the kenrel uses for its userland mappings. For the devices we support, there seems to be no need so far. If this ever gives us any trouble in the feature, I'll have a look and fix it. While drivers should call bus_dmamap_unload(9) before bus_dmamap_destroy(9), the API explicitly states that bus_dmamap_destroy(9) should unload the map if it is still loaded. Hence we need to do exactly that. I actually have found one network driver which behaves that way, and the developer intends to change the network driver's behaviour.
*	Extend the commented code that shows which additional mappings are needed,	patrick	2021-03-05	1	-6/+24
\| \| \| \| \| \| \|	or which regions need to be reserved. As it turns out, a region we should not map is the PCIe address space. Making a PCIe device try to do DMA to an address in PCIe address space will obviously not make its way to SMMU and host memory. We'll probably have to add an API for that.
*	Do not delay while waiting for IOTLB invalidation to complete. A 1ms	patrick	2021-03-02	1	-3/+1
\| \| \| \| \| \| \|	delay is awful in a hot path, and the SMMU is actually quite quick on invalidation, so simply removing the delay is worth a thousand roses. Found with mental support from dlg@ (and btrace)
*	Update the MSI addresses for the Armada 8040. This chunk will only be	patrick	2021-03-01	1	-3/+7
\| \| \| \|	there until we have a proper way of making the MSI pages available.
*	Instead of sprinkling the device's DMA tag, always return a new DMA tag	patrick	2021-03-01	3	-41/+33
\| \| \| \| \| \| \| \| \| \|	which is based on the IOMMU's. If you think about it, using the IOMMU's DMA tag makes more sense because it is the IOMMU that does the actual DMA. Noticed while debugging, since the SMMU's map function was called twice: once for the PCI device, and once for its ppb(4). As the transaction has the PCI device's Stream ID, not the ppb(4)'s, this would be useless work. Suggested by kettenis@
*	Have acpipci(4) look for a matching SMMU in the IORT.	patrick	2021-02-28	1	-1/+69
\| \| \| \|	ok kettenis@
*	Add smmu(4), a driver the ARM System MMU. This IOMMU is basically a	patrick	2021-02-28	5	-0/+2008
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	regular ARM CPU MMU re-used for I/O devices. Implementations can have a mix of stage-2 only and stage-2/stage-2 context blocks (domains). The IOMMU allows different ways of grouping devices into a single domain. This implementation only supports SMMUv2, since there is basically no relevant SMMUv1 hardware. It also only supports AArch64 pagetables, the same as our pmap. Hence lots of code was taken from there. There is no support for 32-bit pagetables, which would have also been needed for SMMUv1 support. I have not yet seen any machines with SMMUv3, which will probably need a new driver. There is some work to be done, but the code works and it's about time it hits the tree. ok kettenis@
*	Add acpiiort(4), a driver for the ACPI I/O Remapping Table. This table	patrick	2021-02-28	2	-0/+138
\| \| \| \| \| \|	contains information which IOMMUs we have and how the devices are routed. ok kettenis@
*	Issue call to IOMMU OFW API to collect an IOMMU-sprinkled DMA tag.	patrick	2021-02-28	2	-2/+9
\| \| \| \|	ok kettenis@
*	Issue call to IOMMU OFW API to collect an IOMMU-sprinkled DMA tag.	patrick	2021-02-28	1	-1/+8
\| \| \| \|	ok kettenis@
*	Add apldart(4), a driver for the IOMMU on Apple M1 SoCs.	kettenis	2021-02-27	1	-0/+560
\| \| \| \|	ok patrick@
*	Add aplcpie(4), a (minimal) driver for the PCIe host bridge on Apple M1 SoCs.	kettenis	2021-02-26	1	-0/+494
\| \| \| \|	ok patrick@
*	Add some infrastructure in the PCI chipset tag for pci_probe_device_hook()	patrick	2021-02-25	2	-2/+19
\| \| \| \| \| \|	so that we can provide IOMMU-hooked bus DMA tags for each PCI device. ok kettenis@
*	Add aplintc(4), a driver for the interrupt controller found on	kettenis	2021-02-23	1	-0/+379
\| \| \| \| \| \|	Apple M1 SoCs. ok patrick@
*	timecounting: use C99-style initialization for all timecounter structs	cheloha	2021-02-23	1	-3/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The timecounter struct is large and I think it may change in the future. Changing it later will be easier if we use C99-style initialization for all timecounter structs. It also makes reading the code a bit easier. For reasons I cannot explain, switching to C99-style initialization sometimes changes the hash of the resulting object file, even though the resulting struct should be the same. So there is a binary change here, but only sometimes. No behavior should change in either case. I can't compile-test this everywhere but I have been staring at the diff for days now and I'm relatively confident this will not break compilation. Fingers crossed. ok gnezdo@
*	Add apldog(4), a driver for the watchdog on Apple M1 SoCs.	kettenis	2021-02-22	1	-0/+117
\| \| \| \| \| \|	This allows us to reboot the machine. ok patrick@
*	Add support for FIQs. We need these to support agtimer(4) on Apple M1 SoCs	kettenis	2021-02-17	3	-6/+6
\| \| \| \| \| \| \| \|	since its interrupts seem to be hardwared to trigger an FIQ instead of an IRQ. This means we need to manipulate both the F and the I bit in the DAIF register when enabling and disabling interrupts. ok patrick@
*	Introduce BUS_SPACE_MAP_POSTED such that we can distinguish between	kettenis	2021-02-16	1	-3/+7
\| \| \| \| \| \| \| \|	posted and non-posted device memory mappings and set the right memory attributes for them. Needed because on the Apple M1 using the wrong mapping will fault. ok patrick@, dlg@
*	s/KHz/kHz/ and reduce dmesg spam a bit	kettenis	2021-01-19	1	-4/+4
\| \| \| \|	ok tb@, deraadt@
*	Split the IORT struct into two, as the current version not only contained	patrick	2021-01-15	1	-2/+4
\| \| \| \| \| \|	the generic IORT node information but also the Root Complex's attributes. ok kettenis@
*	Move IO Remapping Table (IORT) struct defines to the common ACPI header	patrick	2021-01-15	1	-35/+1
\| \| \| \| \| \|	so that it can be used by more drivers. ok kettenis@
*	Refactor klist insertion and removal	visa	2020-12-25	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \|	Rename klist_{insert,remove}() to klist_{insert,remove}_locked(). These functions assume that the caller has locked the klist. The current state of locking remains intact because the kernel lock is still used with all klists. Add new functions klist_insert() and klist_remove() that lock the klist internally. This allows some code simplification. OK mpi@
*	Implement pci_intr_disestablish(9) for acpicpi(4) on arm64.	kettenis	2020-12-06	1	-2/+8
\| \| \| \|	ok patrick@
*	ICC_DIR, used to deactive interrupts, is only needed when running in	patrick	2020-11-28	1	-7/+2
\| \| \| \| \| \| \| \| \|	EOImode == 1, which we don't do. Hence there's no need to touch the register at all. This allows OpenBSD to progress on ESXi-Arm. This bug in ESXi-Arm will be fixed there as well. Noticed by Jared McNeill ok kettenis@
*	Make bus_space_mmap(9) work for simplebus(4).	kettenis	2020-11-19	1	-1/+54
\| \| \| \|	ok patrick@
*	Make sure bus_space_mmap(9) works for pciecam(4).	kettenis	2020-11-19	1	-1/+24
\| \| \| \|	ok patrick@
*	Implement address translation for bus_space_mmap(9).	kettenis	2020-11-19	1	-1/+20
\| \| \| \|	ok patrick@
*	Add support for edge-triggered interrupts to agintc(4). This is mostly	patrick	2020-11-15	1	-8/+41
\| \| \| \| \| \| \| \|	based on the ampintc(4) version, which already has support for it. For now only do this for SPIs. SGIs are always edge-triggered and cannot be modified, and PPIs can be both, but let's keep it safe for now. ok kettenis@
*	allow compile of kernels with DDB, in more cases.	deraadt	2020-09-05	2	-2/+6
\| \| \| \| \|	from Matt Baulch discussed with kettenis and drahn
*	Declare hw_{prod,serial,uuid,vendor,ver} in <sys/systm.h>.	visa	2020-08-26	2	-8/+2
\| \| \| \|	OK deraadt@, mpi@
*	Re-work intr_barrier(9) on arm64 to remove layer violation. So far we	patrick	2020-07-17	3	-4/+43
\| \| \| \| \| \| \| \| \| \| \|	have stored the struct cpu_info * in the wrapper around the interrupt handler cookie, but since we can have a few layers inbetween, this does not seem very nice. Instead have each and every interrupt controller provide a barrier function. This means that intr_barrier(9) will in the end be executed by the interrupt controller that actually wired the pin to a core. And that's the only place where the information is stored. ok kettenis@
*	Store struct cpu_info * in arm64's interrupt wrap. intr_barrier() can	patrick	2020-07-16	1	-1/+2
\| \| \| \| \| \| \| \| \|	already assume every cookie is wrapped and simply retrieve the pointer from it. It's a bit of a layer violation though, since only the intc should actually store that kind of information. This is good enough for now, but I'm already cooking up a diff to resolve this. ok dlg@
*	To be able to have intr_barrier() on arm64, we need to be able to	patrick	2020-07-16	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	somehow gain access to the struct cpu_info * used to establish the interrupt. One possibility is to store the pointer in the cookie returned by the establish methods. A better way would be to ask the interrupt controller directly to do barrier. This means that all external facing interrupt establish functions need to wrap the cookie in a common way. We already do this for FDT-based interrupts. Also most PCI controllers already return the cookie from the FDT API, which is already wrapped. So arm64's acpi_intr_establish() and acpipci(4) now need to explicitly wrap it, since they call ic->ic_establish directly, which does not wrap. ok dlg@
*	Userland timecounter implementation for arm64.	kettenis	2020-07-15	1	-3/+12
\| \| \| \|	ok naddy@
*	Fix agintc(4) for non-MULTIPROCESSOR kernels. Due to the recent changes	patrick	2020-07-15	1	-3/+6
\| \| \| \| \| \| \| \| \|	the driver expected that it can find all CPUs referenced by the interrupt controller. Since on non-MP we only spin one core up, the driver won't ever be able to find them. Relax the requirement for non-MP, since the info extracted there is only needed for MP. ok kettenis@
*	Add support for routing interrupts to other CPUs in agintc(4). Most of the	patrick	2020-07-14	1	-47/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	supporting code was already there. The driver supports establishing multiple handlers on the same pin. Make sure that a single pin can only be established on a specific core by recording the struct cpu_info * of the first establish, and returning NULL if someone tries to share the pin with a different core. For LPIs, typically used for MSIs, the routing is done by targetting an LPI to a specific "collection". We create a collection per core, indexing it by cpu_number(). For this we need to know a CPU's "processor number", unless GITS_TYPER_PTA is set. Since we now attach CPUs early, and the redistributors are not banked, we can retrieve that information early on. It's important to move this as far up as possible, as it's not as easy as on ampintc(4) to re-route LPIs. To establish an LPI on a different core, we now only have the pass the CPU's number as part of the map command which is sent to the hardware. Prompted by dlg@ ok kettenis@
*	Add support for routing interrupts to other CPUs in ampintc(4). Most of the	patrick	2020-07-14	1	-35/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	supporting code was already there. The driver supports establishing multiple handlers on the same pin. Make sure that a single pin can only be established on a specific core by recording the struct cpu_info * of the first establish, and returning NULL if someone tries to share the pin with a different core. Since the array of CPU masks, used for enabling/disabling interrupt routing to specific cores, is only populated during cpu_boot_secondary_processors(), each core will re-route the interrupts once a core read its mask. Until then, the core will not receive interrupts for that pin. While there, remove a call to ampintc_setipl(), which seems to be a no-op. It tries to set the same value that's already set. Since the function that calls it is supposed to calculate a pin's mask and do the routing, this doesn't seem to be the correct place for such a call. agintc(4) doesn't have it either. Prompted by dlg@ ok kettenis@