wireguard-openbsd - WireGuard implementation for the OpenBSD kernel

	Commit message (Collapse)	Author	Age	Files	Lines
*	Add a guard page between I/O virtual address space allocations. The idea	patrick	2021-04-03	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \|	is that IOVA allocations always have a gap in-between which produces a fault on access. If a transfer to a given allocation runs further than expected we should be able to see it. We pre-allocate IOVA on bus DMA map creation, and as long as we don't allocate a PTE descriptor, this comes with no cost. We have plenty of address space anyway, so adding a page-sized gap does not hurt at all and can only have positive effects. Idea from kettenis@
*	Exclude the first page from I/O virtual address space, which is the NULL	patrick	2021-04-03	1	-3/+4
\| \| \| \| \| \| \| \| \|	pointer address. Not allowing this one to be allocated might help find driver bugs, where the device is programmed with a NULL pointer. We have plenty of address space anyway, so excluding this single page does not hurt at all and can only have positive effects. Idea from kettenis@
*	Fix Dale's email address	tb	2021-04-02	4	-8/+8
\| \| \| \|	ok drahn
*	Turns out the PCIe DARTs support a full 32-bit device virtual address space.	kettenis	2021-03-29	1	-4/+9
\| \| \| \| \| \| \| \|	Adjust the region managed by the extend accordingly but avoid the first and last page. The last page collides with the MSI address used by the PCIe controller and not using the first page helps finding bugs. ok patrick@
*	Fix IA32_EPT_VPID_CAP_XO_TRANSLATIONS specification	dv	2021-03-29	1	-2/+2
\| \| \| \| \| \|	Per Intel SDM (Vol 3D, App. A.10) bit 0 should be read as a 1 if enabled. From Adam Steen. ok mlarkin@
*	Make sure that all CPUs end up with the same bits set in SCTLR_EL1.	kettenis	2021-03-27	2	-27/+28
\| \| \| \| \| \| \| \| \| \|	Do this by clearing all the bits marked RES0 and set all the bits marked RES1 for the ARMv8.0. Any optional features introduced in later revisions of the architecture (such as PAN) will be enabled after SCTLR_EL1 is initialized. ok patrick@
*	Add ARMv8.5 instruction set related CPU features.	kettenis	2021-03-27	2	-4/+184
\| \| \| \|	ok patrick@
*	Return EOPNOTSUPP for unsupported ioctls	kn	2021-03-26	1	-16/+6
\| \| \| \| \| \| \| \| \|	Match what apm(4/macppc) says and make apmd(8) log an approiate warning when unsupported power actions are requested. Merge identical cases while here. This syncs with the apm ioctl handlers on loongson and arm64.
*	Fix "mach dtb" return code to avoid bogus boot	kn	2021-03-26	1	-6/+8
\| \| \| \| \| \| \| \| \| \| \| \|	Bootloader command functions must return zero in case of failure, returning 1 tells the bootloader to boot the currently set kernel iamge. "machine dtb" is is the wrong way around so using it triggers a boot. Fix this and print a brief usage (like other commands such as "hexdump" do) while here. Feedback OK patrick
*	Fix errno, merge ioctl cases	kn	2021-03-26	1	-13/+5
\| \| \| \| \| \| \| \| \|	The EBADF error is always overwritten for the standby, suspend and hibernate ioctls, only the mode ioctl has it right. Merge the now identical casese while here. OK patrick
*	remove uneeded includes in md armv7 files	jsg	2021-03-25	69	-307/+71
\| \| \| \|	based on include-what-you-use suggestions
*	The logic in mmrw() to check whether an address is within direct	bluhm	2021-03-24	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	map was the wrong way around. The && prevented an EFAULT error and could pass userland addresses as kernel source to copyout(9). The kernel could crash with protection fault due to an invalid offset when reading /dev/kmem. Also make the range checks stricter. Not only the start address must be valid, but also the end address must be within the region to be copied. Note that sysctl kern.allowkmem=0 makes the bug unreachable by default. OK deraadt@
*	Pack the SPCR struct definition since the struct isn't naturally aligned	patrick	2021-03-23	1	-2/+2
\| \| \| \| \| \| \|	or padded, and hence e. g. the access to the PCI vendor/device id would be broken. The structs for the other tables all seem to be packed as well. ok kettenis@
*	Now that MSI pages are properly mapped, all that debug code in smmu(4)	patrick	2021-03-22	1	-34/+2
\| \| \| \| \| \|	can be removed. The only thing left to implement for smmu(4) to work out of the box with PCIe devices is to reserve the PCIe MMIO windows. Let's see how we can do this properly.
*	Load MSI pages through bus_dma(9). Our interrupt controllers for MSIs	patrick	2021-03-22	4	-18/+100
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	typically pass the physical address, however retrieved, to our PCIe controller code. This physical address can in practise be directly given to the PCIe, but it is not a given that the CPU and the PCIe controller are able to use the same physical addresses. This is even more obvious with an smmu(4) inbetween, which can change the world view by introducing I/O virtual addresses. Hence for this it is indeed necessary to map those pages, which thanks to integration with bus_dma(9) works easily. For this we remember the PCI devices' DMA tag in the interrupt handle during the MSI map, so that we can use the smmu(4)-hooked DMA tag to load the physical address. While some systems might prefer to implement "trapping" pages for MSIs, to make sure devices cannot trigger other devices' interrupts, we only make sure the whole page is mapped. Having the IOMMU create a mapping for each MSI is a bit wasteful, but for now it's the simplest way to implement it. Discussed with and ok kettenis@
*	Disambiguate expressions.	visa	2021-03-21	1	-3/+3
\|
*	another unfortunate action to cope with relentless kernel growth	deraadt	2021-03-19	1	-2/+2
\|
*	Add missing memory clobbers to "data" barriers.	kettenis	2021-03-17	3	-11/+11
\|
*	Always use an allocated buffer for {Read,Write}Blocks() to make	yasuoka	2021-03-17	2	-80/+34
\| \| \| \| \| \| \| \| \|	efid_io() simpler. Also fixes the problem on some machines when boot from CD-ROM. It happened because the previous version passed unaligned pointers to the functions even if it is restricted by the IoAlign property of the media. idea from kettenis, work with asou ok kettenis
*	Node without a "status" property should be considered enabled as well.	kettenis	2021-03-16	1	-3/+3
\| \| \| \|	Same change made to arm64 a week ago.
*	Make sure that switching the console from serial to framebuffer works	kettenis	2021-03-16	2	-22/+25
\| \| \| \| \| \|	for framebuffer nodes under / and /chosen. Same change made to arm64 last month.
*	acpi_intr_disestablish() should free its own cookie.	patrick	2021-03-16	1	-1/+2
\| \| \| \|	ok kettenis@
*	Bump MAXTSIZ to 256MB on i386.	kurt	2021-03-16	1	-2/+2
\| \| \| \|	okay deraadt@
*	Fix some correctness issues in the lowelevel kernel bringup code.	kettenis	2021-03-16	3	-5/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Make sure we install a dummy page table in TTBR0_EL1 before we change the size of the VA space in TCR_EL1. - Flush the TLB after updating TCR_EL1. - Flush TLB after installing the real kernel page table in TTBR1_EL1. - Add some barriers around TLB flushes to make it consistent with other places where we do TLB flushes. ok drahn@, patrick@
*	Add code to acpiiort(4) to look up named components in the IORT and	patrick	2021-03-15	3	-4/+80
\| \| \| \| \| \| \|	map them. This makes ACPI's call to acpi_iommu_device_map() do work through acpiiort(4). ok kettenis@
*	Change API of acpiiort(4). It was written as a hook before, taking the	patrick	2021-03-15	7	-33/+23
\| \| \| \| \| \| \| \| \| \|	PCI attach args and replacing the DMA tag inside. Our other IOMMU API though takes a DMA tag and returns the old one or a new one. To have acpiiort(4) integrate better with non-PCI ACPI devices, change the API so that it is more similar to the other API. This also makes the code easier to understand. ok kettenis@
*	Add acpi_iommu_device_map(), which replaces the DMA tag with one that	patrick	2021-03-15	3	-3/+21
\| \| \| \| \| \| \| \| \|	is blessed with IOMMU magic, if available. This is mainly for arm64, since on amd64 and i386 the IOMMU only captures PCIe devices, as far as I know, which uses the pci_probe_device_hook(). This though is for non-PCI devices attached through ACPI. ok kettenis@
*	Don't put an extern variable (ppc_kvm_stolen) into vmparam.h, other instances	deraadt	2021-03-15	2	-6/+3
\| \| \| \|	of this file are only doing cpp #define
*	We can use memory marked as EfiBootServicesCode or EfiBootServicesData	kettenis	2021-03-13	1	-3/+6
\| \| \| \| \| \|	as well. ok drahn@, kn@
*	spelling	jsg	2021-03-11	140	-331/+331
\|
*	Add SMP support.	kettenis	2021-03-11	1	-14/+99
\| \| \| \|	ok patrick@
*	grow media a little	deraadt	2021-03-11	1	-2/+2
\|
*	Let MAIR comment catch up with reality.	kettenis	2021-03-10	1	-2/+5
\|
*	pmap_avail_setup() is the only place physmem is calculated, delete a bunch	deraadt	2021-03-10	1	-9/+2
\| \| \| \| \|	of code which thinks it could be done elsewhere. ok kurt
*	Node without a "status" property should be considered enabled as well.	kettenis	2021-03-09	1	-3/+3
\| \| \| \|	ok patrick@
*	Recognize Apple Firestorm cores.	kettenis	2021-03-09	1	-1/+3
\|
*	Add initial bits for Check Point UTM-1 EDGE N.	visa	2021-03-09	3	-3/+15
\| \| \| \|	From Thaison Nguyen
*	ofw_read_mem_regions() can skip calculation of physmem. pmap.c	deraadt	2021-03-09	1	-5/+1
\| \| \| \| \| \| \| \| \| \|	already calculates _usable_ memory and updates physmem (if it is 0), whereas ofw_read_mem_regions() was counting usable+unuseable memory. ie. 4G or more on some machines. powerpc's 32-bit pagetable cannot use memory beyond 4G phys addr. (On a 4G machine, physmem64 was calculated as 0, which caused the installer's auto-diskabel code to place /tmp on the b partition). ok gkoehler, works for kurt also
*	Enable ixl(4).	patrick	2021-03-08	2	-2/+4
\|
*	Revise the ASID allocation sheme to avoid a hang when running out of free	kettenis	2021-03-08	2	-31/+120
\| \| \| \| \| \| \| \| \| \| \| \| \|	ASIDs. This should only happen on systems with 8-bit ASIDs, which are currently unsupported in OpenBSD. The new scheme uses "generations". Whenever we run out of ASIDs we bump the generation and flush the complete TLB. The pmaps of processes that are currently on the CPU are carried over into the new generation. This implementation relies on the scheduler lock to make sure this happens without any (known) races. ok patrick@, mpi@
*	Explicitly align kernel text.	mortimer	2021-03-07	2	-5/+6
\| \| \| \| \| \| \|	lld11 no longer quietly aligns this when given an address, so we do the alignment explicitly. ok kettenis@
*	Since with the current design there's one device per domain, and one	patrick	2021-03-06	1	-17/+11
\| \| \| \| \| \| \| \| \| \|	domain per pagetable, there's no need for a backpointer to the domain in the pagetable entry descriptor. There can't be any other domain. Also since there's no list, no list entry member is needed either. This reduces early allocation to half of the previous size. I think it's possible to reduce it even further and not need a pagetable entry descriptor at all, but I need to think about that a bit more.
*	One major issue talked about in research papers is reducing the overhead	patrick	2021-03-06	1	-61/+103
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	of the IOVA allocation. As far as I can see the current "best solution" is to cache IOVA ranges in percpu magazines. I don't think we have this issue at all thanks to bus_dmamap_create(9). The map is created ahead of time, and we know the maximum size of the DMA transfer. Since with smmu(4) we have IOVA per domain, allocating IOVA 'early' is essentially free. But pagetable mapping also incurs a performance penalty, since we allocate pagetable entry descriptors through pools. Since we have the IOVA early, we can allocate those early as well. This allocation is a bit more expensive though, but can be optimized further. All this means that there is no allocation overhead in hot code paths. The "only" thing remaining is assigning IOVA to the segments, adjusting the pagetable mappings, and flushing the IOTLB on unload. Maybe there's a way to do a combined flush for NICs, because we give a list of mbufs to the network stack and we could do the IOTLB invalidation only once right before we hand over the mbuf list to the upper layers.
*	ansi	jsg	2021-03-06	2	-6/+4
\|
*	Improve readability of softc accesses.	patrick	2021-03-05	1	-13/+20
\|
*	Introduce an IOVA allocator instead of mapping pages 1:1. Mapping pages 1:1	patrick	2021-03-05	2	-106/+129
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	obviously reduces the overhead of IOVA allocation, but instead you have the problem of doubly mapped pages, and making sure a page is only unmapped once the last user is gone. My initial attempt, modeled after apldart(4), calls the allocator for each segment. Unfortunately this introduces a performance penalty which reduces performance from around 700 Mbit/s to about 20 Mbit/s, or even less, in a simple single stream tcpbench scenario. Most mbufs from userland seem to have at least 3 segments. Calculating the needed IOVA space upfront reduces this penalty. IOVA allocation overhead could be reduced once and for all if it is possible to reserve IOVA during bus_dmamap_create(9), as it is only called upon creation and basically never for each DMA cycle. This needs some more thought. With this we now put the pressure on the PTED pools instead. Additionally, but not part of this diff, percpu pools for the PTEDs seem to reduce the overhead for that single stream tcpbench scenario to 0.3%. Right now this means we're hitting a different bottleneck, not related to the IOMMU. The next bottleneck will be discovered once forwarding is unlocked. Though it should be possible to benchmark the current implementation, and different designs, using a cycles counter. With IOVA allocation it's not easily possible to correlate memory passed to bus_dmamem_map(9) with memory passed to bus_dmamap_load(9). So far my code try to use the same cachability attributes as the kenrel uses for its userland mappings. For the devices we support, there seems to be no need so far. If this ever gives us any trouble in the feature, I'll have a look and fix it. While drivers should call bus_dmamap_unload(9) before bus_dmamap_destroy(9), the API explicitly states that bus_dmamap_destroy(9) should unload the map if it is still loaded. Hence we need to do exactly that. I actually have found one network driver which behaves that way, and the developer intends to change the network driver's behaviour.
*	Extend the commented code that shows which additional mappings are needed,	patrick	2021-03-05	1	-6/+24
\| \| \| \| \| \| \|	or which regions need to be reserved. As it turns out, a region we should not map is the PCIe address space. Making a PCIe device try to do DMA to an address in PCIe address space will obviously not make its way to SMMU and host memory. We'll probably have to add an API for that.
*	Turns out the cores on Apple's M1 SoC only support 8-bit ASIDs.	kettenis	2021-03-04	1	-52/+57
\| \| \| \| \| \| \| \| \| \| \|	Thank you Apple (not)! Add an initial attempt to support such systems. This isn't good enough since the kernel will hang once you create more than 127 processes. But it makes things work reasonably well until you reach that limit which is good enough to build things on the machine itself. ok patrick@
*	Print feature that indicates a CPU core supports 16-bit ASIDs.	kettenis	2021-03-04	1	-1/+13
\| \| \| \|	ok patrick@
*	Tweak whitespace and adjust prototypes.	visa	2021-03-04	1	-23/+21
\|