wireguard-openbsd - WireGuard implementation for the OpenBSD kernel

	Commit message (Collapse)	Author	Age	Files	Lines
*	Simplify if_output to avoid queueingjd/simplify-queueing	Jason A. Donenfeld	2020-06-22	1	-52/+13
\| \| \| \| \|	fq_codel is done by copying the mbuf flow hash for the final egress, but not at the encapsulation stage.
*	add missing rcs id	jasper	2020-06-22	2	-0/+4
\|
*	Rework checks for `pppx_ifs' tree modification.	mvs	2020-06-22	1	-8/+4
\| \| \| \| \| \| \| \|	- There is no panic() condition while inserting `pxi' to tree so drop RBT_FIND() to avoid two lookups. - Modify text in panic() message in delete case. ok yasuoka@ claudio@
*	The interface if_ioctl routine must be called with the NET_LOCK() held.	claudio	2020-06-22	2	-8/+5
\| \| \| \| \| \| \| \| \| \| \| \|	For example the bridge_ioctl() function calls NET_UNLOCK() unconditionally and so calling if_ioctl() without netlock will trigger an assert because of not holding the netlock. Make sure the ioctl handlers are called with the netlock held and drop the lock for the wg(4) specific ioctls in the wg_ioctl handler. This fixes a panic in bridge_ioctl() triggered by ifconfig(8) issuing a SIOCGWG ioctl against bridge(4). This is just a workaround this needs more cleanup but at least this way the panic can not be triggered anymore. OK stsp@, tested by semarie@
*	Prevent potencial `state_list' corruption while pppac(4) destroys pipex(4)	mvs	2020-06-22	1	-2/+4
\| \| \| \| \| \|	sessions by pipex_iface_fini() or by pipex_ioctl() with `PIPEXSMODE' command. ok yasuoka@
*	deprecate network livelock detection using the softclock.	dlg	2020-06-22	1	-38/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	livelock detection used to rely on code running at softnet blocking the softclock handling at a lower interrupt priority level. if the hard clock interrupt count diverged from one kept by a timeout, we assumed the network stack was doing too much work and we should apply backpressure to the receptions of packets. the network stack doesnt really block timeouts from firing anymore though. this is especially true on MP systems, because timeouts fire on cpu0 and the nettq thread could be somewhere else entirely. this means network activity doesn't make the softclock lose ticks, which means we aren't scaling rx ring activity like we think we are. the alternative way to detect livelock is when a driver queues packets for the stack to process, if there's too many packets built up then the input routine return value tells the driver to slow down. this enables finer grained livelock detection too. the rx ring accounting is done per rx ring, and each rx ring is tied to a specific nettq. if one of them is going too fast it shouldn't affect the others. the tick based detection was done system wide and punished all the drivers. ive converted all the drivers to the new mechanism. let's see how we go with it. jmatthew@ confirms rings still shrink, so some backpressure is being applied.
*	add wg(4), an in kernel driver for WireGuard vpn communication.	dlg	2020-06-21	7	-1/+5218
\| \| \| \| \| \| \| \| \| \| \|	thanks to Matt Dunwoodie and Jason A. Donenfeld for their effort. it's at least as functional as the go implementation, and maybe more so since this one works on more architectures. i'm sure there's further development that can be done, but you can say that about anything and everything that's in the tree. ok deraadt@
*	add IFT_WIREGUARD.	dlg	2020-06-21	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	i'm still not a fan of the peer semantics of wireguard interfaces where each interface can have multiple peers and each peer has a set of the allowed ips configurred, aka cryptokey routing. traditionally we would use a tunnel (IFT_TUNNEL) style interface per peer, which means there's a 1:1 mapping between a peer and an interface. in turn that means you can apply policy with things like pf to the interface and it implies policy on the peer. so allowed ips inside a wg interface feels like a bandaid for a self inflicted wound to some degree. however, deraadt@ points out that the boat has sailed, and being compatible with the larger ecosystem has benefits. admins can choose to setup an interface per peer if they want too, so we get the best of both worlds. i will admit an interface per peer sucks in a concentrator situation though. that's why we still have pppac(4) as well as pppx(4). i also don't have any better ideas for how to scale or even express this kind of policy in a concentrator setting either. apologies for the teary. from Matt Dunwoodie and Jason A. Donenfeld ok deraadt@
*	let stoeplitz_to_key take a void * argument instead of uint8_t *.	dlg	2020-06-19	2	-4/+5
\| \| \| \|	ix(4) wants an array of uint32_ts to push into 32bit registers.
*	pass the mbuf with the data separately to the one with the pkthdr to mtap.	dlg	2020-06-18	1	-7/+15
\| \| \| \| \|	this lets things calling bpf_mtap_hdr and related functions also populate the extended bpf_hdr with the rcvif and prio and stuff.
*	add $OpenBSD$ tags	dlg	2020-06-18	2	-0/+4
\|
*	extend the bpf_hdr struct to include some metadata if available.	dlg	2020-06-18	2	-34/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	the metadata is set if the mbuf is passed with an m_pktrhdr, and copies the mbufs rcvif, priority, flowid. it also carries the direction of the packet. it also makes bpf_hdr a multiple of 4 bytes, which simplifies some calculations a bit. it also requires no changes in userland because libpcap just thinks the extra bytes in the header are padding and skips over them to the payload. this helps me verify things like whether the stack and a network card agree about toeplitz hashes, and paves the way for doing more interesting packet captures. being able to see where a packet came from as it is leaving a machine is very useful. ok mpi@
*	Combine and replace duplicated code in pipex(4) and pppx(4) by new functions	mvs	2020-06-18	3	-256/+147
\| \| \| \| \| \| \| \| \| \| \|	listed below. - pipex_init_session() to check request and alloc new session. - pipex_link_session() to link session to pipex(4) layer. - pipex_unlink_session() to unlink session from pipex(4) layer. - pipex_rele_session() to release session and it's internal allocation. ok mpi@
*	Introduce stoeplitz_hash_n32() and use it to simplify the hash_ip*	tb	2020-06-18	2	-20/+14
\| \| \| \| \| \|	functions further. ok dlg
*	The same simplification can be done a second time: widen the type,	tb	2020-06-18	1	-39/+22
\| \| \| \| \| \| \|	xor laddr and faddr and the ports together and only then fold the 32 bits into 16 bits. ok dlg
*	Now that the calls to stoeplitz_cache_entry() are out of the way, we can	tb	2020-06-18	1	-38/+9
\| \| \| \| \| \| \| \| \| \|	ditch half of the calculations by merging the computation of hi and lo, only splitting at the end. This allows us to leverage stoeplitz_hash_n16(). The name lo is now wrong. I kept it in order to avoid noise. I'm going to clean this up in the next step. ok dlg
*	The next step is to use that we have cached the result of the matrix	tb	2020-06-18	1	-47/+51
\| \| \| \| \| \| \| \| \| \| \|	multiplication H * val in stoeplitz_cache_entry(scache, val), so the identity (H * x) ^ (H * y) == H * (x ^ y) allows us to push the calls to the cache function down to the end of stoeplitz_hash_ip{4,6}{,port}(). The identity in question was also confirmed on amd64, sparc64 and powerpc for all possible values of skey, x and y. ok dlg
*	make ph_flowid in mbufs 16bits by storing whether it's set in csum_flags.	dlg	2020-06-17	10	-45/+45
\| \| \| \| \|	i've been wanting to do this for a while, and now that we've got stoeplitz and it gives us 16 bits, it seems like the right time.
*	Remove some of the unnecessary complications in the calculation of the	tb	2020-06-17	1	-24/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	stoeplitz_cache and bring them into a form more suitable for mathematical reasoning. Add a comment explaining the full construction which will also help justifying upcoming diffs. The observations for the code changes are the following: First, scache->bytes[val] is a uint16_t, and we only need the lower 16 bits of res in the second nested pair of for loops. The values of key[b] are only xored together to compute res, so we only need the lower 16 bits of those, too. Second, looking at the first nested for loop, we see that the values 0..15 of j only touch the top 16 bits of key[b], so we can skip them. For b = 0, the inner loop for j in 16..31 scans backwards through skey and sets the corresponding bits of key[b], so key[0] = skey. A bit of pondering then leads to key[b] = skey << b \| skey >> (NBSK - b). The key array is renamed into column since it stores columns of the Toeplitz matrix. It's not very expensive to brute-force verify that scache->bytes[val] remains the same for all values of val and all values of skey. I did this on amd64, sparc64 and powerpc. ok dlg
*	Add a symmetric toeplitz implementation, with integration for nics.	dlg	2020-06-16	2	-0/+344
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is another bit of the puzzle for supporting multiple rx rings and receive side scaling (RSS) on nics. It borrows heavily from DragonflyBSD, but I've made some tweaks on the way. The interesting bits that dfly came up with was a way to use Toeplitz hashing so the kernel AND network interfaces hash packets so packets in both directions onto the same bucket. The other interesting thing is that the optimised the hash calculation by building a cache of all the intermediate results possible for each input byte. Their hash calculation is simply xoring these intermediate reults together. So this diff adds an API for the kernel to use for calculating a hash for ip addresses and ports, and adds a function for network drivers to call that gives them a key to use with RSS. If all drivers use the same key, then the same flows should be steered to the same place when they enter the network stack regardless of which hardware they came in on. The changes I made relative to dfly are around reducing the size of the caches. DragonflyBSD builds a cache of 32bit values, but because of how the Toeplitz key is constructed, the 32bits are made up of a repeated 16bit pattern. We can just store the 16 bits and reconstruct the 32 bits if we want. Both us and dragonfly only use 15 or 16 bits of the result anyway, so 32bits is unecessary. Secondly, the dfly implementation keeps a cache of values for the high and low bytes of input, but the values in the two caches are almost the same. You can byteswap the values in one of the byte caches to get the values for the other, but you can also just byteswap values at runtime to get the same value, which is what this implementation does. The result of both these changes is that the byte cache is a quarter of the size of the one in dragonflybsd. tb@ has had a close look at this and found a bunch of other optimisations that can be implemented, and because he's a wizard^Wmathematician he has proofs (and also did tests). ok tb@ jmatthew@
*	Remove redundant code	denis	2020-06-05	1	-2/+1
\| \| \| \| \| \| \|	Reported by Prof. Dr. Steffen Wendzel <wendzel @ hs-worms . de>, thanks! OK martijn@ sthen@
*	Fix pfr_kentry_byaddr() to be used for a rule in an anchor. It	yasuoka	2020-06-04	1	-41/+23
\| \| \| \| \| \| \| \|	couldn't find an entry if its table is attached a table on the root. This fixes the problem "route-to <TABLE> least-states" doesn't work. The problem is found by IIJ. OK sashan
*	When the set of ports in an aggr changes, set the aggr capabilities to	jmatthew	2020-06-02	1	-13/+14
\| \| \| \| \| \| \| \|	the intersection of the capabilities of the ports, allowing use of vlan and checksum offloads if supported by all ports. Since this works the same way as updating hardmtu, do them both at the same time. ok dlg@
*	use ip{,6}_send instead of ip{,6}_output for l2tp and pptp.	dlg	2020-05-31	1	-14/+4
\| \| \| \| \| \| \| \| \| \|	pipex output is part of pppx and pppac if_start functions, so it can't rely on or know if it already has NET_LOCK. this defers the ip output stuff to where it can take the NET_LOCK reliably. tested by Vitaliy Makkoveev, who also found that this was necessary after ifq.c 1.38 and provided an excellent analysis of the problem. ok mpi@
*	Mark the descriptor as dead when starting to destroy it.	mpi	2020-05-29	1	-1/+2
\| \| \| \| \| \|	This help in case of a context switch inside if_detach(). From Vitaliy Makkoveev.
*	dev/rndvar.h no longer has statistical interfaces (removed during various	deraadt	2020-05-29	1	-3/+1
\| \| \| \| \| \|	conversion steps). it only contains kernel prototypes for 4 interfaces, all of which legitimately belong in sys/systm.h, which are already included by all enqueue_randomness() users.
*	Document the various flavors of NET_LOCK() and rename the reader version.	mpi	2020-05-27	2	-15/+15
\| \| \| \| \| \| \| \| \| \|	Since our last concurrency mistake only ioctl(2) ans sysctl(2) code path take the reader lock. This is mostly for documentation purpose as long as the softnet thread is converted back to use a read lock. dlg@ said that comments should be good enough. ok sashan@
*	Extract more randomness from mbuf flow	visa	2020-05-26	1	-2/+3
\| \| \| \| \| \| \| \| \| \|	Input bits of the mbuf list head to enqueue_randomness(). While the set of mbufs in circulation is relatively stable, the order in which they reach if_input_process() is unpredictable. Shuffling can happen in many subsystems, such as the network stack, device drivers, and memory management. OK deraadt@ mpi@
*	Use C99 initializers in 'struct filterops' definitions.	mpi	2020-05-26	1	-9/+9
\|
*	Document which lock protect pipex(4) fields.	mpi	2020-05-26	2	-65/+78
\| \| \| \|	From Vitaliy Makkoveev.
*	Kill unecessary `pppx_ifs_lk' lock.	mpi	2020-05-26	1	-14/+1
\| \| \| \| \| \| \| \| \|	Premature locking is causing more trouble than it is solving issue. In this case the lifetime of descriptors is protected by the KERNEL_LOCK() so using a rwlock for the lookup introduce sleeping points and possible new races without benefit. From Vitaliy Makkoveev.
*	better wording	dlg	2020-05-22	1	-4/+4
\|
*	white space fixes. no functional change.	dlg	2020-05-22	1	-3/+3
\|
*	mention if_attach_queues().	dlg	2020-05-22	1	-10/+28
\|
*	speeling in comment fix. no functional change.	dlg	2020-05-22	1	-3/+3
\|
*	don't limit the output queue (ifq) length to 1 anymore.	dlg	2020-05-21	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	if we use the ifq to move packet processing to another context, it's too easy to fill up the one slot and cause packet loss. the ifq len was set to 1 to avoid delays produced by the original implementation of tx mitigation. however, trunk now introduces latency because it isn't mpsafe yet, which causes the network stack to have to take the kernel lock for each packet, and the kernel lock can be quite contended. i want to use the ifq to move the packet to the systq thread (which already has the kernel lock) before trunk is asked to transmit it. tested by mark patruck and myself.
*	back out 1.38. some bits of the stack aren't ready for it yet.	dlg	2020-05-21	1	-7/+3
\| \| \| \| \| \| \| \|	mark patruck found significant packet drops with trunk(4), and there's some reports that pppx or pipex relies on some implicit locking that it shouldn't. i can fix those without this diff being in the tree.
*	Use `if_bpf' directly instead of the non-initialized duplicated copy of it.	mpi	2020-05-20	2	-11/+10
\| \| \| \| \| \|	From Sergey Ryazanov. Reviewed by Vitaliy Makkoveev, ok claudio@
*	defer calling !IFXF_MPSAFE driver start routines to the systq	dlg	2020-05-20	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \|	this reuses the tx mitigation machinery, but instead of deferring some start calls to the nettq, it defers all calls to the systq. this is to avoid taking the KERNEL_LOCK while processing packets in the stack. i've been running this in production for 6 or so months, and the start of a release is a good time to get more people trying it too. ok jmatthew@
*	bpf(4): separate descriptor non-blocking status from read timeout	cheloha	2020-05-13	2	-10/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If you set FIONBIO on a bpf(4) descriptor you enable non-blocking mode and also clobber any read timeout set for the descriptor. The reverse is also true: do BIOCSRTIMEOUT and you'll set a timeout and simultaneously disable non-blocking status. The two are mutually exclusive. This relationship is undocumented and might cause a bug. At the very least it makes reasoning about the code difficult. This patch adds a new member to bpf_d, bd_rnonblock, to store the non-blocking status of the descriptor. The read timeout is still kept in bd_rtout. With this in place, non-blocking status and the read timeout can coexist. Setting one state does not clear the other, and vice versa. Separating the two states also clears the way for changing the bpf(4) read timeout to use the system clock instead of ticks. More on that in a later patch. With insight from dlg@ regarding the purpose of the read timeout. ok dlg@
*	only pass the IO_NDELAY flag to ifq_deq_sleep as the nbio argument.	dlg	2020-05-13	1	-3/+3
\|
*	Set timeout(9) to refill the receive ring descriptors if the amount of	jan	2020-05-12	1	-1/+2
\| \| \| \| \| \| \| \| \| \|	descriptors runs below the low watermark. The em(4) firmware seems not to work properly with just a few descriptors in the receive ring. Thus, we use the low water mark as an indicator instead of zero descriptors, which causes deadlocks. ok kettenis@
*	Add support for autmatically moving traffic between rdomains on ipsec(4)	tobhe	2020-04-23	4	-25/+102
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	encryption or decryption. This allows us to keep plaintext and encrypted network traffic seperated and reduces the attack surface for network sidechannel attacks. The only way to reach the inner rdomain from outside is by successful decryption and integrity verification through the responsible Security Association (SA). The only way for internal traffic to get out is getting encrypted and moved through the outgoing SA. Multiple plaintext rdomains can share the same encrypted rdomain while the unencrypted packets are still kept seperate. The encrypted and unencrypted rdomains can have different default routes. The rdomains can be configured with the new SADB_X_EXT_RDOMAIN pfkey extension. Each SA (tdb) gets a new attribute 'tdb_rdomain_post'. If this differs from 'tdb_rdomain' then the packet is moved to 'tdb_rdomain_post' afer IPsec processing. Flows and outgoing IPsec SAs are installed in the plaintext rdomain, incoming IPsec SAs are installed in the encrypted rdomain. IPCOMP SAs are always installed in the plaintext rdomain. They can be viewed with 'route -T X exec ipsecctl -sa' where X is the rdomain ID. As the kernel does not create encX devices automatically when creating rdomains they have to be added by hand with ifconfig for IPsec to work in non-default rdomains. discussed with chris@ and kn@ ok markus@, patrick@
*	Don't return stack garbage even if it is going to be	krw	2020-04-20	1	-2/+2
\| \| \| \| \| \| \| \|	ignored. Initialize 'error' to 0. CID 1483380 ok mpi@
*	fix insufficient input sanitization in pf_rulecopyin() and pf_pool_copyin()	sashan	2020-04-19	1	-1/+4
\| \| \| \| \| \| \|	Reported-by: syzbot+d0639632a0affe0a690e@syzkaller.appspotmail.com Reported-by: syzbot+ae5e359d7f82688edd6a@syzkaller.appspotmail.com OK anton@
*	Use MHLEN for the space size of mbuf header. This fixes the panic	yasuoka	2020-04-18	1	-3/+3
\| \| \| \| \| \|	when using pppac without pipex. ok dlg
*	Do not delete an existing RTF_CACHED entry with the same destination	mpi	2020-04-15	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	address as the one trying to be inserted. Such entry must stay in the table as long as its parent route exist. If a code path tries to re-insert a route with the same destination address on the same interface it is a bug. Avoid the "route contains no arp information" problem reported by sthen@ and Laurent Salle. ok claudio@
*	Stop processing packets under non-exclusive (read) netlock.	mpi	2020-04-12	5	-17/+17
\| \| \| \| \| \| \| \| \| \| \| \|	Prevent concurrency in the socket layer which is not ready for that. Two recent data corruptions in pfsync(4) and the socket layer pointed out that, at least, tun(4) was incorrectly using NET_RUNLOCK(). Until we find a way in software to avoid future mistakes and to make sure that only the softnet thread and some ioctls are safe to use a read version of the lock, put everything back to the exclusive version. ok stsp@, visa@
*	make ifpromisc assert that the caller is holding the NET_LOCK.	dlg	2020-04-12	1	-1/+3
\| \| \| \| \| \|	it needs NET_LOCK because it modifies if_flags and if_pcount. ok visa@
*	say if_pcount needs NET_LOCK instead of the kernel lock.	dlg	2020-04-12	1	-2/+2
\| \| \| \| \| \| \| \|	if_pcount is only touched in ifpromisc(), and ifpromisc() needs NET_LOCK anyway because it also modifies if_flags. suggested by mpi@ ok visa@