wireguard-linux-compat - WireGuard kernel module backport for Linux 3.10

	Commit message (Collapse)	Author	Age	Files	Lines
*	queueing: get rid of per-peer ring buffers	Jason A. Donenfeld	2021-02-18	1	-20/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Having two ring buffers per-peer means that every peer results in two massive ring allocations. On an 8-core x86_64 machine, this commit reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which is an 90% reduction. Ninety percent! With some single-machine deployments approaching 500,000 peers, we're talking about a reduction from 7 gigs of memory down to 700 megs of memory. In order to get rid of these per-peer allocations, this commit switches to using a list-based queueing approach. Currently GSO fragments are chained together using the skb->next pointer (the skb_list_* singly linked list approach), so we form the per-peer queue around the unused skb->prev pointer (which sort of makes sense because the links are pointing backwards). Use of skb_queue_* is not possible here, because that is based on doubly linked lists and spinlocks. Multiple cores can write into the queue at any given time, because its writes occur in the start_xmit path or in the udp_recv path. But reads happen in a single workqueue item per-peer, amounting to a multi-producer, single-consumer paradigm. The MPSC queue is implemented locklessly and never blocks. However, it is not linearizable (though it is serializable), with a very tight and unlikely race on writes, which, when hit (some tiny fraction of the 0.15% of partial adds on a fully loaded 16-core x86_64 system), causes the queue reader to terminate early. However, because every packet sent queues up the same workqueue item after it is fully added, the worker resumes again, and stopping early isn't actually a problem, since at that point the packet wouldn't have yet been added to the encryption queue. These properties allow us to avoid disabling interrupts or spinning. The design is based on Dmitry Vyukov's algorithm [1]. Performance-wise, ordinarily list-based queues aren't preferable to ringbuffers, because of cache misses when following pointers around. However, we already have to follow the adjacent pointers when working through fragments, so there shouldn't actually be any change there. A potential downside is that dequeueing is a bit more complicated, but the ptr_ring structure used prior had a spinlock when dequeueing, so all and all the difference appears to be a wash. Actually, from profiling, the biggest performance hit, by far, of this commit winds up being atomic_add_unless(count, 1, max) and atomic_ dec(count), which account for the majority of CPU time, according to perf. In that sense, the previous ring buffer was superior in that it could check if it was full by head==tail, which the list-based approach cannot do. But all and all, this enables us to get massive memory savings, allowing WireGuard to scale for real world deployments, without taking much of a performance hit. [1] http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	noise: separate receive counter from send counter	Jason A. Donenfeld	2020-05-19	1	-7/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In "queueing: preserve flow hash across packet scrubbing", we were required to slightly increase the size of the receive replay counter to something still fairly small, but an increase nonetheless. It turns out that we can recoup some of the additional memory overhead by splitting up the prior union type into two distinct types. Before, we used the same "noise_counter" union for both sending and receiving, with sending just using a simple atomic64_t, while receiving used the full replay counter checker. This meant that most of the memory being allocated for the sending counter was being wasted. Since the old "noise_counter" type increased in size in the prior commit, now is a good time to split up that union type into a distinct "noise_replay_ counter" for receiving and a boring atomic64_t for sending, each using neither more nor less memory than required. Also, since sometimes the replay counter is accessed without necessitating additional accesses to the bitmap, we can reduce cache misses by hoisting the always-necessary lock above the bitmap in the struct layout. We also change a "noise_replay_counter" stack allocation to kmalloc in a -DDEBUG selftest so that KASAN doesn't trigger a stack frame warning. All and all, removing a bit of abstraction in this commit makes the code simpler and smaller, in addition to the motivating memory usage recuperation. For example, passing around raw "noise_symmetric_key" structs is something that really only makes sense within noise.c, in the one place where the sending and receiving keys can safely be thought of as the same type of object; subsequent to that, it's important that we uniformly access these through keypair->{sending,receiving}, where their distinct roles are always made explicit. So this patch allows us to draw that distinction clearly as well. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	queueing: preserve flow hash across packet scrubbing	Jason A. Donenfeld	2020-05-19	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It's important that we clear most header fields during encapsulation and decapsulation, because the packet is substantially changed, and we don't want any info leak or logic bug due to an accidental correlation. But, for encapsulation, it's wrong to clear skb->hash, since it's used by fq_codel and flow dissection in general. Without it, classification does not proceed as usual. This change might make it easier to estimate the number of innerflows by examining clustering of out of order packets, but this shouldn't open up anything that can't already be inferred otherwise (e.g. syn packet size inference), and fq_codel can be disabled anyway. Furthermore, it might be the case that the hash isn't used or queried at all until after wireguard transmits the encrypted UDP packet, which means skb->hash might still be zero at this point, and thus no hash taken over the inner packet data. In order to address this situation, we force a calculation of skb->hash before encrypting packet data. Of course this means that fq_codel might transmit packets slightly more out of order than usual. Toke did some testing on beefy machines with high quantities of parallel flows and found that increasing the reply-attack counter to 8192 takes care of the most pathological cases pretty well. Reported-by: Dave Taht <dave.taht@gmail.com> Reviewed-and-tested-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	send/receive: use explicit unlikely branch instead of implicit coalescing	Jason A. Donenfeld	2020-05-05	1	-9/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It's very unlikely that send will become true. It's nearly always false between 0 and 120 seconds of a session, and in most cases becomes true only between 120 and 121 seconds before becoming false again. So, unlikely(send) is clearly the right option here. What happened before was that we had this complex boolean expression with multiple likely and unlikely clauses nested. Since this is evaluated left-to-right anyway, the whole thing got converted to unlikely. So, we can clean this up to better represent what's going on. The generated code is the same. Suggested-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	send: cond_resched() when processing tx ringbuffers	Jason A. Donenfeld	2020-05-04	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	Users with pathological hardware reported CPU stalls on CONFIG_ PREEMPT_VOLUNTARY=y, because the ringbuffers would stay full, meaning these workers would never terminate. That turned out not to be okay on systems without forced preemption. This commit adds a cond_resched() to the bottom of each loop iteration, so that these workers don't hog the core. We don't do this on encryption/decryption because the compat module here uses simd_relax, which already includes a call to schedule in preempt_enable. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	send: use normaler alignment formula from upstream	Jason A. Donenfeld	2020-03-18	1	-1/+1
\| \| \| \| \| \| \|	Slightly more meh, but upstream likes it better, and I'd rather minimize the delta between trees. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	send: cleanup skb padding calculation	Jason A. Donenfeld	2020-02-14	1	-6/+11
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	send: account for mtu=0 devices	Jason A. Donenfeld	2020-02-14	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It turns out there's an easy way to get packets queued up while still having an MTU of zero, and that's via persistent keep alive. This commit makes sure that in whatever condition, we don't wind up dividing by zero. Note that an MTU of zero for a wireguard interface is something quasi-valid, so I don't think the correct fix is to limit it via min_mtu. This can be reproduced easily with: ip link add wg0 type wireguard ip link add wg1 type wireguard ip link set wg0 up mtu 0 ip link set wg1 up wg set wg0 private-key <(wg genkey) wg set wg1 listen-port 1 private-key <(wg genkey) peer $(wg show wg0 public-key) wg set wg0 peer $(wg show wg1 public-key) persistent-keepalive 1 endpoint 127.0.0.1:1 However, while min_mtu=0 seems fine, it makes sense to restrict the max_mtu. This commit also restricts the maximum MTU to the greatest number for which rounding up to the padding multiple won't overflow a signed integer. Packets this large were always rejected anyway eventually, due to checks deeper in, but it seems more sound not to even let the administrator configure something that won't work anyway. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	chacha20poly1305: port to sgmitter for 5.5	Jason A. Donenfeld	2019-12-05	1	-3/+4
\| \| \| \| \| \| \|	I'm not totally comfortable with these changes yet, and it'll require some more scrutiny. But it's a start. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	device: prepare skb_list_walk_safe for upstreaming	Jason A. Donenfeld	2019-12-05	1	-6/+2
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	send: use kfree_skb_list	Jason A. Donenfeld	2019-12-05	1	-9/+2
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: switch to coarse ktime	Jason A. Donenfeld	2019-06-25	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Coarse ktime is broken until [1] in 5.2 and kernels without the backport, so we use fallback code there. The fallback code has also been improved significantly. It now only uses slower clocks on kernels < 3.17, at the expense of some accuracy we're not overly concerned about. [1] https://lore.kernel.org/lkml/tip-e3ff9c3678b4d80e22d2557b68726174578eaf52@git.kernel.org/ Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: update copyright	Jason A. Donenfeld	2019-01-07	1	-1/+1
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	send: calculate inner checksums for all protocols	Andrejs Hanins	2018-10-27	1	-5/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I'm using GRE tunnel (transparent Ethernet bridging flavor) over WireGuard interface to be able to bridge L2 network segments. The typical protocol chain looks like this IP->GRE->EthernetHeader->IP->UDP. UDP here is the packet sent from the L2 network segment which is tunneled using GRE over Wireguard. Indeed, there is a checksum inside UDP header which is, as a rule, kept partially calculated while packet travels through network stack and outer protocols are added until the packet reaches WG device which exposes NETIF_F_HW_CSUM feature meaning it can handle checksum offload for all protocols. But the problem here is that skb_checksum_setup called from encrypt_packet handles only TCP/UDP protocols under top level IP, but in my case there is a GRE protocol there, so skb_checksum_help is not called and packet continues its life with unfinished (broken) checksum and gets encrypted as-is. When such packet is received by other side and reaches L2 networks it's seen there with a broken checksum inside the UDP header. The fact that Wireguard on the receiving side sets skb->ip_summed to CHECKSUM_UNNECESSARY partially mitigates the problem by telling network stack on the receiving side that validation of the checksum is not necessary, so local TCP stack, for example, works fine. But it doesn't help in situations when packet needs to be forwarded further (sent out from the box). In this case there is no way we can tell next hop that checksum verification for this packet is not necessary, we just send it out with bad checksum and packet gets dropped on the next hop box. I think the issue of the original code was the wrong usage of skb_checksum_setup, simply because it's not needed in this case. Instead, we can just rely on ip_summed skb field to see if partial checksum needs to be finalized or not. Note that many other drivers in kernel follow this approach. In summary: - skb_checksum_setup can only handle TCP/UDP protocols under top level IP header, packets with other protocols (like GRE) are sent out by Wireguard with unfinished partial checksums which causes problems on receiving side (bad checksums). - encrypt_packet gets skb prepared by network stack, so there is no need to setup the checksum from scratch, but just perform hw checksum offload using software helper skb_checksum_help for packet which explicitly require it as denoted by CHECKSUM_PARTIAL. Signed-off-by: Andrejs Hanins <ahanins@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	send: consider dropped stage packets to be dropped	Jason A. Donenfeld	2018-10-27	1	-0/+8
\| \| \| \| \|	Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: do not allow compiler to reorder is_valid or is_dead	Jason A. Donenfeld	2018-10-25	1	-5/+6
\| \| \| \| \|	Suggested-by: Jann Horn <jann@thejh.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: give if statements brackets and other cleanups	Jason A. Donenfeld	2018-10-09	1	-2/+2
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: more nits	Jason A. Donenfeld	2018-10-08	1	-10/+11
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: rename struct wireguard_ to struct wg_	Jason A. Donenfeld	2018-10-08	1	-18/+18
\| \| \| \| \| \| \|	This required a bit of pruning of our christmas trees. Suggested-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: prefix functions used in callbacks with wg_	Jason A. Donenfeld	2018-10-08	1	-10/+10
\| \| \| \| \|	Suggested-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: style nits	Jason A. Donenfeld	2018-10-07	1	-2/+3
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: prefix all functions with wg_	Jason A. Donenfeld	2018-10-02	1	-66/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	I understand why this must be done, though I'm not so happy about having to do it. In some places, it puts us over 80 chars and we have to break lines up in further ugly ways. And in general, I think this makes things harder to read. Yet another thing we must do to please upstream. Maybe this can be replaced in the future by some kind of automatic module namespacing logic in the linker, or even combined with LTO and aggressive symbol stripping. Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: put SPDX identifier on its own line	Jason A. Donenfeld	2018-09-20	1	-2/+2
\| \| \| \| \| \| \|	The kernel has very specific rules correlating file type with comment type, and also SPDX identifiers can't be merged with other comments. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	crypto: pass simd by reference	Jason A. Donenfeld	2018-09-17	1	-5/+6
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: remove non-essential inline annotations	Jason A. Donenfeld	2018-09-16	1	-6/+5
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	send/receive: reduce number of sg entries	Jason A. Donenfeld	2018-09-16	1	-1/+1
\| \| \| \| \| \|	This reduces stack usage to quell warnings on powerpc. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: prefer sizeof(*pointer) when possible	Jason A. Donenfeld	2018-09-04	1	-10/+6
\| \| \| \| \|	Suggested-by: Sultan Alsawaf <sultanxda@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	crypto: import zinc	Jason A. Donenfeld	2018-09-03	1	-1/+1
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: run through clang-format	Jason A. Donenfeld	2018-08-28	1	-67/+136
\| \| \| \| \| \| \| \| \|	This is the worst commit in the whole repo, making the code much less readable, but so it goes with upstream maintainers. We are now woefully wrapped at 80 columns. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	crypto: move simd context to specific type	Jason A. Donenfeld	2018-08-06	1	-6/+6
\| \| \| \| \|	Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	send: switch handshake stamp to an atomic	Jason A. Donenfeld	2018-08-04	1	-11/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	Rather than abusing the handshake lock, we're much better off just using a boring atomic64 for this. It's simpler and performs better. Also, while we're at it, we set the handshake stamp both before and after the calculations, in case the calculations block for a really long time waiting for the RNG to initialize. Otherwise it's possible that when the RNG finally initializes, two handshakes are sent back to back, which isn't sensible. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	peer: ensure destruction doesn't race	Jason A. Donenfeld	2018-08-03	1	-14/+20
\| \| \| \| \| \| \|	Completely rework peer removal to ensure peers don't jump between contexts and create races. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	queueing: ensure strictly ordered loads and stores	Jason A. Donenfeld	2018-08-02	1	-1/+1
\| \| \| \| \| \| \| \|	We don't want a consumer to read plaintext when it's supposed to be reading ciphertext, which means we need to synchronize across cores. Suggested-by: Jann Horn <jann@thejh.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	send: address of variable is never null	Jason A. Donenfeld	2018-07-31	1	-1/+1
\| \| \| \| \|	Reported-by: Jann Horn <jann@thejh.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	peer: simplify rcu reference counts	Jason A. Donenfeld	2018-07-31	1	-2/+2
\| \| \| \| \| \| \| \|	Use RCU reference counts only when we must, and otherwise use a more reasonably named function. Reported-by: Jann Horn <jann@thejh.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: use fast boottime instead of normal boottime	Jason A. Donenfeld	2018-06-23	1	-2/+2
\| \| \| \| \| \|	Generally if we're inaccurate by a few nanoseconds, it doesn't matter. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: use ktime boottime instead of jiffies	Jason A. Donenfeld	2018-06-23	1	-7/+6
\| \| \| \| \| \| \| \| \| \|	Since this is a network protocol, expirations need to be accounted for, even across system suspend. On real systems, this isn't a problem, since we're clearing all keys before suspend. But on Android, where we don't do that, this is something of a problem. So, we switch to using boottime instead of jiffies. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: fix a few typos	Jonathan Neuschäfer	2018-06-22	1	-1/+1
\| \| \| \| \|	Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	simd: encapsulate fpu amortization into nice functions	Jason A. Donenfeld	2018-06-17	1	-7/+4
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	queueing: re-enable preemption periodically to lower latency	Jason A. Donenfeld	2018-06-16	1	-0/+6
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	queueing: remove useless spinlocks on sc	Jason A. Donenfeld	2018-06-16	1	-2/+0
\| \| \| \| \| \|	Since these are the only consumers, there's no need for locking. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	timers: clear send_keepalive timer on sending handshake response	Jason A. Donenfeld	2018-05-19	1	-0/+3
\| \| \| \| \| \| \| \|	We reorganize this into also doing so on sending keepalives itself, which means the state machine is much more consistent, even if this was already implied. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	send: simplify skb_padding with nice macro	Jason A. Donenfeld	2018-04-16	1	-4/+3
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	send: account for route-based MTU	Jason A. Donenfeld	2018-04-15	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \|	It might be that a particular route has a different MTU than the interface, via `ip route add ... dev wg0 mtu 1281`, for example. In this case, it's important that we don't accidently pad beyond the end of the MTU. We accomplish that in this patch by carrying forward the MTU from the dst if it exists. We also add a unit test for this issue. Reported-by: Roman Mamedov <rm.wg@romanrm.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: year bump	Jason A. Donenfeld	2018-01-03	1	-1/+1
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: add SPDX tags to all files	Greg Kroah-Hartman	2017-12-09	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	It's good to have SPDX identifiers in all files as the Linux kernel developers are working to add these identifiers to all files. Update all files with the correct SPDX license identifier based on the license text of the project or based on the license in the file itself. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Modified-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: style nits	Jason A. Donenfeld	2017-10-31	1	-6/+12
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: infuriating kernel iterator style	Jason A. Donenfeld	2017-10-31	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	One types: for (i = 0 ... So one should also type: for_each_obj (obj ... But the upstream kernel style guidelines are insane, and so we must instead do: for_each_obj(obj ... Ugly, but one must choose his battles wisely. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: accept decent check_patch.pl suggestions	Jason A. Donenfeld	2017-10-31	1	-3/+4
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	stats: more robust accounting	Jason A. Donenfeld	2017-10-31	1	-0/+1
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>