wireguard-linux-compat - WireGuard kernel module backport for Linux 3.10

	Commit message (Collapse)	Author	Files	Lines
2022-06-28	compat: do not backport ktime_get_coarse_boottime_ns to c8s	Jason A. Donenfeld	1	-2/+2
	Also bump the c8s version stamp. Reported-by: Vladimír Beneš <vbenes@redhat.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-06-27	version: bumpv1.0.20220627	Jason A. Donenfeld	2	-2/+2
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-06-22	compat: handle backported rng and blake2s	Jason A. Donenfeld	2	-6/+8
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-05	qemu: give up on RHEL8 in CI	Jason A. Donenfeld	1	-6/+0
	They keep breaking their kernel and being difficult when I send patches to fix it, so just give up on trying to support this in the CI. It'll bitrot and people will complain and we'll see what happens at that point. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-05	qemu: set panic_on_warn=1 from cmdline	Jason A. Donenfeld	14	-19/+13
	Rather than setting this once init is running, set panic_on_warn from the kernel command line, so that it catches splats from WireGuard initialization code and the various crypto selftests. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-05	qemu: use vports on arm	Jason A. Donenfeld	5	-6/+25
	Rather than having to hack up QEMU, just use the virtio serial device. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-05	netns: limit parallelism to $(nproc) tests at once	Jason A. Donenfeld	1	-10/+10
	The parallel tests were added to catch queueing issues from multiple cores. But what happens in reality when testing tons of processes is that these separate threads wind up fighting with the scheduler, and we wind up with contention in places we don't care about that decrease the chances of hitting a bug. So just do a test with the number of CPU cores, rather than trying to scale up arbitrarily. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-05	netns: make routing loop test non-fatal	Jason A. Donenfeld	1	-1/+13
	I hate to do this, but I still do not have a good solution to actually fix this bug across architectures. So just disable it for now, so that the CI can still deliver actionable results. This commit adds a large red warning, so that at least the failure isn't lost forever, and hopefully this can be revisited down the line. Link: https://lore.kernel.org/netdev/CAHmME9pv1x6C4TNdL6648HydD8r+txpV4hTUXOBVkrapBXH4QQ@mail.gmail.com/ Link: https://lore.kernel.org/netdev/YmszSXueTxYOC41G@zx2c4.com/ Link: https://lore.kernel.org/wireguard/CAHmME9rNnBiNvBstb7MPwK-7AmAN0sOfnhdR=eeLrowWcKxaaQ@mail.gmail.com/ Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-04-14	device: check for metadata_dst with skb_valid_dst()	Nikolay Aleksandrov	3	-1/+9
	When we try to transmit an skb with md_dst attached through wireguard we hit a null pointer dereference in wg_xmit() due to the use of dst_mtu() which calls into dst_blackhole_mtu() which in turn tries to dereference dst->dev. Since wireguard doesn't use md_dsts we should use skb_valid_dst(), which checks for DST_METADATA flag, and if it's set, then falls back to wireguard's device mtu. That gives us the best chance of transmitting the packet; otherwise if the blackhole netdev is used we'd get ETH_MIN_MTU. [ 263.693506] BUG: kernel NULL pointer dereference, address: 00000000000000e0 [ 263.693908] #PF: supervisor read access in kernel mode [ 263.694174] #PF: error_code(0x0000) - not-present page [ 263.694424] PGD 0 P4D 0 [ 263.694653] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 263.694876] CPU: 5 PID: 951 Comm: mausezahn Kdump: loaded Not tainted 5.18.0-rc1+ #522 [ 263.695190] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014 [ 263.695529] RIP: 0010:dst_blackhole_mtu+0x17/0x20 [ 263.695770] Code: 00 00 00 0f 1f 44 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 47 10 48 83 e0 fc 8b 40 04 85 c0 75 09 48 8b 07 <8b> 80 e0 00 00 00 c3 66 90 0f 1f 44 00 00 48 89 d7 be 01 00 00 00 [ 263.696339] RSP: 0018:ffffa4a4422fbb28 EFLAGS: 00010246 [ 263.696600] RAX: 0000000000000000 RBX: ffff8ac9c3553000 RCX: 0000000000000000 [ 263.696891] RDX: 0000000000000401 RSI: 00000000fffffe01 RDI: ffffc4a43fb48900 [ 263.697178] RBP: ffffa4a4422fbb90 R08: ffffffff9622635e R09: 0000000000000002 [ 263.697469] R10: ffffffff9b69a6c0 R11: ffffa4a4422fbd0c R12: ffff8ac9d18b1a00 [ 263.697766] R13: ffff8ac9d0ce1840 R14: ffff8ac9d18b1a00 R15: ffff8ac9c3553000 [ 263.698054] FS: 00007f3704c337c0(0000) GS:ffff8acaebf40000(0000) knlGS:0000000000000000 [ 263.698470] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 263.698826] CR2: 00000000000000e0 CR3: 0000000117a5c000 CR4: 00000000000006e0 [ 263.699214] Call Trace: [ 263.699505] <TASK> [ 263.699759] wg_xmit+0x411/0x450 [ 263.700059] ? bpf_skb_set_tunnel_key+0x46/0x2d0 [ 263.700382] ? dev_queue_xmit_nit+0x31/0x2b0 [ 263.700719] dev_hard_start_xmit+0xd9/0x220 [ 263.701047] __dev_queue_xmit+0x8b9/0xd30 [ 263.701344] __bpf_redirect+0x1a4/0x380 [ 263.701664] __dev_queue_xmit+0x83b/0xd30 [ 263.701961] ? packet_parse_headers+0xb4/0xf0 [ 263.702275] packet_sendmsg+0x9a8/0x16a0 [ 263.702596] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 263.702933] sock_sendmsg+0x5e/0x60 [ 263.703239] __sys_sendto+0xf0/0x160 [ 263.703549] __x64_sys_sendto+0x20/0x30 [ 263.703853] do_syscall_64+0x3b/0x90 [ 263.704162] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 263.704494] RIP: 0033:0x7f3704d50506 [ 263.704789] Code: 48 c7 c0 ff ff ff ff eb b7 66 2e 0f 1f 84 00 00 00 00 00 90 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 72 c3 90 55 48 83 ec 30 44 89 4c 24 2c 4c 89 [ 263.705652] RSP: 002b:00007ffe954b0b88 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 263.706141] RAX: ffffffffffffffda RBX: 0000558bb259b490 RCX: 00007f3704d50506 [ 263.706544] RDX: 000000000000004a RSI: 0000558bb259b7b2 RDI: 0000000000000003 [ 263.706952] RBP: 0000000000000000 R08: 00007ffe954b0b90 R09: 0000000000000014 [ 263.707339] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe954b0b90 [ 263.707735] R13: 000000000000004a R14: 0000558bb259b7b2 R15: 0000000000000001 [ 263.708132] </TASK> [ 263.708398] Modules linked in: bridge netconsole bonding [last unloaded: bridge] [ 263.708942] CR2: 00000000000000e0 Link: https://github.com/cilium/cilium/issues/19428 Reported-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> [Jason: polyfilled for < 4.3] Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-04-06	qemu: enable ACPI for SMP	Jason A. Donenfeld	2	-0/+2
	It turns out that by having CONFIG_ACPI=n, we've been failing to boot additional CPUs, and so these systems were functionally UP. The code bloat is unfortunate for build times, but I don't see an alternative. So this commit sets CONFIG_ACPI=y for x86_64 and i686 configs. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-04-06	socket: ignore v6 endpoints when ipv6 is disabled	Jason A. Donenfeld	1	-2/+2
	The previous commit fixed a memory leak on the send path in the event that IPv6 is disabled at compile time, but how did a packet even arrive there to begin with? It turns out we have previously allowed IPv6 endpoints even when IPv6 support is disabled at compile time. This is awkward and inconsistent. Instead, let's just ignore all things IPv6, the same way we do other malformed endpoints, in the case where IPv6 is disabled. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-04-06	socket: free skb in send6 when ipv6 is disabled	Wang Hai	1	-0/+1
	I got a memory leak report: unreferenced object 0xffff8881191fc040 (size 232): comm "kworker/u17:0", pid 23193, jiffies 4295238848 (age 3464.870s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff814c3ef4>] slab_post_alloc_hook+0x84/0x3b0 [<ffffffff814c8977>] kmem_cache_alloc_node+0x167/0x340 [<ffffffff832974fb>] __alloc_skb+0x1db/0x200 [<ffffffff82612b5d>] wg_socket_send_buffer_to_peer+0x3d/0xc0 [<ffffffff8260e94a>] wg_packet_send_handshake_initiation+0xfa/0x110 [<ffffffff8260ec81>] wg_packet_handshake_send_worker+0x21/0x30 [<ffffffff8119c558>] process_one_work+0x2e8/0x770 [<ffffffff8119ca2a>] worker_thread+0x4a/0x4b0 [<ffffffff811a88e0>] kthread+0x120/0x160 [<ffffffff8100242f>] ret_from_fork+0x1f/0x30 In function wg_socket_send_buffer_as_reply_to_skb() or wg_socket_send_ buffer_to_peer(), the semantics of send6() is required to free skb. But when CONFIG_IPV6 is disable, kfree_skb() is missing. This patch adds it to fix this bug. Signed-off-by: Wang Hai <wanghai38@huawei.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-03-03	qemu: simplify RNG seeding	Jason A. Donenfeld	1	-18/+8
	We don't actualy need to write anything in the pool. Instead, we just force the total over 128, and we should be good to go for all old kernels. We also only need this on getrandom() kernels, which simplifies things too. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-03-02	queueing: use CFI-safe ptr_ring cleanup function	Jason A. Donenfeld	3	-1/+17
	We make too nuanced use of ptr_ring to entirely move to the skb_array wrappers, but we at least should avoid the naughty function pointer cast when cleaning up skbs. Otherwise RAP/CFI will honk at us. This patch uses the __skb_array_destroy_skb wrapper for the cleanup, rather than directly providing kfree_skb, which is what other drivers in the same situation do too. Reported-by: PaX Team <pageexec@freemail.hu> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-13	crypto: curve25519-x86_64: use in/out register constraints more precisely	Jason A. Donenfeld	1	-293/+504
	Rather than passing all variables as modified, pass ones that are only read into that parameter. This helps with old gcc versions when alternatives are additionally used, and lets gcc's codegen be a little bit more efficient. This also syncs up with the latest Vale/EverCrypt output. This also forward ports 3c9f3b6 ("crypto: curve25519-x86_64: solve register constraints with reserved registers"). Cc: Aymeric Fromherz <aymeric.fromherz@inria.fr> Cc: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/wireguard/1554725710.1290070.1639240504281.JavaMail.zimbra@inria.fr/ Link: https://github.com/project-everest/hacl-star/pull/501 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-13	compat: drop Ubuntu 14.04	Jason A. Donenfeld	1	-6/+4
	It's been over a year since we announced sunsetting this. Link: https://lore.kernel.org/wireguard/CAHmME9rckipsdZYW+LA=x6wCMybdFFA+VqoogFXnR=kHYiCteg@mail.gmail.com/T Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-08	version: bumpv1.0.20211208	Jason A. Donenfeld	2	-2/+2
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-06	crypto: curve25519-x86_64: solve register constraints with reserved registers	Mathias Krause	1	-4/+4
	The register constraints for the inline assembly in fsqr() and fsqr2() are pretty tight on what the compiler may assign to the remaining three register variables. The clobber list only allows the following to be used: RDI, RSI, RBP and R12. With RAP reserving R12 and a kernel having CONFIG_FRAME_POINTER=y, claiming RBP, there are only two registers left so the compiler rightfully complains about impossible constraints. Provide alternatives that'll allow a memory reference for 'out' to solve the allocation constraint dilemma for this configuration. Also make 'out' an input-only operand as it is only used as such. This not only allows gcc to optimize its usage further, but also works around older gcc versions, apparently failing to handle multiple alternatives correctly, as in failing to initialize the 'out' operand with its input value. Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-06	compat: udp_tunnel: don't take reference to non-init namespace	Jason A. Donenfeld	1	-5/+7
	The comment to sk_change_net is instructive: Kernel sockets, f.e. rtnl or icmp_socket, are a part of a namespace. They should not hold a reference to a namespace in order to allow to stop it. Sockets after sk_change_net should be released using sk_release_kernel We weren't following these rules before, and were instead using __sock_create, which means we kept a reference to the namespace, which in turn meant that interfaces were not cleaned up on namespace exit. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-03	compat: siphash: use _unaligned version by default	Arnd Bergmann	2	-34/+28
	On ARM v6 and later, we define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS because the ordinary load/store instructions (ldr, ldrh, ldrb) can tolerate any misalignment of the memory address. However, load/store double and load/store multiple instructions (ldrd, ldm) may still only be used on memory addresses that are 32-bit aligned, and so we have to use the CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS macro with care, or we may end up with a severe performance hit due to alignment traps that require fixups by the kernel. Testing shows that this currently happens with clang-13 but not gcc-11. In theory, any compiler version can produce this bug or other problems, as we are dealing with undefined behavior in C99 even on architectures that support this in hardware, see also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363. Fortunately, the get_unaligned() accessors do the right thing: when building for ARMv6 or later, the compiler will emit unaligned accesses using the ordinary load/store instructions (but avoid the ones that require 32-bit alignment). When building for older ARM, those accessors will emit the appropriate sequence of ldrb/mov/orr instructions. And on architectures that can truly tolerate any kind of misalignment, the get_unaligned() accessors resolve to the leXX_to_cpup accessors that operate on aligned addresses. Since the compiler will in fact emit ldrd or ldm instructions when building this code for ARM v6 or later, the solution is to use the unaligned accessors unconditionally on architectures where this is known to be fast. The _aligned version of the hash function is however still needed to get the best performance on architectures that cannot do any unaligned access in hardware. This new version avoids the undefined behavior and should produce the fastest hash on all architectures we support. Reported-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-03	ratelimiter: use kvcalloc() instead of kvzalloc()	Gustavo A. R. Silva	2	-2/+24
	Use 2-factor argument form kvcalloc() instead of kvzalloc(). Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-03	receive: drop handshakes if queue lock is contended	Jason A. Donenfeld	1	-3/+13
	If we're being delivered packets from multiple CPUs so quickly that the ring lock is contended for CPU tries, then it's safe to assume that the queue is near capacity anyway, so just drop the packet rather than spinning. This helps deal with multicore DoS that can interfere with data path performance. It _still_ does not completely fix the issue, but it again chips away at it. Reported-by: Streun Fabio <fstreun@student.ethz.ch> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-03	receive: use ring buffer for incoming handshakes	Jason A. Donenfeld	5	-43/+37
	Apparently the spinlock on incoming_handshake's skb_queue is highly contended, and a torrent of handshake or cookie packets can bring the data plane to its knees, simply by virtue of enqueueing the handshake packets to be processed asynchronously. So, we try switching this to a ring buffer to hopefully have less lock contention. This alleviates the problem somewhat, though it still isn't perfect, so future patches will have to improve this further. However, it at least doesn't completely diminish the data plane. Reported-by: Streun Fabio <fstreun@student.ethz.ch> Reported-by: Joel Wanner <joel.wanner@inf.ethz.ch> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-03	device: reset peer src endpoint when netns exits	Jason A. Donenfeld	5	-2/+60
	Each peer's endpoint contains a dst_cache entry that takes a reference to another netdev. When the containing namespace exits, we take down the socket and prevent future sockets from being created (by setting creating_net to NULL), which removes that potential reference on the netns. However, it doesn't release references to the netns that a netdev cached in dst_cache might be taking, so the netns still might fail to exit. Since the socket is gimped anyway, we can simply clear all the dst_caches (by way of clearing the endpoint src), which will release all references. However, the current dst_cache_reset function only releases those references lazily. But it turns out that all of our usages of wg_socket_clear_peer_endpoint_src are called from contexts that are not exactly high-speed or bottle-necked. For example, when there's connection difficulty, or when userspace is reconfiguring the interface. And in particular for this patch, when the netns is exiting. So for those cases, it makes more sense to call dst_release immediately. For that, we add a small helper function to dst_cache. This patch also adds a test to netns.sh from Hangbin Liu to ensure this doesn't regress. Test-by: Hangbin Liu <liuhangbin@gmail.com> Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-03	main: rename 'mod_init' & 'mod_exit' functions to be module-specific	Randy Dunlap	1	-4/+4
	Rename module_init & module_exit functions that are named "mod_init" and "mod_exit" so that they are unique in both the System.map file and in initcall_debug output instead of showing up as almost anonymous "mod_init". This is helpful for debugging and in determining how long certain module_init calls take to execute. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-03	netns: actually test for routing loops	Jason A. Donenfeld	1	-1/+5
	We previously removed the restriction on looping to self, and then added a test to make sure the kernel didn't blow up during a routing loop. The kernel didn't blow up, thankfully, but on certain architectures where skb fragmentation is easier, such as ppc64, the skbs weren't actually being discarded after a few rounds through. But the test wasn't catching this. So actually test explicitly for massive increases in tx to see if we have a routing loop. Note that the actual loop problem will need to be addressed in a different commit. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-03	compat: update for RHEL 8.5	Peter Georg	2	-4/+4
	RHEL 8.5 has been released. Replace all ISCENTOS8S checks with ISRHEL8. Increase RHEL_MINOR for CentOS 8 Stream detection to 6. Signed-off-by: Peter Georg <peter.georg@physik.uni-regensburg.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-08-08	compat: account for grsecurity backports and changes	Mathias Krause	2	-3/+9
	grsecurity kernels tend to carry additional backports and changes, like commit b60b87fc2996 ("netlink: add ethernet address policy types") or the SYM_FUNC_* changes. RAP nowadays hooks the latter, therefore no diversion to RAP_ENTRY is needed any more. Instead of relying on the kernel version test, also test for the macros we're about to define to not already be defined to account for these additional changes in the grsecurity patch without breaking compatibility to the older public ones. Also test for CONFIG_PAX instead of RAP_PLUGIN for the timer API related changes as these don't depend on the RAP plugin to be enabled but just a PaX/grsecurity patch to be applied. While there is no preprocessor knob for the latter, use CONFIG_PAX as this will likely be enabled in every kernel that uses the patch. Signed-off-by: Mathias Krause <minipli@grsecurity.net> [zx2c4: small changes to include a header nearby a macro def test] Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-15	compat: account for latest c8s backports	Jason A. Donenfeld	1	-3/+3
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-06	version: bumpv1.0.20210606	Jason A. Donenfeld	2	-2/+2
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-06	qemu: increase default dmesg log size	Jason A. Donenfeld	1	-0/+1
	The selftests currently parse the kernel log at the end to track potential memory leaks. With these tests now reading off the end of the buffer, due to recent optimizations, some creation messages were lost, making the tests think that there was a free without an alloc. Fix this by increasing the kernel log size. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-06	qemu: add disgusting hacks for RHEL 8	Jason A. Donenfeld	1	-1/+7
	Red Hat does awful things to their kernel for RHEL 8, such that it doesn't even compile in most configurations. This is utter craziness, and their response to me sending patches to fix this stuff has been to stonewall for months on end and then do nothing. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-04	allowedips: add missing __rcu annotation to satisfy sparse	Jason A. Donenfeld	1	-1/+1
	A __rcu annotation got lost during refactoring, which caused sparse to become enraged. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-04	allowedips: free empty intermediate nodes when removing single node	Jason A. Donenfeld	3	-131/+137
	When removing single nodes, it's possible that that node's parent is an empty intermediate node, in which case, it too should be removed. Otherwise the trie fills up and never is fully emptied, leading to gradual memory leaks over time for tries that are modified often. There was originally code to do this, but was removed during refactoring in 2016 and never reworked. Now that we have proper parent pointers from the previous commits, we can implement this properly. In order to reduce branching and expensive comparisons, we want to keep the double pointer for parent assignment (which lets us easily chain up to the root), but we still need to actually get the parent's base address. So encode the bit number into the last two bits of the pointer, and pack and unpack it as needed. This is a little bit clumsy but is the fastest and less memory wasteful of the compromises. Note that we align the root struct here to a minimum of 4, because it's embedded into a larger struct, and we're relying on having the bottom two bits for our flag, which would only be 16-bit aligned on m68k. The existing macro-based helpers were a bit unwieldy for adding the bit packing to, so this commit replaces them with safer and clearer ordinary functions. We add a test to the randomized/fuzzer part of the selftests, to free the randomized tries by-peer, refuzz it, and repeat, until it's supposed to be empty, and then then see if that actually resulted in the whole thing being emptied. That combined with kmemcheck should hopefully make sure this commit is doing what it should. Along the way this resulted in various other cleanups of the tests and fixes for recent graphviz. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-04	allowedips: allocate nodes in kmem_cache	Jason A. Donenfeld	3	-13/+38
	The previous commit moved from O(n) to O(1) for removal, but in the process introduced an additional pointer member to a struct that increased the size from 60 to 68 bytes, putting nodes in the 128-byte slab. With deployed systems having as many as 2 million nodes, this represents a significant doubling in memory usage (128 MiB -> 256 MiB). Fix this by using our own kmem_cache, that's sized exactly right. This also makes wireguard's memory usage more transparent in tools like slabtop and /proc/slabinfo. Suggested-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-04	allowedips: remove nodes in O(1)	Jason A. Donenfeld	2	-84/+57
	Previously, deleting peers would require traversing the entire trie in order to rebalance nodes and safely free them. This meant that removing 1000 peers from a trie with a half million nodes would take an extremely long time, during which we're holding the rtnl lock. Large-scale users were reporting 200ms latencies added to the networking stack as a whole every time their userspace software would queue up significant removals. That's a serious situation. This commit fixes that by maintaining a double pointer to the parent's bit pointer for each node, and then using the already existing node list belonging to each peer to go directly to the node, fix up its pointers, and free it with RCU. This means removal is O(1) instead of O(n), and we don't use gobs of stack. The removal algorithm has the same downside as the code that it fixes: it won't collapse needlessly long runs of fillers. We can enhance that in the future if it ever becomes a problem. This commit documents that limitation with a TODO comment in code, a small but meaningful improvement over the prior situation. Currently the biggest flaw, which the next commit addresses, is that because this increases the node size on 64-bit machines from 60 bytes to 68 bytes. 60 rounds up to 64, but 68 rounds up to 128. So we wind up using twice as much memory per node, because of power-of-two allocations, which is a big bummer. We'll need to figure something out there. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-04	allowedips: initialize list head in selftest	Jason A. Donenfeld	1	-1/+2
	The randomized trie tests weren't initializing the dummy peer list head, resulting in a NULL pointer dereference when used. Fix this by initializing it in the randomized trie test, just like we do for the static unit test. While we're at it, all of the other strings like this have the word "self-test", so add it to the missing place here. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-04	peer: allocate in kmem_cache	Jason A. Donenfeld	3	-4/+27
	With deployments having upwards of 600k peers now, this somewhat heavy structure could benefit from more fine-grained allocations. Specifically, instead of using a 2048-byte slab for a 1544-byte object, we can now use 1544-byte objects directly, thus saving almost 25% per-peer, or with 600k peers, that's a savings of 303 MiB. This also makes wireguard's memory usage more transparent in tools like slabtop and /proc/slabinfo. Suggested-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-02	global: use synchronize_net rather than synchronize_rcu	Jason A. Donenfeld	2	-4/+4
	Many of the synchronization points are sometimes called under the rtnl lock, which means we should use synchronize_net rather than synchronize_rcu. Under the hood, this expands to using the expedited flavor of function in the event that rtnl is held, in order to not stall other concurrent changes. This fixes some very, very long delays when removing multiple peers at once, which would cause some operations to take several minutes. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-02	kbuild: do not use -O3	Jason A. Donenfeld	1	-3/+2
	Apparently, various versions of gcc have O3-related miscompiles. Looking at the difference between -O2 and -O3 for gcc 11 doesn't indicate miscompiles, but the difference also doesn't seem so significant for performance that it's worth risking. Link: https://lore.kernel.org/lkml/CAHk-=wjuoGyxDhAF8SsrTkN0-YfCx7E6jUN3ikC_tn2AKWTTsA@mail.gmail.com/ Link: https://lore.kernel.org/lkml/CAHmME9otB5Wwxp7H8bR_i2uH2esEMvoBMC8uEXBMH9p0q1s6Bw@mail.gmail.com/ Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-02	netns: make sure rp_filter is disabled on vethc	Jason A. Donenfeld	1	-0/+1
	Some distros may enable strict rp_filter by default, which will prevent vethc from receiving the packets with an unroutable reverse path address. Reported-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-04-24	version: bumpv1.0.20210424	Jason A. Donenfeld	2	-2/+2
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-04-23	Revert "compat: skb_mark_not_on_list will be backported to Ubuntu 18.04"	Thadeu Lima de Souza Cascardo	1	-1/+1
	This reverts commit cad80597c7947f0def83caf8cb56aff0149c83a8. Because this commit has not been backported so far, due to the implications of building Ubuntu's backport of wireguard in a timely manner. For now, reverting this fix would allow wireguard-linux-compat CI to work on Ubuntu 18.04. A different fix or the same one can be applied again when the time is right. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-04-22	compat: update and improve detection of CentOS Stream 8	Peter Georg	2	-2/+2
	CentOS Stream 8 by now (4.18.0-301.1.el8) reports RHEL_MINOR=5. The current RHEL 8 minor release is still 3. RHEL 8.4 is in beta. Replace equal comparison by greater equal to (hopefully) be a little bit more future proof. Signed-off-by: Peter Georg <peter.georg@physik.uni-regensburg.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-03-07	compat: icmp_ndo_send functions were backported extensively	Jason A. Donenfeld	1	-1/+1
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-19	version: bumpv1.0.20210219	Jason A. Donenfeld	2	-2/+2
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-19	qemu: bump default kernel version	Jason A. Donenfeld	1	-1/+1
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-19	compat: zero out skb->cb before icmp	Jason A. Donenfeld	1	-4/+16
	This corresponds to the fancier upstream commit that's still on lkml, which passes a zeroed ip_options struct to __icmp_send. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-18	compat: skb_mark_not_on_list will be backported to Ubuntu 18.04	Thadeu Lima de Souza Cascardo	1	-1/+1
	linux commit 22f6bbb7bcfcef0b373b0502a7ff390275c575dd ("net: use skb_list_del_init() to remove from RX sublists") will be backported to Ubuntu 18.04 default kernel, which is based on linux 4.15. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-18	queueing: get rid of per-peer ring buffers	Jason A. Donenfeld	9	-93/+160
	Having two ring buffers per-peer means that every peer results in two massive ring allocations. On an 8-core x86_64 machine, this commit reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which is an 90% reduction. Ninety percent! With some single-machine deployments approaching 500,000 peers, we're talking about a reduction from 7 gigs of memory down to 700 megs of memory. In order to get rid of these per-peer allocations, this commit switches to using a list-based queueing approach. Currently GSO fragments are chained together using the skb->next pointer (the skb_list_* singly linked list approach), so we form the per-peer queue around the unused skb->prev pointer (which sort of makes sense because the links are pointing backwards). Use of skb_queue_* is not possible here, because that is based on doubly linked lists and spinlocks. Multiple cores can write into the queue at any given time, because its writes occur in the start_xmit path or in the udp_recv path. But reads happen in a single workqueue item per-peer, amounting to a multi-producer, single-consumer paradigm. The MPSC queue is implemented locklessly and never blocks. However, it is not linearizable (though it is serializable), with a very tight and unlikely race on writes, which, when hit (some tiny fraction of the 0.15% of partial adds on a fully loaded 16-core x86_64 system), causes the queue reader to terminate early. However, because every packet sent queues up the same workqueue item after it is fully added, the worker resumes again, and stopping early isn't actually a problem, since at that point the packet wouldn't have yet been added to the encryption queue. These properties allow us to avoid disabling interrupts or spinning. The design is based on Dmitry Vyukov's algorithm [1]. Performance-wise, ordinarily list-based queues aren't preferable to ringbuffers, because of cache misses when following pointers around. However, we already have to follow the adjacent pointers when working through fragments, so there shouldn't actually be any change there. A potential downside is that dequeueing is a bit more complicated, but the ptr_ring structure used prior had a spinlock when dequeueing, so all and all the difference appears to be a wash. Actually, from profiling, the biggest performance hit, by far, of this commit winds up being atomic_add_unless(count, 1, max) and atomic_ dec(count), which account for the majority of CPU time, according to perf. In that sense, the previous ring buffer was superior in that it could check if it was full by head==tail, which the list-based approach cannot do. But all and all, this enables us to get massive memory savings, allowing WireGuard to scale for real world deployments, without taking much of a performance hit. [1] http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>