wireguard-go - Go implementation of WireGuard

	Commit message (Collapse)	Author	Age	Files	Lines
*	version: bump snapshotHEAD 0.0.20250522 master	Jason A. Donenfeld	2025-05-22	1	-1/+1
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	conn: don't enable GRO on Linux < 5.12	Jason A. Donenfeld	2025-05-22	1	-0/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Kernels below 5.12 are missing this: commit 98184612aca0a9ee42b8eb0262a49900ee9eef0d Author: Norman Maurer <norman_maurer@apple.com> Date: Thu Apr 1 08:59:17 2021 net: udp: Add support for getsockopt(..., ..., UDP_GRO, ..., ...); Support for UDP_GRO was added in the past but the implementation for getsockopt was missed which did lead to an error when we tried to retrieve the setting for UDP_GRO. This patch adds the missing switch case for UDP_GRO Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.") Signed-off-by: Norman Maurer <norman_maurer@apple.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> That means we can't set the option and then read it back later. Given how buggy UDP_GRO is in general on odd kernels, just disable it on older kernels all together. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	device: optimize message encoding	Alexander Yastrebov	2025-05-21	2	-13/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Optimize message encoding by eliminating binary.Write (which internally uses reflection) in favour of hand-rolled encoding. This is companion to 9e7529c3d2d0c54f4d5384c01645a9279e4740ae. Synthetic benchmark: var packetSink []byte func BenchmarkMessageInitiationMarshal(b testing.B) { var msg MessageInitiation b.Run("binary.Write", func(b testing.B) { b.ReportAllocs() for range b.N { var buf [MessageInitiationSize]byte writer := bytes.NewBuffer(buf[:0]) _ = binary.Write(writer, binary.LittleEndian, msg) packetSink = writer.Bytes() } }) b.Run("binary.Encode", func(b testing.B) { b.ReportAllocs() for range b.N { packet := make([]byte, MessageInitiationSize) _, _ = binary.Encode(packet, binary.LittleEndian, msg) packetSink = packet } }) b.Run("marshal", func(b testing.B) { b.ReportAllocs() for range b.N { packet := make([]byte, MessageInitiationSize) _ = msg.marshal(packet) packetSink = packet } }) } Results: │ - │ │ sec/op │ MessageInitiationMarshal/binary.Write-8 1.337µ ± 0% MessageInitiationMarshal/binary.Encode-8 1.242µ ± 0% MessageInitiationMarshal/marshal-8 53.05n ± 1% │ - │ │ B/op │ MessageInitiationMarshal/binary.Write-8 368.0 ± 0% MessageInitiationMarshal/binary.Encode-8 160.0 ± 0% MessageInitiationMarshal/marshal-8 160.0 ± 0% │ - │ │ allocs/op │ MessageInitiationMarshal/binary.Write-8 3.000 ± 0% MessageInitiationMarshal/binary.Encode-8 1.000 ± 0% MessageInitiationMarshal/marshal-8 1.000 ± 0% Signed-off-by: Alexander Yastrebov <yastrebov.alex@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	device: add support for removing allowedips individually	Jason A. Donenfeld	2025-05-20	3	-34/+125
\| \| \| \| \| \|	This pairs with the recent change in wireguard-tools. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	version: bump snapshot0.0.20250515	Jason A. Donenfeld	2025-05-15	1	-1/+1
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	device: make unmarshall length checks exact	Jason A. Donenfeld	2025-05-15	1	-7/+7
\| \| \| \| \| \| \|	This is already enforced in receive.go, but if these unmarshallers are to have error return values anyway, make them as explicit as possible. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	device: reduce RoutineHandshake allocations	Alexander Yastrebov	2025-05-15	2	-7/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reduce allocations by eliminating byte reader, hand-rolled decoding and reusing message structs. Synthetic benchmark: var msgSink MessageInitiation func BenchmarkMessageInitiationUnmarshal(b testing.B) { packet := make([]byte, MessageInitiationSize) reader := bytes.NewReader(packet) err := binary.Read(reader, binary.LittleEndian, &msgSink) if err != nil { b.Fatal(err) } b.Run("binary.Read", func(b testing.B) { b.ReportAllocs() for range b.N { reader := bytes.NewReader(packet) _ = binary.Read(reader, binary.LittleEndian, &msgSink) } }) b.Run("unmarshal", func(b *testing.B) { b.ReportAllocs() for range b.N { _ = msgSink.unmarshal(packet) } }) } Results: │ - │ │ sec/op │ MessageInitiationUnmarshal/binary.Read-8 1.508µ ± 2% MessageInitiationUnmarshal/unmarshal-8 12.66n ± 2% │ - │ │ B/op │ MessageInitiationUnmarshal/binary.Read-8 208.0 ± 0% MessageInitiationUnmarshal/unmarshal-8 0.000 ± 0% │ - │ │ allocs/op │ MessageInitiationUnmarshal/binary.Read-8 2.000 ± 0% MessageInitiationUnmarshal/unmarshal-8 0.000 ± 0% Signed-off-by: Alexander Yastrebov <yastrebov.alex@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	rwcancel: fix wrong poll event flag on ReadyWrite	Kurnia D Win	2025-05-05	1	-1/+1
\| \| \| \| \| \| \|	It should be POLLIN because closeFd is read-only file. Signed-off-by: Kurnia D Win <kurnia.d.win@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	device: use rand.NewSource instead of rand.Seed	Tom Holford	2025-05-05	2	-10/+10
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: replaced unused function params with _	Tom Holford	2025-05-05	5	-6/+6
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	tun: darwin: fetch flags and mtu from if_msghdr directly	ruokeqx	2025-05-05	1	-25/+9
\| \| \| \| \|	Signed-off-by: ruokeqx <ruokeqx@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	tun: use add-with-carry in checksumNoFold()	Tu Dinh Ngoc	2025-05-05	2	-69/+116
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use parallel summation with native byte order per RFC 1071. add-with-carry operation is used to add 4 words per operation. Byteswap is performed before and after checksumming for compatibility with old `checksumNoFold()`. With this we get a 30-80% speedup in `checksum()` depending on packet sizes. Add unit tests with comparison to a per-word implementation. Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz \| Size \| OldTime \| NewTime \| Speedup \| \|------\|---------\|---------\|----------\| \| 64 \| 12.64 \| 9.183 \| 1.376456 \| \| 128 \| 18.52 \| 12.72 \| 1.455975 \| \| 256 \| 31.01 \| 18.13 \| 1.710425 \| \| 512 \| 54.46 \| 29.03 \| 1.87599 \| \| 1024 \| 102 \| 52.2 \| 1.954023 \| \| 1500 \| 146.8 \| 81.36 \| 1.804326 \| \| 2048 \| 196.9 \| 102.5 \| 1.920976 \| \| 4096 \| 389.8 \| 200.8 \| 1.941235 \| \| 8192 \| 767.3 \| 413.3 \| 1.856521 \| \| 9000 \| 851.7 \| 448.8 \| 1.897727 \| \| 9001 \| 854.8 \| 451.9 \| 1.891569 \| AMD EPYC 7352 24-Core Processor \| Size \| OldTime \| NewTime \| Speedup \| \|------\|---------\|---------\|----------\| \| 64 \| 9.159 \| 6.949 \| 1.318031 \| \| 128 \| 13.59 \| 10.59 \| 1.283286 \| \| 256 \| 22.37 \| 14.91 \| 1.500335 \| \| 512 \| 41.42 \| 24.22 \| 1.710157 \| \| 1024 \| 81.59 \| 45.05 \| 1.811099 \| \| 1500 \| 120.4 \| 68.35 \| 1.761522 \| \| 2048 \| 162.8 \| 90.14 \| 1.806079 \| \| 4096 \| 321.4 \| 180.3 \| 1.782585 \| \| 8192 \| 650.4 \| 360.8 \| 1.802661 \| \| 9000 \| 706.3 \| 398.1 \| 1.774177 \| \| 9001 \| 712.4 \| 398.2 \| 1.789051 \| Signed-off-by: Tu Dinh Ngoc <dinhngoc.tu@irit.fr> [Jason: simplified and cleaned up unit tests] Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	tun/netstack: cleanup network stack at closing time	Jason A. Donenfeld	2025-05-05	1	-3/+5
\| \| \| \| \| \| \| \| \| \|	Colin's commit went a step further and protected tun.incomingPacket with a lock on shutdown, but let's see if the tun.stack.Close() call actually solves that on its own. Suggested-by: kshangx <hikeshang@hotmail.com> Suggested-by: Colin Adler <colin1adler@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	tun/netstack: remove usage of pkt.IsNil()	Jason A. Donenfeld	2025-05-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Since 3c75945fd ("netstack: remove PacketBuffer.IsNil()") this has been invalid. Follow the replacement pattern of that commit. The old definition inlined to the same code anyway: func (pk *PacketBuffer) IsNil() bool { return pk == nil } Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	mod: bump deps	Jason A. Donenfeld	2025-05-05	2	-19/+19
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: bump copyright notice	Jason A. Donenfeld	2025-05-05	90	-90/+90
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	device: fix missed return of QueueOutboundElementsContainer to its WaitPool	Jordan Whited	2025-05-04	1	-0/+1
\| \| \| \| \| \| \|	Fixes: 3bb8fec ("conn, device, tun: implement vectorized I/O plumbing") Reviewed-by: Brad Fitzpatrick <bradfitz@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	device: fix WaitPool sync.Cond usage	Jordan Whited	2025-05-04	2	-6/+9
\| \| \| \| \| \| \| \| \| \| \|	The sync.Locker used with a sync.Cond must be acquired when changing the associated condition, otherwise there is a window within sync.Cond.Wait() where a wake-up may be missed. Fixes: 4846070 ("device: use a waiting sync.Pool instead of a channel") Reviewed-by: Brad Fitzpatrick <bradfitz@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	device: fix possible deadlock in close method	Martin Basovnik	2023-12-11	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is a possible deadlock in `device.Close()` when you try to close the device very soon after its start. The problem is that two different methods acquire the same locks in different order: 1. device.Close() - device.ipcMutex.Lock() - device.state.Lock() 2. device.changeState(deviceState) - device.state.Lock() - device.ipcMutex.Lock() Reproducer: func TestDevice_deadlock(t testing.T) { d := randDevice(t) d.Close() } Problem: $ go clean -testcache && go test -race -timeout 3s -run TestDevice_deadlock ./device \| grep -A 10 sync.runtime_SemacquireMutex sync.runtime_SemacquireMutex(0xc000117d20?, 0x94?, 0x0?) /usr/local/opt/go/libexec/src/runtime/sema.go:77 +0x25 sync.(Mutex).lockSlow(0xc000130518) /usr/local/opt/go/libexec/src/sync/mutex.go:171 +0x213 sync.(Mutex).Lock(0xc000130518) /usr/local/opt/go/libexec/src/sync/mutex.go:90 +0x55 golang.zx2c4.com/wireguard/device.(Device).Close(0xc000130500) /Users/martin.basovnik/git/basovnik/wireguard-go/device/device.go:373 +0xb6 golang.zx2c4.com/wireguard/device.TestDevice_deadlock(0x0?) /Users/martin.basovnik/git/basovnik/wireguard-go/device/device_test.go:480 +0x2c testing.tRunner(0xc00014c000, 0x131d7b0) -- sync.runtime_SemacquireMutex(0xc000130564?, 0x60?, 0xc000130548?) /usr/local/opt/go/libexec/src/runtime/sema.go:77 +0x25 sync.(Mutex).lockSlow(0xc000130750) /usr/local/opt/go/libexec/src/sync/mutex.go:171 +0x213 sync.(Mutex).Lock(0xc000130750) /usr/local/opt/go/libexec/src/sync/mutex.go:90 +0x55 sync.(RWMutex).Lock(0xc000130750) /usr/local/opt/go/libexec/src/sync/rwmutex.go:147 +0x45 golang.zx2c4.com/wireguard/device.(Device).upLocked(0xc000130500) /Users/martin.basovnik/git/basovnik/wireguard-go/device/device.go:179 +0x72 golang.zx2c4.com/wireguard/device.(*Device).changeState(0xc000130500, 0x1) Signed-off-by: Martin Basovnik <martin.basovnik@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	device: do atomic 64-bit add outside of vector loop	Jason A. Donenfeld	2023-12-11	1	-1/+4
\| \| \| \| \| \| \|	Only bother updating the rxBytes counter once we've processed a whole vector, since additions are atomic. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	device: reduce redundant per-packet overhead in RX path	Jordan Whited	2023-12-11	1	-6/+15
\| \| \| \| \| \| \| \| \| \|	Peer.RoutineSequentialReceiver() deals with packet vectors and does not need to perform timer and endpoint operations for every packet in a given vector. Changing these per-packet operations to per-vector improves throughput by as much as 10% in some environments. Signed-off-by: Jordan Whited <jordan@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	device: change Peer.endpoint locking to reduce contention	Jordan Whited	2023-12-11	6	-84/+86
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Access to Peer.endpoint was previously synchronized by Peer.RWMutex. This has now moved to Peer.endpoint.Mutex. Peer.SendBuffers() is now the sole caller of Endpoint.ClearSrc(), which is signaled via a new bool, Peer.endpoint.clearSrcOnTx. Previous Callers of Endpoint.ClearSrc() now set this bool, primarily via peer.markEndpointSrcForClearing(). Peer.SetEndpointFromPacket() clears Peer.endpoint.clearSrcOnTx when an updated conn.Endpoint is stored. This maintains the same event order as before, i.e. a conn.Endpoint received after peer.endpoint.clearSrcOnTx is set, but before the next Peer.SendBuffers() call results in the latest conn.Endpoint source being used for the next packet transmission. These changes result in throughput improvements for single flow, parallel (-P n) flow, and bidirectional (--bidir) flow iperf3 TCP/UDP tests as measured on both Linux and Windows. Latency under load improves especially for high throughput Linux scenarios. These improvements are likely realized on all platforms to some degree, as the changes are not platform-specific. Co-authored-by: James Tucker <james@tailscale.com> Signed-off-by: James Tucker <james@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	tun: implement UDP GSO/GRO for Linux	Jordan Whited	2023-12-11	6	-590/+1258
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implement UDP GSO and GRO for the Linux tun.Device, which is made possible by virtio extensions in the kernel's TUN driver starting in v6.2. secnetperf, a QUIC benchmark utility from microsoft/msquic@8e1eb1a, is used to demonstrate the effect of this commit between two Linux computers with i5-12400 CPUs. There is roughly ~13us of round trip latency between them. secnetperf was invoked with the following command line options: -stats:1 -exec:maxtput -test:tput -download:10000 -timed:1 -encrypt:0 The first result is from commit 2e0774f without UDP GSO/GRO on the TUN. [conn][0x55739a144980] STATS: EcnCapable=0 RTT=3973 us SendTotalPackets=55859 SendSuspectedLostPackets=61 SendSpuriousLostPackets=59 SendCongestionCount=27 SendEcnCongestionCount=0 RecvTotalPackets=2779122 RecvReorderedPackets=0 RecvDroppedPackets=0 RecvDuplicatePackets=0 RecvDecryptionFailures=0 Result: 3654977571 bytes @ 2922821 kbps (10003.972 ms). The second result is with UDP GSO/GRO on the TUN. [conn][0x56493dfd09a0] STATS: EcnCapable=0 RTT=1216 us SendTotalPackets=165033 SendSuspectedLostPackets=64 SendSpuriousLostPackets=61 SendCongestionCount=53 SendEcnCongestionCount=0 RecvTotalPackets=11845268 RecvReorderedPackets=25267 RecvDroppedPackets=0 RecvDuplicatePackets=0 RecvDecryptionFailures=0 Result: 15574671184 bytes @ 12458214 kbps (10001.222 ms). Signed-off-by: Jordan Whited <jordan@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	tun: fix Device.Read() buf length assumption on Windows	Jordan Whited	2023-12-11	1	-4/+3
\| \| \| \| \| \| \| \| \|	The length of a packet read from the underlying TUN device may exceed the length of a supplied buffer when MTU exceeds device.MaxMessageSize. Reviewed-by: Brad Fitzpatrick <bradfitz@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	device: ratchet up max segment size on android	Jason A. Donenfeld	2023-10-22	1	-1/+1
\| \| \| \| \| \| \| \|	GRO requires big allocations to be efficient. This isn't great, as there might be Android memory usage issues. So we should revisit this commit. But at least it gets things working again. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	conn: set unused OOB to zero length	Jason A. Donenfeld	2023-10-21	1	-1/+2
\| \| \| \| \| \| \| \|	Otherwise in the event that we're using GSO without sticky sockets, we pass garbage OOB buffers to sendmmsg, making a EINVAL, when GSO doesn't set its header. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	conn: fix cmsg data padding calculation for gso	Jason A. Donenfeld	2023-10-21	1	-1/+1
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	conn: separate gso and sticky control	Jason A. Donenfeld	2023-10-21	6	-66/+96
\| \| \| \| \| \|	Android wants GSO but not sticky. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	conn: harmonize GOOS checks between "linux" and "android"	Jason A. Donenfeld	2023-10-18	1	-5/+5
\| \| \| \| \| \| \|	Otherwise GRO gets enabled on Android, but the conn doesn't use it, resulting in bundled packets being discarded. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	conn: simplify supportsUDPOffload	Jason A. Donenfeld	2023-10-18	1	-8/+2
\| \| \| \| \| \| \|	This allows a kernel to support UDP_GRO while not supporting UDP_SEGMENT. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	go.mod,tun/netstack: bump gvisor	James Tucker	2023-10-10	4	-23/+23
\| \| \| \| \|	Signed-off-by: James Tucker <james@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	tun: fix crash when ForceMTU is called after close	James Tucker	2023-10-10	1	-0/+3
\| \| \| \| \| \| \| \| \|	Close closes the events channel, resulting in a panic from send on closed channel. Reported-By: Brad Fitzpatrick <brad@tailscale.com> Signed-off-by: James Tucker <james@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	device: move Queue{In,Out}boundElement Mutex to container type	Jordan Whited	2023-10-10	6	-111/+121
\| \| \| \| \| \| \| \| \| \| \|	Queue{In,Out}boundElement locking can contribute to significant overhead via sync.Mutex.lockSlow() in some environments. These types are passed throughout the device package as elements in a slice, so move the per-element Mutex to a container around the slice. Reviewed-by: Maisem Ali <maisem@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	tun: reduce redundant checksumming in tcpGRO()	Jordan Whited	2023-10-10	1	-63/+99
\| \| \| \| \| \| \| \| \| \| \|	IPv4 header and pseudo header checksums were being computed on every merge operation. Additionally, virtioNetHdr was being written at the same time. This delays those operations until after all coalescing has occurred. Reviewed-by: Adrian Dewhurst <adrian@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	tun: unwind summing loop in checksumNoFold()	Jordan Whited	2023-10-10	2	-12/+123
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	$ benchstat old.txt new.txt goos: linux goarch: amd64 pkg: golang.zx2c4.com/wireguard/tun cpu: 12th Gen Intel(R) Core(TM) i5-12400 │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ Checksum/64-12 10.670n ± 2% 4.769n ± 0% -55.30% (p=0.000 n=10) Checksum/128-12 19.665n ± 2% 8.032n ± 0% -59.16% (p=0.000 n=10) Checksum/256-12 37.68n ± 1% 16.06n ± 0% -57.37% (p=0.000 n=10) Checksum/512-12 76.61n ± 3% 32.13n ± 0% -58.06% (p=0.000 n=10) Checksum/1024-12 160.55n ± 4% 64.25n ± 0% -59.98% (p=0.000 n=10) Checksum/1500-12 231.05n ± 7% 94.12n ± 0% -59.26% (p=0.000 n=10) Checksum/2048-12 309.5n ± 3% 128.5n ± 0% -58.48% (p=0.000 n=10) Checksum/4096-12 603.8n ± 4% 257.2n ± 0% -57.41% (p=0.000 n=10) Checksum/8192-12 1185.0n ± 3% 515.5n ± 0% -56.50% (p=0.000 n=10) Checksum/9000-12 1328.5n ± 5% 564.8n ± 0% -57.49% (p=0.000 n=10) Checksum/9001-12 1340.5n ± 3% 564.8n ± 0% -57.87% (p=0.000 n=10) geomean 185.3n 77.99n -57.92% Reviewed-by: Adrian Dewhurst <adrian@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	device: distribute crypto work as slice of elements	Jordan Whited	2023-10-10	3	-55/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After reducing UDP stack traversal overhead via GSO and GRO, runtime.chanrecv() began to account for a high percentage (20% in one environment) of perf samples during a throughput benchmark. The individual packet channel ops with the crypto goroutines was the primary contributor to this overhead. Updating these channels to pass vectors, which the device package already handles at its ends, reduced this overhead substantially, and improved throughput. The iperf3 results below demonstrate the effect of this commit between two Linux computers with i5-12400 CPUs. There is roughly ~13us of round trip latency between them. The first result is with UDP GSO and GRO, and with single element channels. Starting Test: protocol: TCP, 1 streams, 131072 byte blocks [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-10.00 sec 12.3 GBytes 10.6 Gbits/sec 232 3.15 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 12.3 GBytes 10.6 Gbits/sec 232 sender [ 5] 0.00-10.04 sec 12.3 GBytes 10.6 Gbits/sec receiver The second result is with channels updated to pass a slice of elements. Starting Test: protocol: TCP, 1 streams, 131072 byte blocks [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-10.00 sec 13.2 GBytes 11.3 Gbits/sec 182 3.15 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 13.2 GBytes 11.3 Gbits/sec 182 sender [ 5] 0.00-10.04 sec 13.2 GBytes 11.3 Gbits/sec receiver Reviewed-by: Adrian Dewhurst <adrian@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	conn, device: use UDP GSO and GRO on Linux	Jordan Whited	2023-10-10	13	-147/+673
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	StdNetBind probes for UDP GSO and GRO support at runtime. UDP GSO is dependent on checksum offload support on the egress netdev. UDP GSO will be disabled in the event sendmmsg() returns EIO, which is a strong signal that the egress netdev does not support checksum offload. The iperf3 results below demonstrate the effect of this commit between two Linux computers with i5-12400 CPUs. There is roughly ~13us of round trip latency between them. The first result is from commit 052af4a without UDP GSO or GRO. Starting Test: protocol: TCP, 1 streams, 131072 byte blocks [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-10.00 sec 9.85 GBytes 8.46 Gbits/sec 1139 3.01 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 9.85 GBytes 8.46 Gbits/sec 1139 sender [ 5] 0.00-10.04 sec 9.85 GBytes 8.42 Gbits/sec receiver The second result is with UDP GSO and GRO. Starting Test: protocol: TCP, 1 streams, 131072 byte blocks [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-10.00 sec 12.3 GBytes 10.6 Gbits/sec 232 3.15 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 12.3 GBytes 10.6 Gbits/sec 232 sender [ 5] 0.00-10.04 sec 12.3 GBytes 10.6 Gbits/sec receiver Reviewed-by: Adrian Dewhurst <adrian@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	netstack: fix typo	Dimitri Papadopoulos Orfanos	2023-07-04	1	-1/+1
\| \| \| \| \|	Signed-off-by: Dimitri Papadopoulos Orfanos <3234522+DimitriPapadopoulos@users.noreply.github.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	all: adjust build tags for wasip1/wasm	Brad Fitzpatrick	2023-07-04	4	-4/+4
\| \| \| \| \|	Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	conn: windows: add missing return statement in DstToString AF_INET	springhack	2023-06-27	1	-1/+1
\| \| \| \| \|	Signed-off-by: SpringHack <springhack@live.cn> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	conn: store IP_PKTINFO cmsg in StdNetendpoint src	James Tucker	2023-06-27	4	-98/+128
\| \| \| \| \| \| \| \| \|	Replace the src storage inside StdNetEndpoint with a copy of the raw control message buffer, to reduce allocation and perform less work on a per-packet basis. Signed-off-by: James Tucker <james@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	device: wait for and lock ipc operations during close	James Tucker	2023-06-27	1	-0/+2
\| \| \| \| \| \| \| \| \|	If an IPC operation is in flight while close starts, it is possible for both processes to deadlock. Prevent this by taking the IPC lock at the start of close and for the duration. Signed-off-by: James Tucker <jftucker@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	tun: use correct IP header comparisons in tcpGRO() and tcpPacketsCanCoalesce()	Jordan Whited	2023-03-25	2	-16/+119
\| \| \| \| \| \| \| \| \| \| \| \|	tcpGRO() was using an incorrect IPv4 more fragments bit mask. tcpPacketsCanCoalesce() was not distinguishing tcp6 from tcp4, and TTL values were not compared. TTL values should be equal at the IP layer, otherwise the packets should not coalesce. This tracks with the kernel. Reviewed-by: Denton Gentry <dgentry@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	tun: disqualify tcp4 packets w/IP options from coalescing	Jordan Whited	2023-03-25	2	-5/+55
\| \| \| \| \| \| \| \| \| \|	IP options were not being compared prior to coalescing. They are not commonly used. Disqualification due to nonzero options is in line with the kernel. Reviewed-by: Denton Gentry <dgentry@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	conn: move booleans to bottom of StdNetBind struct	Jason A. Donenfeld	2023-03-24	1	-9/+11
\| \| \| \| \| \|	This results in a more compact structure. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	conn: use ipv6 message pool for ipv6 receiving	Jason A. Donenfeld	2023-03-24	1	-2/+2
\| \| \| \| \| \| \|	Looks like a simple copy&paste error. Fixes: 9e2f386 ("conn, device, tun: implement vectorized I/O on Linux") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	conn: fix StdNetEndpoint data race by dynamically allocating endpoints	Jordan Whited	2023-03-24	1	-24/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In 9e2f386 ("conn, device, tun: implement vectorized I/O on Linux"), the Linux-specific Bind implementation was collapsed into StdNetBind. This introduced a race on StdNetEndpoint from getSrcFromControl() and setSrcControl(). Remove the sync.Pool involved in the race, and simplify StdNetBind's receive path to allocate StdNetEndpoint on the heap instead, with the intent for it to be cleaned up by the GC, later. This essentially reverts ef5c587 ("conn: remove the final alloc per packet receive"), adding back that allocation, unfortunately. This does slightly increase resident memory usage in higher throughput scenarios. StdNetBind is the only Bind implementation that was using this Endpoint recycling technique prior to this commit. This is considered a stop-gap solution, and there are plans to replace the allocation with a better mechanism. Reported-by: lsc <lsc@lv6.tw> Link: https://lore.kernel.org/wireguard/ac87f86f-6837-4e0e-ec34-1df35f52540e@lv6.tw/ Fixes: 9e2f386 ("conn, device, tun: implement vectorized I/O on Linux") Cc: Josh Bleecher Snyder <josharian@gmail.com> Reviewed-by: James Tucker <james@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	conn: disable sticky sockets on Android	Jason A. Donenfeld	2023-03-23	5	-8/+22
\| \| \| \| \| \| \| \|	We can't have the netlink listener socket, so it's not possible to support it. Plus, android networking stack complexity makes it a bit tricky anyway, so best to leave it disabled. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	global: remove old style build tags	Jason A. Donenfeld	2023-03-23	8	-8/+0
\| \| \| \|	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
*	tun: replace ErrorBatch() with errors.Join()	Jordan Whited	2023-03-17	2	-51/+3
\| \| \| \| \| \|	Reviewed-by: Maisem Ali <maisem@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>