| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
| |
ix(4) wants an array of uint32_ts to push into 32bit registers.
|
|
|
|
|
| |
this lets things calling bpf_mtap_hdr and related functions also
populate the extended bpf_hdr with the rcvif and prio and stuff.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the metadata is set if the mbuf is passed with an m_pktrhdr, and
copies the mbufs rcvif, priority, flowid. it also carries the
direction of the packet.
it also makes bpf_hdr a multiple of 4 bytes, which simplifies some
calculations a bit. it also requires no changes in userland because
libpcap just thinks the extra bytes in the header are padding and
skips over them to the payload.
this helps me verify things like whether the stack and a network
card agree about toeplitz hashes, and paves the way for doing more
interesting packet captures. being able to see where a packet came
from as it is leaving a machine is very useful.
ok mpi@
|
|
|
|
|
|
|
|
|
|
|
| |
listed below.
- pipex_init_session() to check request and alloc new session.
- pipex_link_session() to link session to pipex(4) layer.
- pipex_unlink_session() to unlink session from pipex(4) layer.
- pipex_rele_session() to release session and it's internal allocation.
ok mpi@
|
|
|
|
|
|
| |
functions further.
ok dlg
|
|
|
|
|
|
|
| |
xor laddr and faddr and the ports together and only then fold the
32 bits into 16 bits.
ok dlg
|
|
|
|
|
|
|
|
|
|
| |
ditch half of the calculations by merging the computation of hi and lo,
only splitting at the end. This allows us to leverage stoeplitz_hash_n16().
The name lo is now wrong. I kept it in order to avoid noise. I'm going to
clean this up in the next step.
ok dlg
|
|
|
|
|
|
|
|
|
|
|
| |
multiplication H * val in stoeplitz_cache_entry(scache, val), so the
identity (H * x) ^ (H * y) == H * (x ^ y) allows us to push the calls to
the cache function down to the end of stoeplitz_hash_ip{4,6}{,port}().
The identity in question was also confirmed on amd64, sparc64 and powerpc
for all possible values of skey, x and y.
ok dlg
|
|
|
|
|
| |
i've been wanting to do this for a while, and now that we've got
stoeplitz and it gives us 16 bits, it seems like the right time.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
stoeplitz_cache and bring them into a form more suitable for mathematical
reasoning. Add a comment explaining the full construction which will also
help justifying upcoming diffs.
The observations for the code changes are the following:
First, scache->bytes[val] is a uint16_t, and we only need the lower
16 bits of res in the second nested pair of for loops. The values of
key[b] are only xored together to compute res, so we only need the lower
16 bits of those, too.
Second, looking at the first nested for loop, we see that the values 0..15
of j only touch the top 16 bits of key[b], so we can skip them. For b = 0,
the inner loop for j in 16..31 scans backwards through skey and sets the
corresponding bits of key[b], so key[0] = skey. A bit of pondering then
leads to key[b] = skey << b | skey >> (NBSK - b).
The key array is renamed into column since it stores columns of the
Toeplitz matrix.
It's not very expensive to brute-force verify that scache->bytes[val]
remains the same for all values of val and all values of skey. I did
this on amd64, sparc64 and powerpc.
ok dlg
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is another bit of the puzzle for supporting multiple rx rings
and receive side scaling (RSS) on nics. It borrows heavily from
DragonflyBSD, but I've made some tweaks on the way.
The interesting bits that dfly came up with was a way to use Toeplitz
hashing so the kernel AND network interfaces hash packets so packets
in both directions onto the same bucket. The other interesting thing
is that the optimised the hash calculation by building a cache of
all the intermediate results possible for each input byte. Their
hash calculation is simply xoring these intermediate reults together.
So this diff adds an API for the kernel to use for calculating a
hash for ip addresses and ports, and adds a function for network
drivers to call that gives them a key to use with RSS. If all drivers
use the same key, then the same flows should be steered to the same
place when they enter the network stack regardless of which hardware
they came in on.
The changes I made relative to dfly are around reducing the size
of the caches. DragonflyBSD builds a cache of 32bit values, but
because of how the Toeplitz key is constructed, the 32bits are made
up of a repeated 16bit pattern. We can just store the 16 bits and
reconstruct the 32 bits if we want. Both us and dragonfly only
use 15 or 16 bits of the result anyway, so 32bits is unecessary.
Secondly, the dfly implementation keeps a cache of values for the
high and low bytes of input, but the values in the two caches are
almost the same. You can byteswap the values in one of the byte
caches to get the values for the other, but you can also just
byteswap values at runtime to get the same value, which is what
this implementation does. The result of both these changes is that
the byte cache is a quarter of the size of the one in dragonflybsd.
tb@ has had a close look at this and found a bunch of other
optimisations that can be implemented, and because he's a
wizard^Wmathematician he has proofs (and also did tests).
ok tb@ jmatthew@
|
|
|
|
|
|
|
| |
Reported by Prof. Dr. Steffen Wendzel <wendzel @ hs-worms . de>,
thanks!
OK martijn@ sthen@
|
|
|
|
|
|
|
|
| |
couldn't find an entry if its table is attached a table on the root.
This fixes the problem "route-to <TABLE> least-states" doesn't work.
The problem is found by IIJ.
OK sashan
|
|
|
|
|
|
|
|
| |
the intersection of the capabilities of the ports, allowing use of
vlan and checksum offloads if supported by all ports. Since this works
the same way as updating hardmtu, do them both at the same time.
ok dlg@
|
|
|
|
|
|
|
|
|
|
| |
pipex output is part of pppx and pppac if_start functions, so it
can't rely on or know if it already has NET_LOCK. this defers the
ip output stuff to where it can take the NET_LOCK reliably.
tested by Vitaliy Makkoveev, who also found that this was necessary
after ifq.c 1.38 and provided an excellent analysis of the problem.
ok mpi@
|
|
|
|
|
|
| |
This help in case of a context switch inside if_detach().
From Vitaliy Makkoveev.
|
|
|
|
|
|
| |
conversion steps). it only contains kernel prototypes for 4 interfaces,
all of which legitimately belong in sys/systm.h, which are already included
by all enqueue_randomness() users.
|
|
|
|
|
|
|
|
|
|
| |
Since our last concurrency mistake only ioctl(2) ans sysctl(2) code path
take the reader lock. This is mostly for documentation purpose as long as
the softnet thread is converted back to use a read lock.
dlg@ said that comments should be good enough.
ok sashan@
|
|
|
|
|
|
|
|
|
|
| |
Input bits of the mbuf list head to enqueue_randomness(). While the set
of mbufs in circulation is relatively stable, the order in which they
reach if_input_process() is unpredictable. Shuffling can happen in many
subsystems, such as the network stack, device drivers, and memory
management.
OK deraadt@ mpi@
|
| |
|
|
|
|
| |
From Vitaliy Makkoveev.
|
|
|
|
|
|
|
|
|
| |
Premature locking is causing more trouble than it is solving issue. In this
case the lifetime of descriptors is protected by the KERNEL_LOCK() so using
a rwlock for the lookup introduce sleeping points and possible new races
without benefit.
From Vitaliy Makkoveev.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
if we use the ifq to move packet processing to another context,
it's too easy to fill up the one slot and cause packet loss.
the ifq len was set to 1 to avoid delays produced by the original
implementation of tx mitigation. however, trunk now introduces
latency because it isn't mpsafe yet, which causes the network stack
to have to take the kernel lock for each packet, and the kernel
lock can be quite contended. i want to use the ifq to move the
packet to the systq thread (which already has the kernel lock)
before trunk is asked to transmit it.
tested by mark patruck and myself.
|
|
|
|
|
|
|
|
| |
mark patruck found significant packet drops with trunk(4), and there's
some reports that pppx or pipex relies on some implicit locking
that it shouldn't.
i can fix those without this diff being in the tree.
|
|
|
|
|
|
| |
From Sergey Ryazanov.
Reviewed by Vitaliy Makkoveev, ok claudio@
|
|
|
|
|
|
|
|
|
|
|
|
| |
this reuses the tx mitigation machinery, but instead of deferring
some start calls to the nettq, it defers all calls to the systq.
this is to avoid taking the KERNEL_LOCK while processing packets
in the stack.
i've been running this in production for 6 or so months, and the
start of a release is a good time to get more people trying it too.
ok jmatthew@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If you set FIONBIO on a bpf(4) descriptor you enable non-blocking mode
and also clobber any read timeout set for the descriptor. The reverse
is also true: do BIOCSRTIMEOUT and you'll set a timeout and
simultaneously disable non-blocking status. The two are mutually
exclusive.
This relationship is undocumented and might cause a bug. At the
very least it makes reasoning about the code difficult.
This patch adds a new member to bpf_d, bd_rnonblock, to store the
non-blocking status of the descriptor. The read timeout is still
kept in bd_rtout.
With this in place, non-blocking status and the read timeout can
coexist. Setting one state does not clear the other, and vice versa.
Separating the two states also clears the way for changing the bpf(4)
read timeout to use the system clock instead of ticks. More on that
in a later patch.
With insight from dlg@ regarding the purpose of the read timeout.
ok dlg@
|
| |
|
|
|
|
|
|
|
|
|
|
| |
descriptors runs below the low watermark.
The em(4) firmware seems not to work properly with just a few descriptors in
the receive ring. Thus, we use the low water mark as an indicator instead of
zero descriptors, which causes deadlocks.
ok kettenis@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
encryption or decryption. This allows us to keep plaintext and encrypted
network traffic seperated and reduces the attack surface for network
sidechannel attacks.
The only way to reach the inner rdomain from outside is by successful
decryption and integrity verification through the responsible Security
Association (SA).
The only way for internal traffic to get out is getting encrypted and
moved through the outgoing SA.
Multiple plaintext rdomains can share the same encrypted rdomain while
the unencrypted packets are still kept seperate.
The encrypted and unencrypted rdomains can have different default routes.
The rdomains can be configured with the new SADB_X_EXT_RDOMAIN pfkey
extension. Each SA (tdb) gets a new attribute 'tdb_rdomain_post'.
If this differs from 'tdb_rdomain' then the packet is moved to
'tdb_rdomain_post' afer IPsec processing.
Flows and outgoing IPsec SAs are installed in the plaintext rdomain,
incoming IPsec SAs are installed in the encrypted rdomain.
IPCOMP SAs are always installed in the plaintext rdomain.
They can be viewed with 'route -T X exec ipsecctl -sa' where X is the
rdomain ID.
As the kernel does not create encX devices automatically when creating
rdomains they have to be added by hand with ifconfig for IPsec to work
in non-default rdomains.
discussed with chris@ and kn@
ok markus@, patrick@
|
|
|
|
|
|
|
|
| |
ignored. Initialize 'error' to 0.
CID 1483380
ok mpi@
|
|
|
|
|
|
|
| |
Reported-by: syzbot+d0639632a0affe0a690e@syzkaller.appspotmail.com
Reported-by: syzbot+ae5e359d7f82688edd6a@syzkaller.appspotmail.com
OK anton@
|
|
|
|
|
|
| |
when using pppac without pipex.
ok dlg
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
address as the one trying to be inserted.
Such entry must stay in the table as long as its parent route exist. If
a code path tries to re-insert a route with the same destination address
on the same interface it is a bug.
Avoid the "route contains no arp information" problem reported by sthen@
and Laurent Salle.
ok claudio@
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prevent concurrency in the socket layer which is not ready for that.
Two recent data corruptions in pfsync(4) and the socket layer pointed
out that, at least, tun(4) was incorrectly using NET_RUNLOCK(). Until
we find a way in software to avoid future mistakes and to make sure that
only the softnet thread and some ioctls are safe to use a read version
of the lock, put everything back to the exclusive version.
ok stsp@, visa@
|
|
|
|
|
|
| |
it needs NET_LOCK because it modifies if_flags and if_pcount.
ok visa@
|
|
|
|
|
|
|
|
| |
if_pcount is only touched in ifpromisc(), and ifpromisc() needs
NET_LOCK anyway because it also modifies if_flags.
suggested by mpi@
ok visa@
|
|
|
|
|
|
|
|
|
|
|
| |
aggr_p_dtor() calls ifpromisc(), and ifpromisc() callers need to
be holding NET_LOCK to make changes to if_flags and if_pcount, and
before calling the interfaces ioctl to apply the flag change.
i found this while reading code with my eyes, and was able to trigger
the NET_ASSERT_LOCKED in the vlan_ioctl path.
ok visa@
|
|
|
|
|
|
|
|
|
|
|
|
| |
tpmr_p_dtor() calls ifpromisc(), and ifpromisc() callers need to
be holding NET_LOCK to make changes to if_flags and if_pcount, and
before calling the interfaces ioctl to apply the flag change.
found by hrvoje popovski who was testing tpmr with vlan interfaces.
vlan(4) asserts that the net lock is held in it's ioctl path, which
started this whole bug hunt.
ok visa@ (who came up with a similar diff, which hrvoje tested)
|
|
|
|
|
|
|
|
|
| |
promiscuous mode from bridge(4). This fixes a regression of r1.332
of sys/net/if_bridge.c.
splassert with bridge(4) and vlan(4) reported by David Hill
OK mpi@, dlg@
|
| |
|
|
|
|
| |
ok mpi@
|
|
|
|
|
|
|
|
|
| |
Prevent a data corruption on a UDP receive socket buffer reported by
procter@ who triggered it with wireguard-go.
The symptoms are underflow of sb_cc/sb_datacc/sb_mcnt.
ok visa@
|
| |
|
| |
|