| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
| |
For example the bridge_ioctl() function calls NET_UNLOCK() unconditionally
and so calling if_ioctl() without netlock will trigger an assert because
of not holding the netlock. Make sure the ioctl handlers are called with
the netlock held and drop the lock for the wg(4) specific ioctls in the
wg_ioctl handler. This fixes a panic in bridge_ioctl() triggered by
ifconfig(8) issuing a SIOCGWG ioctl against bridge(4).
This is just a workaround this needs more cleanup but at least this way
the panic can not be triggered anymore.
OK stsp@, tested by semarie@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
livelock detection used to rely on code running at softnet blocking
the softclock handling at a lower interrupt priority level. if the
hard clock interrupt count diverged from one kept by a timeout, we
assumed the network stack was doing too much work and we should
apply backpressure to the receptions of packets.
the network stack doesnt really block timeouts from firing anymore
though. this is especially true on MP systems, because timeouts
fire on cpu0 and the nettq thread could be somewhere else entirely.
this means network activity doesn't make the softclock lose ticks,
which means we aren't scaling rx ring activity like we think we
are.
the alternative way to detect livelock is when a driver queues
packets for the stack to process, if there's too many packets built
up then the input routine return value tells the driver to slow
down. this enables finer grained livelock detection too. the rx
ring accounting is done per rx ring, and each rx ring is tied to a
specific nettq. if one of them is going too fast it shouldn't affect
the others. the tick based detection was done system wide and
punished all the drivers.
ive converted all the drivers to the new mechanism. let's see how
we go with it.
jmatthew@ confirms rings still shrink, so some backpressure is being
applied.
|
|
|
|
|
|
|
|
|
|
|
| |
thanks to Matt Dunwoodie and Jason A. Donenfeld for their effort.
it's at least as functional as the go implementation, and maybe
more so since this one works on more architectures.
i'm sure there's further development that can be done, but you can
say that about anything and everything that's in the tree.
ok deraadt@
|
|
|
|
|
| |
i've been wanting to do this for a while, and now that we've got
stoeplitz and it gives us 16 bits, it seems like the right time.
|
|
|
|
|
|
| |
conversion steps). it only contains kernel prototypes for 4 interfaces,
all of which legitimately belong in sys/systm.h, which are already included
by all enqueue_randomness() users.
|
|
|
|
|
|
|
|
|
|
| |
Since our last concurrency mistake only ioctl(2) ans sysctl(2) code path
take the reader lock. This is mostly for documentation purpose as long as
the softnet thread is converted back to use a read lock.
dlg@ said that comments should be good enough.
ok sashan@
|
|
|
|
|
|
|
|
|
|
| |
Input bits of the mbuf list head to enqueue_randomness(). While the set
of mbufs in circulation is relatively stable, the order in which they
reach if_input_process() is unpredictable. Shuffling can happen in many
subsystems, such as the network stack, device drivers, and memory
management.
OK deraadt@ mpi@
|
|
|
|
|
|
| |
it needs NET_LOCK because it modifies if_flags and if_pcount.
ok visa@
|
|
|
|
|
|
|
|
|
| |
Prevent a data corruption on a UDP receive socket buffer reported by
procter@ who triggered it with wireguard-go.
The symptoms are underflow of sb_cc/sb_datacc/sb_mcnt.
ok visa@
|
|
|
|
|
| |
Feedback from and ok dlg@
ok kn@ todd@
|
|
|
|
| |
ok tedu@ krw@ deraadt@
|
|
|
|
|
|
|
|
|
|
|
| |
if_detach passes the groupname from an ifg_list struct to if_delgroup,
if_delgroup then uses the name to find the same ifg_list struct so
it can free it, and then passes the name from the struct to
pfi_group_change(). at worst this can cause a fault if malloc(9)
actually unmaps the page the struct was on, and at best it causes
pf interfaces with garbage names to be created.
ok sashan@ bluhm@
|
|
|
|
|
| |
of a network interface.
OK deraadt@ claudio@
|
| |
|
|
|
|
|
|
|
|
|
| |
there's now a bunch of drivers that implement the bridge ioctls,
but they're inconsistent at checking privilege. doing it up front
once means less code duplication, and more consistent application
of the checks.
ok bluhm@ deraadt@
|
|
|
|
|
| |
found by Ilja Van Sprundel
ok deraadt@ mpi@ bluhm@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
when vxlans parent interface has a link state change event, vxlan
reconfigures the parent to cope with things not being as it expects
when the interface comes back. it does this by removing its config
and then adding it again. part of it's config removal is to take
the link state hook away, and part of putting the config on is is
adding the link state hook.
if we're running an interfaces link state hooks from head to tail,
and the vxlan hook adds itself back to the tail, we end up running
the vxlan hook forever cos it always ends up at the tail.
bluhm@ hit this infinite loop while running regress tests. if turns
out we need to run link state hooks in the same order they were
added, i have a way to avoid this situation, but this is simple.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
this follows what's been done for detach and link state hooks, and
makes handling of hooks generally more robust.
address hooks are a bit different to detach/link state hooks in
that there's only a few things that register hooks (carp, pf, vxlan),
but a lot of places to run the hooks (lots of ipv4 and ipv6 address
configuration).
an address hook cookie was in struct pfi_kif, which is part of the
pf abi. rather than break pfctl -sI, this maintains the void * used
for the cookie and uses it to store a task, which is then used as
intended with the new api.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
i had NET_ASSERT_LOCKED() in the hook add and remove operations,
because that's what's held when the hooks are run. some callers do
not hold the NET_LOCK when calling them though, eg, bridge(4). aggr
and tpmr used to not hold NET_LOCK while being destroyed, which
also caused the asserts to fire, so i moved the port destroys inside
NET_LOCK, but now I have deadlocks with some barrier calls.
the hooks having their own lock means callers don't have to hold
NET_LOCK and the list will stay sane. the code that runs the hooks
gives up the mutex when calling the hook, but keeps track of where
it's up to bey putting a cursor in the list.
there's a single global mutex for all the interface linkstate and
detach hooks, but this stuff isn't a hot path by any stretch of the
imagination.
based on (a lot of) testing by hrvoje popovski. thank you.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
this is largely mechanical, except for carp. this moves the addition
of the carp link state hook after we're committed to using the new
interface as a carpdev. because the add can't fail, we avoid a
complicated unwind dance. also, this tweaks the carp linkstate hook
so it only updates the relevant carp interface, not all of the
carpdevs on the parent.
hrvoje popovski has tested an early version of this diff and it's
generally ok, but there's some splasserts that this diff fires that
i'll fix in an upcoming diff.
ok claudio@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the main semantic change is that things registering detach hooks
have to allocate and set a task structure that then gets added to
the list. this means if the task is allocated up front (eg, as part
of carps softc or bridges port structure), it avoids the possibility
that adding a hook can fail. a lot of drivers weren't checking for
failure, and unwinding state in the event of failure in other parts
was error prone.
while doing this i discovered that the list operations have to be
in a particular order, but drivers weren't doing that consistently
either. this diff wraps the list ops up so you have to seriously
go out of your way to screw them up.
ive also sprinkled some NET_ASSERT_LOCKED around the list operations
so we can make sure there's no potential for the list to be corrupted,
especially while it's being run.
hrvoje popovski has tested this a bit, and some issues he discovered
have been fixed.
ok sashan@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
for IPv6 link local addresses.
Some hosting and VM providers route customer IPv6 prefixes to link
local addresses derived from ethernet MAC addresses (RFC 2464). This
leads to hard to debug IPv6 connectivity problems and is probably not
worth the effort.
RFC 7721 lists 4 weaknesses:
3.1. Correlation of Activities over Time & 3.2. Location Tracking
These are still possible with RFC 7217 addresses for an adversary
connected to the same layer 2 network (think conference wifi). Since
the link local prefix stays the same (fe80::/64) the link local
addresses do not change between different networks.
An adversary on the same layer 2 network can probably track ethernet
MAC addresses via different means, too.
3.3. Address Scanning & 3.4. Device-Specific Vulnerability Exploitation
These now become possible, however, as noted above a layer 2 adversary
was probably able to do this via different means.
People concerned with these weaknesses are advised to use
ifconfig lladdr random.
OK benno
input & OK kn
|
|
|
|
|
|
|
|
|
|
|
| |
introduced a queue to grab the lock for multiple packets. Now we
have only netlock for both IP and protocol input. So the queue is
not necessary anymore. It just switches CPU and decreases performance.
So remove the inet and inet6 ip queue for local packets.
To get TCP running on loopback, we have to queue once between TCP
input and output of the two sockets. So use the loopback queue in
looutput() unconditionally.
OK visa@
|
| |
|
|
|
|
| |
ok kn@
|
|
|
|
| |
OK mpi@
|
|
|
|
| |
OK mpi@
|
|
|
|
|
|
| |
Fix a regression introduced by the bridge(4) refactoring.
Found by and ok bluhm@
|
|
|
|
|
|
|
|
| |
This redefines the ifp <-> bridge relationship. No lock can be
currently used across the multiples contexts where the bridge has
tentacles to protect a pointer, use an interface index.
Tested by various, ok dlg@, visa@
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
if_vinput assumes that the interface that its called against uses
per cpu counters so it can count input packets, but basically does
all the things that if_input and ifiq_input do. the main difference
is it assumes the network stack is already running and runs the
interface input handlers directly. this is instead of queuing the
packets for a nettq to run.
ifiqs arent free, especially when they only run per packet like
they do on psuedo interfaces. this allows that overhead to be
bypassed.
|
|
|
|
| |
this is a step toward letting interfaces like vlan bypass ifiqs
|
|
|
|
|
|
| |
l2 and l3 drivers do the same thing all the time, so reduce the
chance of error by doing the checks once and making it available
for drivers to call instead of rolling on their own again.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the idea is to call the hardware transmit routine less since in a
lot of cases posting a producer ring update to the chip is (very)
expensive. it's better to do it for several packets instead of each
packet, hence calling this tx mitigation.
this diff defers the call to the transmit routine to a network
taskq, or until a backlog of packets has built up. dragonflybsd
uses 16 as the size of it's backlog, so i'm copying them for now.
i've tried this before, but previous versions caused deadlocks. i
discovered that the deadlocks in the previous version was from
ifq_barrier calling taskq_barrier against the nettq. interfaces
generally hold NET_LOCK while calling ifq_barrier, but the tq might
already be waiting for the lock we hold.
this version just doesnt have ifq_barrier call taskq_barrier. it
instead relies on the IFF_RUNNING flag and normal ifq serialiser
barrier to guarantee the start routine wont be called when an
interface is going down. the taskq_barrier is only used during
interface destruction to make sure the task struct wont get used
in the future, which is already done without the NET_LOCK being
held.
tx mitigation provides a nice performanace bump in some setups. up
to 25% in some cases.
tested by tb@ and hrvoje popovski (who's running this in production).
ok visa@
|
|
|
|
|
|
|
|
|
|
|
|
| |
the stack uses the NET_LOCK for most protection now, so it doesnt
need to block actual hardware interrupts. blocking hw interrupts
can cause huge latency spikes, which in turn works against the rx
ring moderation.
im putting this in early in the release cycle so it can get the
most testing possible.
ok mpi@ (a while back)
|
|
|
|
|
|
|
|
|
|
|
| |
this should only be used by root, and it should not tak the NET_LOCK
because a bunch of i2c reads can take a relatively long time during
which packets would be blocked.
while here make sure userland only requests pages from the eeprom
and diag i2c addresses.
ok deraadt@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
previously ifiq_input uses the traditional backpressure or defense
mechanism and counts packets to decide when to shed load by dropping.
currently it ends up waiting for 10240 packets to get queued on the
stack before it would decide to drop packets. this may be ok for
some machines, but for a lot this was too much.
this diff reworks how ifiqs measure how busy the stack is by
introducing an ifiq_pressure counter that is incremented when
ifiq_input is called, and cleared when ifiq_process calls the network
stack to process the queue. if ifiq_input is called multiple times
before ifiq_process in a net taskq runs, ifiq_pressure goes up, and
ifiq_input uses a high value to decide the stack is busy and it
should drop.
i was hoping there would be no performance impact from this change,
but hrvoje popovski notes a slight bump in forwarding performance.
my own testing shows that the ifiq input list length grows to a
fraction of the 10240 it used to get to, which means the maximum
burst of packets through the stack is smoothed out a bit. instead
of big lists of packets followed by big periods of drops, we get
relatively small bursts of packets with smaller gaps where we drop.
the follow-on from this is to make drivers implementing rx ring
moderation to use the return value of ifiq_input to scale the ring
allocation down, allowing the hardware to drop packets so software
doesnt have to.
|
|
|
|
| |
part of a larger diff ok mpi@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
if_enqueue() still makes sure packets get handled by pf on the way
out, and seen by bridge if needed. however instead of falling through
to ifq mapping and output, it now calls a function pointer in the
ifnet struct. that pointer defaults to the ifq handling, but drivers
can override it to bypass ifq processing.
the most obvious users of the function pointer will be virtual
interfaces, eg, vlan(4). ifqs are good if you need to serialise
access to the thing that transmits packets (like hardware rings on
nics), or mitigate the number of times you do ring processing, but
neither of those things are desirable on vlan interfaces. ideally
vlan could transmit on any cpu without having packets serialised
by it's own ifq before being pushed down to an arbitrary number of
rings on the parent interface. bypassing ifqs means the driver can
push the vlan tag on concurrently and push down to the parent frmo
any cpu.
ok mpi@
no objection from claudio@
|
|
|
|
|
| |
define to IFNET_SLOWTIMO since it is no longer a hz divisor.
OK visa@ bluhm@ kn@
|
|
|
|
|
|
|
|
|
| |
these exist so interfaces that want to do mpsafe work outside the
ifq machinery have a place to allocate and update stats in. the
generic ioctl handling for getting stats to userland knows how to
roll the new per cpu stats into the rest before export.
ok visa@
|
|
|
|
| |
ok claudio@
|
|
|
|
| |
ok claudio@
|
|
|
|
|
|
|
| |
Wireless drivers call if_enqueue() out of the NET_LOCK() so it cannot
be used to serialize bridge(4) states.
Found by stsp@, ok visa@
|
|
|
|
|
|
|
| |
Tested by Hrvoje Popovski who measured a 30% improvement of forwarded
packets in the best case.
ok visa@
|
|
|
|
| |
crosshairs.
|
|
|
|
|
|
| |
so we can let go if_cloners_lock.
OK tb@, claudio@, bluhm@, kn@, henning@
|
|
|
|
|
|
|
|
|
| |
rdomain case leading to locking issues and lots of headscratching. turns out
the only case where if_setrdomain could actually create an rdomain and thus
end up with that pattern is the ioctl path.
make if_setrdomain never create an rdomain, return error if it doesn't exist
already, introduce if_createrdomain, and adjust the ioctl path to use it.
ok sashan bluhm claudio
|
|
|
|
|
| |
nonexistant ones as before. nasty error handling with bluhm, feedback mpi as
well. ok bluhm
|