| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
case when requested table is already exists.
Except initialization time, route_output() and if_createrdomain() are the
only paths where we call rtable_add(9). We check requested table existence
by rtable_exists(9) and it's not the error condition if the table exists.
Otherwise we are trying to create requested table by rtable_add(9). Those
paths are kernel locked so concurrent thread can't create requested table
just after rtable_exists(9) check. Also rtable_add(9) has internal
rtable_exists(9) check and in this case the table existence assumed as
EEXIST error. This error path is never reached.
We are going to unlock PF_ROUTE sockets. This means route_output() will
not be serialized with if_createrdomain() and concurrent thread could
create requested table. Table existence check and creation should be
serialized and it makes sense to do this within rtable_add(9). This time
kernel lock is used for this so it pushed down to rtable_add(9). The
internal rtable_exists(9) check was modified and table existence is not
error now.
Since the external rtable_exists(9) check is useless it was removed from
if_createrdomain(). It still exists in route_output() path because the
logic is more complicated here.
ok mpi@
|
|
|
|
|
| |
Make the interface come up when the IFXF_AUTOCONF6TEMP is set.
OK kn
|
|
|
|
|
| |
Also prefer if (error == 0) over if (!error).
OK florian@ bluhm@
|
|
|
|
|
| |
already call rtm_ifchg() and so this would just result in a duplicate message.
Noticed by deraadt@. OK florian@ bluhm@
|
|
|
|
|
|
|
|
| |
AUTOCONF6 flag is already set.
This is likely a leftover from when we sent router solicitations from
the kernel. This was a way to trigger sending a solicitation from
userland.
OK kn
|
|
|
|
| |
ok florian claudio
|
|
|
|
| |
ok gnezdo@ semarie@ mpi@
|
|
|
|
|
|
| |
the top(1) wait column.
ok mvs@
|
|
|
|
|
|
|
|
| |
the l3 protocol input to push the packet is based on a value in
m->m_pkthdr.ph_family, which tunnel drivers should set before calling
if_vinput.
add p2p_bpf_mtap to call bpf_mtap_af also using m->m_pkthdr.ph_family.
|
|
|
|
|
| |
call (*ifp->if_bpf_mtap) instead of bpf_mtap_ether in ifiq_input
and if_vinput.
|
|
|
|
|
|
|
|
|
|
|
| |
an example use of this is when you have a span port on a switch and
you want to be able to see the packets coming out of it with tcpdump,
but do not want these packets to enter the network stack for
processing. this is particularly important if the span port is
pushing a copy of any packets related to the machine doing the
monitoring as it will confuse pf states and the stack.
ok benno@
|
|
|
|
|
|
|
|
|
| |
fully initialized because we initialize `if_groups' after linking. It's
not triggered because if_attach() and if_unit(9) are serialized by
kernel lock and `ifp' is often filled by nulls. Move `if_groups'
initialization to if_attach_common() to prevent this.
ok bluhm@ claudio@ deraadt@
|
|
|
|
|
|
|
|
|
|
|
|
| |
the kernel made the unique check before trunkating with strlcpy().
So there could be two interface groups with the same name. The kif
is created by a name lookup. The trunkated names are equal, so
there was only one kif owned by both groups. When the groups got
destroyed, the single kif was removed twice from the RB tree.
Check length of group name before doing the unique check.
The empty group name was allowed and is now invalid.
Reported-by: syzbot+f47e8296ebd559f9bbff@syzkaller.appspotmail.com
OK deraadt@ gnezdo@ anton@ mvs@ claudio@
|
|
|
|
|
|
|
| |
a new object that is already refcounted, so carp attach does not
reach into internal structures. Add kasserts to detect counter
overflow or underflow.
OK mvs@
|
|
|
|
| |
ok bluhm@ dlg@
|
|
|
|
|
|
|
|
|
|
|
| |
interface descriptor corresponding to the unique name. This descriptor
is guaranteed to be valid until if_put(9) is called on the returned
pointer. if_unit(9) should replace already existent ifunit() which
returns descriptor not safe for dereference when context was switched.
This allow us to avoid some use-after-free issues in ioctl(2) path.
Also this unifies interface descriptor usage.
ok claudio@ sashan@
|
|
|
|
|
|
|
|
|
|
|
|
| |
packets were resent through simplex broadcast delivery and socket
splicing. Although there is an M_LOOP check in somove(9), it did
not take effect. if_input_local() cleared the M_BCAST and M_MCAST
flags with m_resethdr().
As if_input_local() is used for broadcast and multicast delivery,
it was a mistake to delete them. Keep the M_BCAST and M_MCAST mbuf
flags when packets are reinjected into the network stack.
Reported-by: syzbot+a43ace363f1b663238f8@syzkaller.appspotmail.com
OK anton@; discussed with claudio@
|
|
|
|
|
|
|
|
|
|
| |
Less scheduling, lock contention and queues.
Previously, if_netisr() handled the net lock around those calls, now
if_input_process() does it before calling ether_input(), so no need to add
or remove NET_*LOCK() anywhere.
OK mvs claudio
|
|
|
|
|
|
|
|
|
|
|
|
| |
"struct pppoe_softc" documents no member being protected by the kernel lock
(alone); further review of the code paths starting from pppoeintr() shows
no sleeping points which must be avoided in the softnet thread.
Everything is fine as is to run without the big lock, so remove it.
Tests sthen
Feedback mpi mvs
OK mvs claudio
|
|
|
|
|
|
| |
this is to avoid a timestamp being used on the way out of the stack
(eg, in bpf), or if it reenters the stack (eg, if it goes between
rdomains with pair(4)).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
if_clone_{create,destroy}(). This fixes the races described below.
if_clone_{create,destroy}() are kernel locked, but since they touch
various sleep points introduced by rwlocks and M_WAITOK allocations,
without serialization they can intersect due to race condition.
The avoided races are:
1. While performing if_clone_create(), concurrent thread which performing
if_clone_create() can attach `ifp' with the same `if_xname' and made
inconsistent `if_list' where all attached interfaces linked.
2. While performing if_clone_create(), concurrent thread which performing
if_clone_destroy() can kill this incomplete `ifp'.
3. While performing if_clone_destroy(), concurrent thread which performing
if_clone_destroy() can kill this dying `ifp'.
ok claudio@ kn@ mpi@ sashan@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ifconfig(8) detects switch(4) through its unique SIOCSWGDPID ioctl(2) and
further does another switch specific ioctl for the default output regardless
of configuration and/or members.
But since these two ioctls are limited to root, running ifconfig as
unprivileged user makes switch interfaces partially appear as bridge devices
because the detection fails, e.g. STP parameters are shown instead of
datapath id and flow parameters.
ifioctl() limits a list of set/write ioctls to root, but these two read-only
ioctls seem to have been listed by mistake, so remove them to omit the root
check and fix "ifconfig switch" output for unprivileged users.
Feedback from dlg
|
|
|
|
| |
done under both the KERNEL_LOCK() and NET_LOCK().
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
related mbufs. Each mbuf(9) passed to these queues stores the pointer to
corresponding pipex(4) session referenced as `m_pkthdr.ph_cookie'. When
session was destroyed its reference can still be in these queues so we
have use after free issue while pipexintr() dereference it.
I removed `pipexinq', `pipexoutq' and pipexintr(). This not only allows
us to avoid issue described above, but also removes unnecessary context
switch in packet processing. Also it makes code simpler.
ok mpi@ yasuoka@
|
|
|
|
| |
ok sashan@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the interface input handler lists were originally set up to help
us during the intial mpsafe network stack work. at the time not all
the virtual ethernet interfaces (vlan, svlan, bridge, trunk, etc)
were mpsafe, so we wanted a way to avoid them by default, and only
take the kernel lock hit when they were specifically enabled on the
interface. since then, they have been fixed up to be mpsafe.
i could leave the list in place, but it has some semantic problems.
because virtual interfaces filter packets based on the order they
were attached to the parent interface, you can get packets taken
away in surprising ways, especially when you reboot and netstart
does something different to what you did by hand. by hardcoding the
order that things like vlan and bridge get to look at packets, we
can document the behaviour and get consistency.
it also means we can get rid of a use of SRPs which were difficult
to replace with SMRs. the interface input handler list is an SRPL,
which we would like to deprecate. it turns out that you can sleep
during stack processing, which you're not supposed to do with SRPs
or SMRs, but SRPs are a lot more forgiving and it worked.
lastly, it turns out that this code is faster than the input list
handling, so lots of winning all around.
special thanks to hrvoje popovski and aaron bieber for testing.
this has been in snaps as part of a larger diff for over a week.
|
|
|
|
|
|
|
|
|
|
|
|
| |
protects this list. Also corresponding assertion added to be sure the
required lock was held.
This is the step to clean locking mess around `if_list'.
Also we are going to protect `if_list' by it's own lock and this will
allow us to avoid lock order issues in future.
ok dlg@
|
|
|
|
| |
ok mpi@
|
|
|
|
| |
ok dlg@ tobhe@
|
|
|
|
|
| |
Size taken from if_creategroup();
OK mvs
|
|
|
|
|
|
|
|
|
|
|
|
| |
For example the bridge_ioctl() function calls NET_UNLOCK() unconditionally
and so calling if_ioctl() without netlock will trigger an assert because
of not holding the netlock. Make sure the ioctl handlers are called with
the netlock held and drop the lock for the wg(4) specific ioctls in the
wg_ioctl handler. This fixes a panic in bridge_ioctl() triggered by
ifconfig(8) issuing a SIOCGWG ioctl against bridge(4).
This is just a workaround this needs more cleanup but at least this way
the panic can not be triggered anymore.
OK stsp@, tested by semarie@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
livelock detection used to rely on code running at softnet blocking
the softclock handling at a lower interrupt priority level. if the
hard clock interrupt count diverged from one kept by a timeout, we
assumed the network stack was doing too much work and we should
apply backpressure to the receptions of packets.
the network stack doesnt really block timeouts from firing anymore
though. this is especially true on MP systems, because timeouts
fire on cpu0 and the nettq thread could be somewhere else entirely.
this means network activity doesn't make the softclock lose ticks,
which means we aren't scaling rx ring activity like we think we
are.
the alternative way to detect livelock is when a driver queues
packets for the stack to process, if there's too many packets built
up then the input routine return value tells the driver to slow
down. this enables finer grained livelock detection too. the rx
ring accounting is done per rx ring, and each rx ring is tied to a
specific nettq. if one of them is going too fast it shouldn't affect
the others. the tick based detection was done system wide and
punished all the drivers.
ive converted all the drivers to the new mechanism. let's see how
we go with it.
jmatthew@ confirms rings still shrink, so some backpressure is being
applied.
|
|
|
|
|
|
|
|
|
|
|
| |
thanks to Matt Dunwoodie and Jason A. Donenfeld for their effort.
it's at least as functional as the go implementation, and maybe
more so since this one works on more architectures.
i'm sure there's further development that can be done, but you can
say that about anything and everything that's in the tree.
ok deraadt@
|
|
|
|
|
| |
i've been wanting to do this for a while, and now that we've got
stoeplitz and it gives us 16 bits, it seems like the right time.
|
|
|
|
|
|
| |
conversion steps). it only contains kernel prototypes for 4 interfaces,
all of which legitimately belong in sys/systm.h, which are already included
by all enqueue_randomness() users.
|
|
|
|
|
|
|
|
|
|
| |
Since our last concurrency mistake only ioctl(2) ans sysctl(2) code path
take the reader lock. This is mostly for documentation purpose as long as
the softnet thread is converted back to use a read lock.
dlg@ said that comments should be good enough.
ok sashan@
|
|
|
|
|
|
|
|
|
|
| |
Input bits of the mbuf list head to enqueue_randomness(). While the set
of mbufs in circulation is relatively stable, the order in which they
reach if_input_process() is unpredictable. Shuffling can happen in many
subsystems, such as the network stack, device drivers, and memory
management.
OK deraadt@ mpi@
|
|
|
|
|
|
| |
it needs NET_LOCK because it modifies if_flags and if_pcount.
ok visa@
|
|
|
|
|
|
|
|
|
| |
Prevent a data corruption on a UDP receive socket buffer reported by
procter@ who triggered it with wireguard-go.
The symptoms are underflow of sb_cc/sb_datacc/sb_mcnt.
ok visa@
|
|
|
|
|
| |
Feedback from and ok dlg@
ok kn@ todd@
|
|
|
|
| |
ok tedu@ krw@ deraadt@
|
|
|
|
|
|
|
|
|
|
|
| |
if_detach passes the groupname from an ifg_list struct to if_delgroup,
if_delgroup then uses the name to find the same ifg_list struct so
it can free it, and then passes the name from the struct to
pfi_group_change(). at worst this can cause a fault if malloc(9)
actually unmaps the page the struct was on, and at best it causes
pf interfaces with garbage names to be created.
ok sashan@ bluhm@
|
|
|
|
|
| |
of a network interface.
OK deraadt@ claudio@
|
| |
|
|
|
|
|
|
|
|
|
| |
there's now a bunch of drivers that implement the bridge ioctls,
but they're inconsistent at checking privilege. doing it up front
once means less code duplication, and more consistent application
of the checks.
ok bluhm@ deraadt@
|
|
|
|
|
| |
found by Ilja Van Sprundel
ok deraadt@ mpi@ bluhm@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
when vxlans parent interface has a link state change event, vxlan
reconfigures the parent to cope with things not being as it expects
when the interface comes back. it does this by removing its config
and then adding it again. part of it's config removal is to take
the link state hook away, and part of putting the config on is is
adding the link state hook.
if we're running an interfaces link state hooks from head to tail,
and the vxlan hook adds itself back to the tail, we end up running
the vxlan hook forever cos it always ends up at the tail.
bluhm@ hit this infinite loop while running regress tests. if turns
out we need to run link state hooks in the same order they were
added, i have a way to avoid this situation, but this is simple.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
this follows what's been done for detach and link state hooks, and
makes handling of hooks generally more robust.
address hooks are a bit different to detach/link state hooks in
that there's only a few things that register hooks (carp, pf, vxlan),
but a lot of places to run the hooks (lots of ipv4 and ipv6 address
configuration).
an address hook cookie was in struct pfi_kif, which is part of the
pf abi. rather than break pfctl -sI, this maintains the void * used
for the cookie and uses it to store a task, which is then used as
intended with the new api.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
i had NET_ASSERT_LOCKED() in the hook add and remove operations,
because that's what's held when the hooks are run. some callers do
not hold the NET_LOCK when calling them though, eg, bridge(4). aggr
and tpmr used to not hold NET_LOCK while being destroyed, which
also caused the asserts to fire, so i moved the port destroys inside
NET_LOCK, but now I have deadlocks with some barrier calls.
the hooks having their own lock means callers don't have to hold
NET_LOCK and the list will stay sane. the code that runs the hooks
gives up the mutex when calling the hook, but keeps track of where
it's up to bey putting a cursor in the list.
there's a single global mutex for all the interface linkstate and
detach hooks, but this stuff isn't a hot path by any stretch of the
imagination.
based on (a lot of) testing by hrvoje popovski. thank you.
|