summaryrefslogtreecommitdiffstats
path: root/sys/net/if.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Push kernel lock within rtable_add(9) and rework it to return 0 in themvs2021-03-261-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | case when requested table is already exists. Except initialization time, route_output() and if_createrdomain() are the only paths where we call rtable_add(9). We check requested table existence by rtable_exists(9) and it's not the error condition if the table exists. Otherwise we are trying to create requested table by rtable_add(9). Those paths are kernel locked so concurrent thread can't create requested table just after rtable_exists(9) check. Also rtable_add(9) has internal rtable_exists(9) check and in this case the table existence assumed as EEXIST error. This error path is never reached. We are going to unlock PF_ROUTE sockets. This means route_output() will not be serialized with if_createrdomain() and concurrent thread could create requested table. Table existence check and creation should be serialized and it makes sense to do this within rtable_add(9). This time kernel lock is used for this so it pushed down to rtable_add(9). The internal rtable_exists(9) check was modified and table existence is not error now. Since the external rtable_exists(9) check is useless it was removed from if_createrdomain(). It still exists in route_output() path because the logic is more complicated here. ok mpi@
* RFC 8981 allows the configuration of only temporary IPv6 addresses.florian2021-03-201-4/+8
| | | | | Make the interface come up when the IFXF_AUTOCONF6TEMP is set. OK kn
* When changing the link local address send a RTM_IFINFO message out.claudio2021-03-181-2/+4
| | | | | Also prefer if (error == 0) over if (!error). OK florian@ bluhm@
* Do not call rtm_ifchg() if IFF_UP changed. The code in if_up() and if_down()claudio2021-03-181-3/+6
| | | | | already call rtm_ifchg() and so this would just result in a duplicate message. Noticed by deraadt@. OK florian@ bluhm@
* There is no need to try to attach IPv6 to an interface when theflorian2021-03-111-2/+3
| | | | | | | | AUTOCONF6 flag is already set. This is likely a leftover from when we sent router solicitations from the kernel. This was a way to trigger sending a solicitation from userland. OK kn
* If the AUTOCONF4 or AUTOCONF6 flags get enabled, force the interface up.deraadt2021-03-111-23/+34
| | | | ok florian claudio
* spellingjsg2021-03-101-2/+2
| | | | ok gnezdo@ semarie@ mpi@
* Shorten the if_cloners_lock name preventing it from being truncated inanton2021-03-091-2/+2
| | | | | | the top(1) wait column. ok mvs@
* add p2p_input, like ether_input but for l3 tunnel interfaces.dlg2021-02-201-1/+41
| | | | | | | | the l3 protocol input to push the packet is based on a value in m->m_pkthdr.ph_family, which tunnel drivers should set before calling if_vinput. add p2p_bpf_mtap to call bpf_mtap_af also using m->m_pkthdr.ph_family.
* default interfaces to bpf_mtap_ether for their if_bpf_mtap handler.dlg2021-02-201-2/+6
| | | | | call (*ifp->if_bpf_mtap) instead of bpf_mtap_ether in ifiq_input and if_vinput.
* add a MONITOR flag to ifaces to say they're only used for watching packets.dlg2021-02-201-2/+3
| | | | | | | | | | | an example use of this is when you have a span port on a switch and you want to be able to see the packets coming out of it with tcpdump, but do not want these packets to enter the network stack for processing. this is particularly important if the span port is pushing a copy of any packets related to the machine doing the monitoring as it will confuse pf states and the stack. ok benno@
* We link `ifp' to `if_list' before we perform if_attachsetup(). It is notmvs2021-02-111-3/+2
| | | | | | | | | fully initialized because we initialize `if_groups' after linking. It's not triggered because if_attach() and if_unit(9) are serialized by kernel lock and `ifp' is often filled by nulls. Move `if_groups' initialization to if_attach_common() to prevent this. ok bluhm@ claudio@ deraadt@
* Interface group names must fit into IFNAMSIZ and be unique. Butbluhm2021-02-101-3/+5
| | | | | | | | | | | | the kernel made the unique check before trunkating with strlcpy(). So there could be two interface groups with the same name. The kif is created by a name lookup. The trunkated names are equal, so there was only one kif owned by both groups. When the groups got destroyed, the single kif was removed twice from the RB tree. Check length of group name before doing the unique check. The empty group name was allowed and is now invalid. Reported-by: syzbot+f47e8296ebd559f9bbff@syzkaller.appspotmail.com OK deraadt@ gnezdo@ anton@ mvs@ claudio@
* Start refcounting interface groups with 1. if_creategroup() returnsbluhm2021-02-081-8/+13
| | | | | | | a new object that is already refcounted, so carp attach does not reach into internal structures. Add kasserts to detect counter overflow or underflow. OK mvs@
* ifunit() was fully replaced by if_unit(9) and should go away.mvs2021-02-011-18/+7
| | | | ok bluhm@ dlg@
* Introduce new function if_unit(9). This function returns a pointer themvs2021-01-181-18/+53
| | | | | | | | | | | interface descriptor corresponding to the unique name. This descriptor is guaranteed to be valid until if_put(9) is called on the returned pointer. if_unit(9) should replace already existent ifunit() which returns descriptor not safe for dereference when context was switched. This allow us to avoid some use-after-free issues in ioctl(2) path. Also this unifies interface descriptor usage. ok claudio@ sashan@
* Syzkaller has found a stack overflow in socket splicing. Broadcastbluhm2021-01-091-2/+5
| | | | | | | | | | | | packets were resent through simplex broadcast delivery and socket splicing. Although there is an M_LOOP check in somove(9), it did not take effect. if_input_local() cleared the M_BCAST and M_MCAST flags with m_resethdr(). As if_input_local() is used for broadcast and multicast delivery, it was a mistake to delete them. Keep the M_BCAST and M_MCAST mbuf flags when packets are reinjected into the network stack. Reported-by: syzbot+a43ace363f1b663238f8@syzkaller.appspotmail.com OK anton@; discussed with claudio@
* Process pppoe(4) packets directly, do not queue through netiskn2021-01-041-7/+1
| | | | | | | | | | Less scheduling, lock contention and queues. Previously, if_netisr() handled the net lock around those calls, now if_input_process() does it before calling ether_input(), so no need to add or remove NET_*LOCK() anywhere. OK mvs claudio
* Remove kernel lock from pppoe(4) input pathkn2021-01-041-3/+1
| | | | | | | | | | | | "struct pppoe_softc" documents no member being protected by the kernel lock (alone); further review of the code paths starting from pppoeintr() shows no sleeping points which must be avoided in the softnet thread. Everything is fine as is to run without the big lock, so remove it. Tests sthen Feedback mpi mvs OK mvs claudio
* clear M_TIMESTAMP in if_enqueue.dlg2020-12-151-1/+3
| | | | | | this is to avoid a timestamp being used on the way out of the stack (eg, in bpf), or if it reenters the stack (eg, if it goes between rdomains with pair(4)).
* Introduce `if_cloners_lock' rwlock and use it to serializemvs2020-10-031-5/+20
| | | | | | | | | | | | | | | | | | | | | if_clone_{create,destroy}(). This fixes the races described below. if_clone_{create,destroy}() are kernel locked, but since they touch various sleep points introduced by rwlocks and M_WAITOK allocations, without serialization they can intersect due to race condition. The avoided races are: 1. While performing if_clone_create(), concurrent thread which performing if_clone_create() can attach `ifp' with the same `if_xname' and made inconsistent `if_list' where all attached interfaces linked. 2. While performing if_clone_create(), concurrent thread which performing if_clone_destroy() can kill this incomplete `ifp'. 3. While performing if_clone_destroy(), concurrent thread which performing if_clone_destroy() can kill this dying `ifp'. ok claudio@ kn@ mpi@ sashan@
* Allow SIOCSWGDPID and SIOCSWGMAXFLOW ioctls for non-rootkn2020-08-191-3/+1
| | | | | | | | | | | | | | | | | ifconfig(8) detects switch(4) through its unique SIOCSWGDPID ioctl(2) and further does another switch specific ioctl for the default output regardless of configuration and/or members. But since these two ioctls are limited to root, running ifconfig as unprivileged user makes switch interfaces partially appear as bridge devices because the detection fails, e.g. STP parameters are shown instead of datapath id and flow parameters. ifioctl() limits a list of set/write ioctls to root, but these two read-only ioctls seem to have been listed by mistake, so remove them to omit the root check and fix "ifconfig switch" output for unprivileged users. Feedback from dlg
* Revert r1.614 of net/if.c. Modifications of `ifnet' list must still bemvs2020-08-051-6/+3
| | | | done under both the KERNEL_LOCK() and NET_LOCK().
* We have `pipexinq' and `pipexoutq' mbuf(9) queues to store pipex(4)mvs2020-08-041-8/+1
| | | | | | | | | | | | | related mbufs. Each mbuf(9) passed to these queues stores the pointer to corresponding pipex(4) session referenced as `m_pkthdr.ph_cookie'. When session was destroyed its reference can still be in these queues so we have use after free issue while pipexintr() dereference it. I removed `pipexinq', `pipexoutq' and pipexintr(). This not only allows us to avoid issue described above, but also removes unnecessary context switch in packet processing. Also it makes code simpler. ok mpi@ yasuoka@
* Use interface index instead of pointer to `ifnet' in carp(4).mvs2020-07-241-3/+4
| | | | ok sashan@
* deprecate interface input handler lists, just use one input function.dlg2020-07-221-108/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | the interface input handler lists were originally set up to help us during the intial mpsafe network stack work. at the time not all the virtual ethernet interfaces (vlan, svlan, bridge, trunk, etc) were mpsafe, so we wanted a way to avoid them by default, and only take the kernel lock hit when they were specifically enabled on the interface. since then, they have been fixed up to be mpsafe. i could leave the list in place, but it has some semantic problems. because virtual interfaces filter packets based on the order they were attached to the parent interface, you can get packets taken away in surprising ways, especially when you reboot and netstart does something different to what you did by hand. by hardcoding the order that things like vlan and bridge get to look at packets, we can document the behaviour and get consistency. it also means we can get rid of a use of SRPs which were difficult to replace with SMRs. the interface input handler list is an SRPL, which we would like to deprecate. it turns out that you can sleep during stack processing, which you're not supposed to do with SRPs or SMRs, but SRPs are a lot more forgiving and it worked. lastly, it turns out that this code is faster than the input list handling, so lots of winning all around. special thanks to hrvoje popovski and aaron bieber for testing. this has been in snaps as part of a larger diff for over a week.
* Move insertions to `if_list' out of NET_LOCK() because KERNEL_LOCK()mvs2020-07-201-3/+6
| | | | | | | | | | | | protects this list. Also corresponding assertion added to be sure the required lock was held. This is the step to clean locking mess around `if_list'. Also we are going to protect `if_list' by it's own lock and this will allow us to avoid lock order issues in future. ok dlg@
* Check destruction ability before search instance of clone interface.mvs2020-07-171-4/+4
| | | | ok mpi@
* Change users of IFQ_PURGE() to use the "new" API.patrick2020-07-101-2/+2
| | | | ok dlg@ tobhe@
* Add size to free(9) callkn2020-06-301-2/+2
| | | | | Size taken from if_creategroup(); OK mvs
* The interface if_ioctl routine must be called with the NET_LOCK() held.claudio2020-06-221-8/+1
| | | | | | | | | | | | For example the bridge_ioctl() function calls NET_UNLOCK() unconditionally and so calling if_ioctl() without netlock will trigger an assert because of not holding the netlock. Make sure the ioctl handlers are called with the netlock held and drop the lock for the wg(4) specific ioctls in the wg_ioctl handler. This fixes a panic in bridge_ioctl() triggered by ifconfig(8) issuing a SIOCGWG ioctl against bridge(4). This is just a workaround this needs more cleanup but at least this way the panic can not be triggered anymore. OK stsp@, tested by semarie@
* deprecate network livelock detection using the softclock.dlg2020-06-221-38/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | livelock detection used to rely on code running at softnet blocking the softclock handling at a lower interrupt priority level. if the hard clock interrupt count diverged from one kept by a timeout, we assumed the network stack was doing too much work and we should apply backpressure to the receptions of packets. the network stack doesnt really block timeouts from firing anymore though. this is especially true on MP systems, because timeouts fire on cpu0 and the nettq thread could be somewhere else entirely. this means network activity doesn't make the softclock lose ticks, which means we aren't scaling rx ring activity like we think we are. the alternative way to detect livelock is when a driver queues packets for the stack to process, if there's too many packets built up then the input routine return value tells the driver to slow down. this enables finer grained livelock detection too. the rx ring accounting is done per rx ring, and each rx ring is tied to a specific nettq. if one of them is going too fast it shouldn't affect the others. the tick based detection was done system wide and punished all the drivers. ive converted all the drivers to the new mechanism. let's see how we go with it. jmatthew@ confirms rings still shrink, so some backpressure is being applied.
* add wg(4), an in kernel driver for WireGuard vpn communication.dlg2020-06-211-1/+9
| | | | | | | | | | | thanks to Matt Dunwoodie and Jason A. Donenfeld for their effort. it's at least as functional as the go implementation, and maybe more so since this one works on more architectures. i'm sure there's further development that can be done, but you can say that about anything and everything that's in the tree. ok deraadt@
* make ph_flowid in mbufs 16bits by storing whether it's set in csum_flags.dlg2020-06-171-3/+3
| | | | | i've been wanting to do this for a while, and now that we've got stoeplitz and it gives us 16 bits, it seems like the right time.
* dev/rndvar.h no longer has statistical interfaces (removed during variousderaadt2020-05-291-3/+1
| | | | | | conversion steps). it only contains kernel prototypes for 4 interfaces, all of which legitimately belong in sys/systm.h, which are already included by all enqueue_randomness() users.
* Document the various flavors of NET_LOCK() and rename the reader version.mpi2020-05-271-12/+12
| | | | | | | | | | Since our last concurrency mistake only ioctl(2) ans sysctl(2) code path take the reader lock. This is mostly for documentation purpose as long as the softnet thread is converted back to use a read lock. dlg@ said that comments should be good enough. ok sashan@
* Extract more randomness from mbuf flowvisa2020-05-261-2/+3
| | | | | | | | | | Input bits of the mbuf list head to enqueue_randomness(). While the set of mbufs in circulation is relatively stable, the order in which they reach if_input_process() is unpredictable. Shuffling can happen in many subsystems, such as the network stack, device drivers, and memory management. OK deraadt@ mpi@
* make ifpromisc assert that the caller is holding the NET_LOCK.dlg2020-04-121-1/+3
| | | | | | it needs NET_LOCK because it modifies if_flags and if_pcount. ok visa@
* Grab the exclusive NET_LOCK() in the softnet thread.mpi2020-04-111-7/+7
| | | | | | | | | Prevent a data corruption on a UDP receive socket buffer reported by procter@ who triggered it with wireguard-go. The symptoms are underflow of sb_cc/sb_datacc/sb_mcnt. ok visa@
* Properly exit loop at end of hooks TAILQ.tobhe2020-03-101-5/+4
| | | | | Feedback from and ok dlg@ ok kn@ todd@
* cleanup unused headers generated by configjsg2020-01-241-2/+1
| | | | ok tedu@ krw@ deraadt@
* avoid a use after free in if_delgroup.dlg2019-11-141-5/+5
| | | | | | | | | | | if_detach passes the groupname from an ifg_list struct to if_delgroup, if_delgroup then uses the name to find the same ifg_list struct so it can free it, and then passes the name from the struct to pfi_group_change(). at worst this can cause a fault if malloc(9) actually unmaps the page the struct was on, and at best it causes pf interfaces with garbage names to be created. ok sashan@ bluhm@
* Non root user must not use ioctl(2) to mess around with the addressbluhm2019-11-131-5/+24
| | | | | of a network interface. OK deraadt@ claudio@
* unbreak ramdisksderaadt2019-11-131-1/+3
|
* check for privileged bridges ioctls next to the other privileged ioctls.dlg2019-11-121-1/+26
| | | | | | | | | there's now a bunch of drivers that implement the bridge ioctls, but they're inconsistent at checking privilege. doing it up front once means less code duplication, and more consistent application of the checks. ok bluhm@ deraadt@
* SIOCDVNETID mutates state, so should only be run by root.dlg2019-11-121-1/+2
| | | | | found by Ilja Van Sprundel ok deraadt@ mpi@ bluhm@
* add linkstate hooks at the head of the tailq.dlg2019-11-111-2/+2
| | | | | | | | | | | | | | | | | when vxlans parent interface has a link state change event, vxlan reconfigures the parent to cope with things not being as it expects when the interface comes back. it does this by removing its config and then adding it again. part of it's config removal is to take the link state hook away, and part of putting the config on is is adding the link state hook. if we're running an interfaces link state hooks from head to tail, and the vxlan hook adds itself back to the tail, we end up running the vxlan hook forever cos it always ends up at the tail. bluhm@ hit this infinite loop while running regress tests. if turns out we need to run link state hooks in the same order they were added, i have a way to avoid this situation, but this is simple.
* whitespace fixes, no functional change.dlg2019-11-091-6/+6
|
* convert interface address change hooks to tasks and a task_list.dlg2019-11-081-14/+33
| | | | | | | | | | | | | | | this follows what's been done for detach and link state hooks, and makes handling of hooks generally more robust. address hooks are a bit different to detach/link state hooks in that there's only a few things that register hooks (carp, pf, vxlan), but a lot of places to run the hooks (lots of ipv4 and ipv6 address configuration). an address hook cookie was in struct pfi_kif, which is part of the pf abi. rather than break pfctl -sI, this maintains the void * used for the cookie and uses it to store a task, which is then used as intended with the new api.
* serialise hook adds and dels with a mutex instead of an implicit NET_LOCK.dlg2019-11-071-13/+46
| | | | | | | | | | | | | | | | | | | | i had NET_ASSERT_LOCKED() in the hook add and remove operations, because that's what's held when the hooks are run. some callers do not hold the NET_LOCK when calling them though, eg, bridge(4). aggr and tpmr used to not hold NET_LOCK while being destroyed, which also caused the asserts to fire, so i moved the port destroys inside NET_LOCK, but now I have deadlocks with some barrier calls. the hooks having their own lock means callers don't have to hold NET_LOCK and the list will stay sane. the code that runs the hooks gives up the mutex when calling the hook, but keeps track of where it's up to bey putting a cursor in the list. there's a single global mutex for all the interface linkstate and detach hooks, but this stuff isn't a hot path by any stretch of the imagination. based on (a lot of) testing by hrvoje popovski. thank you.