summaryrefslogtreecommitdiffstats
path: root/sys/net/if.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* The interface if_ioctl routine must be called with the NET_LOCK() held.claudio2020-06-221-8/+1
| | | | | | | | | | | | For example the bridge_ioctl() function calls NET_UNLOCK() unconditionally and so calling if_ioctl() without netlock will trigger an assert because of not holding the netlock. Make sure the ioctl handlers are called with the netlock held and drop the lock for the wg(4) specific ioctls in the wg_ioctl handler. This fixes a panic in bridge_ioctl() triggered by ifconfig(8) issuing a SIOCGWG ioctl against bridge(4). This is just a workaround this needs more cleanup but at least this way the panic can not be triggered anymore. OK stsp@, tested by semarie@
* deprecate network livelock detection using the softclock.dlg2020-06-221-38/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | livelock detection used to rely on code running at softnet blocking the softclock handling at a lower interrupt priority level. if the hard clock interrupt count diverged from one kept by a timeout, we assumed the network stack was doing too much work and we should apply backpressure to the receptions of packets. the network stack doesnt really block timeouts from firing anymore though. this is especially true on MP systems, because timeouts fire on cpu0 and the nettq thread could be somewhere else entirely. this means network activity doesn't make the softclock lose ticks, which means we aren't scaling rx ring activity like we think we are. the alternative way to detect livelock is when a driver queues packets for the stack to process, if there's too many packets built up then the input routine return value tells the driver to slow down. this enables finer grained livelock detection too. the rx ring accounting is done per rx ring, and each rx ring is tied to a specific nettq. if one of them is going too fast it shouldn't affect the others. the tick based detection was done system wide and punished all the drivers. ive converted all the drivers to the new mechanism. let's see how we go with it. jmatthew@ confirms rings still shrink, so some backpressure is being applied.
* add wg(4), an in kernel driver for WireGuard vpn communication.dlg2020-06-211-1/+9
| | | | | | | | | | | thanks to Matt Dunwoodie and Jason A. Donenfeld for their effort. it's at least as functional as the go implementation, and maybe more so since this one works on more architectures. i'm sure there's further development that can be done, but you can say that about anything and everything that's in the tree. ok deraadt@
* make ph_flowid in mbufs 16bits by storing whether it's set in csum_flags.dlg2020-06-171-3/+3
| | | | | i've been wanting to do this for a while, and now that we've got stoeplitz and it gives us 16 bits, it seems like the right time.
* dev/rndvar.h no longer has statistical interfaces (removed during variousderaadt2020-05-291-3/+1
| | | | | | conversion steps). it only contains kernel prototypes for 4 interfaces, all of which legitimately belong in sys/systm.h, which are already included by all enqueue_randomness() users.
* Document the various flavors of NET_LOCK() and rename the reader version.mpi2020-05-271-12/+12
| | | | | | | | | | Since our last concurrency mistake only ioctl(2) ans sysctl(2) code path take the reader lock. This is mostly for documentation purpose as long as the softnet thread is converted back to use a read lock. dlg@ said that comments should be good enough. ok sashan@
* Extract more randomness from mbuf flowvisa2020-05-261-2/+3
| | | | | | | | | | Input bits of the mbuf list head to enqueue_randomness(). While the set of mbufs in circulation is relatively stable, the order in which they reach if_input_process() is unpredictable. Shuffling can happen in many subsystems, such as the network stack, device drivers, and memory management. OK deraadt@ mpi@
* make ifpromisc assert that the caller is holding the NET_LOCK.dlg2020-04-121-1/+3
| | | | | | it needs NET_LOCK because it modifies if_flags and if_pcount. ok visa@
* Grab the exclusive NET_LOCK() in the softnet thread.mpi2020-04-111-7/+7
| | | | | | | | | Prevent a data corruption on a UDP receive socket buffer reported by procter@ who triggered it with wireguard-go. The symptoms are underflow of sb_cc/sb_datacc/sb_mcnt. ok visa@
* Properly exit loop at end of hooks TAILQ.tobhe2020-03-101-5/+4
| | | | | Feedback from and ok dlg@ ok kn@ todd@
* cleanup unused headers generated by configjsg2020-01-241-2/+1
| | | | ok tedu@ krw@ deraadt@
* avoid a use after free in if_delgroup.dlg2019-11-141-5/+5
| | | | | | | | | | | if_detach passes the groupname from an ifg_list struct to if_delgroup, if_delgroup then uses the name to find the same ifg_list struct so it can free it, and then passes the name from the struct to pfi_group_change(). at worst this can cause a fault if malloc(9) actually unmaps the page the struct was on, and at best it causes pf interfaces with garbage names to be created. ok sashan@ bluhm@
* Non root user must not use ioctl(2) to mess around with the addressbluhm2019-11-131-5/+24
| | | | | of a network interface. OK deraadt@ claudio@
* unbreak ramdisksderaadt2019-11-131-1/+3
|
* check for privileged bridges ioctls next to the other privileged ioctls.dlg2019-11-121-1/+26
| | | | | | | | | there's now a bunch of drivers that implement the bridge ioctls, but they're inconsistent at checking privilege. doing it up front once means less code duplication, and more consistent application of the checks. ok bluhm@ deraadt@
* SIOCDVNETID mutates state, so should only be run by root.dlg2019-11-121-1/+2
| | | | | found by Ilja Van Sprundel ok deraadt@ mpi@ bluhm@
* add linkstate hooks at the head of the tailq.dlg2019-11-111-2/+2
| | | | | | | | | | | | | | | | | when vxlans parent interface has a link state change event, vxlan reconfigures the parent to cope with things not being as it expects when the interface comes back. it does this by removing its config and then adding it again. part of it's config removal is to take the link state hook away, and part of putting the config on is is adding the link state hook. if we're running an interfaces link state hooks from head to tail, and the vxlan hook adds itself back to the tail, we end up running the vxlan hook forever cos it always ends up at the tail. bluhm@ hit this infinite loop while running regress tests. if turns out we need to run link state hooks in the same order they were added, i have a way to avoid this situation, but this is simple.
* whitespace fixes, no functional change.dlg2019-11-091-6/+6
|
* convert interface address change hooks to tasks and a task_list.dlg2019-11-081-14/+33
| | | | | | | | | | | | | | | this follows what's been done for detach and link state hooks, and makes handling of hooks generally more robust. address hooks are a bit different to detach/link state hooks in that there's only a few things that register hooks (carp, pf, vxlan), but a lot of places to run the hooks (lots of ipv4 and ipv6 address configuration). an address hook cookie was in struct pfi_kif, which is part of the pf abi. rather than break pfctl -sI, this maintains the void * used for the cookie and uses it to store a task, which is then used as intended with the new api.
* serialise hook adds and dels with a mutex instead of an implicit NET_LOCK.dlg2019-11-071-13/+46
| | | | | | | | | | | | | | | | | | | | i had NET_ASSERT_LOCKED() in the hook add and remove operations, because that's what's held when the hooks are run. some callers do not hold the NET_LOCK when calling them though, eg, bridge(4). aggr and tpmr used to not hold NET_LOCK while being destroyed, which also caused the asserts to fire, so i moved the port destroys inside NET_LOCK, but now I have deadlocks with some barrier calls. the hooks having their own lock means callers don't have to hold NET_LOCK and the list will stay sane. the code that runs the hooks gives up the mutex when calling the hook, but keeps track of where it's up to bey putting a cursor in the list. there's a single global mutex for all the interface linkstate and detach hooks, but this stuff isn't a hot path by any stretch of the imagination. based on (a lot of) testing by hrvoje popovski. thank you.
* turn the linkstate hooks into a task list, like the detach hooks.dlg2019-11-071-8/+23
| | | | | | | | | | | | | | | this is largely mechanical, except for carp. this moves the addition of the carp link state hook after we're committed to using the new interface as a carpdev. because the add can't fail, we avoid a complicated unwind dance. also, this tweaks the carp linkstate hook so it only updates the relevant carp interface, not all of the carpdevs on the parent. hrvoje popovski has tested an early version of this diff and it's generally ok, but there's some splasserts that this diff fires that i'll fix in an upcoming diff. ok claudio@
* replace the hooks used with if_detachhooks with a task list.dlg2019-11-061-7/+23
| | | | | | | | | | | | | | | | | | | | | | | | the main semantic change is that things registering detach hooks have to allocate and set a task structure that then gets added to the list. this means if the task is allocated up front (eg, as part of carps softc or bridges port structure), it avoids the possibility that adding a hook can fail. a lot of drivers weren't checking for failure, and unwinding state in the event of failure in other parts was error prone. while doing this i discovered that the list operations have to be in a particular order, but drivers weren't doing that consistently either. this diff wraps the list ops up so you have to seriously go out of your way to screw them up. ive also sprinkled some NET_ASSERT_LOCKED around the list operations so we can make sure there's no potential for the list to be corrupted, especially while it's being run. hrvoje popovski has tested this a bit, and some issues he discovered have been fixed. ok sashan@
* Remove support for semantically opace interface identifiers (RFC 7217)florian2019-08-211-7/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | for IPv6 link local addresses. Some hosting and VM providers route customer IPv6 prefixes to link local addresses derived from ethernet MAC addresses (RFC 2464). This leads to hard to debug IPv6 connectivity problems and is probably not worth the effort. RFC 7721 lists 4 weaknesses: 3.1. Correlation of Activities over Time & 3.2. Location Tracking These are still possible with RFC 7217 addresses for an adversary connected to the same layer 2 network (think conference wifi). Since the link local prefix stays the same (fe80::/64) the link local addresses do not change between different networks. An adversary on the same layer 2 network can probably track ethernet MAC addresses via different means, too. 3.3. Address Scanning & 3.4. Device-Specific Vulnerability Exploitation These now become possible, however, as noted above a layer 2 adversary was probably able to do this via different means. People concerned with these weaknesses are advised to use ifconfig lladdr random. OK benno input & OK kn
* When we needed the kernel lock for local IP packet delivery, mpi@bluhm2019-08-061-7/+1
| | | | | | | | | | | introduced a queue to grab the lock for multiple packets. Now we have only netlock for both IP and protocol input. So the queue is not necessary anymore. It just switches CPU and decreases performance. So remove the inet and inet6 ip queue for local packets. To get TCP running on loopback, we have to queue once between TCP input and output of the two sockets. So use the loopback queue in looutput() unconditionally. OK visa@
* if_vinput should pass BPF_DIRECTION_IN to bpf_mtap, not OUTdlg2019-06-301-2/+2
|
* Free sizes for hooks and `if_sadl'.mpi2019-06-151-7/+9
| | | | ok kn@
* if_netisr(): trade NET_LOCK() for NET_RLOCK()sashan2019-06-041-5/+5
| | | | OK mpi@
* pushing NET_LOCK() further down from if_clone_{create,destroy}()sashan2019-05-121-18/+10
| | | | OK mpi@
* Correct a check in if_isconnected().mpi2019-05-111-2/+2
| | | | | | Fix a regression introduced by the bridge(4) refactoring. Found by and ok bluhm@
* Removes the KERNEL_LOCK() from bridge(4)'s output fast-path.mpi2019-04-281-6/+4
| | | | | | | | This redefines the ifp <-> bridge relationship. No lock can be currently used across the multiples contexts where the bridge has tentacles to protect a pointer, use an interface index. Tested by various, ok dlg@, visa@
* add if_vinput so pseudo (ethernet) interfaces can bypass ifiqsdlg2019-04-221-1/+27
| | | | | | | | | | | | | if_vinput assumes that the interface that its called against uses per cpu counters so it can count input packets, but basically does all the things that if_input and ifiq_input do. the main difference is it assumes the network stack is already running and runs the interface input handlers directly. this is instead of queuing the packets for a nettq to run. ifiqs arent free, especially when they only run per packet like they do on psuedo interfaces. this allows that overhead to be bypassed.
* factor out calling the if input handlers for a packet.dlg2019-04-221-17/+23
| | | | this is a step toward letting interfaces like vlan bypass ifiqs
* provide factored out txhprio and rxhprio checksdlg2019-04-191-1/+65
| | | | | | l2 and l3 drivers do the same thing all the time, so reduce the chance of error by doing the checks once and making it available for drivers to call instead of rolling on their own again.
* only root can change rxpriodlg2019-04-191-1/+2
|
* have another go at tx mitigationdlg2019-04-161-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
* don't take splnet when running the network stackdlg2019-04-141-4/+1
| | | | | | | | | | | | the stack uses the NET_LOCK for most protection now, so it doesnt need to block actual hardware interrupts. blocking hw interrupts can cause huge latency spikes, which in turn works against the rx ring moderation. im putting this in early in the release cycle so it can get the most testing possible. ok mpi@ (a while back)
* do custom checks for SIOCGIFSFFPAGE.dlg2019-04-101-1/+32
| | | | | | | | | | | this should only be used by root, and it should not tak the NET_LOCK because a bunch of i2c reads can take a relatively long time during which packets would be blocked. while here make sure userland only requests pages from the eeprom and diag i2c addresses. ok deraadt@
* rework how ifiq_input decides the stack is busy and whether it should dropdlg2019-03-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
* check for root on mpls and pwe3 ioctlsdlg2019-02-261-1/+8
| | | | part of a larger diff ok mpi@
* split if_enqueue up so drivers can replace ifq handling if neededdlg2019-01-091-11/+25
| | | | | | | | | | | | | | | | | | | | | | if_enqueue() still makes sure packets get handled by pf on the way out, and seen by bridge if needed. however instead of falling through to ifq mapping and output, it now calls a function pointer in the ifnet struct. that pointer defaults to the ifq handling, but drivers can override it to bypass ifq processing. the most obvious users of the function pointer will be virtual interfaces, eg, vlan(4). ifqs are good if you need to serialise access to the thing that transmits packets (like hardware rings on nics), or mitigate the number of times you do ring processing, but neither of those things are desirable on vlan interfaces. ideally vlan could transmit on any cpu without having packets serialised by it's own ifq before being pushed down to an arbitrary number of rings on the parent interface. bypassing ifqs means the driver can push the vlan tag on concurrently and push down to the parent frmo any cpu. ok mpi@ no objection from claudio@
* Make this not hz dependent by using timeout_add_sec() also rename theclaudio2018-12-201-2/+2
| | | | | define to IFNET_SLOWTIMO since it is no longer a hz divisor. OK visa@ bluhm@ kn@
* add optional per-cpu counters for interface stats.dlg2018-12-111-1/+41
| | | | | | | | | these exist so interfaces that want to do mpsafe work outside the ifq machinery have a place to allocate and update stats in. the generic ioctl handling for getting stats to userland knows how to roll the new per cpu stats into the rest before export. ok visa@
* restrict setting ecn to rootdlg2018-11-291-1/+2
| | | | ok claudio@
* only let root configure the txprio setting on an interfacedlg2018-11-121-1/+2
| | | | ok claudio@
* Put bridge_input & output back under the KERNEL_LOCK().mpi2018-10-011-3/+7
| | | | | | | Wireless drivers call if_enqueue() out of the NET_LOCK() so it cannot be used to serialize bridge(4) states. Found by stsp@, ok visa@
* Move bridge input/output paths outside of the KERNEL_LOCK().mpi2018-09-261-11/+4
| | | | | | | Tested by Hrvoje Popovski who measured a 30% improvement of forwarded packets in the best case. ok visa@
* Nuke some trailing spaces that wandered into thekrw2018-09-111-2/+2
| | | | crosshairs.
* - if_cloners list populated at boot time only then becomes immutable,sashan2018-09-101-24/+10
| | | | | | so we can let go if_cloners_lock. OK tb@, claudio@, bluhm@, kn@, henning@
* if_setrdomain could potentially call if_clone_create recursively in the createhenning2018-09-101-32/+35
| | | | | | | | | rdomain case leading to locking issues and lots of headscratching. turns out the only case where if_setrdomain could actually create an rdomain and thus end up with that pattern is the ioctl path. make if_setrdomain never create an rdomain, return error if it doesn't exist already, introduce if_createrdomain, and adjust the ioctl path to use it. ok sashan bluhm claudio
* if_setrdomain: allow empty rtables to be turned into rdomains, not justhenning2018-09-091-13/+14
| | | | | nonexistant ones as before. nasty error handling with bluhm, feedback mpi as well. ok bluhm