summaryrefslogtreecommitdiffstats
path: root/sys/netinet (follow)
Commit message (Collapse)AuthorAgeFilesLines
...
* Simplify igmp_sysctl to directly return error in default casegnezdo2020-08-172-15/+3
| | | | | | | This replaces a piece of observationally identical code which was much more complicated. ok mpi@
* No longer prevent TCP connections to IPv6 anycast addresses.florian2020-08-081-15/+1
| | | | | | | | | | | | | | | | | | | | RFC 4291 dropped this requirement from RFC 3513: o An anycast address must not be used as the source address of an IPv6 packet. And from that requirement draft-itojun-ipv6-tcp-to-anycast rightly concluded that TCP connections must be prevented. The draft also states: The proposed method MUST be removed when one of the following events happens in the future: o Restriction imposed on IPv6 anycast address is loosened, so that anycast address can be placed into source address field of the IPv6 header[...] OK jca
* Don't compare pointers against zero.mglocker2020-08-051-3/+3
| | | | | | Reported by Peter J. Philipp. ok mvs@ deraadt@
* Move range check inside sysctl_int_arrgnezdo2020-08-017-88/+56
| | | | | | | Range violations are now consistently reported as EOPNOTSUPP. Previously they were mixed with ENOPROTOOPT. OK kn@
* Don't treat an error if carppeer is an unicast and the peer is down.yasuoka2020-07-281-2/+4
| | | | ok kn
* After the previous commit, src/regress/sys/netinet/carp triggeredbluhm2020-07-281-3/+3
| | | | | an uvm fault. Check that ifp0 is not NULL. OK sashan@ mvs@
* netinet: tcp_close(): delay reaper timeout by one tickcheloha2020-07-241-2/+2
| | | | | | | | | Zero-tick timeouts rely on implicit behavior in the timeout layer that inhibits optimizations in softclock(). bluhm@ says waiting a tick for the reaper shouldn't break anything. ok bluhm@
* Use interface index instead of pointer to `ifnet' in carp(4).mvs2020-07-242-58/+96
| | | | ok sashan@
* deprecate interface input handler lists, just use one input function.dlg2020-07-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | the interface input handler lists were originally set up to help us during the intial mpsafe network stack work. at the time not all the virtual ethernet interfaces (vlan, svlan, bridge, trunk, etc) were mpsafe, so we wanted a way to avoid them by default, and only take the kernel lock hit when they were specifically enabled on the interface. since then, they have been fixed up to be mpsafe. i could leave the list in place, but it has some semantic problems. because virtual interfaces filter packets based on the order they were attached to the parent interface, you can get packets taken away in surprising ways, especially when you reboot and netstart does something different to what you did by hand. by hardcoding the order that things like vlan and bridge get to look at packets, we can document the behaviour and get consistency. it also means we can get rid of a use of SRPs which were difficult to replace with SMRs. the interface input handler list is an SRPL, which we would like to deprecate. it turns out that you can sleep during stack processing, which you're not supposed to do with SRPs or SMRs, but SRPs are a lot more forgiving and it worked. lastly, it turns out that this code is faster than the input list handling, so lots of winning all around. special thanks to hrvoje popovski and aaron bieber for testing. this has been in snaps as part of a larger diff for over a week.
* move carp_input into ether_input, instead of via an input handler.dlg2020-07-222-23/+9
| | | | | | | | carp_input is only tried after vlan and bridge handling is done, and after the ethernet packet doesnt match the parent interfaces mac address. this has been in snaps as part of a larger diff for over a week.
* add code to coordinate how bridges attach to ethernet interfaces.dlg2020-07-221-1/+14
| | | | | | | | | | | | | | | | | | | | | | | this is the first step in refactoring how ethernet frames are demuxed by virtual interfaces, and also in deprecating interface input list handling. we now have drivers for three types of virtual bridges, bridge(4), switch(4), and tpmr(4), and it doesn't make sense for any of them to be enabled on the same "port" interfaces at the same time. currently you can add a port interface to multiple types of bridge, but which one gets to steal the packets depends on the order in which they were attached. this creates an ether_brport structure that holds an input function for the bridge, and optionally some per port state that the bridge can use. arpcom has a single pointer to one of these structs that will be used during normal ether_input processing to see if a packet should be passed to a bridge, and will be used instead of an if input handler. because it is a single pointer, it will make sure only one bridge of any type is attached to a port at any one time. this has been in snaps as part of a larger diff for over a week.
* kernel: use gettime(9)/getuptime(9) in lieu of time_second(9)/time_uptime(9)cheloha2020-06-246-29/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | time_second(9) and time_uptime(9) are widely used in the kernel to quickly get the system UTC or system uptime as a time_t. However, time_t is 64-bit everywhere, so it is not generally safe to use them on 32-bit platforms: you have a split-read problem if your hardware cannot perform atomic 64-bit reads. This patch replaces time_second(9) with gettime(9), a safer successor interface, throughout the kernel. Similarly, time_uptime(9) is replaced with getuptime(9). There is a performance cost on 32-bit platforms in exchange for eliminating the split-read problem: instead of two register reads you now have a lockless read loop to pull the values from the timehands. This is really not *too* bad in the grand scheme of things, but compared to what we were doing before it is several times slower. There is no performance cost on 64-bit (__LP64__) platforms. With input from visa@, dlg@, and tedu@. Several bugs squashed by visa@. ok kettenis@
* wrap a long line. no functional change.dlg2020-06-211-2/+3
|
* if an inp_upcall is set, let it look at and maybe steal the udp packet.dlg2020-06-211-3/+11
| | | | | i wrote the original version of this, but it was tweaked by Matt Dunwoodie and Jason A. Donenfeld for use with wireguard.
* knf: the inp_upcall line was too long.dlg2020-06-211-2/+3
|
* add a inp_upcall function pointer and inp_upcall_arg to struct in_pcb.dlg2020-06-211-1/+3
| | | | | | | | | this is so protocols (eg, udp) can let things (eg, kernel support for wireguard or vxlan or geneve) look at and possibly steal packets before they get added to a socket buffer. i wrote the original version of this, but it was tweaked by Matt Dunwoodie and Jason A. Donenfeld for use with wireguard.
* Break a glass ceiling on cwnd due to integer division during congestionprocter2020-06-191-2/+2
| | | | | | | | | avoidance. The problem and fix is noted in RFC5681 section 3.1, page 7. Report, diff and testing from Brian Brombacher, thanks! Testing and a cosmetic tweak by myself. ok claudio
* Refuse to set 0 or a negative value for net.inet.tcp.synbucketlimit.mpi2020-06-181-1/+14
| | | | | | | | Prevent a panic in syn_cache_insert() found by syzbot. Reported-by: syzbot+aee24ad9b7bf5665912d@syzkaller.appspotmail.com ok sashan@, anton@, millert@
* Connectionless sockets like UDP can be re-connected to a differentbluhm2020-05-271-1/+8
| | | | | | | | | address. In that case, the linking to the pf state must be dissolved as the latter still contains the old address. If it is a divert state, also remove the state as any divert state must be associated with a matching socket. Call pf_remove_divert_state() and pf_inp_unlink() from in_pcbconnect(). reported by Tim Kuijsten; OK sashan@ claudio@
* Document the various flavors of NET_LOCK() and rename the reader version.mpi2020-05-272-8/+8
| | | | | | | | | | Since our last concurrency mistake only ioctl(2) ans sysctl(2) code path take the reader lock. This is mostly for documentation purpose as long as the softnet thread is converted back to use a read lock. dlg@ said that comments should be good enough. ok sashan@
* don't count packets in the carp protocol handling against an interface.dlg2020-05-211-7/+1
| | | | | | | these packets have generally already been counted on the interface because that's where they were sent or received from. the protocol handling side of things already counts things like packets, which you see with netstat -sp carp.
* implement a carp_transmit that bypasses the ifq on output.dlg2020-05-211-41/+65
| | | | | | | | | | | | this is modelled on vlan_transmit, and basically enqueues the packet directly on the parent interface. even though carp is generally not used to transmit packets, we run dhcp relays on it at work and hit a situation where we unecessarily dropped packets because it's ifq maxlen was 1. i've been running this for a month in production. ok jmatthew@
* remove some trailing whitespace. no functional change.dlg2020-04-291-5/+5
|
* Add support for autmatically moving traffic between rdomains on ipsec(4)tobhe2020-04-234-47/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | encryption or decryption. This allows us to keep plaintext and encrypted network traffic seperated and reduces the attack surface for network sidechannel attacks. The only way to reach the inner rdomain from outside is by successful decryption and integrity verification through the responsible Security Association (SA). The only way for internal traffic to get out is getting encrypted and moved through the outgoing SA. Multiple plaintext rdomains can share the same encrypted rdomain while the unencrypted packets are still kept seperate. The encrypted and unencrypted rdomains can have different default routes. The rdomains can be configured with the new SADB_X_EXT_RDOMAIN pfkey extension. Each SA (tdb) gets a new attribute 'tdb_rdomain_post'. If this differs from 'tdb_rdomain' then the packet is moved to 'tdb_rdomain_post' afer IPsec processing. Flows and outgoing IPsec SAs are installed in the plaintext rdomain, incoming IPsec SAs are installed in the encrypted rdomain. IPCOMP SAs are always installed in the plaintext rdomain. They can be viewed with 'route -T X exec ipsecctl -sa' where X is the rdomain ID. As the kernel does not create encX devices automatically when creating rdomains they have to be added by hand with ifconfig for IPsec to work in non-default rdomains. discussed with chris@ and kn@ ok markus@, patrick@
* Stop processing packets under non-exclusive (read) netlock.mpi2020-04-121-3/+3
| | | | | | | | | | | | Prevent concurrency in the socket layer which is not ready for that. Two recent data corruptions in pfsync(4) and the socket layer pointed out that, at least, tun(4) was incorrectly using NET_RUNLOCK(). Until we find a way in software to avoid future mistakes and to make sure that only the softnet thread and some ioctls are safe to use a read version of the lock, put everything back to the exclusive version. ok stsp@, visa@
* Guard SIOCDELMULTI if_ioctl calls with KERNEL_LOCK() where the call isvisa2020-03-152-2/+6
| | | | | | | | | | made from socket close path. Most device drivers are not MP-safe yet, and the closing of AF_INET and AF_INET6 sockets is no longer under the kernel lock. This fixes a panic seen by jcs@. OK mpi@
* Fix uninitialized use of variable 'len'.tobhe2020-03-061-6/+4
| | | | ok bluhm@
* add define for IPTOS_DSCP_LE; "low effort" DSCP codepoint standardiseddjm2020-01-261-1/+2
| | | | in RFC8622; ok job@
* rdr-to with loopback destination should work even thoughsashan2019-12-231-2/+3
| | | | | | IP forwarding is disabled. Issue reported by Daniel Jakots (danj@) OK bluhm@
* Make bundled IPcomp/ESP policies work with IPSEC_LEVEL_REQUIRE.tobhe2019-12-101-1/+19
| | | | | | | We only install flows for IPcomp. When processing an incoming ESP SA, look for a bundled IPcomp SA and use that in the policy check. ok bluhm@
* always pull in if_types.h, to unbreak ramdisksderaadt2019-12-091-2/+2
|
* Make sure packet destination address matches interface address,sashan2019-12-083-4/+44
| | | | | | | | | where such packet is bound to. This check is enforced if and only IP forwarding is disabled. Change discussed with bluhm@, claudio@, deraadt@, markus@, tobhe@ OK bluhm@, claudio@, tobhe@
* Checking the IPsec policy is expensive. Check only when IPsec is used.tobhe2019-12-062-30/+34
| | | | ok bluhm@
* Don't require a valid sa_len for a bunch of IPv4 "get" ioctlsjca2019-12-011-3/+6
| | | | | Same fix as for the IPv6 case. Fixes a regression in ports/net/openvpn spotted by landry@, ok bluhm@
* Change the default security level for incoming IPsec flows fromtobhe2019-11-292-60/+63
| | | | | | isakmpd and iked to REQUIRE. Filter policy violations earlier. ok sashan@ bluhm@
* Although ifconfig(8) checks it already, enforce contiguous inetbluhm2019-11-281-4/+21
| | | | | netmask in the kernel. OK visa@
* Add DoT 853 to DEFBADDYNAMICPORTS_TCP. This port will be increasinglyderaadt2019-11-131-2/+2
| | | | | | unfiltered in the future, so this prevents rresvport_af(3) from randomly exposing a service intended for local visibility only. ok florian
* Prevent underflows in tp->snd_wnd if the remote side ACKs more thanbluhm2019-11-111-3/+9
| | | | | | tp->snd_wnd. This can happen, for example, when the remote side responds to a window probe by ACKing the one byte it contains. from FreeBSD; via markus@; OK sashan@ tobhe@
* void being too clever about setting/clearing ifpromisc on the parent.dlg2019-11-081-8/+6
| | | | | | ifpromisc() already refcounts, so carp doesn't have to do it implicitly with the carpdev list. there's no functional change, the code just gets a bit simpler.
* convert interface address change hooks to tasks and a task_list.dlg2019-11-082-11/+11
| | | | | | | | | | | | | | | this follows what's been done for detach and link state hooks, and makes handling of hooks generally more robust. address hooks are a bit different to detach/link state hooks in that there's only a few things that register hooks (carp, pf, vxlan), but a lot of places to run the hooks (lots of ipv4 and ipv6 address configuration). an address hook cookie was in struct pfi_kif, which is part of the pf abi. rather than break pfctl -sI, this maintains the void * used for the cookie and uses it to store a task, which is then used as intended with the new api.
* Do propper kernel input validation for in_control() ioctl(2)bluhm2019-11-071-40/+63
| | | | | | | | | | SIOCGIFADDR, SIOCGIFNETMASK, SIOCGIFDSTADDR, SIOCGIFBRDADDR, SIOCSIFADDR, SIOCSIFNETMASK, SIOCSIFDSTADDR, and SIOCSIFBRDADDR. Name in_ioctl_set_ifaddr() consistently. Use in_sa2sin() to validate inet address. Combine if_addrlist loops and add comment. Although netmask is not a inet address, length must be valid. Reported-by: syzbot+5fc6da002fc4e8d994be@syzkaller.appspotmail.com OK visa@
* Avoid NULL dereference in arpinvalidate() and nd6_invalidate() bykrw2019-11-071-1/+3
| | | | | | making RTM_INVALIDATE code path perform same check as RTM_DELETE does. ok mpi@
* turn the linkstate hooks into a task list, like the detach hooks.dlg2019-11-071-47/+27
| | | | | | | | | | | | | | | this is largely mechanical, except for carp. this moves the addition of the carp link state hook after we're committed to using the new interface as a carpdev. because the add can't fail, we avoid a complicated unwind dance. also, this tweaks the carp linkstate hook so it only updates the relevant carp interface, not all of the carpdevs on the parent. hrvoje popovski has tested an early version of this diff and it's generally ok, but there's some splasserts that this diff fires that i'll fix in an upcoming diff. ok claudio@
* replace the hooks used with if_detachhooks with a task list.dlg2019-11-061-14/+8
| | | | | | | | | | | | | | | | | | | | | | | | the main semantic change is that things registering detach hooks have to allocate and set a task structure that then gets added to the list. this means if the task is allocated up front (eg, as part of carps softc or bridges port structure), it avoids the possibility that adding a hook can fail. a lot of drivers weren't checking for failure, and unwinding state in the event of failure in other parts was error prone. while doing this i discovered that the list operations have to be in a particular order, but drivers weren't doing that consistently either. this diff wraps the list ops up so you have to seriously go out of your way to screw them up. ive also sprinkled some NET_ASSERT_LOCKED around the list operations so we can make sure there's no potential for the list to be corrupted, especially while it's being run. hrvoje popovski has tested this a bit, and some issues he discovered have been fixed. ok sashan@
* remove mobileip(4)dlg2019-11-043-34/+4
| | | | | | | noone seems to use it, and we should not encourage people to use it by having it available. it's been disabled for most of the last release and noones asked for it in 6.6, so i'm taking that as an ok for this removal.
* make whitespace in the IPPROTO defines consistent. no functional change.dlg2019-10-251-13/+13
|
* +#define IPPROTO_UDPLITE 136, as per RFC 3828 and the IANA allocationdlg2019-10-251-1/+2
| | | | | please don't interpret this as an intention on my part to implement UDP-Lite.
* Kernel is missing propper input validation when configuring addresses.bluhm2019-10-232-34/+66
| | | | | | Fix the SIOCAIFADDR and SIOCDIFADDR ioctl(2) by implementing in_sa2sin() to validate inet address family and address length. OK visa@
* in6_setsockaddr and in6_setpeeraddr can't fail, so let them return void.dlg2019-10-171-3/+3
| | | | | | this also brings them in line with the AF_INET equivalents. ok visa@ bluhm@
* tsleep(9) -> tsleep_nsec(9)mpi2019-10-161-2/+3
| | | | ok cheloha@, visa@