summaryrefslogtreecommitdiffstats
path: root/sys/netinet/tcp_output.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Remove maxburst feature from tcp_outputjan2021-02-081-3/+2
| | | | OK bluhm@, claudio@, deraadt@
* if stoeplitz is enabled, use it to provide a flowid for tcp packets.dlg2021-01-251-1/+6
| | | | | | | | | | | | | | drivers that implement rss and multiple rings depend on the symmetric toeplitz code, and use it to generate a key that decides with rx ring a packet lands on. if the toeplitz code is enabled, this diff has the pcb and tcp layer use the toeplitz code to generate a flowid for packets they send, which in turn is used to pick a tx ring. because the nic and the stack use the same key, the tx and rx sides end up with the same hash/flowid. at the very least this means that the same rx and tx queue pair on a particular nic are used for both sides of the connection. as the stack becomes more parallel, it will also help keep both sides of the tcp connection processing in the one place.
* Do not translate the EACCES error from pf(4) to EHOSTUNREACH anymore.bluhm2018-11-101-3/+1
| | | | | | It also translated a documented send(2) EACCES case erroneously. This was too much magic and always prone to errors. from Jan Klemkow; man page jmc@; OK claudio@
* M_LEADINGSPACE() and M_TRAILINGSPACE() are just wrappers forclaudio2018-11-091-2/+2
| | | | | | m_leadingspace() and m_trailingspace(). Convert all callers to call directly the functions and remove the defines. OK krw@, mpi@
* Add reference counting for inet pcb, this will be needed when webluhm2018-09-131-2/+6
| | | | | | start locking the socket. An inp can be referenced by the PCB queue and hashes, by a pf mbuf header, or by a pf state key. OK visa@
* The output from tcp debug sockets was incomplete. After detach tpbluhm2018-06-111-2/+2
| | | | | | | | was NULL and nothing was traced. So save the old tcpcb and use that to retrieve some information. Note that otb may be freed and must not be dereferenced. Use a heuristic for cases where the address family is in the IP header but not provided in the PCB. OK visa@
* Historically there were slow and fast tcp timeouts. That is whybluhm2018-05-081-5/+5
| | | | | | the delack timer had a different implementation. Use the same mechanism for all TCP timer. OK mpi@ visa@
* Remove the TCP_FACK option and associated #if{,n}def code.job2017-10-251-43/+1
| | | | | | | | | TCP_FACK was disabled by provos@ in June 1999. TCP_FACK is an algorithm that decides that when something is lost, all not SACKed packets until the most forward SACK are lost. It may be a correct estimate, if network does not reorder packets. OK visa@ mpi@ mikeb@
* Unconditionally enable TCP selective acknowledgements (SACK)mikeb2017-10-221-51/+18
| | | | OK deraadt, mpi, visa, job
* Assert that the corresponding socket is locked when manipulating socketmpi2017-06-261-2/+2
| | | | | | | | | | | | | | | | buffers. This is one step towards unlocking TCP input path. Note that all the functions asserting for the socket lock are not necessarilly MP-safe. All the fields of 'struct socket' aren't protected. Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to tell when a filter needs to lock the underlying data structures. Logic and name taken from NetBSD. Tested by Hrvoje Popovski. ok claudio@, bluhm@, mikeb@
* Merge the content of <netinet/tcpip.h> and <netinet6/tcpipv6.h> inmpi2017-05-181-2/+1
| | | | | | | | | <netinet/tcp_debug.h>. The IPv6 variant was always included and the IPv4 version is not present on all systems. Most of the offending ports are already fixed, thanks to sthen@!
* percpu counters for TCP statsjca2017-02-091-19/+17
| | | | ok mpi@ bluhm@
* Plug an mbuf leak in the error path of tcp signature in tcp_output().bluhm2016-07-191-3/+7
| | | | OK claudio@ henning@
* On localhost a user program may create a socket splicing loop.bluhm2016-06-131-1/+4
| | | | | | | | After writing data into this loop, it was spinning forever causing a kernel hang. Detect the loop by counting how often the same mbuf is spliced. If that happens 128 times, assume that there is a loop and abort the splicing with ELOOP. Bug found by tedu@; OK tedu@ millert@ benno@
* upgrade tcp/ip to use the latest in C89 technology: memcpy.tedu2015-12-051-5/+5
| | | | ok henning
* Ignore Router Advertisment's current hop limit.mpi2015-10-241-2/+2
| | | | | | | | Appart from the usual inet6 axe murdering exercise to keep you fit, this allows us to get rid of a lot of layer violation due to the use of per- ifp variables to store the current hop limit. Imputs from bluhm@, ok phessler@, florian@, bluhm@
* Kill yet another argument to functions in IPv6. This time ip6_output'sclaudio2015-09-111-2/+2
| | | | | | | ifpp - XXX: just for statistics ifpp is always NULL in all callers so that statistic confirms ifpp is dying OK mpi@
* Avoid a situation where we do not set the tcp persist timer afterbluhm2015-07-131-1/+27
| | | | | | | a zero window condition. If you send a 0-length packet, but there is data is the socket buffer, and neither the rexmt or persist timer is already set, then activate the persist timer. From FreeBSD revision 284941; OK deraadt@ markus@ mikeb@ claudio@
* Get rid of the undocumented & temporary* m_copy() macro added formpi2015-06-301-2/+3
| | | | | | | | compatibility with 4.3BSD in September 1989. *Pick your own definition for "temporary". ok bluhm@, claudio@, dlg@
* Store a unique ID, an interface index, rather than a pointer to thempi2015-06-161-2/+2
| | | | | | | | | | | | | | | receiving interface in the packet header of every mbuf. The interface pointer should now be retrieved when necessary with if_get(). If a NULL pointer is returned by if_get(), the interface has probably been destroy/removed and the mbuf should be freed. Such mechanism will simplify garbage collection of mbufs and limit problems with dangling ifp pointers. Tested by jmatthew@ and krw@, discussed with many. ok mikeb@, bluhm@, dlg@
* Replace a bunch of == 0 with == NULL in pointer tests. Nuke somekrw2015-06-071-13/+13
| | | | | | | annoying trailing, leading and embedded whitespace. No change to .o files. ok deraadt@
* Remove some includes include-what-you-use claims don'tjsg2015-03-141-2/+1
| | | | | | | have any direct symbols used. Tested for indirect use by compiling amd64/i386/sparc64 kernels. ok tedu@ deraadt@
* unifdef INET in net code as a precursor to removing the pretend option.tedu2014-12-191-7/+1
| | | | | long live the one true internet. ok henning mikeb
* Fewer <netinet/in_systm.h> !mpi2014-07-221-2/+1
|
* ip_output() using varargs always struck me as bizarre, esp since it's onlyhenning2014-04-211-2/+2
| | | | | | ever used to pass on uint32 (for ipsec). stop that madness and just pass the uint32, 0 in all cases but the two that pass the ipsec flowinfo. ok deraadt reyk guenther
* "struct pkthdr" holds a routing table ID, not a routing domain one.mpi2014-04-141-3/+3
| | | | | | | | | | | | | | Avoid the confusion by using an appropriate name for the variable. Note that since routing domain IDs are a subset of the set of routing table IDs, the following idiom is correct: rtableid = rdomain But to get the routing domain ID corresponding to a given routing table ID, you must call rtable_l2(9). claudio@ likes it, ok mikeb@
* Retire kernel support for SO_DONTROUTE, this time without breakingmpi2014-04-071-6/+3
| | | | | | | | | | | localhost connections. The plan is to always use the routing table for addresses and routes resolutions, so there is no future for an option that wants to bypass it. This option has never been implemented for IPv6 anyway, so let's just remove the IPv4 bits that you weren't aware of. Tested a least by lteo@, guenther@ and chrisz@, ok mikeb@, benno@
* revert "Retire kernel support for SO_DONTROUTE" diff, which does bad thingssthen2014-03-281-3/+6
| | | | for localhost connections. discussed with deraadt@
* Retire kernel support for SO_DONTROUTE, since the plan is to alwaysmpi2014-03-271-6/+3
| | | | | | | | use the routing table there's no future for an option that wants to bypass it. This option has never been implemented for IPv6 anyway, so let's just remove the IPv4 bits that you weren't aware of. Tested by florian@, man pages inputs from jmc@, ok benno@
* Remove the number of in6_var.h inclusions by moving some functions andmpi2013-10-241-6/+1
| | | | | | global variables to in6.h. ok deraadt@
* make in_proto_cksum_out not rely on the pseudo header checksum to behenning2013-10-191-23/+3
| | | | | | | | | | already there, just compute it - it's dirt cheap. since that happens very late in ip_output, the rest of the stack doesn't have to care about checksums at all any more, if something needs to be checksummed, just set the flag on the pkthdr mbuf to indicate so. stop pre-computing the pseudo header checksum and incrementally updating it in the tcp and udp stacks. ok lteo florian
* Add the TCP socket option TCP_NOPUSH to delay sending the stream.bluhm2013-08-121-3/+4
| | | | | | This is useful to aggregate data in the kernel from multiple sources like writes and socket splicing. It avoids sending small packets. From FreeBSD via David Hill; OK mikeb@ henning@
* Link pf states and socket inpcbs together more tightly. The linkingbluhm2013-06-031-1/+7
| | | | | | | | | | | | | | was only done when a packet traveled up the stack from pf to tcp_input(). Now also link the state and inpcb when the packet is going down from tcp_output() to pf. As a consequence, divert-reply states where the initial SYN does not get an answer, can be handled more correctly. This change is part of a larger diff that has been backed out in 2011. Bring the feature back in small steps to see when bad things start to happen. OK henning deraadt
* spltdb() was really just #define'd to be splsoftnet(); replace the formerblambert2012-09-201-3/+1
| | | | | | | | with the latter no change in md5 checksum of generated files ok claudio@ henning@
* Revert the pf->socket linking diff.oga2011-05-131-7/+1
| | | | | | | | | | | | | | | | | | | | | | | | | at least krw@, pirofti@ and todd@ have been seeing panics (todd and krw with xxxterm not sure about pirofti) involving pool corruption while using this commit. krw and todd confirm that this backout fixes the problem. ok blambert@ krw@, todd@ henning@ and kettenis@ Double link between pf states and sockets. Henning has already implemented half of it. The additional part is: - The pf state lookup for outgoing packets is optimized by using mbuf->inp->state. - For incomming tcp, udp, raw, raw6 packets the socket lookup always is optimized by using mbuf->state->inp. - All protocols establish the link for incomming packets. - All protocols set the inp in the mbuf for outgoing packets. This allows the linkage beginning with the first packet for outgoing connections. - In case of divert states, delete the state when the socket closes. Otherwise new connections could match on old states instead of being diverted to the listen socket. ok henning@
* Double link between pf states and sockets. Henning has alreadybluhm2011-04-241-1/+7
| | | | | | | | | | | | | | | | implemented half of it. The additional part is: - The pf state lookup for outgoing packets is optimized by using mbuf->inp->state. - For incomming tcp, udp, raw, raw6 packets the socket lookup always is optimized by using mbuf->state->inp. - All protocols establish the link for incomming packets. - All protocols set the inp in the mbuf for outgoing packets. This allows the linkage beginning with the first packet for outgoing connections. - In case of divert states, delete the state when the socket closes. Otherwise new connections could match on old states instead of being diverted to the listen socket. ok henning@
* mechanic rename M_{TCP|UDP}V4_CSUM_OUT -> M_{TCP|UDP}_CSUM_OUThenning2011-04-051-2/+2
| | | | ok claudio krw
* Add socket option SO_SPLICE to splice together two TCP sockets.bluhm2011-01-071-1/+7
| | | | | | | The data received on the source socket will automatically be sent on the drain socket. This allows to write relay daemons with zero data copy. ok markus@
* TCP send and recv buffer scaling.claudio2010-09-241-1/+8
| | | | | | | | | | | | | | | | | Send buffer is scaled by not accounting unacknowledged on the wire data against the buffer limit. Receive buffer scaling is done similar to FreeBSD -- measure the delay * bandwith product and base the buffer on that. The problem is that our RTT measurment is coarse so it overshoots on low delay links. This does not matter that much since the recvbuffer is almost always empty. Add a back pressure mechanism to control the amount of memory assigned to socketbuffers that kicks in when 80% of the cluster pool is used. Increases the download speed from 300kB/s to 4.4MB/s on ftp.eu.openbsd.org. Based on work by markus@ and djm@. OK dlg@, henning@, put it in deraadt@
* Return EACCES when pf_test() blocks a packet in ip_output(). This allowsclaudio2010-09-081-1/+3
| | | | | | | | ip_forward() to know the difference between blocked packets and those that can't be forwarded (EHOSTUNREACH). Only in the latter case an ICMP should be sent. In the other callers of ip_output() change the error back to EHOSTUNREACH since userland may not expect EACCES on a sendto(). OK henning@, markus@
* Add support for using IPsec in multiple rdomains.reyk2010-07-091-2/+3
| | | | | | | | | | | | | | | | | This allows to run isakmpd/iked/ipsecctl in multiple rdomains independently (with "route exec"); the kernel will pickup the rdomain from the process context of the pfkey socket and load the flows and SAs into the matching rdomain encap routing table. The network stack also needs to pass the rdomain to the ipsec stack to lookup the correct rdomain that belongs to an interface/mbuf/... You can now run individual IPsec configs per rdomain or create IPsec VPNs between multiple rdomains on the same machine ;). Note that a primary enc(4) in addition to enc0 interface is required per rdomain, eg. enc1 rdomain 1. Test by some people, mostly on existing "rdomain 0" setups. Was in snaps for some days and people didn't complain. ok claudio@ naddy@
* Fix the naming of interfaces and variables for rdomains and rtablesguenther2010-07-031-2/+2
| | | | | | | | | | | | and make it possible to bind sockets (including listening sockets!) to rtables and not just rdomains. This changes the name of the system calls, socket option, and ioctl. After building with this you should remove the files /usr/share/man/cat2/[gs]etrdomain.0. Since this removes the existing [gs]etrdomain() system calls, the libc major is bumped. Written by claudio@, criticized^Wcritiqued by me
* Make sure the temporary buffer used to generate tcp options is properlykettenis2010-05-281-2/+3
| | | | | | | aligned, otherwise we lose on strict alignment architecture. Should fix problems with gcc4 compiled bsd.rd's that people see on sparc64. ok millert@, beck@, jsing@
* Initial support for routing domains. This allows to bind interfaces toclaudio2009-06-051-1/+4
| | | | | | | | | alternate routing table and separate them from other interfaces in distinct routing tables. The same network can now be used in any doamin at the same time without causing conflicts. This diff is mostly mechanical and adds the necessary rdomain checks accross net and netinet. L2 and IPv4 are mostly covered still missing pf and IPv6. input and tested by jsg@, phessler@ and reyk@. "put it in" deraadt@
* do not set the pkthdr mbuf state key pointer to the state key saved in thehenning2008-09-031-2/+1
| | | | | | | | | | pcb. the state key ptr in the pcb is the one that had to be used by pf outbound. but by convention the state key pointer in the pkthdr is the one used INbound, so pf follows its reverse pointer to find the sk to use, and since a reverse doesn't exist for locally terminated connections the reverse pointer is null and thus the whole game a noop. note that this only affects packets FROM local udp/tcp sockets, for the other direction everything works as expected.
* link pf state keys to tcp pcbs and vice versa.henning2008-07-031-1/+2
| | | | | | | | | | | | | | when we first do a pcb lookup and we have a pointer to a pf state key in the mbuf header, store the state key pointer in the pcb and a pointer to the pcb we just found in the state key. when either the state key or the pcb is removed, clear the pointers. on subsequent packets inbound we can skip the pcb lookup and just use the pointer from the state key. on subsequent packets outbound we can skip the state key lookup and use the pointer from the pcb. about 8% speedup with 100 concurrent tcp sessions, should help much more with more tcp sessions. ok markus ryan
* no EOL between tcpsig and sack headers; ok jsing, frantzenmarkus2008-06-281-2/+2
|
* Remove some crazy #if mess.jsing2008-06-121-5/+1
| | | | ok markus@ henning@
* ANSIfy function definitions.jsing2008-06-121-3/+2
| | | | ok markus@ mcbride@ henning@ deraadt@
* some spelling fixes from Martynas Venckusjmc2007-11-241-2/+2
|