aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4 (follow)
AgeCommit message (Collapse)AuthorFilesLines
2012-07-27tcp: perform DMA to userspace only if there is a task waiting for itJiri Kosina1-1/+3
Back in 2006, commit 1a2449a87b ("[I/OAT]: TCP recv offload to I/OAT") added support for receive offloading to IOAT dma engine if available. The code in tcp_rcv_established() tries to perform early DMA copy if applicable. It however does so without checking whether the userspace task is actually expecting the data in the buffer. This is not a problem under normal circumstances, but there is a corner case where this doesn't work -- and that's when MSG_TRUNC flag to recvmsg() is used. If the IOAT dma engine is not used, the code properly checks whether there is a valid ucopy.task and the socket is owned by userspace, but misses the check in the dmaengine case. This problem can be observed in real trivially -- for example 'tbench' is a good reproducer, as it makes a heavy use of MSG_TRUNC. On systems utilizing IOAT, you will soon find tbench waiting indefinitely in sk_wait_data(), as they have been already early-copied in tcp_rcv_established() using dma engine. This patch introduces the same check we are performing in the simple iovec copy case to the IOAT case as well. It fixes the indefinite recvmsg(MSG_TRUNC) hangs. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-27ipv4: fix TCP early demuxEric Dumazet3-8/+8
commit 92101b3b2e317 (ipv4: Prepare for change of rt->rt_iif encoding.) invalidated TCP early demux, because rx_dst_ifindex is not properly initialized and checked. Also remove the use of inet_iif(skb) in favor or skb->skb_iif Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-27tcp: Add TCP_USER_TIMEOUT negative value checkHangbin Liu1-1/+4
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int. But patch "tcp: Add TCP_USER_TIMEOUT socket option"(dca43c75) didn't check the negative values. If a user assign -1 to it, the socket will set successfully and wait for 4294967295 miliseconds. This patch add a negative value check to avoid this issue. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-26ipv6: Early TCP socket demuxEric Dumazet1-0/+1
This is the IPv6 missing bits for infrastructure added in commit 41063e9dd1195 (ipv4: Early TCP socket demux.) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-26ipv4: Fix input route performance regression.David S. Miller6-35/+31
With the routing cache removal we lost the "noref" code paths on input, and this can kill some routing workloads. Reinstate the noref path when we hit a cached route in the FIB nexthops. With help from Eric Dumazet. Reported-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-25ipv4: rt_cache_valid must check expired routesEric Dumazet1-3/+5
commit d2d68ba9fe8 (ipv4: Cache input routes in fib_info nexthops.) introduced rt_cache_valid() helper. It unfortunately doesn't check if route is expired before caching it. I noticed sk_setup_caps() was constantly called on a tcp workload. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-24tcp: early_demux fixesEric Dumazet2-8/+6
1) Remove a non needed pskb_may_pull() in tcp_v4_early_demux() and fix a potential bug if skb->head was reallocated (iph & th pointers were not reloaded) TCP stack will pull/check headers anyway. 2) must reload iph in ip_rcv_finish() after early_demux() call since skb->head might have changed. 3) skb->dev->ifindex can be now replaced by skb->skb_iif Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23ipv4: Change rt->rt_iif encoding.David S. Miller1-4/+4
On input packet processing, rt->rt_iif will be zero if we should use skb->dev->ifindex. Since we access rt->rt_iif consistently via inet_iif(), that is the only spot whose interpretation have to adjust. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23ipv4: Prepare for change of rt->rt_iif encoding.David S. Miller5-23/+22
Use inet_iif() consistently, and for TCP record the input interface of cached RX dst in inet sock. rt->rt_iif is going to be encoded differently, so that we can legitimately cache input routes in the FIB info more aggressively. When the input interface is "use SKB device index" the rt->rt_iif will be set to zero. This forces us to move the TCP RX dst cache installation into the ipv4 specific code, and as well it should since doing the route caching for ipv6 is pointless at the moment since it is not inspected in the ipv6 input paths yet. Also, remove the unlikely on dst->obsolete, all ipv4 dsts have obsolete set to a non-zero value to force invocation of the check callback. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23ipv4: Remove all RTCF_DIRECTSRC handliing.David S. Miller1-9/+2
The last and final kernel user, ICMP address replies, has been removed. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23ipv4: Really ignore ICMP address requests/replies.David S. Miller1-82/+2
Alexey removed kernel side support for requests, and the only thing we do for replies is log a message if something doesn't look right. As Alexey's comment indicates, this belongs in userspace (if anywhere), and thus we can safely just get rid of this code. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse.Saurabh1-5/+5
With CONFIG_SPARSE_RCU_POINTER=y sparse identified references which did not specificy __rcu in ip_vti.c Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23ipv4: Remove redundant assignmentLin Ming1-1/+0
It is redundant to set no_addr and accept_local to 0 and then set them with other values just after that. Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23tcp: dont drop MTU reduction indicationsEric Dumazet2-5/+20
ICMP messages generated in output path if frame length is bigger than mtu are actually lost because socket is owned by user (doing the xmit) One example is the ipgre_tunnel_xmit() calling icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu)); We had a similar case fixed in commit a34a101e1e6 (ipv6: disable GSO on sockets hitting dst_allfrag). Problem of such fix is that it relied on retransmit timers, so short tcp sessions paid a too big latency increase price. This patch uses the tcp_release_cb() infrastructure so that MTU reduction messages (ICMP messages) are not lost, and no extra delay is added in TCP transmits. Reported-by: Maciej Żenczykowski <maze@google.com> Diagnosed-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Tore Anderson <tore@fud.no> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23tcp: avoid oops in tcp_metrics and reset tcpm_stampJulian Anastasov1-2/+3
In tcp_tw_remember_stamp we incorrectly checked tw instead of tm, it can lead to oops if the cached entry is not found. tcpm_stamp was not updated in tcpm_check_stamp when tcpm_suck_dst was called, move the update into tcpm_suck_dst, so that we do not call it infinitely on every next cache hit after TCP_METRICS_TIMEOUT. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-22Merge branch 'kill_rtcache'David S. Miller15-1155/+242
The ipv4 routing cache is non-deterministic, performance wise, and is subject to reasonably easy to launch denial of service attacks. The routing cache works great for well behaved traffic, and the world was a much friendlier place when the tradeoffs that led to the routing cache's design were considered. What it boils down to is that the performance of the routing cache is a product of the traffic patterns seen by a system rather than being a product of the contents of the routing tables. The former of which is controllable by external entitites. Even for "well behaved" legitimate traffic, high volume sites can see hit rates in the routing cache of only ~%10. The general flow of this patch series is that first the routing cache is removed. We build a completely new rtable entry every lookup request. Next we make some simplifications due to the fact that removing the routing cache causes several members of struct rtable to become no longer necessary. Then we need to make some amends such that we can legally cache pre-constructed routes in the FIB nexthops. Firstly, we need to invalidate routes which are hit with nexthop exceptions. Secondly we have to change the semantics of rt->rt_gateway such that zero means that the destination is on-link and non-zero otherwise. Now that the preparations are ready, we start caching precomputed routes in the FIB nexthops. Output and input routes need different kinds of care when determining if we can legally do such caching or not. The details are in the commit log messages for those changes. The patch series then winds down with some more struct rtable simplifications and other tidy ups that remove unnecessary overhead. On a SPARC-T3 output route lookups are ~876 cycles. Input route lookups are ~1169 cycles with rpfilter disabled, and about ~1468 cycles with rpfilter enabled. These measurements were taken with the kbench_mod test module in the net_test_tools GIT tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net_test_tools.git That GIT tree also includes a udpflood tester tool and stresses route lookups on packet output. For example, on the same SPARC-T3 system we can run: time ./udpflood -l 10000000 10.2.2.11 with routing cache: real 1m21.955s user 0m6.530s sys 1m15.390s without routing cache: real 1m31.678s user 0m6.520s sys 1m25.140s Performance undoubtedly can easily be improved further. For example fib_table_lookup() performs a lot of excessive computations with all the masking and shifting, some of it conditionalized to deal with edge cases. Also, Eric's no-ref optimization for input route lookups can be re-instated for the FIB nexthop caching code path. I would be really pleased if someone would work on that. In fact anyone suitable motivated can just fire up perf on the loading of the test net_test_tools benchmark kernel module. I spend much of my time going: bash# perf record insmod ./kbench_mod.ko dst=172.30.42.22 src=74.128.0.1 iif=2 bash# perf report Thanks to helpful feedback from Joe Perches, Eric Dumazet, Ben Hutchings, and others. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-22ipv4: tcp: set unicast_sock uc_ttl to -1Eric Dumazet1-1/+2
Set unicast_sock uc_ttl to -1 so that we select the right ttl, instead of sending packets with a 0 ttl. Bug added in commit be9f4a44e7d4 (ipv4: tcp: remove per net tcp_sock) Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Kill rt->fiDavid S. Miller1-31/+1
It's not really needed. We only grabbed a reference to the fib_info for the sake of fib_info local metrics. However, fib_info objects are freed using RCU, as are therefore their private metrics (if any). We would have triggered a route cache flush if we eliminated a reference to a fib_info object in the routing tables. Therefore, any existing cached routes will first check and see that they have been invalidated before an errant reference to these metric values would occur. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Turn rt->rt_route_iif into rt->rt_is_input.David S. Miller2-6/+6
That is this value's only use, as a boolean to indicate whether a route is an input route or not. So implement it that way, using a u16 gap present in the struct already. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Kill rt->rt_oifDavid S. Miller3-8/+5
Never actually used. It was being set on output routes to the original OIF specified in the flow key used for the lookup. Adjust the only user, ipmr_rt_fib_lookup(), for greater correctness of the flowi4_oif and flowi4_iif values, thanks to feedback from Julian Anastasov. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Dirty less cache lines in route caching paths.David S. Miller1-3/+3
Don't bother incrementing dst->__use and setting dst->lastuse, they are completely pointless and just slow things down. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Kill FLOWI_FLAG_RT_NOCACHE and associated code.David S. Miller3-9/+3
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Cache input routes in fib_info nexthops.David S. Miller2-12/+45
Caching input routes is slightly simpler than output routes, since we don't need to be concerned with nexthop exceptions. (locally destined, and routed packets, never trigger PMTU events or redirects that will be processed by us). However, we have to elide caching for the DIRECTSRC and non-zero itag cases. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Cache output routes in fib_info nexthops.David S. Miller2-43/+99
If we have an output route that lacks nexthop exceptions, we can cache it in the FIB info nexthop. Such routes will have DST_HOST cleared because such routes refer to a family of destinations, rather than just one. The sequence of the handling of exceptions during route lookup is adjusted to make the logic work properly. Before we allocate the route, we lookup the exception. Then we know if we will cache this route or not, and therefore whether DST_HOST should be set on the allocated route. Then we use DST_HOST to key off whether we should store the resulting route, during rt_set_nexthop(), in the FIB nexthop cache. With help from Eric Dumazet. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Kill routes during PMTU/redirect updates.David S. Miller1-12/+29
Mark them obsolete so there will be a re-lookup to fetch the FIB nexthop exception info. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20net: Document dst->obsolete better.David S. Miller1-2/+3
Add a big comment explaining how the field works, and use defines instead of magic constants for the values assigned to it. Suggested by Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Adjust semantics of rt->rt_gateway.David S. Miller7-17/+18
In order to allow prefixed routes, we have to adjust how rt_gateway is set and interpreted. The new interpretation is: 1) rt_gateway == 0, destination is on-link, nexthop is iph->daddr 2) rt_gateway != 0, destination requires a nexthop gateway Abstract the fetching of the proper nexthop value using a new inline helper, rt_nexthop(), as suggested by Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
2012-07-20ipv4: Remove 'rt_dst' from 'struct rtable'David S. Miller2-37/+9
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Remove 'rt_mark' from 'struct rtable'David Miller3-9/+3
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Kill 'rt_src' from 'struct rtable'David Miller2-20/+15
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Remove rt_key_{src,dst,tos} from struct rtable.David Miller2-33/+9
They are always used in contexts where they can be reconstituted, or where the finally resolved rt->rt_{src,dst} is semantically equivalent. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Kill ip_route_input_noref().David Miller5-10/+10
The "noref" argument to ip_route_input_common() is now always ignored because we do not cache routes, and in that case we must always grab a reference to the resulting 'dst'. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Delete routing cache.David S. Miller2-932/+13
The ipv4 routing cache is non-deterministic, performance wise, and is subject to reasonably easy to launch denial of service attacks. The routing cache works great for well behaved traffic, and the world was a much friendlier place when the tradeoffs that led to the routing cache's design were considered. What it boils down to is that the performance of the routing cache is a product of the traffic patterns seen by a system rather than being a product of the contents of the routing tables. The former of which is controllable by external entitites. Even for "well behaved" legitimate traffic, high volume sites can see hit rates in the routing cache of only ~%10. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: show pmtu in route listJulian Anastasov1-1/+5
Override the metrics with rt_pmtu Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20tcp: improve latencies of timer triggered eventsEric Dumazet2-50/+66
Modern TCP stack highly depends on tcp_write_timer() having a small latency, but current implementation doesn't exactly meet the expectations. When a timer fires but finds the socket is owned by the user, it rearms itself for an additional delay hoping next run will be more successful. tcp_write_timer() for example uses a 50ms delay for next try, and it defeats many attempts to get predictable TCP behavior in term of latencies. Use the recently introduced tcp_release_cb(), so that the user owning the socket will call various handlers right before socket release. This will permit us to post a followup patch to address the tcp_tso_should_defer() syndrome (some deferred packets have to wait RTO timer to be transmitted, while cwnd should allow us to send them sooner) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: John Heffner <johnwheffner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20tcp: fix ABC in tcp_slow_start()Eric Dumazet1-2/+3
When/if sysctl_tcp_abc > 1, we expect to increase cwnd by 2 if the received ACK acknowledges more than 2*MSS bytes, in tcp_slow_start() Problem is this RFC 3465 statement is not correctly coded, as the while () loop increases snd_cwnd one by one. Add a new variable to avoid this off-by one error. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: John Heffner <johnwheffner@gmail.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20tcp: use hash_32() in tcp_metricsEric Dumazet1-15/+10
Fix a missing roundup_pow_of_two(), since tcpmhash_entries is not guaranteed to be a power of two. Uses hash_32() instead of custom hash. tcpmhash_entries should be an unsigned int. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20tcp: Return bool instead of int where appropriateVijay Subramanian1-8/+8
Applied to a set of static inline functions in tcp_input.c Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19ipv4: Fix again the time difference calculationJulian Anastasov1-1/+1
Fix again the diff value in rt_bind_exception after collision of two latest patches, my original commit actually fixed the same problem. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-2/+4
Conflicts: drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
2012-07-19net-tcp: Fast Open client - cookie-less modeYuchung Cheng2-3/+11
In trusted networks, e.g., intranet, data-center, the client does not need to use Fast Open cookie to mitigate DoS attacks. In cookie-less mode, sendmsg() with MSG_FASTOPEN flag will send SYN-data regardless of cookie availability. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19net-tcp: Fast Open client - detecting SYN-data dropsYuchung Cheng3-6/+33
On paths with firewalls dropping SYN with data or experimental TCP options, Fast Open connections will have experience SYN timeout and bad performance. The solution is to track such incidents in the cookie cache and disables Fast Open temporarily. Since only the original SYN includes data and/or Fast Open option, the SYN-ACK has some tell-tale sign (tcp_rcv_fastopen_synack()) to detect such drops. If a path has recurring Fast Open SYN drops, Fast Open is disabled for 2^(recurring_losses) minutes starting from four minutes up to roughly one and half day. sendmsg with MSG_FASTOPEN flag will succeed but it behaves as connect() then write(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)Yuchung Cheng3-10/+73
sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2) and write(2). The application should replace connect() with it to send data in the opening SYN packet. For blocking socket, sendmsg() blocks until all the data are buffered locally and the handshake is completed like connect() call. It returns similar errno like connect() if the TCP handshake fails. For non-blocking socket, it returns the number of bytes queued (and transmitted in the SYN-data packet) if cookie is available. If cookie is not available, it transmits a data-less SYN packet with Fast Open cookie request option and returns -EINPROGRESS like connect(). Using MSG_FASTOPEN on connecting or connected socket will result in simlar errno like repeating connect() calls. Therefore the application should only use this flag on new sockets. The buffer size of sendmsg() is independent of the MSS of the connection. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19net-tcp: Fast Open client - receiving SYN-ACKYuchung Cheng1-5/+35
On receiving the SYN-ACK after SYN-data, the client needs to a) update the cached MSS and cookie (if included in SYN-ACK) b) retransmit the data not yet acknowledged by the SYN-ACK in the final ACK of the handshake. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19net-tcp: Fast Open client - sending SYN-dataYuchung Cheng3-11/+115
This patch implements sending SYN-data in tcp_connect(). The data is from tcp_sendmsg() with flag MSG_FASTOPEN (implemented in a later patch). The length of the cookie in tcp_fastopen_req, init'd to 0, controls the type of the SYN. If the cookie is not cached (len==0), the host sends data-less SYN with Fast Open cookie request option to solicit a cookie from the remote. If cookie is not available (len > 0), the host sends a SYN-data with Fast Open cookie option. If cookie length is negative, the SYN will not include any Fast Open option (for fall back operations). To deal with middleboxes that may drop SYN with data or experimental TCP option, the SYN-data is only sent once. SYN retransmits do not include data or Fast Open options. The connection will fall back to regular TCP handshake. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19net-tcp: Fast Open client - cookie cacheYuchung Cheng1-0/+51
With help from Eric Dumazet, add Fast Open metrics in tcp metrics cache. The basic ones are MSS and the cookies. Later patch will cache more to handle unfriendly middleboxes. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19net-tcp: Fast Open baseYuchung Cheng8-13/+66
This patch impelements the common code for both the client and server. 1. TCP Fast Open option processing. Since Fast Open does not have an option number assigned by IANA yet, it shares the experiment option code 254 by implementing draft-ietf-tcpm-experimental-options with a 16 bits magic number 0xF989. This enables global experiments without clashing the scarce(2) experimental options available for TCP. When the draft status becomes standard (maybe), the client should switch to the new option number assigned while the server supports both numbers for transistion. 2. The new sysctl tcp_fastopen 3. A place holder init function Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19ipv4: tcp: remove per net tcp_sockEric Dumazet2-23/+35
tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket per network namespace. This leads to bad behavior on multiqueue NICS, because many cpus contend for the socket lock and once socket lock is acquired, extra false sharing on various socket fields slow down the operations. To better resist to attacks, we use a percpu socket. Each cpu can run without contention, using appropriate memory (local node) Additional features : 1) We also mirror the queue_mapping of the incoming skb, so that answers use the same queue if possible. 2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree() 3) We now limit the number of in-flight RST/ACK [1] packets per cpu, instead of per namespace, and we honor the sysctl_wmem_default limit dynamically. (Prior to this patch, sysctl_wmem_default value was copied at boot time, so any further change would not affect tcp_sock limit) [1] These packets are only generated when no socket was matched for the incoming packet. Reported-by: Bill Sommerfeld <wsommerfeld@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19ipv4: use seqlock for nh_exceptionsJulian Anastasov1-50/+68
Use global seqlock for the nh_exceptions. Call fnhe_oldest with the right hash chain. Correct the diff value for dst_set_expires. v2: after suggestions from Eric Dumazet: * get rid of spin lock fnhe_lock, rearrange update_or_create_fnhe * continue daddr search in rt_bind_exception v3: * remove the daddr check before seqlock in rt_bind_exception * restart lookup in rt_bind_exception on detected seqlock change, as suggested by David Miller Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19ipv4: Fix time difference calculation in rt_bind_exception().David S. Miller1-1/+1
Reported-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>