aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4/tcp_minisocks.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2016-05-02tcp: do not assume TCP code is non preemptibleEric Dumazet1-1/+1
We want to to make TCP stack preemptible, as draining prequeue and backlog queues can take lot of time. Many SNMP updates were assuming that BH (and preemption) was disabled. Need to convert some __NET_INC_STATS() calls to NET_INC_STATS() and some __TCP_INC_STATS() to TCP_INC_STATS() Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset() and tcp_v4_send_ack(), we add an explicit preempt disabled section. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: rename NET_{ADD|INC}_STATS_BH()Eric Dumazet1-5/+5
Rename NET_INC_STATS_BH() to __NET_INC_STATS() and NET_ADD_STATS_BH() to __NET_ADD_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: tcp: rename TCP_INC_STATS_BHEric Dumazet1-2/+2
Rename TCP_INC_STATS_BH() to __TCP_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04tcp: rate limit ACK sent by SYN_RECV request socketsEric Dumazet1-1/+4
Attackers like to use SYNFLOOD targeting one 5-tuple, as they hit a single RX queue (and cpu) on the victim. If they use random sequence numbers in their SYN, we detect they do not match the expected window and send back an ACK. This patch adds a rate limitation, so that the effect of such attacks is limited to ingress only. We roughly double our ability to absorb such attacks. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Maciej Żenczykowski <maze@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-14tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/InMartin KaFai Lau1-1/+1
Per RFC4898, they count segments sent/received containing a positive length data segment (that includes retransmission segments carrying data). Unlike tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments carrying no data (e.g. pure ack). The patch also updates the segs_in in tcp_fastopen_add_skb() so that segs_in >= data_segs_in property is kept. Together with retransmission data, tcpi_data_segs_out gives a better signal on the rxmit rate. v6: Rebase on the latest net-next v5: Eric pointed out that checking skb->len is still needed in tcp_fastopen_add_skb() because skb can carry a FIN without data. Hence, instead of open coding segs_in and data_segs_in, tcp_segs_in() helper is used. Comment is added to the fastopen case to explain why segs_in has to be reset and tcp_segs_in() has to be called before __skb_pull(). v4: Add comment to the changes in tcp_fastopen_add_skb() and also add remark on this case in the commit message. v3: Add const modifier to the skb parameter in tcp_segs_in() v2: Rework based on recent fix by Eric: commit a9d99ce28ed3 ("tcp: fix tcpi_segs_in after connection establishment") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Chris Rapier <rapier@psc.edu> Cc: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <mleitner@redhat.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+2
Several cases of overlapping changes, as well as one instance (vxlan) of a bug fix in 'net' overlapping with code movement in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07tcp: fix tcpi_segs_in after connection establishmentEric Dumazet1-1/+2
If final packet (ACK) of 3WHS is lost, it appears we do not properly account the following incoming segment into tcpi_segs_in While we are at it, starts segs_in with one, to count the SYN packet. We do not yet count number of SYN we received for a request sock, we might add this someday. packetdrill script showing proper behavior after fix : // Tests tcpi_segs_in when 3rd packet (ACK) of 3WHS is lost 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK> +.020 < P. 1:1001(1000) ack 1 win 32792 +0 accept(3, ..., ...) = 4 +.000 %{ assert tcpi_segs_in == 2, 'tcpi_segs_in=%d' % tcpi_segs_in }% Fixes: 2efd055c53c06 ("tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp syncookies sysctl knobNikolay Borisov1-3/+0
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22tcp: honour SO_BINDTODEVICE for TW_RST case tooFlorian Westphal1-5/+2
Hannes points out that when we generate tcp reset for timewait sockets we pretend we found no socket and pass NULL sk to tcp_vX_send_reset(). Make it cope with inet tw sockets and then provide tw sk. This makes RSTs appear on correct interface when SO_BINDTODEVICE is used. Packetdrill test case: // want default route to be used, we rely on BINDTODEVICE `ip route del 192.0.2.0/24 via 192.168.0.2 dev tun0` 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 // test case still works due to BINDTODEVICE 0.001 setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, "tun0", 4) = 0 0.100...0.200 connect(3, ..., ...) = 0 0.100 > S 0:0(0) <mss 1460,sackOK,nop,nop> 0.200 < S. 0:0(0) ack 1 win 32792 <mss 1460,sackOK,nop,nop> 0.200 > . 1:1(0) ack 1 0.210 close(3) = 0 0.210 > F. 1:1(0) ack 1 win 29200 0.300 < . 1:1(0) ack 2 win 46 // more data while in FIN_WAIT2, expect RST 1.300 < P. 1:1001(1000) ack 1 win 46 // fails without this change -- default route is used 1.301 > R 1:1(0) win 0 Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-05tcp: fix req->saved_syn raceEric Dumazet1-3/+0
For the reasons explained in commit ce1050089c96 ("tcp/dccp: fix ireq->pktopts race"), we need to make sure we do not access req->saved_syn unless we own the request sock. This fixes races for listeners using TCP_SAVE_SYN option. Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets") Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Ying Cai <ycai@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-23tcp/dccp: fix hashdance race for passive sessionsEric Dumazet1-7/+4
Multiple cpus can process duplicates of incoming ACK messages matching a SYN_RECV request socket. This is a rare event under normal operations, but definitely can happen. Only one must win the race, otherwise corruption would occur. To fix this without adding new atomic ops, we use logic in inet_ehash_nolisten() to detect the request was present in the same ehash bucket where we try to insert the new child. If request socket was not found, we have to undo the child creation. This actually removes a spin_lock()/spin_unlock() pair in reqsk_queue_unlink() for the fast path. Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets") Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: track the packet timings in RACKYuchung Cheng1-0/+2
This patch is the first half of the RACK loss recovery. RACK loss recovery uses the notion of time instead of packet sequence (FACK) or counts (dupthresh). It's inspired by the previous FACK heuristic in tcp_mark_lost_retrans(): when a limited transmit (new data packet) is sacked, then current retransmitted sequence below the newly sacked sequence must been lost, since at least one round trip time has elapsed. But it has several limitations: 1) can't detect tail drops since it depends on limited transmit 2) is disabled upon reordering (assumes no reordering) 3) only enabled in fast recovery ut not timeout recovery RACK (Recently ACK) addresses these limitations with the notion of time instead: a packet P1 is lost if a later packet P2 is s/acked, as at least one round trip has passed. Since RACK cares about the time sequence instead of the data sequence of packets, it can detect tail drops when later retransmission is s/acked while FACK or dupthresh can't. For reordering RACK uses a dynamically adjusted reordering window ("reo_wnd") to reduce false positives on ever (small) degree of reordering. This patch implements tcp_advanced_rack() which tracks the most recent transmission time among the packets that have been delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp is the key to determine which packet has been lost. Consider an example that the sender sends six packets: T1: P1 (lost) T2: P2 T3: P3 T4: P4 T100: sack of P2. rack.mstamp = T2 T101: retransmit P1 T102: sack of P2,P3,P4. rack.mstamp = T4 T205: ACK of P4 since the hole is repaired. rack.mstamp = T101 We need to be careful about spurious retransmission because it may falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK to falsely mark all packets lost, just like a spurious timeout. We identify spurious retransmission by the ACK's TS echo value. If TS option is not applicable but the retransmission is acknowledged less than min-RTT ago, it is likely to be spurious. We refrain from using the transmission time of these spurious retransmissions. The second half is implemented in the next patch that marks packet lost using RACK timestamp. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: track min RTT using windowed min-filterYuchung Cheng1-0/+1
Kathleen Nichols' algorithm for tracking the minimum RTT of a data stream over some measurement window. It uses constant space and constant time per update. Yet it almost always delivers the same minimum as an implementation that has to keep all the data in the window. The measurement window is tunable via sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes. The algorithm keeps track of the best, 2nd best & 3rd best min values, maintaining an invariant that the measurement time of the n'th best >= n-1'th best. It also makes sure that the three values are widely separated in the time window since that bounds the worse case error when that data is monotonically increasing over the window. Upon getting a new min, we can forget everything earlier because it has no value - the new min is less than everything else in the window by definition and it's the most recent. So we restart fresh on every new min and overwrites the 2nd & 3rd choices. The same property holds for the 2nd & 3rd best. Therefore we have to maintain two invariants to maximize the information in the samples, one on values (1st.v <= 2nd.v <= 3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <= now). These invariants determine the structure of the code The RTT input to the windowed filter is the minimum RTT measured from ACK or SACK, or as the last resort from TCP timestamps. The accessor tcp_min_rtt() returns the minimum RTT seen in the window. ~0U indicates it is not available. The minimum is 1usec even if the true RTT is below that. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12net: shrink struct sock and request_sock by 8 bytesEric Dumazet1-9/+9
One 32bit hole is following skc_refcnt, use it. skc_incoming_cpu can also be an union for request_sock rcv_wnd. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-11tcp: fix RFS vs lockless listenersEric Dumazet1-0/+1
Before recent TCP listener patches, we were updating listener sk->sk_rxhash before the cloning of master socket. children sk_rxhash was therefore correct after the normal 3WHS. But with lockless listener, we no longer dirty/change listener sk_rxhash as it would be racy. We need to correctly update the child sk_rxhash, otherwise first data packet wont hit correct cpu if RFS is used. Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Willem de Bruijn <willemb@google.com> Cc: Tom Herbert <tom@herbertland.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp: remove BUG_ON() in tcp_check_req()Eric Dumazet1-2/+0
Once listener is lockless, its sk_state can change anytime. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29tcp: constify tcp_create_openreq_child() socket argumentEric Dumazet1-1/+3
This method does not touch the listener socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29tcp: remove tcp_rcv_state_process() tcp_hdr argumentEric Dumazet1-1/+1
Factorize code to get tcp header from skb. It makes no sense to duplicate code in callers. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29tcp: remove unused len argument from tcp_rcv_state_process()Eric Dumazet1-2/+1
Once we realize tcp_rcv_synsent_state_process() does not use its 'len' argument and we get rid of it, then it becomes clear this argument is no longer used in tcp_rcv_state_process() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-7/+6
Conflicts: net/ipv4/arp.c The net/ipv4/arp.c conflict was one commit adding a new local variable while another commit was deleting one. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25tcp: constify tcp_openreq_init_rwin()Eric Dumazet1-10/+18
Soon, listener socket wont be locked when tcp_openreq_init_rwin() is called. We need to read socket fields once, as their value could change under us. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24tcp: factorize sk_txhash initEric Dumazet1-0/+1
Neal suggested to move sk_txhash init into tcp_create_openreq_child(), called both from IPv4 and IPv6. This opportunity was missed in commit 58d607d3e52f ("tcp: provide skb->hash to synack packets") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21tcp/dccp: fix timewait races in timer handlingEric Dumazet1-7/+6
When creating a timewait socket, we need to arm the timer before allowing other cpus to find it. The signal allowing cpus to find the socket is setting tw_refcnt to non zero value. As we set tw_refcnt in __inet_twsk_hashdance(), we therefore need to call inet_twsk_schedule() first. This also means we need to remove tw_refcnt changes from inet_twsk_schedule() and let the caller handle it. Note that because we use mod_timer_pinned(), we have the guarantee the timer wont expire before we set tw_refcnt as we run in BH context. To make things more readable I introduced inet_twsk_reschedule() helper. When rearming the timer, we can use mod_timer_pending() to make sure we do not rearm a canceled timer. Note: This bug can possibly trigger if packets of a flow can hit multiple cpus. This does not normally happen, unless flow steering is broken somehow. This explains this bug was spotted ~5 months after its introduction. A similar fix is needed for SYN_RECV sockets in reqsk_queue_hash_req(), but will be provided in a separate patch for proper tracking. Fixes: 789f558cfb36 ("tcp/dccp: get rid of central timewait timer") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Ying Cai <ycai@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21tcp: usec resolution SYN/ACK RTTYuchung Cheng1-1/+2
Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK RTT is often measured as 0ms or sometimes 1ms, which would affect RTT estimation and min RTT samping used by some congestion control. This patch improves SYN/ACK RTT to be usec resolution if platform supports it. While the timestamping of SYN/ACK is done in request sock, the RTT measurement is carefully arranged to avoid storing another u64 timestamp in tcp_sock. For regular handshake w/o SYNACK retransmission, the RTT is sampled right after the child socket is created and right before the request sock is released (tcp_check_req() in tcp_minisocks.c) For Fast Open the child socket is already created when SYN/ACK was sent, the RTT is sampled in tcp_rcv_state_process() after processing the final ACK an right before the request socket is released. If the SYN/ACK was retransmistted or SYN-cookie was used, we rely on TCP timestamps to measure the RTT. The sample is taken at the same place in tcp_rcv_state_process() after the timestamp values are validated in tcp_validate_incoming(). Note that we do not store TS echo value in request_sock for SYN-cookies, because the value is already stored in tp->rx_opt used by tcp_ack_update_rtt(). One side benefit is that the RTT measurement now happens before initializing congestion control (of the passive side). Therefore the congestion control can use the SYN/ACK RTT. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09inet: inet_twsk_deschedule factorizationEric Dumazet1-4/+2
inet_twsk_deschedule() calls are followed by inet_twsk_put(). Only particular case is in inet_twsk_purge() but there is no point to defer the inet_twsk_put() after re-enabling BH. Lets rename inet_twsk_deschedule() to inet_twsk_deschedule_put() and move the inet_twsk_put() inside. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+4
Conflicts: drivers/net/phy/amd-xgbe-phy.c drivers/net/wireless/iwlwifi/Kconfig include/net/mac80211.h iwlwifi/Kconfig and mac80211.h were both trivial overlapping changes. The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and the bug fix that happened on the 'net' side is already integrated into the rest of the amd-xgbe driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-31tcp: fix child sockets to use system default congestion control if not setNeal Cardwell1-1/+4
Linux 3.17 and earlier are explicitly engineered so that if the app doesn't specifically request a CC module on a listener before the SYN arrives, then the child gets the system default CC when the connection is established. See tcp_init_congestion_control() in 3.17 or earlier, which says "if no choice made yet assign the current value set as default". The change ("net: tcp: assign tcp cong_ops when tcp sk is created") altered these semantics, so that children got their parent listener's congestion control even if the system default had changed after the listener was created. This commit returns to those original semantics from 3.17 and earlier, since they are the original semantics from 2007 in 4d4d3d1e8 ("[TCP]: Congestion control initialization."), and some Linux congestion control workflows depend on that. In summary, if a listener socket specifically sets TCP_CONGESTION to "x", or the route locks the CC module to "x", then the child gets "x". Otherwise the child gets current system default from net.ipv4.tcp_congestion_control. That's the behavior in 3.17 and earlier, and this commit restores that. Fixes: 55d8694fa82c ("net: tcp: assign tcp cong_ops when tcp sk is created") Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: Glenn Judd <glenn.judd@morganstanley.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+1
Conflicts: drivers/net/ethernet/cadence/macb.c drivers/net/phy/phy.c include/linux/skbuff.h net/ipv4/tcp.c net/switchdev/switchdev.c Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD} renaming overlapping with net-next changes of various sorts. phy.c was a case of two changes, one adding a local variable to a function whilst the second was removing one. tcp.c overlapped a deadlock fix with the addition of new tcp_info statistic values. macb.c involved the addition of two zyncq device entries. skbuff.h involved adding back ipv4_daddr to nf_bridge_info whilst net-next changes put two other existing members of that struct into a union. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21tcp: add tcpi_segs_in and tcpi_segs_out to tcp_infoMarcelo Ricardo Leitner1-0/+1
This patch tracks the total number of inbound and outbound segments on a TCP socket. One may use this number to have an idea on connection quality when compared against the retransmissions. RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut These are a 32bit field each and can be fetched both from TCP_INFO getsockopt() if one has a handle on a TCP socket, or from inet_diag netlink facility (iproute2/ss patch will follow) Note that tp->segs_out was placed near tp->snd_nxt for good data locality and minimal performance impact, while tp->segs_in was placed near tp->bytes_received for the same reason. Join work with Eric Dumazet. Note that received SYN are accounted on the listener, but sent SYNACK are not accounted. Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17tcp/ipv6: fix flow label setting in TIME_WAIT stateFlorent Fourcot1-1/+1
commit 1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK messages send from TIME_WAIT") added the flow label in the last TCP packets. Unfortunately, it was not casted properly. This patch replace the buggy shift with be32_to_cpu/cpu_to_be32. Fixes: 1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK messages") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05tcp: provide SYN headers for passive connectionsEric Dumazet1-0/+3
This patch allows a server application to get the TCP SYN headers for its passive connections. This is useful if the server is doing fingerprinting of clients based on SYN packet contents. Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN. The first is used on a socket to enable saving the SYN headers for child connections. This can be set before or after the listen() call. The latter is used to retrieve the SYN headers for passive connections, if the parent listener has enabled TCP_SAVE_SYN. TCP_SAVED_SYN is read once, it frees the saved SYN headers. The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP headers. Original patch was written by Tom Herbert, I changed it to not hold a full skb (and associated dst and conntracking reference). We have used such patch for about 3 years at Google. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-24inet: fix possible panic in reqsk_queue_unlink()Eric Dumazet1-3/+4
[ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080 [ 3897.931025] IP: [<ffffffffa9f27686>] reqsk_timer_handler+0x1a6/0x243 There is a race when reqsk_timer_handler() and tcp_check_req() call inet_csk_reqsk_queue_unlink() on the same req at the same time. Before commit fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer"), listener spinlock was held and race could not happen. To solve this bug, we change reqsk_queue_unlink() to not assume req must be found, and we return a status, to conditionally release a refcount on the request sock. This also means tcp_check_req() in non fastopen case might or not consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have to properly handle this. (Same remark for dccp_check_req() and its callers) inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is called 4 times in tcp and 3 times in dccp. Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-13tcp/dccp: get rid of central timewait timerEric Dumazet1-26/+9
Using a timer wheel for timewait sockets was nice ~15 years ago when memory was expensive and machines had a single processor. This does not scale, code is ugly and source of huge latencies (Typically 30 ms have been seen, cpus spinning on death_lock spinlock.) We can afford to use an extra 64 bytes per timewait sock and spread timewait load to all cpus to have better behavior. Tested: On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1 on the target (lpaa24) Before patch : lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0 419594 lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0 437171 While test is running, we can observe 25 or even 33 ms latencies. lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23 ... 1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2 lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23 ... 1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2 After patch : About 90% increase of throughput : lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0 810442 lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0 800992 And latencies are kept to minimal values during this load, even if network utilization is 90% higher : lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23 ... 1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-08tcp: do not rearm rsk_timer on FastOpen requestsEric Dumazet1-4/+10
FastOpen requests are not like other regular request sockets. They do not yet use rsk_timer : tcp_fastopen_queue_check() simply manually removes one expired request from fastopenq->rskq_rst list. Therefore, tcp_check_req() must not call mod_timer_pending(), otherwise we crash because rsk_timer was not initialized. Fixes: fa76ce7328b ("inet: get rid of central tcp/dccp listener timer") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-03ipv4: coding style: comparison for inequality with NULLIan Morris1-3/+3
The ipv4 code uses a mixture of coding styles. In some instances check for non-NULL pointer is done as x != NULL and sometimes as x. x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-03ipv4: coding style: comparison for equality with NULLIan Morris1-1/+1
The ipv4 code uses a mixture of coding styles. In some instances check for NULL pointer is done as x == NULL and sometimes as !x. !x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20inet: get rid of central tcp/dccp listener timerEric Dumazet1-2/+3
One of the major issue for TCP is the SYNACK rtx handling, done by inet_csk_reqsk_queue_prune(), fired by the keepalive timer of a TCP_LISTEN socket. This function runs for awful long times, with socket lock held, meaning that other cpus needing this lock have to spin for hundred of ms. SYNACK are sent in huge bursts, likely to cause severe drops anyway. This model was OK 15 years ago when memory was very tight. We now can afford to have a timer per request sock. Timer invocations no longer need to lock the listener, and can be run from all cpus in parallel. With following patch increasing somaxconn width to 32 bits, I tested a listener with more than 4 million active request sockets, and a steady SYNFLOOD of ~200,000 SYN per second. Host was sending ~830,000 SYNACK per second. This is ~100 times more what we could achieve before this patch. Later, we will get rid of the listener hash and use ehash instead. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20inet: drop prev pointer handling in request sockEric Dumazet1-3/+2
When request sock are put in ehash table, the whole notion of having a previous request to update dl_next is pointless. Also, following patch will get rid of big purge timer, so we want to delete a request sock without holding listener lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08tcp: mitigate ACK loops for connections as tcp_timewait_sockNeal Cardwell1-5/+24
Ensure that in state FIN_WAIT2 or TIME_WAIT, where the connection is represented by a tcp_timewait_sock, we rate limit dupacks in response to incoming packets (a) with TCP timestamps that fail PAWS checks, or (b) with sequence numbers that are out of the acceptable window. We do not send a dupack in response to out-of-window packets if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we last sent a dupack in response to an out-of-window packet. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08tcp: mitigate ACK loops for connections as tcp_sockNeal Cardwell1-0/+1
Ensure that in state ESTABLISHED, where the connection is represented by a tcp_sock, we rate limit dupacks in response to incoming packets (a) with TCP timestamps that fail PAWS checks, or (b) with sequence numbers or ACK numbers that are out of the acceptable window. We do not send a dupack in response to out-of-window packets if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we last sent a dupack in response to an out-of-window packet. There is already a similar (although global) rate-limiting mechanism for "challenge ACKs". When deciding whether to send a challence ACK, we first consult the new per-connection rate limit, and then the global rate limit. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08tcp: mitigate ACK loops for connections as tcp_request_sockNeal Cardwell1-1/+5
In the SYN_RECV state, where the TCP connection is represented by tcp_request_sock, we now rate-limit SYNACKs in response to a client's retransmitted SYNs: we do not send a SYNACK in response to client SYN if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we last sent a SYNACK in response to a client's retransmitted SYN. This allows the vast majority of legitimate client connections to proceed unimpeded, even for the most aggressive platforms, iOS and MacOS, which actually retransmit SYNs 1-second intervals for several times in a row. They use SYN RTO timeouts following the progression: 1,1,1,1,1,2,4,8,16,32. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-05net: tcp: add per route congestion controlDaniel Borkmann1-4/+26
This work adds the possibility to define a per route/destination congestion control algorithm. Generally, this opens up the possibility for a machine with different links to enforce specific congestion control algorithms with optimal strategies for each of them based on their network characteristics, even transparently for a single application listening on all links. For our specific use case, this additionally facilitates deployment of DCTCP, for example, applications can easily serve internal traffic/dsts in DCTCP and external one with CUBIC. Other scenarios would also allow for utilizing e.g. long living, low priority background flows for certain destinations/routes while still being able for normal traffic to utilize the default congestion control algorithm. We also thought about a per netns setting (where different defaults are possible), but given its actually a link specific property, we argue that a per route/destination setting is the most natural and flexible. The administrator can utilize this through ip-route(8) by appending "congctl [lock] <name>", where <name> denotes the name of a congestion control algorithm and the optional lock parameter allows to enforce the given algorithm so that applications in user space would not be allowed to overwrite that algorithm for that destination. The dst metric lookups are being done when a dst entry is already available in order to avoid a costly lookup and still before the algorithms are being initialized, thus overhead is very low when the feature is not being used. While the client side would need to drop the current reference on the module, on server side this can actually even be avoided as we just got a flat-copied socket clone. Joint work with Florian Westphal. Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-29tcp: change TCP_ECN prefixes to lower caseFlorian Westphal1-3/+3
Suggested by Stephen. Also drop inline keyword and let compiler decide. gcc 4.7.3 decides to no longer inline tcp_ecn_check_ce, so split it up. The actual evaluation is not inlined anymore while the ECN_OK test is. Suggested-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-29net: tcp: assign tcp cong_ops when tcp sk is createdFlorian Westphal1-3/+2
Split assignment and initialization from one into two functions. This is required by followup patches that add Datacenter TCP (DCTCP) congestion control algorithm - we need to be able to determine if the connection is moderated by DCTCP before the 3WHS has finished. As we walk the available congestion control list during the assignment, we are always guaranteed to have Reno present as it's fixed compiled-in. Therefore, since we're doing the early assignment, we don't have a real use for the Reno alias tcp_init_congestion_ops anymore and can thus remove it. Actual usage of the congestion control operations are being made after the 3WHS has finished, in some cases however we can access get_info() via diag if implemented, therefore we need to zero out the private area for those modules. Joint work with Daniel Borkmann and Glenn Judd. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05tcp: introduce TCP_SKB_CB(skb)->tcp_tw_isnEric Dumazet1-1/+1
TCP_SKB_CB(skb)->when has different meaning in output and input paths. In output path, it contains a timestamp. In input path, it contains an ISN, chosen by tcp_timewait_state_process() Lets add a different name to ease code comprehension. Note that 'when' field will disappear in following patch, as skb_mstamp already contains timestamp, the anonymous union will promptly disappear as well. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-01inet: move ipv6only in sock_commonEric Dumazet1-1/+1
When an UDP application switches from AF_INET to AF_INET6 sockets, we have a small performance degradation for IPv4 communications because of extra cache line misses to access ipv6only information. This can also be noticed for TCP listeners, as ipv6_only_sock() is also used from __inet_lookup_listener()->compute_score() This is magnified when SO_REUSEPORT is used. Move ipv6only into struct sock_common so that it is available at no extra cost in lookups. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-13tcp: use tcp_v4_send_synack on first SYN-ACKYuchung Cheng1-0/+31
To avoid large code duplication in IPv6, we need to first simplify the complicate SYN-ACK sending code in tcp_v4_conn_request(). To use tcp_v4(6)_send_synack() to send all SYN-ACKs, we need to initialize the mini socket's receive window before trying to create the child socket and/or building the SYN-ACK packet. So we move that initialization from tcp_make_synack() to tcp_v4_conn_request() as a new function tcp_openreq_init_req_rwin(). After this refactoring the SYN-ACK sending code is simpler and easier to implement Fast Open for IPv6. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Daniel Lee <longinus00@gmail.com> Signed-off-by: Jerry Chu <hkchu@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-11net: Fix use after free by removing length arg from sk_data_ready callbacks.David S. Miller1-1/+1
Several spots in the kernel perform a sequence like: skb_queue_tail(&sk->s_receive_queue, skb); sk->sk_data_ready(sk, skb->len); But at the moment we place the SKB onto the socket receive queue it can be consumed and freed up. So this skb->len access is potentially to freed up memory. Furthermore, the skb->len can be modified by the consumer so it is possible that the value isn't accurate. And finally, no actual implementation of this callback actually uses the length argument. And since nobody actually cared about it's value, lots of call sites pass arbitrary values in such as '0' and even '1'. So just remove the length argument from the callback, that way there is no confusion whatsoever and all of these use-after-free cases get fixed as a side effect. Based upon a patch by Eric Dumazet and his suggestion to audit this issue tree-wide. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26tcp: switch rtt estimations to usec resolutionEric Dumazet1-2/+2
Upcoming congestion controls for TCP require usec resolution for RTT estimations. Millisecond resolution is simply not enough these days. FQ/pacing in DC environments also require this change for finer control and removal of bimodal behavior due to the current hack in tcp_update_pacing_rate() for 'small rtt' TCP_CONG_RTT_STAMP is no longer needed. As Julian Anastasov pointed out, we need to keep user compatibility : tcp_metrics used to export RTT and RTTVAR in msec resolution, so we added RTT_US and RTTVAR_US. An iproute2 patch is needed to use the new attributes if provided by the kernel. In this example ss command displays a srtt of 32 usecs (10Gbit link) lpk51:~# ./ss -i dst lpk52 Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port tcp ESTAB 0 1 10.246.11.51:42959 10.246.11.52:64614 cubic wscale:6,6 rto:201 rtt:0.032/0.001 ato:40 mss:1448 cwnd:10 send 3620.0Mbps pacing_rate 7240.0Mbps unacked:1 rcv_rtt:993 rcv_space:29559 Updated iproute2 ip command displays : lpk51:~# ./ip tcp_metrics | grep 10.246.11.52 10.246.11.52 age 561.914sec cwnd 10 rtt 274us rttvar 213us source 10.246.11.51 Old binary displays : lpk51:~# ip tcp_metrics | grep 10.246.11.52 10.246.11.52 age 561.914sec cwnd 10 rtt 250us rttvar 125us source 10.246.11.51 With help from Julian Anastasov, Stephen Hemminger and Yuchung Cheng Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Yuchung Cheng <ycheng@google.com> Cc: Larry Brakmo <brakmo@google.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17ipv6: tcp: fix flowlabel value in ACK messages send from TIME_WAITFlorent Fourcot1-0/+1
This patch is following the commit b903d324bee262 (ipv6: tcp: fix TCLASS value in ACK messages sent from TIME_WAIT). For the same reason than tclass, we have to store the flow label in the inet_timewait_sock to provide consistency of flow label on the last ACK. Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Signed-off-by: David S. Miller <davem@davemloft.net>