aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4/tcp_ipv4.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2017-04-24net/tcp_fastopen: Disable active side TFO in certain scenariosWei Wang1-0/+3
Middlebox firewall issues can potentially cause server's data being blackholed after a successful 3WHS using TFO. Following are the related reports from Apple: https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf Slide 31 identifies an issue where the client ACK to the server's data sent during a TFO'd handshake is dropped. C ---> syn-data ---> S C <--- syn/ack ----- S C (accept & write) C <---- data ------- S C ----- ACK -> X S [retry and timeout] https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf Slide 5 shows a similar situation that the server's data gets dropped after 3WHS. C ---- syn-data ---> S C <--- syn/ack ----- S C ---- ack --------> S S (accept & write) C? X <- data ------ S [retry and timeout] This is the worst failure b/c the client can not detect such behavior to mitigate the situation (such as disabling TFO). Failing to proceed, the application (e.g., SSL library) may simply timeout and retry with TFO again, and the process repeats indefinitely. The proposed solution is to disable active TFO globally under the following circumstances: 1. client side TFO socket detects out of order FIN 2. client side TFO socket receives out of order RST We disable active side TFO globally for 1hr at first. Then if it happens again, we disable it for 2h, then 4h, 8h, ... And we reset the timeout to 1hr if a client side TFO sockets not opened on loopback has successfully received data segs from server. And we examine this condition during close(). The rational behind it is that when such firewall issue happens, application running on the client should eventually close the socket as it is not able to get the data it is expecting. Or application running on the server should close the socket as it is not able to receive any response from client. In both cases, out of order FIN or RST will get received on the client given that the firewall will not block them as no data are in those frames. And we want to disable active TFO globally as it helps if the middle box is very close to the client and most of the connections are likely to fail. Also, add a debug sysctl: tcp_fastopen_blackhole_detect_timeout_sec: the initial timeout to use when firewall blackhole issue happens. This can be set and read. When setting it to 0, it means to disable the active disable logic. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-23Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcuIngo Molnar1-1/+1
Pull RCU updates from Paul E. McKenney: - Documentation updates. - Miscellaneous fixes. - Parallelize SRCU callback handling (plus overlapping patches). Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-18mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCUPaul E. McKenney1-1/+1
A group of Linux kernel hackers reported chasing a bug that resulted from their assumption that SLAB_DESTROY_BY_RCU provided an existence guarantee, that is, that no block from such a slab would be reallocated during an RCU read-side critical section. Of course, that is not the case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire slab of blocks. However, there is a phrase for this, namely "type safety". This commit therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order to avoid future instances of this sort of confusion. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <linux-mm@kvack.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> [ paulmck: Add comments mentioning the old name, as requested by Eric Dumazet, in order to help people familiar with the old name find the new one. ] Acked-by: David Rientjes <rientjes@google.com>
2017-03-24tcp: Record Rx hash and NAPI ID in tcp_child_processAlexander Duyck1-2/+0
While working on some recent busy poll changes we found that child sockets were being instantiated without NAPI ID being set. In our first attempt to fix it, it was suggested that we should just pull programming the NAPI ID into the function itself since all callers will need to have it set. In addition to the NAPI ID change I have dropped the code that was populating the Rx hash since it was actually being populated in tcp_get_cookie_sock. Reported-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16tcp: remove tcp_tw_recycleSoheil Hassas Yeganeh1-13/+2
The tcp_tw_recycle was already broken for connections behind NAT, since the per-destination timestamp is not monotonically increasing for multiple machines behind a single destination address. After the randomization of TCP timestamp offsets in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection), the tcp_tw_recycle is broken for all types of connections for the same reason: the timestamps received from a single machine is not monotonically increasing, anymore. Remove tcp_tw_recycle, since it is not functional. Also, remove the PAWSPassive SNMP counter since it is only used for tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req since the strict argument is only set when tcp_tw_recycle is enabled. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Cc: Lutz Vieweg <lvml@5t9.de> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16tcp: remove per-destination timestamp cacheSoheil Hassas Yeganeh1-4/+0
Commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection) randomizes TCP timestamps per connection. After this commit, there is no guarantee that the timestamps received from the same destination are monotonically increasing. As a result, the per-destination timestamp cache in TCP metrics (i.e., tcpm_ts in struct tcp_metrics_block) is broken and cannot be relied upon. Remove the per-destination timestamp cache and all related code paths. Note that this cache was already broken for caching timestamps of multiple machines behind a NAT sharing the same address. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Cc: Lutz Vieweg <lvml@5t9.de> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-3/+7
Conflicts: drivers/net/ethernet/broadcom/genet/bcmgenet.c net/core/sock.c Conflicts were overlapping changes in bcmgenet and the lockdep handling of sockets. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13dccp/tcp: fix routing redirect raceJon Maxwell1-1/+2
As Eric Dumazet pointed out this also needs to be fixed in IPv6. v2: Contains the IPv6 tcp/Ipv6 dccp patches as well. We have seen a few incidents lately where a dst_enty has been freed with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that dst_entry. If the conditions/timings are right a crash then ensues when the freed dst_entry is referenced later on. A Common crashing back trace is: #8 [] page_fault at ffffffff8163e648 [exception RIP: __tcp_ack_snd_check+74] . . #9 [] tcp_rcv_established at ffffffff81580b64 #10 [] tcp_v4_do_rcv at ffffffff8158b54a #11 [] tcp_v4_rcv at ffffffff8158cd02 #12 [] ip_local_deliver_finish at ffffffff815668f4 #13 [] ip_local_deliver at ffffffff81566bd9 #14 [] ip_rcv_finish at ffffffff8156656d #15 [] ip_rcv at ffffffff81566f06 #16 [] __netif_receive_skb_core at ffffffff8152b3a2 #17 [] __netif_receive_skb at ffffffff8152b608 #18 [] netif_receive_skb at ffffffff8152b690 #19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3] #20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3] #21 [] net_rx_action at ffffffff8152bac2 #22 [] __do_softirq at ffffffff81084b4f #23 [] call_softirq at ffffffff8164845c #24 [] do_softirq at ffffffff81016fc5 #25 [] irq_exit at ffffffff81084ee5 #26 [] do_IRQ at ffffffff81648ff8 Of course it may happen with other NIC drivers as well. It's found the freed dst_entry here: 224 static bool tcp_in_quickack_mode(struct sock *sk)↩ 225 {↩ 226 ▹ const struct inet_connection_sock *icsk = inet_csk(sk);↩ 227 ▹ const struct dst_entry *dst = __sk_dst_get(sk);↩ 228 ↩ 229 ▹ return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩ 230 ▹ ▹ (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩ 231 }↩ But there are other backtraces attributed to the same freed dst_entry in netfilter code as well. All the vmcores showed 2 significant clues: - Remote hosts behind the default gateway had always been redirected to a different gateway. A rtable/dst_entry will be added for that host. Making more dst_entrys with lower reference counts. Making this more probable. - All vmcores showed a postitive LockDroppedIcmps value, e.g: LockDroppedIcmps 267 A closer look at the tcp_v4_err() handler revealed that do_redirect() will run regardless of whether user space has the socket locked. This can result in a race condition where the same dst_entry cached in sk->sk_dst_entry can be decremented twice for the same socket via: do_redirect()->__sk_dst_check()-> dst_release(). Which leads to the dst_entry being prematurely freed with another socket pointing to it via sk->sk_dst_cache and a subsequent crash. To fix this skip do_redirect() if usespace has the socket locked. Instead let the redirect take place later when user space does not have the socket locked. The dccp/IPv6 code is very similar in this respect, so fixing it there too. As Eric Garver pointed out the following commit now invalidates routes. Which can set the dst->obsolete flag so that ipv4_dst_check() returns null and triggers the dst_release(). Fixes: ceb3320610d6 ("ipv4: Kill routes during PMTU/redirect updates.") Cc: Eric Garver <egarver@redhat.com> Cc: Hannes Sowa <hsowa@redhat.com> Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09tcp: rename *_sequence_number() to *_seq_and_tsoff()Alexey Kodanev1-11/+11
The functions that are returning tcp sequence number also setup TS offset value, so rename them to better describe their purpose. No functional changes in this patch. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07tcp: fix various issues for sockets morphing to listen stateEric Dumazet1-2/+5
Dmitry Vyukov reported a divide by 0 triggered by syzkaller, exploiting tcp_disconnect() path that was never really considered and/or used before syzkaller ;) I was not able to reproduce the bug, but it seems issues here are the three possible actions that assumed they would never trigger on a listener. 1) tcp_write_timer_handler 2) tcp_delack_timer_handler 3) MTU reduction Only IPv6 MTU reduction was properly testing TCP_CLOSE and TCP_LISTEN states from tcp_v6_mtu_reduced() Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-22tcp: setup timestamp offset when write_seq already setAlexey Kodanev1-6/+10
Found that when randomized tcp offsets are enabled (by default) TCP client can still start new connections without them. Later, if server does active close and re-uses sockets in TIME-WAIT state, new SYN from client can be rejected on PAWS check inside tcp_timewait_state_process(), because either tw_ts_recent or rcv_tsval doesn't really have an offset set. Here is how to reproduce it with LTP netstress tool: netstress -R 1 & netstress -H 127.0.0.1 -lr 1000000 -a1 [...] < S seq 1956977072 win 43690 TS val 295618 ecr 459956970 > . ack 1956911535 win 342 TS val 459967184 ecr 1547117608 < R seq 1956911535 win 0 length 0 +1. < S seq 1956977072 win 43690 TS val 296640 ecr 459956970 > S. seq 657450664 ack 1956977073 win 43690 TS val 459968205 ecr 296640 Fixes: 95a22caee396 ("tcp: randomize tcp timestamp offsets for each connection") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03tcp: add tcp_mss_clamp() helperEric Dumazet1-4/+1
Small cleanup factorizing code doing the TCP_MAXSEG clamping. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25net/tcp-fastopen: Add new API supportWei Wang1-1/+6
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an alternative way to perform Fast Open on the active side (client). Prior to this patch, a client needs to replace the connect() call with sendto(MSG_FASTOPEN). This can be cumbersome for applications who want to use Fast Open: these socket operations are often done in lower layer libraries used by many other applications. Changing these libraries and/or the socket call sequences are not trivial. A more convenient approach is to perform Fast Open by simply enabling a socket option when the socket is created w/o changing other socket calls sequence: s = socket() create a new socket setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …); newly introduced sockopt If set, new functionality described below will be used. Return ENOTSUPP if TFO is not supported or not enabled in the kernel. connect() With cookie present, return 0 immediately. With no cookie, initiate 3WHS with TFO cookie-request option and return -1 with errno = EINPROGRESS. write()/sendmsg() With cookie present, send out SYN with data and return the number of bytes buffered. With no cookie, and 3WHS not yet completed, return -1 with errno = EINPROGRESS. No MSG_FASTOPEN flag is needed. read() Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but write() is not called yet. Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is established but no msg is received yet. Return number of bytes read if socket is established and there is msg received. The new API simplifies life for applications that always perform a write() immediately after a successful connect(). Such applications can now take advantage of Fast Open by merely making one new setsockopt() call at the time of creating the socket. Nothing else about the application's socket call sequence needs to change. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25tcp: reduce skb overhead in selected placesEric Dumazet1-2/+1
tcp_add_backlog() can use skb_condense() helper to get better gains and less SKB_TRUESIZE() magic. This only happens when socket backlog has to be used. Some attacks involve specially crafted out of order tiny TCP packets, clogging the ofo queue of (many) sockets. Then later, expensive collapse happens, trying to copy all these skbs into single ones. This unfortunately does not work if each skb has no neighbor in TCP sequence order. By using skb_condense() if the skb could not be coalesced to a prior one, we defeat these kind of threats, potentially saving 4K per skb (or more, since this is one page fragment). A typical NAPI driver allocates gro packets with GRO_MAX_HEAD bytes in skb->head, meaning the copy done by skb_condense() is limited to about 200 bytes. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20net: remove bh disabling around percpu_counter accessesEric Dumazet1-2/+0
Shaohua Li made percpu_counter irq safe in commit 098faf5805c8 ("percpu_counter: make APIs irq safe") We can safely remove BH disable/enable sections around various percpu_counter manipulations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: drop ->bind_conflictJosef Bacik1-1/+0
The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict is how they check the rcv_saddr, so delete this call back and simply change inet_csk_bind_conflict to call inet_rcv_saddr_equal. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13tcp: remove early retransmitYuchung Cheng1-1/+0
This patch removes the support of RFC5827 early retransmit (i.e., fast recovery on small inflight with <3 dupacks) because it is subsumed by the new RACK loss detection. More specifically when RACK receives DUPACKs, it'll arm a reordering timer to start fast recovery after a quarter of (min)RTT, hence it covers the early retransmit except RACK does not limit itself to specific inflight or dupack numbers. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13tcp: add reordering timer in RACK loss detectionYuchung Cheng1-0/+1
This patch makes RACK install a reordering timer when it suspects some packets might be lost, but wants to delay the decision a little bit to accomodate reordering. It does not create a new timer but instead repurposes the existing RTO timer, because both are meant to retransmit packets. Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when the RACK timing check fails. The wait time is set to RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge This translates to expecting a packet (Packet) should take (RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent. When there are multiple packets that need a timer, we use one timer with the maximum timeout. Therefore the timer conservatively uses the maximum window to expire N packets by one timeout, instead of N timeouts to expire N packets sent at different times. The fudge factor is 2 jiffies to ensure when the timer fires, all the suspected packets would exceed the deadline and be marked lost by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the clock may tick between calling icsk_reset_xmit_timer(timeout) and actually hang the timer. The next jiffy is to lower-bound the timeout to 2 jiffies when reo_wnd is < 1ms. When the reordering timer fires (tcp_rack_reo_timeout): If we aren't in Recovery we'll enter fast recovery and force fast retransmit. This is very similar to the early retransmit (RFC5827) except RACK is not constrained to only enter recovery for small outstanding flights. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-09net: introduce keepalive function in struct protoUrsula Braun1-0/+1
Direct call of tcp_set_keepalive() function from protocol-agnostic sock_setsockopt() function in net/core/sock.c violates network layering. And newly introduced protocol (SMC-R) will need its own keepalive function. Therefore, add "keepalive" function pointer to "struct proto", and call it from sock_setsockopt() via this pointer. Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Reviewed-by: Utz Bacher <utz.bacher@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29ipv4: Namespaceify tcp_max_syn_backlog knobHaishuang Yan1-2/+5
Different namespace application might require different maximal number of remembered connection requests. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29ipv4: Namespaceify tcp_tw_recycle and tcp_max_tw_buckets knobHaishuang Yan1-4/+8
Different namespace application might require fast recycling TIME-WAIT sockets independently of the host. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-27ipv4: Namespaceify tcp_tw_reuse knobHaishuang Yan1-2/+2
Different namespaces might have different requirements to reuse TIME-WAIT sockets for new connections. This might be required in cases where different namespace applications are in place which require TIME_WAIT socket connections to be reduced independently of the host. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05tcp: tsq: move tsq_flags close to sk_wmem_allocEric Dumazet1-1/+1
tsq_flags being in the same cache line than sk_wmem_alloc makes a lot of sense. Both fields are changed from tcp_wfree() and more generally by various TSQ related functions. Prior patch made room in struct sock and added sk_tsq_flags, this patch deletes tsq_flags from struct tcp_sock. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02tcp: randomize tcp timestamp offsets for each connectionFlorian Westphal1-4/+5
jiffies based timestamps allow for easy inference of number of devices behind NAT translators and also makes tracking of hosts simpler. commit ceaa1fef65a7c2e ("tcp: adding a per-socket timestamp offset") added the main infrastructure that is needed for per-connection ts randomization, in particular writing/reading the on-wire tcp header format takes the offset into account so rest of stack can use normal tcp_time_stamp (jiffies). So only two items are left: - add a tsoffset for request sockets - extend the tcp isn generator to also return another 32bit number in addition to the ISN. Re-use of ISN generator also means timestamps are still monotonically increasing for same connection quadruple, i.e. PAWS will still work. Includes fixes from Eric Dumazet. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+18
Several cases of bug fixes in 'net' overlapping other changes in 'net-next-. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13tcp: take care of truncations done by sk_filter()Eric Dumazet1-1/+18
With syzkaller help, Marco Grassi found a bug in TCP stack, crashing in tcp_collapse() Root cause is that sk_filter() can truncate the incoming skb, but TCP stack was not really expecting this to happen. It probably was expecting a simple DROP or ACCEPT behavior. We first need to make sure no part of TCP header could be removed. Then we need to adjust TCP_SKB_CB(skb)->end_seq Many thanks to syzkaller team and Marco for giving us a reproducer. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Marco Grassi <marco.gra@gmail.com> Reported-by: Vladis Dronov <vdronov@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04net: inet: Support UID-based routing in IP protocols.Lorenzo Colitti1-3/+6
- Use the UID in routing lookups made by protocol connect() and sendmsg() functions. - Make sure that routing lookups triggered by incoming packets (e.g., Path MTU discovery) take the UID of the socket into account. - For packets not associated with a userspace socket, (e.g., ping replies) use UID 0 inside the user namespace corresponding to the network namespace the socket belongs to. This allows all namespaces to apply routing and iptables rules to kernel-originated traffic in that namespaces by matching UID 0. This is better than using the UID of the kernel socket that is sending the traffic, because the UID of kernel sockets created at namespace creation time (e.g., the per-processor ICMP and TCP sockets) is the UID of the user that created the socket, which might not be mapped in the namespace. Tested: compiles allnoconfig, allyesconfig, allmodconfig Tested: https://android-review.googlesource.com/253302 Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-3/+0
Mostly simple overlapping changes. For example, David Ahern's adjacency list revamp in 'net-next' conflicted with an adjacency list traversal bug fix in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20tcp: relax listening_hash operationsEric Dumazet1-3/+3
softirq handlers use RCU protection to lookup listeners, and write operations all happen from process context. We do not need to block BH for dump operations. Also SYN_RECV since request sockets are stored in the ehash table : 1) inet_diag_dump_icsk() no longer need to clear cb->args[3] and cb->args[4] that were used as cursors while iterating the old per listener hash table. 2) Also factorize a test : No need to scan listening_hash[] if r->id.idiag_dport is not zero. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-19tcp: do not export sysctl_tcp_low_latencyEric Dumazet1-1/+0
Since commit b2fb4f54ecd4 ("tcp: uninline tcp_prequeue()") we no longer access sysctl_tcp_low_latency from a module. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18tcp: Remove unused but set variableTobias Klauser1-2/+0
Remove the unused but set variable icsk in listening_get_next to fix the following GCC warning when building with 'W=1': net/ipv4/tcp_ipv4.c: In function ‘listening_get_next’: net/ipv4/tcp_ipv4.c:1890:31: warning: variable ‘icsk’ set but not used [-Wunused-but-set-variable] Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-07mm: memcontrol: consolidate cgroup socket trackingJohannes Weiner1-3/+0
The cgroup core and the memory controller need to track socket ownership for different purposes, but the tracking sites being entirely different is kind of ugly. Be a better citizen and rename the memory controller callbacks to match the cgroup core callbacks, then move them to the same place. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20160914194846.11153-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-25Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextPablo Neira Ayuso1-1/+1
Conflicts: net/netfilter/core.c net/netfilter/nf_tables_netdev.c Resolve two conflicts before pull request for David's net-next tree: 1) Between c73c24849011 ("netfilter: nf_tables_netdev: remove redundant ip_hdr assignment") from the net tree and commit ddc8b6027ad0 ("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()"). 2) Between e8bffe0cf964 ("net: Add _nf_(un)register_hooks symbols") and Aaron Conole's patches to replace list_head with single linked list. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25netfilter: xt_socket: fix transparent match for IPv6 request socketsKOVACS Krisztian1-1/+0
The introduction of TCP_NEW_SYN_RECV state, and the addition of request sockets to the ehash table seems to have broken the --transparent option of the socket match for IPv6 (around commit a9407000). Now that the socket lookup finds the TCP_NEW_SYN_RECV socket instead of the listener, the --transparent option tries to match on the no_srccheck flag of the request socket. Unfortunately, that flag was only set for IPv4 sockets in tcp_v4_init_req() by copying the transparent flag of the listener socket. This effectively causes '-m socket --transparent' not match on the ACK packet sent by the client in a TCP handshake. Based on the suggestion from Eric Dumazet, this change moves the code initializing no_srccheck to tcp_conn_request(), rendering the above scenario working again. Fixes: a940700003 ("netfilter: xt_socket: prepare for TCP_NEW_SYN_RECV support") Signed-off-by: Alex Badics <alex.badics@balabit.com> Signed-off-by: KOVACS Krisztian <hidden@balabit.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-08tcp: use an RB tree for ooo receive queueYaogong Wang1-1/+1
Over the years, TCP BDP has increased by several orders of magnitude, and some people are considering to reach the 2 Gbytes limit. Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000 MSS. In presence of packet losses (or reorders), TCP stores incoming packets into an out of order queue, and number of skbs sitting there waiting for the missing packets to be received can be in the 10^5 range. Most packets are appended to the tail of this queue, and when packets can finally be transferred to receive queue, we scan the queue from its head. However, in presence of heavy losses, we might have to find an arbitrary point in this queue, involving a linear scan for every incoming packet, throwing away cpu caches. This patch converts it to a RB tree, to get bounded latencies. Yaogong wrote a preliminary patch about 2 years ago. Eric did the rebase, added ofo_last_skb cache, polishing and tests. Tested with network dropping between 1 and 10 % packets, with good success (about 30 % increase of throughput in stress tests) Next step would be to also use an RB tree for the write queue at sender side ;) Signed-off-by: Yaogong Wang <wygivan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+7
All three conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-29tcp: add tcp_add_backlog()Eric Dumazet1-4/+29
When TCP operates in lossy environments (between 1 and 10 % packet losses), many SACK blocks can be exchanged, and I noticed we could drop them on busy senders, if these SACK blocks have to be queued into the socket backlog. While the main cause is the poor performance of RACK/SACK processing, we can try to avoid these drops of valuable information that can lead to spurious timeouts and retransmits. Cause of the drops is the skb->truesize overestimation caused by : - drivers allocating ~2048 (or more) bytes as a fragment to hold an Ethernet frame. - various pskb_may_pull() calls bringing the headers into skb->head might have pulled all the frame content, but skb->truesize could not be lowered, as the stack has no idea of each fragment truesize. The backlog drops are also more visible on bidirectional flows, since their sk_rmem_alloc can be quite big. Let's add some room for the backlog, as only the socket owner can selectively take action to lower memory needs, like collapsing receive queues or partial ofo pruning. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-25tcp: md5: add LINUX_MIB_TCPMD5FAILURE counterEric Dumazet1-0/+1
Adds SNMP counter for drops caused by MD5 mismatches. The current syslog might help, but a counter is more precise and helps monitoring. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-25tcp: md5: increment sk_drops on syn_recv stateEric Dumazet1-0/+1
TCP MD5 mismatches do increment sk_drops counter in all states but SYN_RECV. This is very unlikely to happen in the real world, but worth adding to help diagnostics. We increase the parent (listener) sk_drops. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-23tcp: properly scale window in tcp_v[46]_reqsk_send_ack()Eric Dumazet1-1/+7
When sending an ack in SYN_RECV state, we must scale the offered window if wscale option was negotiated and accepted. Tested: Following packetdrill test demonstrates the issue : 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 // Establish a connection. +0 < S 0:0(0) win 20000 <mss 1000,sackOK,wscale 7, nop, TS val 100 ecr 0> +0 > S. 0:0(0) ack 1 win 28960 <mss 1460,sackOK, TS val 100 ecr 100, nop, wscale 7> +0 < . 1:11(10) ack 1 win 156 <nop,nop,TS val 99 ecr 100> // check that window is properly scaled ! +0 > . 1:1(0) ack 1 win 226 <nop,nop,TS val 200 ecr 100> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01tcp: md5: use kmalloc() backed scratch areasEric Dumazet1-17/+14
Some arches have virtually mapped kernel stacks, or will soon have. tcp_md5_hash_header() uses an automatic variable to copy tcp header before mangling th->check and calling crypto function, which might be problematic on such arches. David says that using percpu storage is also problematic on non SMP builds. Just use kmalloc() to allocate scratch areas. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-16tcp: minor optimizations around tcp_hdr() usageEric Dumazet1-3/+3
tcp_hdr() is slightly more expensive than using skb->data in contexts where we know they point to the same byte. In receive path, tcp_v4_rcv() and tcp_v6_rcv() are in this situation, as tcp header has not been pulled yet. In output path, the same can be said when we just pushed the tcp header in the skb, in tcp_transmit_skb() and tcp_make_synack() Also factorize the two checks for tcb->tcp_flags & TCPHDR_SYN in tcp_transmit_skb() and pass tcp header pointer to tcp_ecn_send(), so that compiler can further optimize and avoid a reload. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06ipv4: tcp: ip_send_unicast_reply() is not BH safeEric Dumazet1-4/+4
I forgot that ip_send_unicast_reply() is not BH safe (yet). Disabling preemption before calling it was not a good move. Fixes: c10d9310edf5 ("tcp: do not assume TCP code is non preemptible") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andres Lagar-Cavilla <andreslc@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04tcp: two more missing bh disableEric Dumazet1-0/+2
percpu_counter only have protection against preemption. TCP stack uses them possibly from BH, so we need BH protection in contexts that could be run in process context Fixes: c10d9310edf5 ("tcp: do not assume TCP code is non preemptible") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02tcp: do not assume TCP code is non preemptibleEric Dumazet1-5/+9
We want to to make TCP stack preemptible, as draining prequeue and backlog queues can take lot of time. Many SNMP updates were assuming that BH (and preemption) was disabled. Need to convert some __NET_INC_STATS() calls to NET_INC_STATS() and some __TCP_INC_STATS() to TCP_INC_STATS() Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset() and tcp_v4_send_ack(), we add an explicit preempt disabled section. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28tcp: give prequeue mode some careEric Dumazet1-5/+5
TCP prequeue goal is to defer processing of incoming packets to user space thread currently blocked in a recvmsg() system call. Intent is to spend less time processing these packets on behalf of softirq handler, as softirq handler is unfair to normal process scheduler decisions, as it might interrupt threads that do not even use networking. Current prequeue implementation has following issues : 1) It only checks size of the prequeue against sk_rcvbuf It was fine 15 years ago when sk_rcvbuf was in the 64KB vicinity. But we now have ~8MB values to cope with modern networking needs. We have to add sk_rmem_alloc in the equation, since out of order packets can definitely use up to sk_rcvbuf memory themselves. 2) Even with a fixed memory truesize check, prequeue can be filled by thousands of packets. When prequeue needs to be flushed, either from sofirq context (in tcp_prequeue() or timer code), or process context (in tcp_prequeue_process()), this adds a latency spike which is often not desirable. I added a fixed limit of 32 packets, as this translated to a max flush time of 60 us on my test hosts. Also note that all packets in prequeue are not accounted for tcp_mem, since they are not charged against sk_forward_alloc at this point. This is probably not a big deal. Note that this might increase LINUX_MIB_TCPPREQUEUEDROPPED counts, which is misnamed, as packets are not dropped at all, but rather pushed to the stack (where they can be either consumed or dropped) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: rename NET_{ADD|INC}_STATS_BH()Eric Dumazet1-11/+11
Rename NET_INC_STATS_BH() to __NET_INC_STATS() and NET_ADD_STATS_BH() to __NET_ADD_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: tcp: rename TCP_INC_STATS_BHEric Dumazet1-8/+8
Rename TCP_INC_STATS_BH() to __TCP_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: rename ICMP_INC_STATS_BH()Eric Dumazet1-1/+1
Rename ICMP_INC_STATS_BH() to __ICMP_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15tcp: do not mess with listener sk_wmem_allocEric Dumazet1-2/+2
When removing sk_refcnt manipulation on synflood, I missed that using skb_set_owner_w() was racy, if sk->sk_wmem_alloc had already transitioned to 0. We should hold sk_refcnt instead, but this is a big deal under attack. (Doing so increase performance from 3.2 Mpps to 3.8 Mpps only) In this patch, I chose to not attach a socket to syncookies skb. Performance is now 5 Mpps instead of 3.2 Mpps. Following patch will remove last known false sharing in tcp_rcv_state_process() Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>