aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4/tcp_output.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-02-26tcp: remove tcp_queue argument from tso_fragment()Eric Dumazet1-7/+6
tso_fragment() is only called for packets still in write queue. Remove the tcp_queue parameter to make this more obvious, even if the comment clearly states this. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-26tcp: convert tcp_md5_needed to static_branch APIEric Dumazet1-2/+2
We prefer static_branch_unlikely() over static_key_false() these days. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+1
Three conflicts, one of which, for marvell10g.c is non-trivial and requires some follow-up from Heiner or someone else. The issue is that Heiner converted the marvell10g driver over to use the generic c45 code as much as possible. However, in 'net' a bug fix appeared which makes sure that a new local mask (MDIO_AN_10GBT_CTRL_ADV_NBT_MASK) with value 0x01e0 is cleared. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-23tcp: repaired skbs must init their tso_segsEric Dumazet1-0/+1
syzbot reported a WARN_ON(!tcp_skb_pcount(skb)) in tcp_send_loss_probe() [1] This was caused by TCP_REPAIR sent skbs that inadvertenly were missing a call to tcp_init_tso_segs() [1] WARNING: CPU: 1 PID: 0 at net/ipv4/tcp_output.c:2534 tcp_send_loss_probe+0x771/0x8a0 net/ipv4/tcp_output.c:2534 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.0.0-rc7+ #77 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 panic+0x2cb/0x65c kernel/panic.c:214 __warn.cold+0x20/0x45 kernel/panic.c:571 report_bug+0x263/0x2b0 lib/bug.c:186 fixup_bug arch/x86/kernel/traps.c:178 [inline] fixup_bug arch/x86/kernel/traps.c:173 [inline] do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271 do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:290 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973 RIP: 0010:tcp_send_loss_probe+0x771/0x8a0 net/ipv4/tcp_output.c:2534 Code: 88 fc ff ff 4c 89 ef e8 ed 75 c8 fb e9 c8 fc ff ff e8 43 76 c8 fb e9 63 fd ff ff e8 d9 75 c8 fb e9 94 f9 ff ff e8 bf 03 91 fb <0f> 0b e9 7d fa ff ff e8 b3 03 91 fb 0f b6 1d 37 43 7a 03 31 ff 89 RSP: 0018:ffff8880ae907c60 EFLAGS: 00010206 RAX: ffff8880a989c340 RBX: 0000000000000000 RCX: ffffffff85dedbdb RDX: 0000000000000100 RSI: ffffffff85dee0b1 RDI: 0000000000000005 RBP: ffff8880ae907c90 R08: ffff8880a989c340 R09: ffffed10147d1ae1 R10: ffffed10147d1ae0 R11: ffff8880a3e8d703 R12: ffff888091b90040 R13: ffff8880a3e8d540 R14: 0000000000008000 R15: ffff888091b90860 tcp_write_timer_handler+0x5c0/0x8a0 net/ipv4/tcp_timer.c:583 tcp_write_timer+0x10e/0x1d0 net/ipv4/tcp_timer.c:607 call_timer_fn+0x190/0x720 kernel/time/timer.c:1325 expire_timers kernel/time/timer.c:1362 [inline] __run_timers kernel/time/timer.c:1681 [inline] __run_timers kernel/time/timer.c:1649 [inline] run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694 __do_softirq+0x266/0x95a kernel/softirq.c:292 invoke_softirq kernel/softirq.c:373 [inline] irq_exit+0x180/0x1d0 kernel/softirq.c:413 exiting_irq arch/x86/include/asm/apic.h:536 [inline] smp_apic_timer_interrupt+0x14a/0x570 arch/x86/kernel/apic/apic.c:1062 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:807 </IRQ> RIP: 0010:native_safe_halt+0x2/0x10 arch/x86/include/asm/irqflags.h:58 Code: ff ff ff 48 89 c7 48 89 45 d8 e8 59 0c a1 fa 48 8b 45 d8 e9 ce fe ff ff 48 89 df e8 48 0c a1 fa eb 82 90 90 90 90 90 90 fb f4 <c3> 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 f4 c3 90 90 90 90 90 90 RSP: 0018:ffff8880a98afd78 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13 RAX: 1ffffffff1125061 RBX: ffff8880a989c340 RCX: 0000000000000000 RDX: dffffc0000000000 RSI: 0000000000000001 RDI: ffff8880a989cbbc RBP: ffff8880a98afda8 R08: ffff8880a989c340 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 R13: ffffffff889282f8 R14: 0000000000000001 R15: 0000000000000000 arch_cpu_idle+0x10/0x20 arch/x86/kernel/process.c:555 default_idle_call+0x36/0x90 kernel/sched/idle.c:93 cpuidle_idle_call kernel/sched/idle.c:153 [inline] do_idle+0x386/0x570 kernel/sched/idle.c:262 cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:353 start_secondary+0x404/0x5c0 arch/x86/kernel/smpboot.c:271 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243 Kernel Offset: disabled Rebooting in 86400 seconds.. Fixes: 79861919b889 ("tcp: fix TCP_REPAIR xmit queue setup") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-27tcp: change pingpong threshold to 3Wei Wang1-6/+9
In order to be more confident about an on-going interactive session, we increment pingpong count by 1 for every interactive transaction and we adjust TCP_PINGPONG_THRESH to 3. This means, we only consider a session in pingpong mode after we see 3 interactive transactions, and start to activate delayed acks in quick ack mode. And in order to not over-count the credits, we only increase pingpong count for the first packet sent in response for the previous received packet. This is mainly to prevent delaying the ack immediately after some handshake protocol but no real interactive traffic pattern afterwards. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-27tcp: Refactor pingpong codeWei Wang1-2/+2
Instead of using pingpong as a single bit information, we refactor the code to treat it as a counter. When interactive session is detected, we set pingpong count to TCP_PINGPONG_THRESH. And when pingpong count is >= TCP_PINGPONG_THRESH, we consider the session in pingpong mode. This patch is a pure refactor and sets foundation for the next patch. This patch itself does not change any pingpong logic. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-25tcp: allow zerocopy with fastopenWillem de Bruijn1-0/+1
Accept MSG_ZEROCOPY in all the TCP states that allow sendmsg. Remove the explicit check for ESTABLISHED and CLOSE_WAIT states. This requires correctly handling zerocopy state (uarg, sk_zckey) in all paths reachable from other TCP states. Such as the EPIPE case in sk_stream_wait_connect, which a sendmsg() in incorrect state will now hit. Most paths are already safe. Only extension needed is for TCP Fastopen active open. This can build an skb with data in tcp_send_syn_data. Pass the uarg along with other fastopen state, so that this skb also generates a zerocopy notification on release. Tested with active and passive tcp fastopen packetdrill scripts at https://github.com/wdebruij/packetdrill/commit/1747eef03d25a2404e8132817d0f1244fd6f129d Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17tcp: less aggressive window probing on local congestionYuchung Cheng1-15/+7
Previously when the sender fails to send (original) data packet or window probes due to congestion in the local host (e.g. throttling in qdisc), it'll retry within an RTO or two up to 500ms. In low-RTT networks such as data-centers, RTO is often far below the default minimum 200ms. Then local host congestion could trigger a retry storm pouring gas to the fire. Worse yet, the probe counter (icsk_probes_out) is not properly updated so the aggressive retry may exceed the system limit (15 rounds) until the packet finally slips through. On such rare events, it's wise to retry more conservatively (500ms) and update the stats properly to reflect these incidents and follow the system limit. Note that this is consistent with the behaviors when a keep-alive probe or RTO retry is dropped due to local congestion. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17tcp: always set retrans_stamp on recoveryYuchung Cheng1-5/+4
Previously TCP socket's retrans_stamp is not set if the retransmission has failed to send. As a result if a socket is experiencing local issues to retransmit packets, determining when to abort a socket is complicated w/o knowning the starting time of the recovery since retrans_stamp may remain zero. This complication causes sub-optimal behavior that TCP may use the latest, instead of the first, retransmission time to compute the elapsed time of a stalling connection due to local issues. Then TCP may disrecard TCP retries settings and keep retrying until it finally succeed: not a good idea when the local host is already strained. The simple fix is to always timestamp the start of a recovery. It's worth noting that retrans_stamp is also used to compare echo timestamp values to detect spurious recovery. This patch does not break that because retrans_stamp is still later than when the original packet was sent. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17tcp: always timestamp on every skb transmissionYuchung Cheng1-8/+8
Previously TCP skbs are not always timestamped if the transmission failed due to memory or other local issues. This makes deciding when to abort a socket tricky and complicated because the first unacknowledged skb's timestamp may be 0 on TCP timeout. The straight-forward fix is to always timestamp skb on every transmission attempt. Also every skb retransmission needs to be flagged properly to avoid RTT under-estimation. This can happen upon receiving an ACK for the original packet and the a previous (spurious) retransmission has failed. It's worth noting that this reverts to the old time-stamping style before commit 8c72c65b426b ("tcp: update skb->skb_mstamp more carefully") which addresses a problem in computing the elapsed time of a stalled window-probing socket. The problem will be addressed differently in the next patches with a simpler approach. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-10tcp: handle EOR and FIN conditions the same in tcp_tso_should_defer()Eric Dumazet1-5/+2
In commit f9bfe4e6a9d0 ("tcp: lack of available data can also cause TSO defer") we moved the test in tcp_tso_should_defer() for packets with a FIN flag, and we mentioned that the same would be done later for EOR flag. Both flags should be handled at the same time, after all other heuristics have been considered. They both mean that no more bytes can be added to this skb by an application. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-13/+32
Several conflicts, seemingly all over the place. I used Stephen Rothwell's sample resolutions for many of these, if not just to double check my own work, so definitely the credit largely goes to him. The NFP conflict consisted of a bug fix (moving operations past the rhashtable operation) while chaning the initial argument in the function call in the moved code. The net/dsa/master.c conflict had to do with a bug fix intermixing of making dsa_master_set_mtu() static with the fixing of the tagging attribute location. cls_flower had a conflict because the dup reject fix from Or overlapped with the addition of port range classifiction. __set_phy_supported()'s conflict was relatively easy to resolve because Andrew fixed it in both trees, so it was just a matter of taking the net-next copy. Or at least I think it was :-) Joe Stringer's fix to the handling of netns id 0 in bpf_sk_lookup() intermixed with changes on how the sdif and caller_net are calculated in these code paths in net-next. The remaining BPF conflicts were largely about the addition of the __bpf_md_ptr stuff in 'net' overlapping with adjustments and additions to the relevant data structure where the MD pointer macros are used. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-07tcp: lack of available data can also cause TSO deferEric Dumazet1-11/+24
tcp_tso_should_defer() can return true in three different cases : 1) We are cwnd-limited 2) We are rwnd-limited 3) We are application limited. Neal pointed out that my recent fix went too far, since it assumed that if we were not in 1) case, we must be rwnd-limited Fix this by properly populating the is_cwnd_limited and is_rwnd_limited booleans. After this change, we can finally move the silly check for FIN flag only for the application-limited case. The same move for EOR bit will be handled in net-next, since commit 1c09f7d073b1 ("tcp: do not try to defer skbs with eor mark (MSG_EOR)") is scheduled for linux-4.21 Tested by running 200 concurrent netperf -t TCP_RR -- -r 60000,100 and checking none of them was rwnd_limited in the chrono_stat output from "ss -ti" command. Fixes: 41727549de3e ("tcp: Do not underestimate rwnd_limited") Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-05tcp: fix NULL ref in tail loss probeYuchung Cheng1-4/+7
TCP loss probe timer may fire when the retranmission queue is empty but has a non-zero tp->packets_out counter. tcp_send_loss_probe will call tcp_rearm_rto which triggers NULL pointer reference by fetching the retranmission queue head in its sub-routines. Add a more detailed warning to help catch the root cause of the inflight accounting inconsistency. Reported-by: Rafael Tinoco <rafael.tinoco@linaro.org> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-05tcp: Do not underestimate rwnd_limitedEric Dumazet1-1/+4
If available rwnd is too small, tcp_tso_should_defer() can decide it is worth waiting before splitting a TSO packet. This really means we are rwnd limited. Fixes: 5615f88614a4 ("tcp: instrument how long TCP is limited by receive window") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-30tcp: fix SNMP under-estimation on failed retransmissionYuchung Cheng1-1/+1
Previously the SNMP counter LINUX_MIB_TCPRETRANSFAIL is not counting the TSO/GSO properly on failed retransmission. This patch fixes that. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-30tcp: md5: add tcp_md5_needed jump labelEric Dumazet1-2/+4
Most linux hosts never setup TCP MD5 keys. We can avoid a cache line miss (accessing tp->md5ig_info) on RX and TX using a jump label. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-29tcp: remove loop to compute wscaleEric Dumazet1-5/+3
We can remove the loop and conditional branches and compute wscale efficiently thanks to ilog2() Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-3/+3
2018-11-21tcp: defer SACK compression after DupThreshEric Dumazet1-3/+3
Jean-Louis reported a TCP regression and bisected to recent SACK compression. After a loss episode (receiver not able to keep up and dropping packets because its backlog is full), linux TCP stack is sending a single SACK (DUPACK). Sender waits a full RTO timer before recovering losses. While RFC 6675 says in section 5, "Algorithm Details", (2) If DupAcks < DupThresh but IsLost (HighACK + 1) returns true -- indicating at least three segments have arrived above the current cumulative acknowledgment point, which is taken to indicate loss -- go to step (4). ... (4) Invoke fast retransmit and enter loss recovery as follows: there are old TCP stacks not implementing this strategy, and still counting the dupacks before starting fast retransmit. While these stacks probably perform poorly when receivers implement LRO/GRO, we should be a little more gentle to them. This patch makes sure we do not enable SACK compression unless 3 dupacks have been sent since last rcv_nxt update. Ideally we should even rearm the timer to send one or two more DUPACK if no more packets are coming, but that will be work aiming for linux-4.21. Many thanks to Jean-Louis for bisecting the issue, providing packet captures and testing this patch. Fixes: 5d9f4262b7ea ("tcp: add SACK compression") Reported-by: Jean-Louis Dupond <jean-louis@dupond.be> Tested-by: Jean-Louis Dupond <jean-louis@dupond.be> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-11tcp: tsq: no longer use limit_output_bytes for paced flowsEric Dumazet1-2/+3
FQ pacing guarantees that paced packets queued by one flow do not add head-of-line blocking for other flows. After TCP GSO conversion, increasing limit_output_bytes to 1 MB is safe, since this maps to 16 skbs at most in qdisc or device queues. (or slightly more if some drivers lower {gso_max_segs|size}) We still can queue at most 1 ms worth of traffic (this can be scaled by wifi drivers if they need to) Tested: # ethtool -c eth0 | egrep "tx-usecs:|tx-frames:" # 40 Gbit mlx4 NIC tx-usecs: 16 tx-frames: 16 # tc qdisc replace dev eth0 root fq # for f in {1..10};do netperf -P0 -H lpaa24,6 -o THROUGHPUT;done Before patch: 27711 26118 27107 27377 27712 27388 27340 27117 27278 27509 After patch: 37434 36949 36658 36998 37711 37291 37605 36659 36544 37349 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-11tcp: get rid of tcp_tso_should_defer() dependency on HZ/jiffiesEric Dumazet1-2/+5
tcp_tso_should_defer() first heuristic is to not defer if last send is "old enough". Its current implementation uses jiffies and its low granularity. TSO autodefer performance should not rely on kernel HZ :/ After EDT conversion, we have state variables in nanoseconds that can allow us to properly implement the heuristic. This patch increases TSO chunk sizes on medium rate flows, especially when receivers do not use GRO or similar aggregation. It also reduces bursts for HZ=100 or HZ=250 kernels, making TCP behavior more uniform. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-11tcp: refine tcp_tso_should_defer() after EDT adoptionEric Dumazet1-3/+4
tcp_tso_should_defer() last step tries to check if the probable next ACK packet is coming in less than half rtt. Problem is that the head->tstamp might be in the future, so we need to use signed arithmetics to avoid overflows. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-11tcp: do not try to defer skbs with eor mark (MSG_EOR)Eric Dumazet1-0/+4
Applications using MSG_EOR are giving a strong hint to TCP stack : Subsequent sendmsg() can not append more bytes to skbs having the EOR mark. Do not try to TSO defer suchs skbs, there is really no hope. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-23tcp: add tcp_reset_xmit_timer() helperEric Dumazet1-8/+10
With EDT model, SRTT no longer is inflated by pacing delays. This means that RTO and some other xmit timers might be setup incorrectly. This is particularly visible with either : - Very small enforced pacing rates (SO_MAX_PACING_RATE) - Reduced rto (from the default 200 ms) This can lead to TCP flows aborts in the worst case, or spurious retransmits in other cases. For example, this session gets far more throughput than the requested 80kbit : $ netperf -H 127.0.0.2 -l 100 -- -q 10000 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.2 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 540000 262144 262144 104.00 2.66 With the fix : $ netperf -H 127.0.0.2 -l 100 -- -q 10000 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.2 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 540000 262144 262144 104.00 0.12 EDT allows for better control of rtx timers, since TCP has a better idea of the earliest departure time of each skb in the rtx queue. We only have to eventually add to the timer the difference of the EDT time with current time. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-18tcp: fix TCP_REPAIR xmit queue setupEric Dumazet1-6/+7
Andrey reported the following warning triggered while running CRIU tests: tcp_clean_rtx_queue() ... last_ackt = tcp_skb_timestamp_us(skb); WARN_ON_ONCE(last_ackt == 0); This is caused by 5f6188a8003d ("tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh"), as we end up having skbs in retransmit queue with a zero skb->skb_mstamp_ns field. We could fix this bug in different ways, like making sure tp->tcp_wstamp_ns is not zero at socket creation, but as Neal pointed out, we also do not want that pacing status of a repaired socket could push tp->tcp_wstamp_ns far ahead in the future. So we prefer changing tcp_write_xmit() to not call tcp_update_skb_after_send() and instead do what is requested by TCP_REPAIR logic. Fixes: 5f6188a8003d ("tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Vagin <avagin@openvz.org> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15tcp: optimize tcp internal pacingEric Dumazet1-15/+16
When TCP implements its own pacing (when no fq packet scheduler is used), it is arming high resolution timer after a packet is sent. But in many cases (like TCP_RR kind of workloads), this high resolution timer expires before the application attempts to write the following packet. This overhead also happens when the flow is ACK clocked and cwnd limited instead of being limited by the pacing rate. This leads to extra overhead (high number of IRQ) Now tcp_wstamp_ns is reserved for the pacing timer only (after commit "tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh"), we can setup the timer only when a packet is about to be sent, and if tcp_wstamp_ns is in the future. This leads to a ~10% performance increase in TCP_RR workloads. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15tcp: mitigate scheduling jitter in EDT pacing modelEric Dumazet1-6/+13
In commit fefa569a9d4b ("net_sched: sch_fq: account for schedule/timers drifts") we added a mitigation for scheduling jitter in fq packet scheduler. This patch does the same in TCP stack, now it is using EDT model. Note that this mitigation is valid for both external (fq packet scheduler) or internal TCP pacing. This uses the same strategy than the above commit, allowing a time credit of half the packet currently sent. Consider following case : An skb is sent, after an idle period of 300 usec. The air-time (skb->len/pacing_rate) is 500 usec Instead of setting the pacing timer to now+500 usec, it will use now+min(500/2, 300) -> now+250usec This is like having a token bucket with a depth of half an skb. Tested: tc qdisc replace dev eth0 root pfifo_fast Before netperf -P0 -H remote -- -q 1000000000 # 8000Mbit 540000 262144 262144 10.00 7710.43 After : netperf -P0 -H remote -- -q 1000000000 # 8000 Mbit 540000 262144 262144 10.00 7999.75 # Much closer to 8000Mbit target Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15net: extend sk_pacing_rate to unsigned longEric Dumazet1-8/+11
sk_pacing_rate has beed introduced as a u32 field in 2013, effectively limiting per flow pacing to 34Gbit. We believe it is time to allow TCP to pace high speed flows on 64bit hosts, as we now can reach 100Gbit on one TCP flow. This patch adds no cost for 32bit kernels. The tcpi_pacing_rate and tcpi_max_pacing_rate were already exported as 64bit, so iproute2/ss command require no changes. Unfortunately the SO_MAX_PACING_RATE socket option will stay 32bit and we will need to add a new option to let applications control high pacing rates. State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 1787144 10.246.9.76:49992 10.246.9.77:36741 timer:(on,003ms,0) ino:91863 sk:2 <-> skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0) ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448 rcvmss:536 advmss:1448 cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177 segs_in:3916318 data_segs_out:177279175 bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2) send 28045.5Mbps lastrcv:73333 pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps busy:73333ms unacked:135 retrans:0/157 rcv_space:14480 notsent:2085120 minrtt:0.013 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15tcp: do not change tcp_wstamp_ns in tcp_mstamp_refreshEric Dumazet1-3/+6
In EDT design, I made the mistake of using tcp_wstamp_ns to store the last tcp_clock_ns() sample and to store the pacing virtual timer. This causes major regressions at high speed flows. Introduce tcp_clock_cache to store last tcp_clock_ns(). This is needed because some arches have slow high-resolution kernel time service. tcp_wstamp_ns is only updated when a packet is sent. Note that we can remove tcp_mstamp in the future since tcp_mstamp is essentially tcp_clock_cache/1000, so the apparent socket size increase is temporary. Fixes: 9799ccb0e984 ("tcp: add tcp_wstamp_ns socket field") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-29tcp: up initial rmem to 128KB and SYN rwin to around 64KBYuchung Cheng1-21/+4
Previously TCP initial receive buffer is ~87KB by default and the initial receive window is ~29KB (20 MSS). This patch changes the two numbers to 128KB and ~64KB (rounding down to the multiples of MSS) respectively. The patch also simplifies the calculations s.t. the two numbers are directly controlled by sysctl tcp_rmem[1]: 1) Initial receiver buffer budget (sk_rcvbuf): while this should be configured via sysctl tcp_rmem[1], previously tcp_fixup_rcvbuf() always override and set a larger size when a new connection establishes. 2) Initial receive window in SYN: previously it is set to 20 packets if MSS <= 1460. The number 20 was based on the initial congestion window of 10: the receiver needs twice amount to avoid being limited by the receive window upon out-of-order delivery in the first window burst. But since this only applies if the receiving MSS <= 1460, connection using large MTU (e.g. to utilize receiver zero-copy) may be limited by the receive window. With this patch TCP memory configuration is more straight-forward and more properly sized to modern high-speed networks by default. Several popular stacks have been announcing 64KB rwin in SYNs as well. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21tcp: switch tcp_internal_pacing() to tcp_wstamp_nsEric Dumazet1-13/+4
Now TCP keeps track of tcp_wstamp_ns, recording the earliest departure time of next packet, we can remove duplicate code from tcp_internal_pacing() This removes one ktime_get_tai_ns() call, and a divide. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21tcp: switch tcp and sch_fq to new earliest departure time modelEric Dumazet1-4/+18
TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq no longer has to do it. Thanks to this model, TCP can get more accurate RTT samples, since pacing no longer inflates them. This has the nice effect of removing some delays caused by FQ quantum mechanism, causing inflated max/P99 latencies. Also we might relax TCP Small Queue tight limits in the future, since this new model allow TCP to build bigger batches, since sch_fq (or a device with earliest departure time offload) ensure these packets will be delivered on time. Note that other protocols are not converted (they will probably never be) so sch_fq has still support for SO_MAX_PACING_RATE Tested: Test showing FQ pacing quantum artifact for low-rate flows, adding unexpected throttles for RPC flows, inflating max and P99 latencies. The parameters chosen here are to show what happens typically when a TCP flow has a reduced pacing rate (this can be caused by a reduced cwin after few losses, or/and rtt above few ms) MIBS="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY" Before : $ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds 19,82.78,5279,3825,482.02 After : $ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds 20,49.94,128,63,3.18 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21tcp: switch internal pacing timer to CLOCK_TAIEric Dumazet1-1/+1
Next patch will use tcp_wstamp_ns to feed internal TCP pacing timer, so switch to CLOCK_TAI to share same base. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21tcp: provide earliest departure time in skb->tstampEric Dumazet1-7/+6
Switch internal TCP skb->skb_mstamp to skb->skb_mstamp_ns, from usec units to nsec units. Do not clear skb->tstamp before entering IP stacks in TX, so that qdisc or devices can implement pacing based on the earliest departure time instead of socket sk->sk_pacing_rate Packets are fed with tcp_wstamp_ns, and following patch will update tcp_wstamp_ns when both TCP and sch_fq switch to the earliest departure time mechanism. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21tcp: add tcp_wstamp_ns socket fieldEric Dumazet1-0/+16
TCP will soon provide earliest departure time on TX skbs. It needs to track this in a new variable. tcp_mstamp_refresh() needs to update this variable, and became too big to stay an inline. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21tcp: introduce tcp_skb_timestamp_us() helperEric Dumazet1-1/+1
There are few places where TCP reads skb->skb_mstamp expecting a value in usec unit. skb->tstamp (aka skb->skb_mstamp) will soon store CLOCK_TAI nsec value. Add tcp_skb_timestamp_us() to provide proper conversion when needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01tcp: remove set but not used variable 'skb_size'Wei Yongjun1-2/+1
Fixes gcc '-Wunused-but-set-variable' warning: net/ipv4/tcp_output.c: In function 'tcp_collapse_retrans': net/ipv4/tcp_output.c:2700:6: warning: variable 'skb_size' set but not used [-Wunused-but-set-variable] int skb_size, next_skb_size; ^ Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01tcp: add data bytes retransmitted statsWei Wang1-0/+1
Introduce a new TCP stat to record the number of bytes retransmitted (RFC4898 tcpEStatsPerfOctetsRetrans) and expose it in both tcp_info (TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS). Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01tcp: add data bytes sent statsWei Wang1-0/+1
Introduce a new TCP stat to record the number of bytes sent (RFC4898 tcpEStatsPerfHCDataOctetsOut) and expose it in both tcp_info (TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS). Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-24Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-8/+24
2018-07-20Merge ra.kernel.org:/pub/scm/linux/kernel/git/torvalds/linuxDavid S. Miller1-4/+0
All conflicts were trivial overlapping changes, so reasonably easy to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20tcp: do not cancel delay-AcK on DCTCP special ACKYuchung Cheng1-3/+7
Currently when a DCTCP receiver delays an ACK and receive a data packet with a different CE mark from the previous one's, it sends two immediate ACKs acking previous and latest sequences respectly (for ECN accounting). Previously sending the first ACK may mark off the delayed ACK timer (tcp_event_ack_sent). This may subsequently prevent sending the second ACK to acknowledge the latest sequence (tcp_ack_snd_check). The culprit is that tcp_send_ack() assumes it always acknowleges the latest sequence, which is not true for the first special ACK. The fix is to not make the assumption in tcp_send_ack and check the actual ack sequence before cancelling the delayed ACK. Further it's safer to pass the ack sequence number as a local variable into tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid future bugs like this. Reported-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20tcp: helpers to send special DCTCP ackYuchung Cheng1-5/+17
Refactor and create helpers to send the special ACK in DCTCP. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13tcp: remove DELAYED ACK events in DCTCPYuchung Cheng1-4/+0
After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK related callbacks are no longer needed Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-22tcp_bbr: fix bbr pacing rate for internal pacingEric Dumazet1-14/+0
This commit makes BBR use only the MSS (without any headers) to calculate pacing rates when internal TCP-layer pacing is used. This is necessary to achieve the correct pacing behavior in this case, since tcp_internal_pacing() uses only the payload length to calculate pacing delays. Signed-off-by: Kevin Yang <yyd@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-2/+5
S390 bpf_jit.S is removed in net-next and had changes in 'net', since that code isn't used any more take the removal. TLS data structures split the TX and RX components in 'net-next', put the new struct members from the bug fix in 'net' into the RX part. The 'net-next' tree had some reworking of how the ERSPAN code works in the GRE tunneling code, overlapping with a one-line headroom calculation fix in 'net'. Overlapping changes in __sock_map_ctx_update_elem(), keep the bits that read the prog members via READ_ONCE() into local variables before using them. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-18tcp: add TCPAckCompressed SNMP counterEric Dumazet1-0/+2
This counter tracks number of ACK packets that the host has not sent, thanks to ACK compression. Sample output : $ nstat -n;sleep 1;nstat|egrep "IpInReceives|IpOutRequests|TcpInSegs|TcpOutSegs|TcpExtTCPAckCompressed" IpInReceives 123250 0.0 IpOutRequests 3684 0.0 TcpInSegs 123251 0.0 TcpOutSegs 3684 0.0 TcpExtTCPAckCompressed 119252 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-18tcp: add SACK compressionEric Dumazet1-0/+7
When TCP receives an out-of-order packet, it immediately sends a SACK packet, generating network load but also forcing the receiver to send 1-MSS pathological packets, increasing its RTX queue length/depth, and thus processing time. Wifi networks suffer from this aggressive behavior, but generally speaking, all these SACK packets add fuel to the fire when networks are under congestion. This patch adds a high resolution timer and tp->compressed_ack counter. Instead of sending a SACK, we program this timer with a small delay, based on RTT and capped to 1 ms : delay = min ( 5 % of RTT, 1 ms) If subsequent SACKs need to be sent while the timer has not yet expired, we simply increment tp->compressed_ack. When timer expires, a SACK is sent with the latest information. Whenever an ACK is sent (if data is sent, or if in-order data is received) timer is canceled. Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent if the sack blocks need to be shuffled, even if the timer has not expired. A new SNMP counter is added in the following patch. Two other patches add sysctls to allow changing the 1,000,000 and 44 values that this commit hard-coded. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16tcp: purge write queue in tcp_connect_init()Eric Dumazet1-2/+5
syzkaller found a reliable way to crash the host, hitting a BUG() in __tcp_retransmit_skb() Malicous MSG_FASTOPEN is the root cause. We need to purge write queue in tcp_connect_init() at the point we init snd_una/write_seq. This patch also replaces the BUG() by a less intrusive WARN_ON_ONCE() kernel BUG at net/ipv4/tcp_output.c:2837! invalid opcode: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 5276 Comm: syz-executor0 Not tainted 4.17.0-rc3+ #51 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__tcp_retransmit_skb+0x2992/0x2eb0 net/ipv4/tcp_output.c:2837 RSP: 0000:ffff8801dae06ff8 EFLAGS: 00010206 RAX: ffff8801b9fe61c0 RBX: 00000000ffc18a16 RCX: ffffffff864e1a49 RDX: 0000000000000100 RSI: ffffffff864e2e12 RDI: 0000000000000005 RBP: ffff8801dae073a0 R08: ffff8801b9fe61c0 R09: ffffed0039c40dd2 R10: ffffed0039c40dd2 R11: ffff8801ce206e93 R12: 00000000421eeaad R13: ffff8801ce206d4e R14: ffff8801ce206cc0 R15: ffff8801cd4f4a80 FS: 0000000000000000(0000) GS:ffff8801dae00000(0063) knlGS:00000000096bc900 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000020000000 CR3: 00000001c47b6000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> tcp_retransmit_skb+0x2e/0x250 net/ipv4/tcp_output.c:2923 tcp_retransmit_timer+0xc50/0x3060 net/ipv4/tcp_timer.c:488 tcp_write_timer_handler+0x339/0x960 net/ipv4/tcp_timer.c:573 tcp_write_timer+0x111/0x1d0 net/ipv4/tcp_timer.c:593 call_timer_fn+0x230/0x940 kernel/time/timer.c:1326 expire_timers kernel/time/timer.c:1363 [inline] __run_timers+0x79e/0xc50 kernel/time/timer.c:1666 run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692 __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285 invoke_softirq kernel/softirq.c:365 [inline] irq_exit+0x1d1/0x200 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:525 [inline] smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863 Fixes: cf60af03ca4e ("net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>