aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4 (follow)
AgeCommit message (Collapse)AuthorFilesLines
2020-03-11net: memcg: fix lockdep splat in inet_csk_accept()Eric Dumazet1-7/+7
Locking newsk while still holding the listener lock triggered a lockdep splat [1] We can simply move the memcg code after we release the listener lock, as this can also help if multiple threads are sharing a common listener. Also fix a typo while reading socket sk_rmem_alloc. [1] WARNING: possible recursive locking detected 5.6.0-rc3-syzkaller #0 Not tainted -------------------------------------------- syz-executor598/9524 is trying to acquire lock: ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline] ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492 but task is already holding lock: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline] ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(sk_lock-AF_INET6); lock(sk_lock-AF_INET6); *** DEADLOCK *** May be due to missing lock nesting notation 1 lock held by syz-executor598/9524: #0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline] #0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445 stack backtrace: CPU: 0 PID: 9524 Comm: syz-executor598 Not tainted 5.6.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 print_deadlock_bug kernel/locking/lockdep.c:2370 [inline] check_deadlock kernel/locking/lockdep.c:2411 [inline] validate_chain kernel/locking/lockdep.c:2954 [inline] __lock_acquire.cold+0x114/0x288 kernel/locking/lockdep.c:3954 lock_acquire+0x197/0x420 kernel/locking/lockdep.c:4484 lock_sock_nested+0xc5/0x110 net/core/sock.c:2947 lock_sock include/net/sock.h:1541 [inline] inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492 inet_accept+0xe9/0x7c0 net/ipv4/af_inet.c:734 __sys_accept4_file+0x3ac/0x5b0 net/socket.c:1758 __sys_accept4+0x53/0x90 net/socket.c:1809 __do_sys_accept4 net/socket.c:1821 [inline] __se_sys_accept4 net/socket.c:1818 [inline] __x64_sys_accept4+0x93/0xf0 net/socket.c:1818 do_syscall_64+0xf6/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x4445c9 Code: e8 0c 0d 03 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffc35b37608 EFLAGS: 00000246 ORIG_RAX: 0000000000000120 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004445c9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000306777 R09: 0000000000306777 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00000000004053d0 R14: 0000000000000000 R15: 0000000000000000 Fixes: d752a4986532 ("net: memcg: late association of sock to memcg") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-10net: memcg: late association of sock to memcgShakeel Butt1-0/+20
If a TCP socket is allocated in IRQ context or cloned from unassociated (i.e. not associated to a memcg) in IRQ context then it will remain unassociated for its whole life. Almost half of the TCPs created on the system are created in IRQ context, so, memory used by such sockets will not be accounted by the memcg. This issue is more widespread in cgroup v1 where network memory accounting is opt-in but it can happen in cgroup v2 if the source socket for the cloning was created in root memcg. To fix the issue, just do the association of the sockets at the accept() time in the process context and then force charge the memory buffer already used and reserved by the socket. Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-08inet_diag: return classid for all socket typesDmitry Yakunin3-28/+26
In commit 1ec17dbd90f8 ("inet_diag: fix reporting cgroup classid and fallback to priority") croup classid reporting was fixed. But this works only for TCP sockets because for other socket types icsk parameter can be NULL and classid code path is skipped. This change moves classid handling to inet_diag_msg_attrs_fill() function. Also inet_diag_msg_attrs_size() helper was added and addends in nlmsg_new() were reordered to save order from inet_sk_diag_fill(). Fixes: 1ec17dbd90f8 ("inet_diag: fix reporting cgroup classid and fallback to priority") Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-08gre: fix uninit-value in __iptunnel_pull_headerEric Dumazet1-2/+10
syzbot found an interesting case of the kernel reading an uninit-value [1] Problem is in the handling of ETH_P_WCCP in gre_parse_header() We look at the byte following GRE options to eventually decide if the options are four bytes longer. Use skb_header_pointer() to not pull bytes if we found that no more bytes were needed. All callers of gre_parse_header() are properly using pskb_may_pull() anyway before proceeding to next header. [1] BUG: KMSAN: uninit-value in pskb_may_pull include/linux/skbuff.h:2303 [inline] BUG: KMSAN: uninit-value in __iptunnel_pull_header+0x30c/0xbd0 net/ipv4/ip_tunnel_core.c:94 CPU: 1 PID: 11784 Comm: syz-executor940 Not tainted 5.6.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x220 lib/dump_stack.c:118 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:118 __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215 pskb_may_pull include/linux/skbuff.h:2303 [inline] __iptunnel_pull_header+0x30c/0xbd0 net/ipv4/ip_tunnel_core.c:94 iptunnel_pull_header include/net/ip_tunnels.h:411 [inline] gre_rcv+0x15e/0x19c0 net/ipv6/ip6_gre.c:606 ip6_protocol_deliver_rcu+0x181b/0x22c0 net/ipv6/ip6_input.c:432 ip6_input_finish net/ipv6/ip6_input.c:473 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ip6_input net/ipv6/ip6_input.c:482 [inline] ip6_mc_input+0xdf2/0x1460 net/ipv6/ip6_input.c:576 dst_input include/net/dst.h:442 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:306 __netif_receive_skb_one_core net/core/dev.c:5198 [inline] __netif_receive_skb net/core/dev.c:5312 [inline] netif_receive_skb_internal net/core/dev.c:5402 [inline] netif_receive_skb+0x66b/0xf20 net/core/dev.c:5461 tun_rx_batched include/linux/skbuff.h:4321 [inline] tun_get_user+0x6aef/0x6f60 drivers/net/tun.c:1997 tun_chr_write_iter+0x1f2/0x360 drivers/net/tun.c:2026 call_write_iter include/linux/fs.h:1901 [inline] new_sync_write fs/read_write.c:483 [inline] __vfs_write+0xa5a/0xca0 fs/read_write.c:496 vfs_write+0x44a/0x8f0 fs/read_write.c:558 ksys_write+0x267/0x450 fs/read_write.c:611 __do_sys_write fs/read_write.c:623 [inline] __se_sys_write fs/read_write.c:620 [inline] __ia32_sys_write+0xdb/0x120 fs/read_write.c:620 do_syscall_32_irqs_on arch/x86/entry/common.c:339 [inline] do_fast_syscall_32+0x3c7/0x6e0 arch/x86/entry/common.c:410 entry_SYSENTER_compat+0x68/0x77 arch/x86/entry/entry_64_compat.S:139 RIP: 0023:0xf7f62d99 Code: 90 e8 0b 00 00 00 f3 90 0f ae e8 eb f9 8d 74 26 00 89 3c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90 RSP: 002b:00000000fffedb2c EFLAGS: 00000217 ORIG_RAX: 0000000000000004 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000020002580 RDX: 0000000000000fca RSI: 0000000000000036 RDI: 0000000000000004 RBP: 0000000000008914 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:144 [inline] kmsan_internal_poison_shadow+0x66/0xd0 mm/kmsan/kmsan.c:127 kmsan_slab_alloc+0x8a/0xe0 mm/kmsan/kmsan_hooks.c:82 slab_alloc_node mm/slub.c:2793 [inline] __kmalloc_node_track_caller+0xb40/0x1200 mm/slub.c:4401 __kmalloc_reserve net/core/skbuff.c:142 [inline] __alloc_skb+0x2fd/0xac0 net/core/skbuff.c:210 alloc_skb include/linux/skbuff.h:1051 [inline] alloc_skb_with_frags+0x18c/0xa70 net/core/skbuff.c:5766 sock_alloc_send_pskb+0xada/0xc60 net/core/sock.c:2242 tun_alloc_skb drivers/net/tun.c:1529 [inline] tun_get_user+0x10ae/0x6f60 drivers/net/tun.c:1843 tun_chr_write_iter+0x1f2/0x360 drivers/net/tun.c:2026 call_write_iter include/linux/fs.h:1901 [inline] new_sync_write fs/read_write.c:483 [inline] __vfs_write+0xa5a/0xca0 fs/read_write.c:496 vfs_write+0x44a/0x8f0 fs/read_write.c:558 ksys_write+0x267/0x450 fs/read_write.c:611 __do_sys_write fs/read_write.c:623 [inline] __se_sys_write fs/read_write.c:620 [inline] __ia32_sys_write+0xdb/0x120 fs/read_write.c:620 do_syscall_32_irqs_on arch/x86/entry/common.c:339 [inline] do_fast_syscall_32+0x3c7/0x6e0 arch/x86/entry/common.c:410 entry_SYSENTER_compat+0x68/0x77 arch/x86/entry/entry_64_compat.S:139 Fixes: 95f5c64c3c13 ("gre: Move utility functions to common headers") Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-23tcp: fix TFO SYNACK undo to avoid double-timestamp-undoNeal Cardwell1-1/+5
In a rare corner case the new logic for undo of SYNACK RTO could result in triggering the warning in tcp_fastretrans_alert() that says: WARN_ON(tp->retrans_out != 0); The warning looked like: WARNING: CPU: 1 PID: 1 at net/ipv4/tcp_input.c:2818 tcp_ack+0x13e0/0x3270 The sequence that tickles this bug is: - Fast Open server receives TFO SYN with data, sends SYNACK - (client receives SYNACK and sends ACK, but ACK is lost) - server app sends some data packets - (N of the first data packets are lost) - server receives client ACK that has a TS ECR matching first SYNACK, and also SACKs suggesting the first N data packets were lost - server performs TS undo of SYNACK RTO, then immediately enters recovery - buggy behavior then performed a *second* undo that caused the connection to be in CA_Open with retrans_out != 0 Basically, the incoming ACK packet with SACK blocks causes us to first undo the cwnd reduction from the SYNACK RTO, but then immediately enters fast recovery, which then makes us eligible for undo again. And then tcp_rcv_synrecv_state_fastopen() accidentally performs an undo using a "mash-up" of state from two different loss recovery phases: it uses the timestamp info from the ACK of the original SYNACK, and the undo_marker from the fast recovery. This fix refines the logic to only invoke the tcp_try_undo_loss() inside tcp_rcv_synrecv_state_fastopen() if the connection is still in CA_Loss. If peer SACKs triggered fast recovery, then tcp_rcv_synrecv_state_fastopen() can't safely undo. Fixes: 794200d66273 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-22ipv4: ensure rcu_read_lock() in cipso_v4_error()Matteo Croce1-1/+6
Similarly to commit c543cb4a5f07 ("ipv4: ensure rcu_read_lock() in ipv4_link_failure()"), __ip_options_compile() must be called under rcu protection. Fixes: 3da1ed7ac398 ("net: avoid use IPCB in cipso_v4_error") Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Matteo Croce <mcroce@redhat.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-19udp: rehash on disconnectWillem de Bruijn1-1/+5
As of the below commit, udp sockets bound to a specific address can coexist with one bound to the any addr for the same port. The commit also phased out the use of socket hashing based only on port (hslot), in favor of always hashing on {addr, port} (hslot2). The change broke the following behavior with disconnect (AF_UNSPEC): server binds to 0.0.0.0:1337 server connects to 127.0.0.1:80 server disconnects client connects to 127.0.0.1:1337 client sends "hello" server reads "hello" // times out, packet did not find sk On connect the server acquires a specific source addr suitable for routing to its destination. On disconnect it reverts to the any addr. The connect call triggers a rehash to a different hslot2. On disconnect, add the same to return to the original hslot2. Skip this step if the socket is going to be unhashed completely. Fixes: 4cdeeee9252a ("net: udp: prefer listeners bound to an address") Reported-by: Pavel Roskin <plroskin@gmail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-13icmp: introduce helper for nat'd source address in network device contextJason A. Donenfeld1-0/+33
This introduces a helper function to be called only by network drivers that wraps calls to icmp[v6]_send in a conntrack transformation, in case NAT has been used. We don't want to pollute the non-driver path, though, so we introduce this as a helper to be called by places that actually make use of this, as suggested by Florian. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds2-1/+13
Pull networking fixes from David Miller: 1) Use after free in rxrpc_put_local(), from David Howells. 2) Fix 64-bit division error in mlxsw, from Nathan Chancellor. 3) Make sure we clear various bits of TCP state in response to tcp_disconnect(). From Eric Dumazet. 4) Fix netlink attribute policy in cls_rsvp, from Eric Dumazet. 5) txtimer must be deleted in stmmac suspend(), from Nicolin Chen. 6) Fix TC queue mapping in bnxt_en driver, from Michael Chan. 7) Various netdevsim fixes from Taehee Yoo (use of uninitialized data, snapshot panics, stack out of bounds, etc.) 8) cls_tcindex changes hash table size after allocating the table, fix from Cong Wang. 9) Fix regression in the enforcement of session ID uniqueness in l2tp. We only have to enforce uniqueness for IP based tunnels not UDP ones. From Ridge Kennedy. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (46 commits) gtp: use __GFP_NOWARN to avoid memalloc warning l2tp: Allow duplicate session creation with UDP r8152: Add MAC passthrough support to new device net_sched: fix an OOB access in cls_tcindex qed: Remove set but not used variable 'p_link' tc-testing: add missing 'nsPlugin' to basic.json tc-testing: fix eBPF tests failure on linux fresh clones net: hsr: fix possible NULL deref in hsr_handle_frame() netdevsim: remove unused sdev code netdevsim: use __GFP_NOWARN to avoid memalloc warning netdevsim: use IS_ERR instead of IS_ERR_OR_NULL for debugfs netdevsim: fix stack-out-of-bounds in nsim_dev_debugfs_init() netdevsim: fix panic in nsim_dev_take_snapshot_write() netdevsim: disable devlink reload when resources are being used netdevsim: fix using uninitialized resources bnxt_en: Fix TC queue mapping. bnxt_en: Fix logic that disables Bus Master during firmware reset. bnxt_en: Fix RDMA driver failure with SRIOV after firmware reset. bnxt_en: Refactor logic to re-enable SRIOV after firmware reset detected. net: stmmac: Delete txtimer in suspend() ...
2020-02-04proc: convert everything to "struct proc_ops"Alexey Dobriyan3-25/+25
The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in seq_file.h. Conversion rule is: llseek => proc_lseek unlocked_ioctl => proc_ioctl xxx => proc_xxx delete ".owner = THIS_MODULE" line [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c] [sfr@canb.auug.org.au: fix kernel/sched/psi.c] Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-02tcp: Reduce SYN resend delay if a suspicous ACK is receivedSeongJae Park1-1/+7
When closing a connection, the two acks that required to change closing socket's status to FIN_WAIT_2 and then TIME_WAIT could be processed in reverse order. This is possible in RSS disabled environments such as a connection inside a host. For example, expected state transitions and required packets for the disconnection will be similar to below flow. 00 (Process A) (Process B) 01 ESTABLISHED ESTABLISHED 02 close() 03 FIN_WAIT_1 04 ---FIN--> 05 CLOSE_WAIT 06 <--ACK--- 07 FIN_WAIT_2 08 <--FIN/ACK--- 09 TIME_WAIT 10 ---ACK--> 11 LAST_ACK 12 CLOSED CLOSED In some cases such as LINGER option applied socket, the FIN and FIN/ACK will be substituted to RST and RST/ACK, but there is no difference in the main logic. The acks in lines 6 and 8 are the acks. If the line 8 packet is processed before the line 6 packet, it will be just ignored as it is not a expected packet, and the later process of the line 6 packet will change the status of Process A to FIN_WAIT_2, but as it has already handled line 8 packet, it will not go to TIME_WAIT and thus will not send the line 10 packet to Process B. Thus, Process B will left in CLOSE_WAIT status, as below. 00 (Process A) (Process B) 01 ESTABLISHED ESTABLISHED 02 close() 03 FIN_WAIT_1 04 ---FIN--> 05 CLOSE_WAIT 06 (<--ACK---) 07 (<--FIN/ACK---) 08 (fired in right order) 09 <--FIN/ACK--- 10 <--ACK--- 11 (processed in reverse order) 12 FIN_WAIT_2 Later, if the Process B sends SYN to Process A for reconnection using the same port, Process A will responds with an ACK for the last flow, which has no increased sequence number. Thus, Process A will send RST, wait for TIMEOUT_INIT (one second in default), and then try reconnection. If reconnections are frequent, the one second latency spikes can be a big problem. Below is a tcpdump results of the problem: 14.436259 IP 127.0.0.1.45150 > 127.0.0.1.4242: Flags [S], seq 2560603644 14.436266 IP 127.0.0.1.4242 > 127.0.0.1.45150: Flags [.], ack 5, win 512 14.436271 IP 127.0.0.1.45150 > 127.0.0.1.4242: Flags [R], seq 2541101298 /* ONE SECOND DELAY */ 15.464613 IP 127.0.0.1.45150 > 127.0.0.1.4242: Flags [S], seq 2560603644 This commit mitigates the problem by reducing the delay for the next SYN if the suspicous ACK is received while in SYN_SENT state. Following commit will add a selftest, which can be also helpful for understanding of this issue. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: SeongJae Park <sjpark@amazon.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-01-31tcp: clear tp->segs_{in|out} in tcp_disconnect()Eric Dumazet1-0/+2
tp->segs_in and tp->segs_out need to be cleared in tcp_disconnect(). tcp_disconnect() is rarely used, but it is worth fixing it. Fixes: 2efd055c53c0 ("tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <mleitner@redhat.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-01-31tcp: clear tp->data_segs{in|out} in tcp_disconnect()Eric Dumazet1-0/+2
tp->data_segs_in and tp->data_segs_out need to be cleared in tcp_disconnect(). tcp_disconnect() is rarely used, but it is worth fixing it. Fixes: a44d6eacdaf5 ("tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-01-31tcp: clear tp->delivered in tcp_disconnect()Eric Dumazet1-0/+1
tp->delivered needs to be cleared in tcp_disconnect(). tcp_disconnect() is rarely used, but it is worth fixing it. Fixes: ddf1af6fa00e ("tcp: new delivery accounting") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-01-31tcp: clear tp->total_retrans in tcp_disconnect()Eric Dumazet1-0/+1
total_retrans needs to be cleared in tcp_disconnect(). tcp_disconnect() is rarely used, but it is worth fixing it. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: SeongJae Park <sjpark@amazon.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-01-29mptcp: handle tcp fallback when using syn cookiesFlorian Westphal2-0/+7
We can't deal with syncookie mode yet, the syncookie rx path will create tcp reqsk, i.e. we get OOB access because we treat tcp reqsk as mptcp reqsk one: TCP: SYN flooding on port 20002. Sending cookies. BUG: KASAN: slab-out-of-bounds in subflow_syn_recv_sock+0x451/0x4d0 net/mptcp/subflow.c:191 Read of size 1 at addr ffff8881167bc148 by task syz-executor099/2120 subflow_syn_recv_sock+0x451/0x4d0 net/mptcp/subflow.c:191 tcp_get_cookie_sock+0xcf/0x520 net/ipv4/syncookies.c:209 cookie_v6_check+0x15a5/0x1e90 net/ipv6/syncookies.c:252 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1123 [inline] [..] Bug can be reproduced via "sysctl net.ipv4.tcp_syncookies=2". Note that MPTCP should work with syncookies (4th ack would carry needed state), but it appears better to sort that out in -next so do tcp fallback for now. I removed the MPTCP ifdef for tcp_rsk "is_mptcp" member because if (IS_ENABLED()) is easier to read than "#ifdef IS_ENABLED()/#endif" pair. Cc: Eric Dumazet <edumazet@google.com> Fixes: cec37a6e41aae7bf ("mptcp: Handle MP_CAPABLE options for outgoing connections") Reported-by: Christoph Paasch <cpaasch@apple.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27udp: Support UDP fraglist GRO/GSO.Steffen Klassert1-23/+81
This patch extends UDP GRO to support fraglist GRO/GSO by using the previously introduced infrastructure. If the feature is enabled, all UDP packets are going to fraglist GRO (local input and forward). After validating the csum, we mark ip_summed as CHECKSUM_UNNECESSARY for fraglist GRO packets to make sure that the csum is not touched. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27net: include struct nhmsg size in nh nlmsg sizeStephen Worley1-1/+3
Include the size of struct nhmsg size when calculating how much of a payload to allocate in a new netlink nexthop notification message. Without this, we will fail to fill the skbuff at certain nexthop group sizes. You can reproduce the failure with the following iproute2 commands: ip link add dummy1 type dummy ip link add dummy2 type dummy ip link add dummy3 type dummy ip link add dummy4 type dummy ip link add dummy5 type dummy ip link add dummy6 type dummy ip link add dummy7 type dummy ip link add dummy8 type dummy ip link add dummy9 type dummy ip link add dummy10 type dummy ip link add dummy11 type dummy ip link add dummy12 type dummy ip link add dummy13 type dummy ip link add dummy14 type dummy ip link add dummy15 type dummy ip link add dummy16 type dummy ip link add dummy17 type dummy ip link add dummy18 type dummy ip link add dummy19 type dummy ip ro add 1.1.1.1/32 dev dummy1 ip ro add 1.1.1.2/32 dev dummy2 ip ro add 1.1.1.3/32 dev dummy3 ip ro add 1.1.1.4/32 dev dummy4 ip ro add 1.1.1.5/32 dev dummy5 ip ro add 1.1.1.6/32 dev dummy6 ip ro add 1.1.1.7/32 dev dummy7 ip ro add 1.1.1.8/32 dev dummy8 ip ro add 1.1.1.9/32 dev dummy9 ip ro add 1.1.1.10/32 dev dummy10 ip ro add 1.1.1.11/32 dev dummy11 ip ro add 1.1.1.12/32 dev dummy12 ip ro add 1.1.1.13/32 dev dummy13 ip ro add 1.1.1.14/32 dev dummy14 ip ro add 1.1.1.15/32 dev dummy15 ip ro add 1.1.1.16/32 dev dummy16 ip ro add 1.1.1.17/32 dev dummy17 ip ro add 1.1.1.18/32 dev dummy18 ip ro add 1.1.1.19/32 dev dummy19 ip next add id 1 via 1.1.1.1 dev dummy1 ip next add id 2 via 1.1.1.2 dev dummy2 ip next add id 3 via 1.1.1.3 dev dummy3 ip next add id 4 via 1.1.1.4 dev dummy4 ip next add id 5 via 1.1.1.5 dev dummy5 ip next add id 6 via 1.1.1.6 dev dummy6 ip next add id 7 via 1.1.1.7 dev dummy7 ip next add id 8 via 1.1.1.8 dev dummy8 ip next add id 9 via 1.1.1.9 dev dummy9 ip next add id 10 via 1.1.1.10 dev dummy10 ip next add id 11 via 1.1.1.11 dev dummy11 ip next add id 12 via 1.1.1.12 dev dummy12 ip next add id 13 via 1.1.1.13 dev dummy13 ip next add id 14 via 1.1.1.14 dev dummy14 ip next add id 15 via 1.1.1.15 dev dummy15 ip next add id 16 via 1.1.1.16 dev dummy16 ip next add id 17 via 1.1.1.17 dev dummy17 ip next add id 18 via 1.1.1.18 dev dummy18 ip next add id 19 via 1.1.1.19 dev dummy19 ip next add id 1111 group 1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19 ip next del id 1111 Fixes: 430a049190de ("nexthop: Add support for nexthop groups") Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26tcp: export count for rehash attemptsAbdul Kabbani4-1/+13
Using IPv6 flow-label to swiftly route around avoid congested or disconnected network path can greatly improve TCP reliability. This patch adds SNMP counters and a OPT_STATS counter to track both host-level and connection-level statistics. Network administrators can use these counters to evaluate the impact of this new ability better. Export count for rehash attempts to 1) two SNMP counters: TcpTimeoutRehash (rehash due to timeouts), and TcpDuplicateDataRehash (rehash due to receiving duplicate packets) 2) Timestamping API SOF_TIMESTAMPING_OPT_STATS. Signed-off-by: Abdul Kabbani <akabbani@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Kevin(Yudong) Yang <yyd@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller10-11/+23
Minor conflict in mlx5 because changes happened to code that has moved meanwhile. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: parse and emit MP_CAPABLE option according to v1 specChristoph Paasch2-2/+2
This implements MP_CAPABLE options parsing and writing according to RFC 6824 bis / RFC 8684: MPTCP v1. Local key is sent on syn/ack, and both keys are sent on 3rd ack. MP_CAPABLE messages len are updated accordingly. We need the skbuff to correctly emit the above, so we push the skbuff struct as an argument all the way from tcp code to the relevant mptcp callbacks. When processing incoming MP_CAPABLE + data, build a full blown DSS-like map info, to simplify later processing. On child socket creation, we need to record the remote key, if available. Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: Implement MPTCP receive pathMat Martineau1-1/+7
Parses incoming DSS options and populates outgoing MPTCP ACK fields. MPTCP fields are parsed from the TCP option header and placed in an skb extension, allowing the upper MPTCP layer to access MPTCP options after the skb has gone through the TCP stack. The subflow implements its own data_ready() ops, which ensures that the pending data is in sequence - according to MPTCP seq number - dropping out-of-seq skbs. The DATA_READY bit flag is set if this is the case. This allows the MPTCP socket layer to determine if more data is available without having to consult the individual subflows. It additionally validates the current mapping and propagates EoF events to the connection socket. Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Co-developed-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: Handle MP_CAPABLE options for outgoing connectionsPeter Krystad2-0/+50
Add hooks to tcp_output.c to add MP_CAPABLE to an outgoing SYN request, to capture the MP_CAPABLE in the received SYN-ACK, to add MP_CAPABLE to the final ACK of the three-way handshake. Use the .sk_rx_dst_set() handler in the subflow proto to capture when the responding SYN-ACK is received and notify the MPTCP connection layer. Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: Handle MPTCP TCP optionsPeter Krystad2-0/+18
Add hooks to parse and format the MP_CAPABLE option. This option is handled according to MPTCP version 0 (RFC6824). MPTCP version 1 MP_CAPABLE (RFC6824bis/RFC8684) will be added later in coordination with related code changes. Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24mptcp: Add MPTCP socket stubsMat Martineau1-0/+2
Implements the infrastructure for MPTCP sockets. MPTCP sockets open one in-kernel TCP socket per subflow. These subflow sockets are only managed by the MPTCP socket that owns them and are not visible from userspace. This commit allows a userspace program to open an MPTCP socket with: sock = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP); The resulting socket is simply a wrapper around a single regular TCP socket, without any of the MPTCP protocol implemented over the wire. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24rt_cpu_seq_next should increase position indexVasily Averin1-0/+1
if seq_file .next fuction does not change position index, read after some lseek can generate unexpected output. https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24tcp: do not leave dangling pointers in tp->highest_sackEric Dumazet3-0/+3
Latest commit 853697504de0 ("tcp: Fix highest_sack and highest_sack_seq") apparently allowed syzbot to trigger various crashes in TCP stack [1] I believe this commit only made things easier for syzbot to find its way into triggering use-after-frees. But really the bugs could lead to bad TCP behavior or even plain crashes even for non malicious peers. I have audited all calls to tcp_rtx_queue_unlink() and tcp_rtx_queue_unlink_and_free() and made sure tp->highest_sack would be updated if we are removing from rtx queue the skb that tp->highest_sack points to. These updates were missing in three locations : 1) tcp_clean_rtx_queue() [This one seems quite serious, I have no idea why this was not caught earlier] 2) tcp_rtx_queue_purge() [Probably not a big deal for normal operations] 3) tcp_send_synack() [Probably not a big deal for normal operations] [1] BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1864 [inline] BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1856 [inline] BUG: KASAN: use-after-free in tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891 Read of size 4 at addr ffff8880a488d068 by task ksoftirqd/1/16 CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.5.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x197/0x210 lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374 __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506 kasan_report+0x12/0x20 mm/kasan/common.c:639 __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:134 tcp_highest_sack_seq include/net/tcp.h:1864 [inline] tcp_highest_sack_seq include/net/tcp.h:1856 [inline] tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891 tcp_try_undo_partial net/ipv4/tcp_input.c:2730 [inline] tcp_fastretrans_alert+0xf74/0x23f0 net/ipv4/tcp_input.c:2847 tcp_ack+0x2577/0x5bf0 net/ipv4/tcp_input.c:3710 tcp_rcv_established+0x6dd/0x1e90 net/ipv4/tcp_input.c:5706 tcp_v4_do_rcv+0x619/0x8d0 net/ipv4/tcp_ipv4.c:1619 tcp_v4_rcv+0x307f/0x3b40 net/ipv4/tcp_ipv4.c:2001 ip_protocol_deliver_rcu+0x5a/0x880 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x23b/0x380 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:307 [inline] NF_HOOK include/linux/netfilter.h:301 [inline] ip_local_deliver+0x1e9/0x520 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x1db/0x2f0 net/ipv4/ip_input.c:428 NF_HOOK include/linux/netfilter.h:307 [inline] NF_HOOK include/linux/netfilter.h:301 [inline] ip_rcv+0xe8/0x3f0 net/ipv4/ip_input.c:538 __netif_receive_skb_one_core+0x113/0x1a0 net/core/dev.c:5148 __netif_receive_skb+0x2c/0x1d0 net/core/dev.c:5262 process_backlog+0x206/0x750 net/core/dev.c:6093 napi_poll net/core/dev.c:6530 [inline] net_rx_action+0x508/0x1120 net/core/dev.c:6598 __do_softirq+0x262/0x98c kernel/softirq.c:292 run_ksoftirqd kernel/softirq.c:603 [inline] run_ksoftirqd+0x8e/0x110 kernel/softirq.c:595 smpboot_thread_fn+0x6a3/0xa40 kernel/smpboot.c:165 kthread+0x361/0x430 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 Allocated by task 10091: save_stack+0x23/0x90 mm/kasan/common.c:72 set_track mm/kasan/common.c:80 [inline] __kasan_kmalloc mm/kasan/common.c:513 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:521 slab_post_alloc_hook mm/slab.h:584 [inline] slab_alloc_node mm/slab.c:3263 [inline] kmem_cache_alloc_node+0x138/0x740 mm/slab.c:3575 __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:198 alloc_skb_fclone include/linux/skbuff.h:1099 [inline] sk_stream_alloc_skb net/ipv4/tcp.c:875 [inline] sk_stream_alloc_skb+0x113/0xc90 net/ipv4/tcp.c:852 tcp_sendmsg_locked+0xcf9/0x3470 net/ipv4/tcp.c:1282 tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1432 inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:672 __sys_sendto+0x262/0x380 net/socket.c:1998 __do_sys_sendto net/socket.c:2010 [inline] __se_sys_sendto net/socket.c:2006 [inline] __x64_sys_sendto+0xe1/0x1a0 net/socket.c:2006 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 10095: save_stack+0x23/0x90 mm/kasan/common.c:72 set_track mm/kasan/common.c:80 [inline] kasan_set_free_info mm/kasan/common.c:335 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:474 kasan_slab_free+0xe/0x10 mm/kasan/common.c:483 __cache_free mm/slab.c:3426 [inline] kmem_cache_free+0x86/0x320 mm/slab.c:3694 kfree_skbmem+0x178/0x1c0 net/core/skbuff.c:645 __kfree_skb+0x1e/0x30 net/core/skbuff.c:681 sk_eat_skb include/net/sock.h:2453 [inline] tcp_recvmsg+0x1252/0x2930 net/ipv4/tcp.c:2166 inet_recvmsg+0x136/0x610 net/ipv4/af_inet.c:838 sock_recvmsg_nosec net/socket.c:886 [inline] sock_recvmsg net/socket.c:904 [inline] sock_recvmsg+0xce/0x110 net/socket.c:900 __sys_recvfrom+0x1ff/0x350 net/socket.c:2055 __do_sys_recvfrom net/socket.c:2073 [inline] __se_sys_recvfrom net/socket.c:2069 [inline] __x64_sys_recvfrom+0xe1/0x1a0 net/socket.c:2069 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff8880a488d040 which belongs to the cache skbuff_fclone_cache of size 456 The buggy address is located 40 bytes inside of 456-byte region [ffff8880a488d040, ffff8880a488d208) The buggy address belongs to the page: page:ffffea0002922340 refcount:1 mapcount:0 mapping:ffff88821b057000 index:0x0 raw: 00fffe0000000200 ffffea00022a5788 ffffea0002624a48 ffff88821b057000 raw: 0000000000000000 ffff8880a488d040 0000000100000006 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880a488cf00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff8880a488cf80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff8880a488d000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ^ ffff8880a488d080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880a488d100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fixes: 853697504de0 ("tcp: Fix highest_sack and highest_sack_seq") Fixes: 50895b9de1d3 ("tcp: highest_sack fix") Fixes: 737ff314563c ("tcp: use sequence distance to detect reordering") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cambda Zhu <cambda@linux.alibaba.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23fou: Fix IPv6 netlink policyKristian Evensen1-2/+2
When submitting v2 of "fou: Support binding FoU socket" (1713cb37bf67), I accidentally sent the wrong version of the patch and one fix was missing. In the initial version of the patch, as well as the version 2 that I submitted, I incorrectly used ".type" for the two V6-attributes. The correct is to use ".len". Reported-by: Dmitry Vyukov <dvyukov@google.com> Fixes: 1713cb37bf67 ("fou: Support binding FoU socket") Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller6-14/+272
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-01-22 The following pull-request contains BPF updates for your *net-next* tree. We've added 92 non-merge commits during the last 16 day(s) which contain a total of 320 files changed, 7532 insertions(+), 1448 deletions(-). The main changes are: 1) function by function verification and program extensions from Alexei. 2) massive cleanup of selftests/bpf from Toke and Andrii. 3) batched bpf map operations from Brian and Yonghong. 4) tcp congestion control in bpf from Martin. 5) bulking for non-map xdp_redirect form Toke. 6) bpf_send_signal_thread helper from Yonghong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-22Revert "udp: do rmem bulk free even if the rx sk queue is empty"Paolo Abeni1-1/+2
This reverts commit 0d4a6608f68c7532dcbfec2ea1150c9761767d03. Williem reported that after commit 0d4a6608f68c ("udp: do rmem bulk free even if the rx sk queue is empty") the memory allocated by an almost idle system with many UDP sockets can grow a lot. For stable kernel keep the solution as simple as possible and revert the offending commit. Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Diagnosed-by: Eric Dumazet <eric.dumazet@gmail.com> Fixes: 0d4a6608f68c ("udp: do rmem bulk free even if the rx sk queue is empty") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21net, ip_tunnel: fix namespaces moveWilliam Dauchy1-3/+1
in the same manner as commit 690afc165bb3 ("net: ip6_gre: fix moving ip6gre between namespaces"), fix namespace moving as it was broken since commit 2e15ea390e6f ("ip_gre: Add support to collect tunnel metadata."). Indeed, the ip6_gre commit removed the local flag for collect_md condition, so there is no reason to keep it for ip_gre/ip_tunnel. this patch will fix both ip_tunnel and ip_gre modules. Fixes: 2e15ea390e6f ("ip_gre: Add support to collect tunnel metadata.") Signed-off-by: William Dauchy <w.dauchy@criteo.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-nextDavid S. Miller4-26/+261
Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2020-01-21 1) Add support for TCP encapsulation of IKE and ESP messages, as defined by RFC 8229. Patchset from Sabrina Dubroca. Please note that there is a merge conflict in: net/unix/af_unix.c between commit: 3c32da19a858 ("unix: Show number of pending scm files of receive queue in fdinfo") from the net-next tree and commit: b50b0580d27b ("net: add queue argument to __skb_wait_for_more_packets and __skb_{,try_}recv_datagram") from the ipsec-next tree. The conflict can be solved as done in linux-next. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21tcp/ipv4: remove AF_INET_FAMILYAlex Shi1-6/+0
After commit 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") the macro isn't used anymore. remove it. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21tcp: remove redundant assigment to snd_cwndTheodore Dubois1-1/+0
Not sure how this got in here. git blame says the second assignment was added in 3a9a57f6, but that commit also removed the first assignment. Signed-off-by: Theodore Dubois <tblodt@icloud.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21tcp_bbr: improve arithmetic division in bbr_update_bw()Wen Yang1-2/+1
do_div() does a 64-by-32 division. Use div64_long() instead of it if the divisor is long, to avoid truncation to 32-bit. And as a nice side effect also cleans up the function a bit. Signed-off-by: Wen Yang <wenyang@linux.alibaba.com> Cc: Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsecDavid S. Miller2-2/+13
Steffen Klassert says: ==================== pull request (net): ipsec 2020-01-21 1) Fix packet tx through bpf_redirect() for xfrm and vti interfaces. From Nicolas Dichtel. 2) Do not confirm neighbor when do pmtu update on a virtual xfrm interface. From Xu Wang. 3) Support output_mark for offload ESP packets, this was forgotten when the output_mark was added initially. From Ulrich Weber. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-19Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/netDavid S. Miller5-24/+31
2020-01-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller1-9/+10
Pablo Neira Ayuso says: ==================== Netfilter updates for net The following patchset contains Netfilter fixes for net: 1) Fix use-after-free in ipset bitmap destroy path, from Cong Wang. 2) Missing init netns in entry cleanup path of arp_tables, from Florian Westphal. 3) Fix WARN_ON in set destroy path due to missing cleanup on transaction error. 4) Incorrect netlink sanity check in tunnel, from Florian Westphal. 5) Missing sanity check for erspan version netlink attribute, also from Florian. 6) Remove WARN in nft_request_module() that can be triggered from userspace, from Florian Westphal. 7) Memleak in NFTA_HOOK_DEVS netlink parser, from Dan Carpenter. 8) List poison from commit path for flowtables that are added and deleted in the same batch, from Florian Westphal. 9) Fix NAT ICMP packet corruption, from Eyal Birger. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2-12/+11
Daniel Borkmann says: ==================== pull-request: bpf 2020-01-15 The following pull-request contains BPF updates for your *net* tree. We've added 12 non-merge commits during the last 9 day(s) which contain a total of 13 files changed, 95 insertions(+), 43 deletions(-). The main changes are: 1) Fix refcount leak for TCP time wait and request sockets for socket lookup related BPF helpers, from Lorenz Bauer. 2) Fix wrong verification of ARSH instruction under ALU32, from Daniel Borkmann. 3) Batch of several sockmap and related TLS fixes found while operating more complex BPF programs with Cilium and OpenSSL, from John Fastabend. 4) Fix sockmap to read psock's ingress_msg queue before regular sk_receive_queue() to avoid purging data upon teardown, from Lingpeng Chen. 5) Fix printing incorrect pointer in bpftool's btf_dump_ptr() in order to properly dump a BPF map's value with BTF, from Martin KaFai Lau. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15bpf: Sockmap/tls, fix pop data with SK_DROP return codeJohn Fastabend1-4/+1
When user returns SK_DROP we need to reset the number of copied bytes to indicate to the user the bytes were dropped and not sent. If we don't reset the copied arg sendmsg will return as if those bytes were copied giving the user a positive return value. This works as expected today except in the case where the user also pops bytes. In the pop case the sg.size is reduced but we don't correctly account for this when copied bytes is reset. The popped bytes are not accounted for and we return a small positive value potentially confusing the user. The reason this happens is due to a typo where we do the wrong comparison when accounting for pop bytes. In this fix notice the if/else is not needed and that we have a similar problem if we push data except its not visible to the user because if delta is larger the sg.size we return a negative value so it appears as an error regardless. Fixes: 7246d8ed4dcce ("bpf: helper to pop data from messages") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/bpf/20200111061206.8028-9-john.fastabend@gmail.com
2020-01-15bpf: Sockmap/tls, push write_space updates through ulp updatesJohn Fastabend1-2/+4
When sockmap sock with TLS enabled is removed we cleanup bpf/psock state and call tcp_update_ulp() to push updates to TLS ULP on top. However, we don't push the write_space callback up and instead simply overwrite the op with the psock stored previous op. This may or may not be correct so to ensure we don't overwrite the TLS write space hook pass this field to the ULP and have it fixup the ctx. This completes a previous fix that pushed the ops through to the ULP but at the time missed doing this for write_space, presumably because write_space TLS hook was added around the same time. Fixes: 95fa145479fbc ("bpf: sockmap/tls, close can race with map free") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/bpf/20200111061206.8028-4-john.fastabend@gmail.com
2020-01-15tcp: fix marked lost packets not being retransmittedPengcheng Yang1-3/+4
When the packet pointed to by retransmit_skb_hint is unlinked by ACK, retransmit_skb_hint will be set to NULL in tcp_clean_rtx_queue(). If packet loss is detected at this time, retransmit_skb_hint will be set to point to the current packet loss in tcp_verify_retransmit_hint(), then the packets that were previously marked lost but not retransmitted due to the restriction of cwnd will be skipped and cannot be retransmitted. To fix this, when retransmit_skb_hint is NULL, retransmit_skb_hint can be reset only after all marked lost packets are retransmitted (retrans_out >= lost_out), otherwise we need to traverse from tcp_rtx_queue_head in tcp_xmit_retransmit_queue(). Packetdrill to demonstrate: // Disable RACK and set max_reordering to keep things simple 0 `sysctl -q net.ipv4.tcp_recovery=0` +0 `sysctl -q net.ipv4.tcp_max_reordering=3` // Establish a connection +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +.1 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <...> +.01 < . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 // Send 8 data segments +0 write(4, ..., 8000) = 8000 +0 > P. 1:8001(8000) ack 1 // Enter recovery and 1:3001 is marked lost +.01 < . 1:1(0) ack 1 win 257 <sack 3001:4001,nop,nop> +0 < . 1:1(0) ack 1 win 257 <sack 5001:6001 3001:4001,nop,nop> +0 < . 1:1(0) ack 1 win 257 <sack 5001:7001 3001:4001,nop,nop> // Retransmit 1:1001, now retransmit_skb_hint points to 1001:2001 +0 > . 1:1001(1000) ack 1 // 1001:2001 was ACKed causing retransmit_skb_hint to be set to NULL +.01 < . 1:1(0) ack 2001 win 257 <sack 5001:8001 3001:4001,nop,nop> // Now retransmit_skb_hint points to 4001:5001 which is now marked lost // BUG: 2001:3001 was not retransmitted +0 > . 2001:3001(1000) ack 1 Signed-off-by: Pengcheng Yang <yangpc@wangsu.com> Acked-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15xfrm: support output_mark for offload ESP packetsUlrich Weber1-0/+2
Commit 9b42c1f179a6 ("xfrm: Extend the output_mark") added output_mark support but missed ESP offload support. xfrm_smark_get() is not called within xfrm_input() for packets coming from esp4_gro_receive() or esp6_gro_receive(). Therefore call xfrm_smark_get() directly within these functions. Fixes: 9b42c1f179a6 ("xfrm: Extend the output_mark to support input direction and masking.") Signed-off-by: Ulrich Weber <ulrich.weber@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-01-14ipv4: Add "offload" and "trap" indications to routesIdo Schimmel4-0/+81
When performing L3 offload, routes and nexthops are usually programmed into two different tables in the underlying device. Therefore, the fact that a nexthop resides in hardware does not necessarily mean that all the associated routes also reside in hardware and vice-versa. While the kernel can signal to user space the presence of a nexthop in hardware (via 'RTNH_F_OFFLOAD'), it does not have a corresponding flag for routes. In addition, the fact that a route resides in hardware does not necessarily mean that the traffic is offloaded. For example, unreachable routes (i.e., 'RTN_UNREACHABLE') are programmed to trap packets to the CPU so that the kernel will be able to generate the appropriate ICMP error packet. This patch adds an "offload" and "trap" indications to IPv4 routes, so that users will have better visibility into the offload process. 'struct fib_alias' is extended with two new fields that indicate if the route resides in hardware or not and if it is offloading traffic from the kernel or trapping packets to it. Note that the new fields are added in the 6 bytes hole and therefore the struct still fits in a single cache line [1]. Capable drivers are expected to invoke fib_alias_hw_flags_set() with the route's key in order to set the flags. The indications are dumped to user space via a new flags (i.e., 'RTM_F_OFFLOAD' and 'RTM_F_TRAP') in the 'rtm_flags' field in the ancillary header. v2: * Make use of 'struct fib_rt_info' in fib_alias_hw_flags_set() [1] struct fib_alias { struct hlist_node fa_list; /* 0 16 */ struct fib_info * fa_info; /* 16 8 */ u8 fa_tos; /* 24 1 */ u8 fa_type; /* 25 1 */ u8 fa_state; /* 26 1 */ u8 fa_slen; /* 27 1 */ u32 tb_id; /* 28 4 */ s16 fa_default; /* 32 2 */ u8 offload:1; /* 34: 0 1 */ u8 trap:1; /* 34: 1 1 */ u8 unused:6; /* 34: 2 1 */ /* XXX 5 bytes hole, try to pack */ struct callback_head rcu __attribute__((__aligned__(8))); /* 40 16 */ /* size: 56, cachelines: 1, members: 12 */ /* sum members: 50, holes: 1, sum holes: 5 */ /* sum bitfield members: 8 bits (1 bytes) */ /* forced alignments: 1, forced holes: 1, sum forced holes: 5 */ /* last cacheline: 56 bytes */ } __attribute__((__aligned__(8))); Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14ipv4: Encapsulate function arguments in a structIdo Schimmel4-21/+36
fib_dump_info() is used to prepare RTM_{NEW,DEL}ROUTE netlink messages using the passed arguments. Currently, the function takes 11 arguments, 6 of which are attributes of the route being dumped (e.g., prefix, TOS). The next patch will need the function to also dump to user space an indication if the route is present in hardware or not. Instead of passing yet another argument, change the function to take a struct containing the different route attributes. v2: * Name last argument of fib_dump_info() * Move 'struct fib_rt_info' to include/net/ip_fib.h so that it could later be passed to fib_alias_hw_flags_set() Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14ipv4: Replace route in list before notifyingIdo Schimmel1-4/+7
Subsequent patches will add an offload / trap indication to routes which will signal if the route is present in hardware or not. After programming the route to the hardware, drivers will have to ask the IPv4 code to set the flags by passing the route's key. In the case of route replace, the new route is notified before it is actually inserted into the FIB alias list. This can prevent simple drivers (e.g., netdevsim) that program the route to the hardware in the same context it is notified in from being able to set the flag. Solve this by first inserting the new route to the list and rollback the operation in case the route was vetoed. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14net: ipv4: use skb_list_walk_safe helper for gso segmentsJason A. Donenfeld1-5/+3
This is a straight-forward conversion case for the new function, keeping the flow of the existing code as intact as possible. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14net: udp: use skb_list_walk_safe helper for gso segmentsJason A. Donenfeld1-2/+1
This is a straight-forward conversion case for the new function, iterating over the return value from udp_rcv_segment, which actually is a wrapper around skb_gso_segment. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14vti[6]: fix packet tx through bpf_redirect()Nicolas Dichtel1-2/+11
With an ebpf program that redirects packets through a vti[6] interface, the packets are dropped because no dst is attached. This could also be reproduced with an AF_PACKET socket, with the following python script (vti1 is an ip_vti interface): import socket send_s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW, 0) # scapy # p = IP(src='10.100.0.2', dst='10.200.0.1')/ICMP(type='echo-request') # raw(p) req = b'E\x00\x00\x1c\x00\x01\x00\x00@\x01e\xb2\nd\x00\x02\n\xc8\x00\x01\x08\x00\xf7\xff\x00\x00\x00\x00' send_s.sendto(req, ('vti1', 0x800, 0, 0)) Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-01-13netfilter: arp_tables: init netns pointer in xt_tgdtor_param structFlorian Westphal1-9/+10
An earlier commit (1b789577f655060d98d20e, "netfilter: arp_tables: init netns pointer in xt_tgchk_param struct") fixed missing net initialization for arptables, but turns out it was incomplete. We can get a very similar struct net NULL deref during error unwinding: general protection fault: 0000 [#1] PREEMPT SMP KASAN RIP: 0010:xt_rateest_put+0xa1/0x440 net/netfilter/xt_RATEEST.c:77 xt_rateest_tg_destroy+0x72/0xa0 net/netfilter/xt_RATEEST.c:175 cleanup_entry net/ipv4/netfilter/arp_tables.c:509 [inline] translate_table+0x11f4/0x1d80 net/ipv4/netfilter/arp_tables.c:587 do_replace net/ipv4/netfilter/arp_tables.c:981 [inline] do_arpt_set_ctl+0x317/0x650 net/ipv4/netfilter/arp_tables.c:1461 Also init the netns pointer in xt_tgdtor_param struct. Fixes: add67461240c1d ("netfilter: add struct net * to target parameters") Reported-by: syzbot+91bdd8eece0f6629ec8b@syzkaller.appspotmail.com Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>