aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4/inet_diag.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2022-05-12net: inet: Retire port only listening_hashMartin KaFai Lau1-2/+3
The listen sk is currently stored in two hash tables, listening_hash (hashed by port) and lhash2 (hashed by port and address). After commit 0ee58dad5b06 ("net: tcp6: prefer listeners bound to an address") and commit d9fbc7f6431f ("net: tcp: prefer listeners bound to an address"), the TCP-SYN lookup fast path does not use listening_hash. The commit 05c0b35709c5 ("tcp: seq_file: Replace listening_hash with lhash2") also moved the seq_file (/proc/net/tcp) iteration usage from listening_hash to lhash2. There are still a few listening_hash usages left. One of them is inet_reuseport_add_sock() which uses the listening_hash to search a listen sk during the listen() system call. This turns out to be very slow on use cases that listen on many different VIPs at a popular port (e.g. 443). [ On top of the slowness in adding to the tail in the IPv6 case ]. The latter patch has a selftest to demonstrate this case. This patch takes this chance to move all remaining listening_hash usages to lhash2 and then retire listening_hash. Since most changes need to be done together, it is hard to cut the listening_hash to lhash2 switch into small patches. The changes in this patch is highlighted here for the review purpose. 1. Because of the listening_hash removal, lhash2 can use the sk->sk_nulls_node instead of the icsk->icsk_listen_portaddr_node. This will also keep the sk_unhashed() check to work as is after stop adding sk to listening_hash. The union is removed from inet_listen_hashbucket because only nulls_head is needed. 2. icsk->icsk_listen_portaddr_node and its helpers are removed. 3. The current lhash2 users needs to iterate with sk_nulls_node instead of icsk_listen_portaddr_node. One case is in the inet[6]_lhash2_lookup(). Another case is the seq_file iterator in tcp_ipv4.c. One thing to note is sk_nulls_next() is needed because the old inet_lhash2_for_each_icsk_continue() does a "next" first before iterating. 4. Move the remaining listening_hash usage to lhash2 inet_reuseport_add_sock() which this series is trying to improve. inet_diag.c and mptcp_diag.c are the final two remaining use cases and is moved to lhash2 now also. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-10inet_diag: fix kernel-infoleak for UDP socketsEric Dumazet1-3/+1
KMSAN reported a kernel-infoleak [1], that can exploited by unpriv users. After analysis it turned out UDP was not initializing r->idiag_expires. Other users of inet_sk_diag_fill() might make the same mistake in the future, so fix this in inet_sk_diag_fill(). [1] BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:121 [inline] BUG: KMSAN: kernel-infoleak in copyout lib/iov_iter.c:156 [inline] BUG: KMSAN: kernel-infoleak in _copy_to_iter+0x69d/0x25c0 lib/iov_iter.c:670 instrument_copy_to_user include/linux/instrumented.h:121 [inline] copyout lib/iov_iter.c:156 [inline] _copy_to_iter+0x69d/0x25c0 lib/iov_iter.c:670 copy_to_iter include/linux/uio.h:155 [inline] simple_copy_to_iter+0xf3/0x140 net/core/datagram.c:519 __skb_datagram_iter+0x2cb/0x1280 net/core/datagram.c:425 skb_copy_datagram_iter+0xdc/0x270 net/core/datagram.c:533 skb_copy_datagram_msg include/linux/skbuff.h:3657 [inline] netlink_recvmsg+0x660/0x1c60 net/netlink/af_netlink.c:1974 sock_recvmsg_nosec net/socket.c:944 [inline] sock_recvmsg net/socket.c:962 [inline] sock_read_iter+0x5a9/0x630 net/socket.c:1035 call_read_iter include/linux/fs.h:2156 [inline] new_sync_read fs/read_write.c:400 [inline] vfs_read+0x1631/0x1980 fs/read_write.c:481 ksys_read+0x28c/0x520 fs/read_write.c:619 __do_sys_read fs/read_write.c:629 [inline] __se_sys_read fs/read_write.c:627 [inline] __x64_sys_read+0xdb/0x120 fs/read_write.c:627 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae Uninit was created at: slab_post_alloc_hook mm/slab.h:524 [inline] slab_alloc_node mm/slub.c:3251 [inline] __kmalloc_node_track_caller+0xe0c/0x1510 mm/slub.c:4974 kmalloc_reserve net/core/skbuff.c:354 [inline] __alloc_skb+0x545/0xf90 net/core/skbuff.c:426 alloc_skb include/linux/skbuff.h:1126 [inline] netlink_dump+0x3d5/0x16a0 net/netlink/af_netlink.c:2245 __netlink_dump_start+0xd1c/0xee0 net/netlink/af_netlink.c:2370 netlink_dump_start include/linux/netlink.h:254 [inline] inet_diag_handler_cmd+0x2e7/0x400 net/ipv4/inet_diag.c:1343 sock_diag_rcv_msg+0x24a/0x620 netlink_rcv_skb+0x447/0x800 net/netlink/af_netlink.c:2491 sock_diag_rcv+0x63/0x80 net/core/sock_diag.c:276 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x1095/0x1360 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x16f3/0x1870 net/netlink/af_netlink.c:1916 sock_sendmsg_nosec net/socket.c:704 [inline] sock_sendmsg net/socket.c:724 [inline] sock_write_iter+0x594/0x690 net/socket.c:1057 do_iter_readv_writev+0xa7f/0xc70 do_iter_write+0x52c/0x1500 fs/read_write.c:851 vfs_writev fs/read_write.c:924 [inline] do_writev+0x63f/0xe30 fs/read_write.c:967 __do_sys_writev fs/read_write.c:1040 [inline] __se_sys_writev fs/read_write.c:1037 [inline] __x64_sys_writev+0xe5/0x120 fs/read_write.c:1037 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae Bytes 68-71 of 312 are uninitialized Memory access of size 312 starts at ffff88812ab54000 Data copied to user address 0000000020001440 CPU: 1 PID: 6365 Comm: syz-executor801 Not tainted 5.16.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 3c4d05c80567 ("inet_diag: Introduce the inet socket dumping routine") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20211209185058.53917-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27net: introduce sk_forward_alloc_get()Paolo Abeni1-1/+1
A later patch will change the MPTCP memory accounting schema in such a way that MPTCP sockets will encode the total amount of forward allocated memory in two separate fields (one for tx and one for rx). MPTCP sockets will use their own helper to provide the accurate amount of fwd allocated memory. To allow the above, this patch adds a new, optional, sk method to fetch the fwd memory, wrap the call in a new helper and use it where it is appropriate. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-13net: Use nlmsg_unicast() instead of netlink_unicast()Yajun Deng1-4/+1
It has 'if (err >0 )' statement in nlmsg_unicast(), so use nlmsg_unicast() instead of netlink_unicast(), this looks more concise. v2: remove the change in netfilter. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-16inet_diag: add support for tw_markMaciej Żenczykowski1-2/+10
Timewait sockets have included mark since approx 4.18. Cc: Eric Dumazet <edumazet@google.com> Cc: Jon Maxwell <jmaxwell37@gmail.com> Fixes: 00483690552c ("tcp: Add mark for TIMEWAIT sockets") Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jon Maxwell <jmaxwell37@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-11-17inet_diag: Fix error path to cancel the meseage in inet_req_diag_fill()Wang Hai1-1/+3
nlmsg_cancel() needs to be called in the error path of inet_req_diag_fill to cancel the message. Fixes: d545caca827b ("net: inet: diag: expose the socket mark to privileged processes.") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Link: https://lore.kernel.org/r/20201116082018.16496-1-wanghai38@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-5/+15
Two minor conflicts: 1) net/ipv4/route.c, adding a new local variable while moving another local variable and removing it's initial assignment. 2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes. One pretty prints the port mode differently, whilst another changes the driver to try and obtain the port mode from the port node rather than the switch node. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-21inet_diag: validate INET_DIAG_REQ_PROTOCOL attributeEric Dumazet1-5/+15
User space could send an invalid INET_DIAG_REQ_PROTOCOL attribute as caught by syzbot. BUG: KMSAN: uninit-value in inet_diag_lock_handler net/ipv4/inet_diag.c:55 [inline] BUG: KMSAN: uninit-value in __inet_diag_dump+0x58c/0x720 net/ipv4/inet_diag.c:1147 CPU: 0 PID: 8505 Comm: syz-executor174 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x21c/0x280 lib/dump_stack.c:118 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:122 __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:219 inet_diag_lock_handler net/ipv4/inet_diag.c:55 [inline] __inet_diag_dump+0x58c/0x720 net/ipv4/inet_diag.c:1147 inet_diag_dump_compat+0x2a5/0x380 net/ipv4/inet_diag.c:1254 netlink_dump+0xb73/0x1cb0 net/netlink/af_netlink.c:2246 __netlink_dump_start+0xcf2/0xea0 net/netlink/af_netlink.c:2354 netlink_dump_start include/linux/netlink.h:246 [inline] inet_diag_rcv_msg_compat+0x5da/0x6c0 net/ipv4/inet_diag.c:1288 sock_diag_rcv_msg+0x24f/0x620 net/core/sock_diag.c:256 netlink_rcv_skb+0x6d7/0x7e0 net/netlink/af_netlink.c:2470 sock_diag_rcv+0x63/0x80 net/core/sock_diag.c:275 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x11c8/0x1490 net/netlink/af_netlink.c:1330 netlink_sendmsg+0x173a/0x1840 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:651 [inline] sock_sendmsg net/socket.c:671 [inline] ____sys_sendmsg+0xc82/0x1240 net/socket.c:2353 ___sys_sendmsg net/socket.c:2407 [inline] __sys_sendmsg+0x6d1/0x820 net/socket.c:2440 __do_sys_sendmsg net/socket.c:2449 [inline] __se_sys_sendmsg+0x97/0xb0 net/socket.c:2447 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447 do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x441389 Code: e8 fc ab 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 09 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fff3b02ce98 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441389 RDX: 0000000000000000 RSI: 0000000020001500 RDI: 0000000000000003 RBP: 00000000006cb018 R08: 00000000004002c8 R09: 00000000004002c8 R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000402130 R13: 00000000004021c0 R14: 0000000000000000 R15: 0000000000000000 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:143 [inline] kmsan_internal_poison_shadow+0x66/0xd0 mm/kmsan/kmsan.c:126 kmsan_slab_alloc+0x8a/0xe0 mm/kmsan/kmsan_hooks.c:80 slab_alloc_node mm/slub.c:2907 [inline] __kmalloc_node_track_caller+0x9aa/0x12f0 mm/slub.c:4511 __kmalloc_reserve net/core/skbuff.c:142 [inline] __alloc_skb+0x35f/0xb30 net/core/skbuff.c:210 alloc_skb include/linux/skbuff.h:1094 [inline] netlink_alloc_large_skb net/netlink/af_netlink.c:1176 [inline] netlink_sendmsg+0xdb9/0x1840 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:651 [inline] sock_sendmsg net/socket.c:671 [inline] ____sys_sendmsg+0xc82/0x1240 net/socket.c:2353 ___sys_sendmsg net/socket.c:2407 [inline] __sys_sendmsg+0x6d1/0x820 net/socket.c:2440 __do_sys_sendmsg net/socket.c:2449 [inline] __se_sys_sendmsg+0x97/0xb0 net/socket.c:2447 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447 do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 3f935c75eb52 ("inet_diag: support for wider protocol numbers") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Christoph Paasch <cpaasch@apple.com> Cc: Mat Martineau <mathew.j.martineau@linux.intel.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-03ip: expose inet sockopts through inet_diagWei Wang1-0/+17
Expose all exisiting inet sockopt bits through inet_diag for debug purpose. Corresponding changes in iproute2 ss will be submitted to output all these values. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-09inet_diag: support for wider protocol numbersPaolo Abeni1-17/+48
After commit bf9765145b85 ("sock: Make sk_protocol a 16-bit value") the current size of 'sdiag_protocol' is not sufficient to represent the possible protocol values. This change introduces a new inet diag request attribute to let user space specify the relevant protocol number using u32 values. The attribute is parsed by inet diag core on get/dump command and the extended protocol value, if available, is preferred to 'sdiag_protocol' to lookup the diag handler. The parse attributed are exposed to all the diag handlers via the cb->data. Note that inet_diag_dump_one_icsk() is left unmodified, as it will not be used by protocol using the extended attribute. Suggested-by: David S. Miller <davem@davemloft.net> Co-developed-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-02inet_diag: bc: read cgroup id only for full socketsDmitry Yakunin1-1/+2
Fix bug introduced by commit b1f3e43dbfac ("inet_diag: add support for cgroup filter"). Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru> Reported-by: syzbot+ee80f840d9bf6893223b@syzkaller.appspotmail.com Reported-by: syzbot+13bef047dbfffa5cd1af@syzkaller.appspotmail.com Fixes: b1f3e43dbfac ("inet_diag: add support for cgroup filter") Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30inet_diag: add support for cgroup filterDmitry Yakunin1-0/+31
This patch adds ability to filter sockets based on cgroup v2 ID. Such filter is helpful in ss utility for filtering sockets by cgroup pathname. Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30inet_diag: add cgroup id attributeDmitry Yakunin1-0/+7
This patch adds cgroup v2 ID to common inet diag message attributes. Cgroup v2 ID is kernfs ID (ino or ino+gen). This attribute allows filter inet diag output by cgroup ID obtained by name_to_handle_at() syscall. When net_cls or net_prio cgroup is activated this ID is equal to 1 (root cgroup ID) for newly created sockets. Some notes about this ID: 1) gets initialized in socket() syscall 2) incoming socket gets ID from listening socket (not during accept() syscall) 3) not changed when process get moved to another cgroup 4) can point to deleted cgroup (refcounting) v2: - use CONFIG_SOCK_CGROUP_DATA instead if CONFIG_CGROUPS v3: - fix attr size by using nla_total_size_64bit() (Eric Dumazet) - more detailed commit message (Konstantin Khlebnikov) Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-By: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-24/+20
Minor overlapping changes, nothing serious. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-08inet_diag: return classid for all socket typesDmitry Yakunin1-24/+20
In commit 1ec17dbd90f8 ("inet_diag: fix reporting cgroup classid and fallback to priority") croup classid reporting was fixed. But this works only for TCP sockets because for other socket types icsk parameter can be NULL and classid code path is skipped. This change moves classid handling to inet_diag_msg_attrs_fill() function. Also inet_diag_msg_attrs_size() helper was added and addends in nlmsg_new() were reordered to save order from inet_sk_diag_fill(). Fixes: 1ec17dbd90f8 ("inet_diag: fix reporting cgroup classid and fallback to priority") Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27bpf: inet_diag: Dump bpf_sk_storages in inet_diag_dump()Martin KaFai Lau1-0/+74
This patch will dump out the bpf_sk_storages of a sk if the request has the INET_DIAG_REQ_SK_BPF_STORAGES nlattr. An array of SK_DIAG_BPF_STORAGE_REQ_MAP_FD can be specified in INET_DIAG_REQ_SK_BPF_STORAGES to select which bpf_sk_storage to dump. If no map_fd is specified, all bpf_sk_storages of a sk will be dumped. bpf_sk_storages can be added to the system at runtime. It is difficult to find a proper static value for cb->min_dump_alloc. This patch learns the nlattr size required to dump the bpf_sk_storages of a sk. If it happens to be the very first nlmsg of a dump and it cannot fit the needed bpf_sk_storages, it will try to expand the skb by "pskb_expand_head()". Instead of expanding it in inet_sk_diag_fill(), it is expanded at a sleepable context in __inet_diag_dump() so __GFP_DIRECT_RECLAIM can be used. In __inet_diag_dump(), it will retry as long as the skb is empty and the cb->min_dump_alloc becomes larger than before. cb->min_dump_alloc is bounded by KMALLOC_MAX_SIZE. The min_dump_alloc is also changed from 'u16' to 'u32' to accommodate a sk that may have a few large bpf_sk_storages. The updated cb->min_dump_alloc will also be used to allocate the skb in the next dump. This logic already exists in netlink_dump(). Here is the sample output of a locally modified 'ss' and it could be made more readable by using BTF later: [root@arch-fb-vm1 ~]# ss --bpf-map-id 14 --bpf-map-id 13 -t6an 'dst [::1]:8989' State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess ESTAB 0 0 [::1]:51072 [::1]:8989 bpf_map_id:14 value:[ 3feb ] bpf_map_id:13 value:[ 3f ] ESTAB 0 0 [::1]:51070 [::1]:8989 bpf_map_id:14 value:[ 3feb ] bpf_map_id:13 value:[ 3f ] [root@arch-fb-vm1 ~]# ~/devshare/github/iproute2/misc/ss --bpf-maps -t6an 'dst [::1]:8989' State Recv-Q Send-Q Local Address:Port Peer Address:Port Process ESTAB 0 0 [::1]:51072 [::1]:8989 bpf_map_id:14 value:[ 3feb ] bpf_map_id:13 value:[ 3f ] bpf_map_id:12 value:[ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000... total:65407 ] ESTAB 0 0 [::1]:51070 [::1]:8989 bpf_map_id:14 value:[ 3feb ] bpf_map_id:13 value:[ 3f ] bpf_map_id:12 value:[ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000... total:65407 ] Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200225230427.1976129-1-kafai@fb.com
2020-02-27inet_diag: Move the INET_DIAG_REQ_BYTECODE nlattr to cb->dataMartin KaFai Lau1-47/+70
The INET_DIAG_REQ_BYTECODE nlattr is currently re-found every time when the "dump()" is re-started. In a latter patch, it will also need to parse the new INET_DIAG_REQ_SK_BPF_STORAGES nlattr to learn the map_fds. Thus, this patch takes this chance to store the parsed nlattr in cb->data during the "start" time of a dump. By doing this, the "bc" argument also becomes unnecessary and is removed. Also, the two copies of the INET_DIAG_REQ_BYTECODE parsing-audit logic between compat/current version can be consolidated to one. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200225230415.1975555-1-kafai@fb.com
2020-02-27inet_diag: Refactor inet_sk_diag_fill(), dump(), and dump_one()Martin KaFai Lau1-72/+44
In a latter patch, there is a need to update "cb->min_dump_alloc" in inet_sk_diag_fill() as it learns the diffierent bpf_sk_storages stored in a sk while dumping all sk(s) (e.g. tcp_hashinfo). The inet_sk_diag_fill() currently does not take the "cb" as an argument. One of the reason is inet_sk_diag_fill() is used by both dump_one() and dump() (which belong to the "struct inet_diag_handler". The dump_one() interface does not pass the "cb" along. This patch is to make dump_one() pass a "cb". The "cb" is created in inet_diag_cmd_exact(). The "nlh" and "in_skb" are stored in "cb" as the dump() interface does. The total number of args in inet_sk_diag_fill() is also cut from 10 to 7 and that helps many callers to pass fewer args. In particular, "struct user_namespace *user_ns", "u32 pid", and "u32 seq" can be replaced by accessing "cb->nlh" and "cb->skb". A similar argument reduction is also made to inet_twsk_diag_fill() and inet_req_diag_fill(). inet_csk_diag_dump() and inet_csk_diag_fill() are also removed. They are mostly equivalent to inet_sk_diag_fill(). Their repeated usages are very limited. Thus, inet_sk_diag_fill() is directly used in those occasions. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200225230409.1975173-1-kafai@fb.com
2019-12-13tcp/dccp: fix possible race __inet_lookup_established()Eric Dumazet1-1/+2
Michal Kubecek and Firo Yang did a very nice analysis of crashes happening in __inet_lookup_established(). Since a TCP socket can go from TCP_ESTABLISH to TCP_LISTEN (via a close()/socket()/listen() cycle) without a RCU grace period, I should not have changed listeners linkage in their hash table. They must use the nulls protocol (Documentation/RCU/rculist_nulls.txt), so that a lookup can detect a socket in a hash list was moved in another one. Since we added code in commit d296ba60d8e2 ("soreuseport: Resolve merge conflict for v4/v6 ordering fix"), we have to add hlist_nulls_add_tail_rcu() helper. Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Michal Kubecek <mkubecek@suse.cz> Reported-by: Firo Yang <firo.yang@suse.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Link: https://lore.kernel.org/netdev/20191120083919.GH27852@unicorn.suse.cz/ Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-11-06inet_diag: use jiffies_delta_to_msecs()Eric Dumazet1-9/+6
Use jiffies_delta_to_msecs() to avoid reporting 'infinite' timeouts and to cleanup code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-13tcp: annotate sk->sk_wmem_queued lockless readsEric Dumazet1-1/+1
For the sake of tcp_poll(), there are few places where we fetch sk->sk_wmem_queued while this field can change from IRQ or other cpu. We need to add READ_ONCE() annotations, and also make sure write sides use corresponding WRITE_ONCE() to avoid store-tearing. sk_wmem_queued_add() helper is added so that we can in the future convert to ADD_ONCE() or equivalent if/when available. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152Thomas Gleixner1-5/+1
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-12inet_diag: fix reporting cgroup classid and fallback to priorityKonstantin Khlebnikov1-1/+9
Field idiag_ext in struct inet_diag_req_v2 used as bitmap of requested extensions has only 8 bits. Thus extensions starting from DCTCPINFO cannot be requested directly. Some of them included into response unconditionally or hook into some of lower 8 bits. Extension INET_DIAG_CLASS_ID has not way to request from the beginning. This patch bundle it with INET_DIAG_TCLASS (ipv6 tos), fixes space reservation, and documents behavior for other extensions. Also this patch adds fallback to reporting socket priority. This filed is more widely used for traffic classification because ipv4 sockets automatically maps TOS to priority and default qdisc pfifo_fast knows about that. But priority could be changed via setsockopt SO_PRIORITY so INET_DIAG_TOS isn't enough for predicting class. Also cgroup2 obsoletes net_cls classid (it always zero), but we cannot reuse this field for reporting cgroup2 id because it is 64-bit (ino+gen). So, after this patch INET_DIAG_CLASS_ID will report socket priority for most common setup when net_cls isn't set and/or cgroup2 in use. Fixes: 0888e372c37f ("net: inet: diag: expose sockets cgroup classid") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-20tcp: fix a race in inet_diag_dump_icsk()Eric Dumazet1-1/+3
Alexei reported use after frees in inet_diag_dump_icsk() [1] Because we use refcount_set() when various sockets are setup and inserted into ehash, we also need to make sure inet_diag_dump_icsk() wont race with the refcount_set() operations. Jonathan Lemon sent a patch changing net_twsk_hashdance() but other spots would need risky changes. Instead, fix inet_diag_dump_icsk() as this bug came with linux-4.10 only. [1] Quoting Alexei : First something iterating over sockets finds already freed tw socket: refcount_t: increment on 0; use-after-free. WARNING: CPU: 2 PID: 2738 at lib/refcount.c:153 refcount_inc+0x26/0x30 RIP: 0010:refcount_inc+0x26/0x30 RSP: 0018:ffffc90004c8fbc0 EFLAGS: 00010282 RAX: 000000000000002b RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff88085ee9d680 RSI: ffff88085ee954c8 RDI: ffff88085ee954c8 RBP: ffff88010ecbd2c0 R08: 0000000000000000 R09: 000000000000174c R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8806ba9bf210 R14: ffffffff82304600 R15: ffff88010ecbd328 FS: 00007f81f5a7d700(0000) GS:ffff88085ee80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f81e2a95000 CR3: 000000069b2eb006 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: inet_diag_dump_icsk+0x2b3/0x4e0 [inet_diag] // sock_hold(sk); in net/ipv4/inet_diag.c:1002 ? kmalloc_large_node+0x37/0x70 ? __kmalloc_node_track_caller+0x1cb/0x260 ? __alloc_skb+0x72/0x1b0 ? __kmalloc_reserve.isra.40+0x2e/0x80 __inet_diag_dump+0x3b/0x80 [inet_diag] netlink_dump+0x116/0x2a0 netlink_recvmsg+0x205/0x3c0 sock_read_iter+0x89/0xd0 __vfs_read+0xf7/0x140 vfs_read+0x8a/0x140 SyS_read+0x3f/0xa0 do_syscall_64+0x5a/0x100 then a minute later twsk timer fires and hits two bad refcnts for this freed socket: refcount_t: decrement hit 0; leaking memory. WARNING: CPU: 31 PID: 0 at lib/refcount.c:228 refcount_dec+0x2e/0x40 Modules linked in: RIP: 0010:refcount_dec+0x2e/0x40 RSP: 0018:ffff88085f5c3ea8 EFLAGS: 00010296 RAX: 000000000000002c RBX: ffff88010ecbd2c0 RCX: 000000000000083f RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f RBP: ffffc90003c77280 R08: 0000000000000000 R09: 00000000000017d3 R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffffffff82ad2d80 R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> inet_twsk_kill+0x9d/0xc0 // inet_twsk_bind_unhash(tw, hashinfo); call_timer_fn+0x29/0x110 run_timer_softirq+0x36b/0x3a0 refcount_t: underflow; use-after-free. WARNING: CPU: 31 PID: 0 at lib/refcount.c:187 refcount_sub_and_test+0x46/0x50 RIP: 0010:refcount_sub_and_test+0x46/0x50 RSP: 0018:ffff88085f5c3eb8 EFLAGS: 00010296 RAX: 0000000000000026 RBX: ffff88010ecbd2c0 RCX: 000000000000083f RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f RBP: ffff88010ecbd358 R08: 0000000000000000 R09: 000000000000185b R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffff88010ecbd358 R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> inet_twsk_put+0x12/0x20 // inet_twsk_put(tw); call_timer_fn+0x29/0x110 run_timer_softirq+0x36b/0x3a0 Fixes: 67db3e4bfbc9 ("tcp: no longer hold ehash lock while calling tcp_get_info()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Alexei Starovoitov <ast@kernel.org> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-12sock_diag: request _diag module only when the family or proto has been registeredXin Long1-2/+1
Now when using 'ss' in iproute, kernel would try to load all _diag modules, which also causes corresponding family and proto modules to be loaded as well due to module dependencies. Like after running 'ss', sctp, dccp, af_packet (if it works as a module) would be loaded. For example: $ lsmod|grep sctp $ ss $ lsmod|grep sctp sctp_diag 16384 0 sctp 323584 5 sctp_diag inet_diag 24576 4 raw_diag,tcp_diag,sctp_diag,udp_diag libcrc32c 16384 3 nf_conntrack,nf_nat,sctp As these family and proto modules are loaded unintentionally, it could cause some problems, like: - Some debug tools use 'ss' to collect the socket info, which loads all those diag and family and protocol modules. It's noisy for identifying issues. - Users usually expect to drop sctp init packet silently when they have no sense of sctp protocol instead of sending abort back. - It wastes resources (especially with multiple netns), and SCTP module can't be unloaded once it's loaded. ... In short, it's really inappropriate to have these family and proto modules loaded unexpectedly when just doing debugging with inet_diag. This patch is to introduce sock_load_diag_module() where it loads the _diag module only when it's corresponding family or proto has been already registered. Note that we can't just load _diag module without the family or proto loaded, as some symbols used in _diag module are from the family or proto module. v1->v2: - move inet proto check to inet_diag to avoid a compiling err. v2->v3: - define sock_load_diag_module in sock.c and export one symbol only. - improve the changelog. Reported-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Phil Sutter <phil@nwl.cc> Acked-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-02inet_diag: Add equal-operator for portsKristian Evensen1-0/+8
inet_diag currently provides less/greater than or equal operators for comparing ports when filtering sockets. An equal comparison can be performed by combining the two existing operators, or a user can for example request a port range and then do the final filtering in userspace. However, these approaches both have drawbacks. Implementing equal using LE/GE causes the size and complexity of a filter to grow quickly as the number of ports increase, while it on busy machines would be great if the kernel only returns information about relevant sockets. This patch introduces source and destination port equal operators. INET_DIAG_BC_S_EQ is used to match a source port, INET_DIAG_BC_D_EQ a destination port, and usage is the same as for the existing port operators. I.e., the port to match is stored in the no-member of the next inet_diag_bc_op-struct in the filter. Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01inet_diag: allow protocols to provide additional dataIvan Delalande1-4/+18
Extend inet_diag_handler to allow individual protocols to report additional data on INET_DIAG_INFO through idiag_get_aux. The size can be dynamic and is computed by idiag_get_aux_size. Signed-off-by: Ivan Delalande <colona@arista.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18net: inet: diag: expose sockets cgroup classidLevin, Alexander (Sasha Levin)1-0/+11
This is useful for directly looking up a task based on class id rather than having to scan through all open file descriptors. Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13tcp: remove early retransmitYuchung Cheng1-1/+0
This patch removes the support of RFC5827 early retransmit (i.e., fast recovery on small inflight with <3 dupacks) because it is subsumed by the new RACK loss detection. More specifically when RACK receives DUPACKs, it'll arm a reordering timer to start fast recovery after a quarter of (min)RTT, hence it covers the early retransmit except RACK does not limit itself to specific inflight or dupack numbers. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13tcp: add reordering timer in RACK loss detectionYuchung Cheng1-0/+1
This patch makes RACK install a reordering timer when it suspects some packets might be lost, but wants to delay the decision a little bit to accomodate reordering. It does not create a new timer but instead repurposes the existing RTO timer, because both are meant to retransmit packets. Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when the RACK timing check fails. The wait time is set to RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge This translates to expecting a packet (Packet) should take (RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent. When there are multiple packets that need a timer, we use one timer with the maximum timeout. Therefore the timer conservatively uses the maximum window to expire N packets by one timeout, instead of N timeouts to expire N packets sent at different times. The fudge factor is 2 jiffies to ensure when the timer fires, all the suspected packets would exceed the deadline and be marked lost by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the clock may tick between calling icsk_reset_xmit_timer(timeout) and actually hang the timer. The next jiffy is to lower-bound the timeout to 2 jiffies when reo_wnd is < 1ms. When the reordering timer fires (tcp_rack_reo_timeout): If we aren't in Recovery we'll enter fast recovery and force fast retransmit. This is very similar to the early retransmit (RFC5827) except RACK is not constrained to only enter recovery for small outstanding flights. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09tcp: no longer hold ehash lock while calling tcp_get_info()Eric Dumazet1-15/+33
We had various problems in the past in tcp_get_info() and used specific synchronization to avoid deadlocks. We would like to add more instrumentation points for TCP, and avoiding grabing socket lock in tcp_getinfo() was too costly. Being able to lock the socket allows to provide consistent set of fields. inet_diag_dump_icsk() can make sure ehash locks are not held any more when tcp_get_info() is called. We can remove syncp added in commit d654976cbf85 ("tcp: fix a potential deadlock in tcp_get_info()"), but we need to use lock_sock_fast() instead of spin_lock_bh() since TCP input path can now be run from process context. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23net: ip, diag -- Add diag interface for raw socketsCyrill Gorcunov1-0/+9
In criu we are actively using diag interface to collect sockets present in the system when dumping applications. And while for unix, tcp, udp[lite], packet, netlink it works as expected, the raw sockets do not have. Thus add it. v2: - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@) - implement @destroy for diag requests (by dsa@) v3: - add export of raw_abort for IPv6 (by dsa@) - pass net-admin flag into inet_sk_diag_fill due to changes in net-next branch (by dsa@) v4: - use @pad in struct inet_diag_req_v2 for raw socket protocol specification: raw module carries sockets which may have custom protocol passed from socket() syscall and sole @sdiag_protocol is not enough to match underlied ones - start reporting protocol specifed in socket() call when sockets are raw ones for the same reason: user space tools like ss may parse this attribute and use it for socket matching v5 (by eric.dumazet@): - use sock_hold in raw_sock_get instead of atomic_inc, we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock when looking up so counter won't be zero here. v6: - use sdiag_raw_protocol() helper which will access @pad structure used for raw sockets protocol specification: we can't simply rename this member without breaking uapi v7: - sine sdiag_raw_protocol() helper is not suitable for uapi lets rather make an alias structure with proper names. __check_inet_diag_req_raw helper will catch if any of structure unintentionally changed. CC: David S. Miller <davem@davemloft.net> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: David Ahern <dsa@cumulusnetworks.com> CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> CC: James Morris <jmorris@namei.org> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Patrick McHardy <kaber@trash.net> CC: Andrey Vagin <avagin@openvz.org> CC: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20tcp: relax listening_hash operationsEric Dumazet1-12/+4
softirq handlers use RCU protection to lookup listeners, and write operations all happen from process context. We do not need to block BH for dump operations. Also SYN_RECV since request sockets are stored in the ehash table : 1) inet_diag_dump_icsk() no longer need to clear cb->args[3] and cb->args[4] that were used as cursors while iterating the old per listener hash table. 2) Also factorize a test : No need to scan listening_hash[] if r->id.idiag_dport is not zero. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-08net: inet: diag: expose the socket mark to privileged processes.Lorenzo Colitti1-16/+33
This adds the capability for a process that has CAP_NET_ADMIN on a socket to see the socket mark in socket dumps. Commit a52e95abf772 ("net: diag: allow socket bytecode filters to match socket marks") recently gave privileged processes the ability to filter socket dumps based on mark. This patch is complementary: it ensures that the mark is also passed to userspace in the socket's netlink attributes. It is useful for tools like ss which display information about sockets. Tested: https://android-review.googlesource.com/270210 Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-24net: diag: allow socket bytecode filters to match socket marksLorenzo Colitti1-3/+33
This allows a privileged process to filter by socket mark when dumping sockets via INET_DIAG_BY_FAMILY. This is useful on systems that use mark-based routing such as Android. The ability to filter socket marks requires CAP_NET_ADMIN, which is consistent with other privileged operations allowed by the SOCK_DIAG interface such as the ability to destroy sockets and the ability to inspect BPF filters attached to packet sockets. Tested: https://android-review.googlesource.com/261350 Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-24net: diag: slightly refactor the inet_diag_bc_audit error checks.Lorenzo Colitti1-11/+17
This simplifies the code a bit and also allows inet_diag_bc_audit to send to userspace an error that isn't EINVAL. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-28net: diag: Add support to filter on device indexDavid Ahern1-0/+25
Add support to inet_diag facility to filter sockets based on device index. If an interface index is in the filter only sockets bound to that index (sk_bound_dev_if) are returned. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26sock_diag: align nlattr properly when neededNicolas Dichtel1-3/+6
I also fix the value of INET_DIAG_MAX. It's wrong since commit 8f840e47f190 which is only in net-next right now, thus I didn't make a separate patch. Fixes: 8f840e47f190 ("sctp: add the sctp_diag.c file") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21net: use jiffies_to_msecs to replace EXPIRES_IN_MS in inet/sctp_diagXin Long1-6/+6
EXPIRES_IN_MS macro comes from net/ipv4/inet_diag.c and dates back to before jiffies_to_msecs() has been introduced. Now we can remove it and use jiffies_to_msecs(). Suggested-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jakub Sitnicki <jkbs@redhat.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15sctp: export some functions for sctp_diag in inet_diagXin Long1-26/+41
inet_diag_msg_common_fill is used to fill the diag msg common info, we need to use it in sctp_diag as well, so export it. inet_diag_msg_attrs_fill is used to fill some common attrs info between sctp diag and tcp diag. v2->v3: - do not need to define and export inet_diag_get_handler any more. cause all the functions in it are in sctp_diag.ko, we just call them in sctp_diag.ko. - add inet_diag_msg_attrs_fill to make codes clear. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04tcp/dccp: do not touch listener sk_refcnt under synfloodEric Dumazet1-2/+1
When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes. By letting listeners use SOCK_RCU_FREE infrastructure, we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets, only listeners are impacted by this change. Peak performance under SYNFLOOD is increased by ~33% : On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps Most consuming functions are now skb_set_owner_w() and sock_wfree() contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04tcp/dccp: use rcu locking in inet_diag_find_one_icsk()Eric Dumazet1-2/+5
RX packet processing holds rcu_read_lock(), so we can remove pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions if inet_diag also holds rcu before calling them. This is needed anyway as __inet_lookup_listener() and inet6_lookup_listener() will soon no longer increment refcount on the found listener. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-14net: diag: add a scheduling point in inet_diag_dump_icsk()Eric Dumazet1-0/+1
On loaded TCP servers, looking at millions of sockets can hold cpu for many seconds, if the lookup condition is very narrow. (eg : ss dst 1.2.3.4 ) Better add a cond_resched() to allow other processes to access the cpu. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11inet: refactor inet[6]_lookup functions to take skbCraig Gallek1-3/+3
This is a preliminary step to allow fast socket lookup of SO_REUSEPORT groups. Doing so with a BPF filter will require access to the skb in question. This change plumbs the skb (and offset to payload data) through the call stack to the listening socket lookup implementations where it will be used in a following patch. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-20net: diag: support v4mapped sockets in inet_diag_find_one_icsk()Eric Dumazet1-7/+14
Lorenzo reported that we could not properly find v4mapped sockets in inet_diag_find_one_icsk(). This patch fixes the issue. Reported-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15net: diag: Support SOCK_DESTROY for inet sockets.Lorenzo Colitti1-8/+15
This passes the SOCK_DESTROY operation to the underlying protocol diag handler, or returns -EOPNOTSUPP if that handler does not define a destroy operation. Most of this patch is just renaming functions. This is not strictly necessary, but it would be fairly counterintuitive to have the code to destroy inet sockets be in a function whose name starts with inet_diag_get. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15net: diag: split inet_diag_dump_one_icsk into twoLorenzo Colitti1-15/+27
Currently, inet_diag_dump_one_icsk finds a socket and then dumps its information to userspace. Split it into a part that finds the socket and a part that dumps the information. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp/dccp: install syn_recv requests into ehash tableEric Dumazet1-88/+8
In this patch, we insert request sockets into TCP/DCCP regular ehash table (where ESTABLISHED and TIMEWAIT sockets are) instead of using the per listener hash table. ACK packets find SYN_RECV pseudo sockets without having to find and lock the listener. In nominal conditions, this halves pressure on listener lock. Note that this will allow for SO_REUSEPORT refinements, so that we can select a listener using cpu/numa affinities instead of the prior 'consistent hash', since only SYN packets will apply this selection logic. We will shrink listen_sock in the following patch to ease code review. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ying Cai <ycai@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp: move qlen/young out of struct listen_sockEric Dumazet1-1/+1
qlen_inc & young_inc were protected by listener lock, while qlen_dec & young_dec were atomic fields. Everything needs to be atomic for upcoming lockless listener. Also move qlen/young in request_sock_queue as we'll get rid of struct listen_sock eventually. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-10net: inet_diag: always export IPV6_V6ONLY sockopt for listening socketsPhil Sutter1-2/+2
Reconsidering my commit 20462155 "net: inet_diag: export IPV6_V6ONLY sockopt", I am not happy with the limitations it causes for socket analysing code in userspace. Exporting the value only if it is set makes it hard for userspace to decide whether the option is not set or the kernel does not support exporting the option at all. >From an auditor's perspective, the interesting question for listening AF_INET6 sockets is: "Does it NOT have IPV6_V6ONLY set?" Because it is the unexpected case. This patch allows to answer this question reliably. Signed-off-by: Phil Sutter <phil@nwl.cc> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>