aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4 (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-04-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-10/+27
Two easy cases of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-24ipv4: add sanity checks in ipv4_link_failure()Eric Dumazet1-9/+23
Before calling __ip_options_compile(), we need to ensure the network header is a an IPv4 one, and that it is already pulled in skb->head. RAW sockets going through a tunnel can end up calling ipv4_link_failure() with total garbage in the skb, or arbitrary lengthes. syzbot report : BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:355 [inline] BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123 Write of size 69 at addr ffff888096abf068 by task syz-executor.4/9204 CPU: 0 PID: 9204 Comm: syz-executor.4 Not tainted 5.1.0-rc5+ #77 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187 kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 check_memory_region_inline mm/kasan/generic.c:185 [inline] check_memory_region+0x123/0x190 mm/kasan/generic.c:191 memcpy+0x38/0x50 mm/kasan/common.c:133 memcpy include/linux/string.h:355 [inline] __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123 __icmp_send+0x725/0x1400 net/ipv4/icmp.c:695 ipv4_link_failure+0x29f/0x550 net/ipv4/route.c:1204 dst_link_failure include/net/dst.h:427 [inline] vti6_xmit net/ipv6/ip6_vti.c:514 [inline] vti6_tnl_xmit+0x10d4/0x1c0c net/ipv6/ip6_vti.c:553 __netdev_start_xmit include/linux/netdevice.h:4414 [inline] netdev_start_xmit include/linux/netdevice.h:4423 [inline] xmit_one net/core/dev.c:3292 [inline] dev_hard_start_xmit+0x1b2/0x980 net/core/dev.c:3308 __dev_queue_xmit+0x271d/0x3060 net/core/dev.c:3878 dev_queue_xmit+0x18/0x20 net/core/dev.c:3911 neigh_direct_output+0x16/0x20 net/core/neighbour.c:1527 neigh_output include/net/neighbour.h:508 [inline] ip_finish_output2+0x949/0x1740 net/ipv4/ip_output.c:229 ip_finish_output+0x73c/0xd50 net/ipv4/ip_output.c:317 NF_HOOK_COND include/linux/netfilter.h:278 [inline] ip_output+0x21f/0x670 net/ipv4/ip_output.c:405 dst_output include/net/dst.h:444 [inline] NF_HOOK include/linux/netfilter.h:289 [inline] raw_send_hdrinc net/ipv4/raw.c:432 [inline] raw_sendmsg+0x1d2b/0x2f20 net/ipv4/raw.c:663 inet_sendmsg+0x147/0x5d0 net/ipv4/af_inet.c:798 sock_sendmsg_nosec net/socket.c:651 [inline] sock_sendmsg+0xdd/0x130 net/socket.c:661 sock_write_iter+0x27c/0x3e0 net/socket.c:988 call_write_iter include/linux/fs.h:1866 [inline] new_sync_write+0x4c7/0x760 fs/read_write.c:474 __vfs_write+0xe4/0x110 fs/read_write.c:487 vfs_write+0x20c/0x580 fs/read_write.c:549 ksys_write+0x14f/0x2d0 fs/read_write.c:599 __do_sys_write fs/read_write.c:611 [inline] __se_sys_write fs/read_write.c:608 [inline] __x64_sys_write+0x73/0xb0 fs/read_write.c:608 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x458c29 Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f293b44bc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458c29 RDX: 0000000000000014 RSI: 00000000200002c0 RDI: 0000000000000003 RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f293b44c6d4 R13: 00000000004c8623 R14: 00000000004ded68 R15: 00000000ffffffff The buggy address belongs to the page: page:ffffea00025aafc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0x1fffc0000000000() raw: 01fffc0000000000 0000000000000000 ffffffff025a0101 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888096abef80: 00 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 f2 ffff888096abf000: f2 f2 f2 f2 00 00 00 00 00 00 00 00 00 00 00 00 >ffff888096abf080: 00 00 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 ^ ffff888096abf100: 00 00 00 00 f1 f1 f1 f1 00 00 f3 f3 00 00 00 00 ffff888096abf180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Fixes: ed0de45a1008 ("ipv4: recompile ip options in ipv4_link_failure") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Suryaputra <ssuryaextr@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23net: Change nhc_flags to unsigned charDavid Ahern1-4/+4
nhc_flags holds the RTNH_F flags for a given nexthop (fib{6}_nh). All of the RTNH_F_ flags fit in an unsigned char, and since the API to userspace (rtnh_flags and lower byte of rtm_flags) is 1 byte it can not grow. Make nhc_flags in fib_nh_common an unsigned char and shrink the size of the struct by 8, from 56 to 48 bytes. Update the flags arguments for up netdevice events and fib_nexthop_info which determines the RTNH_F flags to return on a dump/event. The RTNH_F flags are passed in the lower byte of rtm_flags which is an unsigned int so use a temp variable for the flags to fib_nexthop_info and combine with rtm_flags in the caller. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23lwtunnel: Pass encap and encap type attributes to lwtunnel_fill_encapDavid Ahern1-1/+2
Currently, lwtunnel_fill_encap hardcodes the encap and encap type attributes as RTA_ENCAP and RTA_ENCAP_TYPE, respectively. The nexthop objects want to re-use this code but the encap attributes passed to userspace as NHA_ENCAP and NHA_ENCAP_TYPE. Since that is the only difference, change lwtunnel_fill_encap to take the attribute type as an input. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22net: bpfilter: dont use module_init in non-modular codePaul Gortmaker1-2/+1
The Kconfig controlling this code is: bpfilter/Kconfig:menuconfig BPFILTER bpfilter/Kconfig: bool "BPF based packet filtering framework (BPFILTER)" Since it isn't a module, we shouldn't use module_init(). Instead we use device_initcall() - which is exactly what module_init() defaults to for non-modular code/builds. We don't remove <linux/module.h> from the includes since this file does a request_module() and hence is a valid user of that header file, even though it is not modular itself. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22net: Rename net/nexthop.h net/rtnh.hDavid Ahern2-2/+2
The header contains rtnh_ macros so rename the file accordingly. Allows a later patch to use the nexthop.h name for the new nexthop code. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19tcp: properly reset skb->truesize for tx recyclingEric Dumazet1-1/+1
tcp sendmsg() and sendpage() normally advance skb->data_len and skb->truesize by the payload added to an skb. But sendmsg(fd, ..., MSG_ZEROCOPY) has to account for whole pages, even if a single byte of payload is used in the page. This means that we can not assume skb->truesize can be adjusted by skb->data_len. We must instead overwrite its value. Otherwise skb->truesize is too big and can hit socket sndbuf limit, especially if the skb is recycled multiple times :/ Fixes: 472c2e07eef0 ("tcp: add one skb cache for tx") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19net: rework SIOCGSTAMP ioctl handlingArnd Bergmann1-6/+3
The SIOCGSTAMP/SIOCGSTAMPNS ioctl commands are implemented by many socket protocol handlers, and all of those end up calling the same sock_get_timestamp()/sock_get_timestampns() helper functions, which results in a lot of duplicate code. With the introduction of 64-bit time_t on 32-bit architectures, this gets worse, as we then need four different ioctl commands in each socket protocol implementation. To simplify that, let's add a new .gettstamp() operation in struct proto_ops, and move ioctl implementation into the common sock_ioctl()/compat_sock_ioctl_trans() functions that these all go through. We can reuse the sock_get_timestamp() implementation, but generalize it so it can deal with both native and compat mode, as well as timeval and timespec structures. Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Link: https://lore.kernel.org/lkml/CAK8P3a038aDQQotzua_QtKGhq8O9n+rdiz2=WDCp82ys8eUT+A@mail.gmail.com/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17ipv4: set the tcp_min_rtt_wlen range from 0 to one dayZhangXiaoxu1-1/+4
There is a UBSAN report as below: UBSAN: Undefined behaviour in net/ipv4/tcp_input.c:2877:56 signed integer overflow: 2147483647 * 1000 cannot be represented in type 'int' CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.1.0-rc4-00058-g582549e #1 Call Trace: <IRQ> dump_stack+0x8c/0xba ubsan_epilogue+0x11/0x60 handle_overflow+0x12d/0x170 ? ttwu_do_wakeup+0x21/0x320 __ubsan_handle_mul_overflow+0x12/0x20 tcp_ack_update_rtt+0x76c/0x780 tcp_clean_rtx_queue+0x499/0x14d0 tcp_ack+0x69e/0x1240 ? __wake_up_sync_key+0x2c/0x50 ? update_group_capacity+0x50/0x680 tcp_rcv_established+0x4e2/0xe10 tcp_v4_do_rcv+0x22b/0x420 tcp_v4_rcv+0xfe8/0x1190 ip_protocol_deliver_rcu+0x36/0x180 ip_local_deliver+0x15b/0x1a0 ip_rcv+0xac/0xd0 __netif_receive_skb_one_core+0x7f/0xb0 __netif_receive_skb+0x33/0xc0 netif_receive_skb_internal+0x84/0x1c0 napi_gro_receive+0x2a0/0x300 receive_buf+0x3d4/0x2350 ? detach_buf_split+0x159/0x390 virtnet_poll+0x198/0x840 ? reweight_entity+0x243/0x4b0 net_rx_action+0x25c/0x770 __do_softirq+0x19b/0x66d irq_exit+0x1eb/0x230 do_IRQ+0x7a/0x150 common_interrupt+0xf/0xf </IRQ> It can be reproduced by: echo 2147483647 > /proc/sys/net/ipv4/tcp_min_rtt_wlen Fixes: f672258391b42 ("tcp: track min RTT using windowed min-filter") Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-35/+40
Conflict resolution of af_smc.c from Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16tcp: tcp_grow_window() needs to respect tcp_space()Eric Dumazet1-5/+5
For some reason, tcp_grow_window() correctly tests if enough room is present before attempting to increase tp->rcv_ssthresh, but does not prevent it to grow past tcp_space() This is causing hard to debug issues, like failing the (__tcp_select_window(sk) >= tp->rcv_wnd) test in __tcp_ack_snd_check(), causing ACK delays and possibly slow flows. Depending on tcp_rmem[2], MTU, skb->len/skb->truesize ratio, we can see the problem happening on "netperf -t TCP_RR -- -r 2000,2000" after about 60 round trips, when the active side no longer sends immediate acks. This bug predates git history. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller4-209/+3
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for net-next: 1) Remove the broute pseudo hook, implement this from the bridge prerouting hook instead. Now broute becomes real table in ebtables, from Florian Westphal. This also includes a size reduction patch for the bridge control buffer area via squashing boolean into bitfields and a selftest. 2) Add OS passive fingerprint version matching, from Fernando Fernandez. 3) Support for gue encapsulation for IPVS, from Jacky Hu. 4) Add support for NAT to the inet family, from Florian Westphal. This includes support for masquerade, redirect and nat extensions. 5) Skip interface lookup in flowtable, use device in the dst object. 6) Add jiffies64_to_msecs() and use it, from Li RongQing. 7) Remove unused parameter in nf_tables_set_desc_parse(), from Colin Ian King. 8) Statify several functions, patches from YueHaibing and Florian Westphal. 9) Add an optimized version of nf_inet_addr_cmp(), from Li RongQing. 10) Merge route extension to core, also from Florian. 11) Use IS_ENABLED(CONFIG_NF_NAT) instead of NF_NAT_NEEDED, from Florian. 12) Merge ip/ip6 masquerade extensions, from Florian. This includes netdevice notifier unification. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-14ipv4: ensure rcu_read_lock() in ipv4_link_failure()Eric Dumazet1-2/+8
fib_compute_spec_dst() needs to be called under rcu protection. syzbot reported : WARNING: suspicious RCU usage 5.1.0-rc4+ #165 Not tainted include/linux/inetdevice.h:220 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by swapper/0/0: #0: 0000000051b67925 ((&n->timer)){+.-.}, at: lockdep_copy_map include/linux/lockdep.h:170 [inline] #0: 0000000051b67925 ((&n->timer)){+.-.}, at: call_timer_fn+0xda/0x720 kernel/time/timer.c:1315 stack backtrace: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.1.0-rc4+ #165 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5162 __in_dev_get_rcu include/linux/inetdevice.h:220 [inline] fib_compute_spec_dst+0xbbd/0x1030 net/ipv4/fib_frontend.c:294 spec_dst_fill net/ipv4/ip_options.c:245 [inline] __ip_options_compile+0x15a7/0x1a10 net/ipv4/ip_options.c:343 ipv4_link_failure+0x172/0x400 net/ipv4/route.c:1195 dst_link_failure include/net/dst.h:427 [inline] arp_error_report+0xd1/0x1c0 net/ipv4/arp.c:297 neigh_invalidate+0x24b/0x570 net/core/neighbour.c:995 neigh_timer_handler+0xc35/0xf30 net/core/neighbour.c:1081 call_timer_fn+0x190/0x720 kernel/time/timer.c:1325 expire_timers kernel/time/timer.c:1362 [inline] __run_timers kernel/time/timer.c:1681 [inline] __run_timers kernel/time/timer.c:1649 [inline] run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694 __do_softirq+0x266/0x95a kernel/softirq.c:293 invoke_softirq kernel/softirq.c:374 [inline] irq_exit+0x180/0x1d0 kernel/softirq.c:414 exiting_irq arch/x86/include/asm/apic.h:536 [inline] smp_apic_timer_interrupt+0x14a/0x570 arch/x86/kernel/apic/apic.c:1062 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:807 Fixes: ed0de45a1008 ("ipv4: recompile ip options in ipv4_link_failure") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12ipv4: recompile ip options in ipv4_link_failureStephen Suryaputra1-1/+9
Recompile IP options since IPCB may not be valid anymore when ipv4_link_failure is called from arp_error_report. Refer to the commit 3da1ed7ac398 ("net: avoid use IPCB in cipso_v4_error") and the commit before that (9ef6b42ad6fd) for a similar issue. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11dctcp: more accurate tracking of packets deliveryEric Dumazet1-28/+17
After commit e21db6f69a95 ("tcp: track total bytes delivered with ECN CE marks") core TCP stack does a very good job tracking ECN signals. The "sender's best estimate of CE information" Yuchung mentioned in his patch is indeed the best we can do. DCTCP can use tp->delivered_ce and tp->delivered to not duplicate the logic, and use the existing best estimate. This solves some problems, since current DCTCP logic does not deal with losses and/or GRO or ack aggregation very well. This also removes a dubious use of inet_csk(sk)->icsk_ack.rcv_mss (this should have been tp->mss_cache), and a 64 bit divide. Finally, we can see that the DCTCP logic, calling dctcp_update_alpha() for every ACK could be done differently, calling it only once per RTT. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Lawrence Brakmo <brakmo@fb.com> Cc: Abdul Kabbani <akabbani@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11netfilter: x_tables: merge ip and ipv6 masquerade modulesFlorian Westphal3-111/+3
No need to have separate modules for this. before: text data bss dec filename 2038 1168 0 3206 net/ipv4/netfilter/ipt_MASQUERADE.ko 1526 1024 0 2550 net/ipv6/netfilter/ip6t_MASQUERADE.ko after: text data bss dec filename 2521 1296 0 3817 net/netfilter/xt_MASQUERADE.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-11netfilter: nf_nat: merge ip/ip6 masquerade headersFlorian Westphal1-1/+1
Both are now implemented by nf_nat_masquerade.c, so no need to keep different headers. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-11net: fou: remove redundant code in gue_udp_recvLorenzo Bianconi1-3/+1
Remove not useful protocol version check in gue_udp_recv since just gue version 0 can hit that code. Moreover remove duplicated hdrlen computation Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10net: fou: do not use guehdr after iptunnel_pull_offloads in gue_udp_recvLorenzo Bianconi1-1/+3
gue tunnels run iptunnel_pull_offloads on received skbs. This can determine a possible use-after-free accessing guehdr pointer since the packet will be 'uncloned' running pskb_expand_head if it is a cloned gso skb (e.g if the packet has been sent though a veth device) Fixes: a09a4c8dd1ec ("tunnels: Remove encapsulation offloads on decap") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10fou: correct spelling of encapsulationSimon Horman1-2/+2
Correct spelling of encapsulation. Found by inspection. Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10ipv4: Handle RTA_GATEWAY set to 0David Ahern2-2/+4
Govindarajulu reported a regression with Network Manager which sends an RTA_GATEWAY attribute with the address set to 0. Fixup the handling of RTA_GATEWAY to only set fc_gw_family if the gateway address is actually set. Fixes: f35b794b3b405 ("ipv4: Prepare fib_config for IPv6 gateway") Reported-by: Govindarajulu Varadarajan <govind.varadar@gmail.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-5/+10
2019-04-08net: ip_gre: fix possible use-after-free in erspan_rcvLorenzo Bianconi1-5/+10
erspan tunnels run __iptunnel_pull_header on received skbs to remove gre and erspan headers. This can determine a possible use-after-free accessing pkt_md pointer in erspan_rcv since the packet will be 'uncloned' running pskb_expand_head if it is a cloned gso skb (e.g if the packet has been sent though a veth device). Fix it resetting pkt_md pointer after __iptunnel_pull_header Fixes: 1d7e2ed22f8d ("net: erspan: refactor existing erspan code") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08ipv4: Allow ipv6 gateway with ipv4 routesDavid Ahern2-8/+121
Add support for RTA_VIA and allow an IPv6 nexthop for v4 routes: $ ip ro add 172.16.1.0/24 via inet6 2001:db8::1 dev eth0 $ ip ro ls ... 172.16.1.0/24 via inet6 2001:db8::1 dev eth0 For convenience and simplicity, userspace can use RTA_VIA to specify AF_INET or AF_INET6 gateway. The common fib_nexthop_info dump function compares the gateway address family to the nh_common family to know if the gateway should be encoded as RTA_VIA or RTA_GATEWAY. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08ipv4: Flag fib_info with a fib_nh using IPv6 gatewayDavid Ahern1-0/+2
Until support is added to the offload drivers, they need to be able to reject routes with an IPv6 gateway. To that end add a flag to fib_info that indicates if any fib_nh has a v6 gateway. The flag allows the drivers to efficiently know the use of a v6 gateway without walking all fib_nh tied to a fib_info each time a route is added. Update mlxsw and rocker to reject the routes with extack message as to why. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08ipv4: Handle ipv6 gateway in fib_good_nhDavid Ahern1-2/+8
Update fib_good_nh to handle an ipv6 gateway. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08ipv4: Handle ipv6 gateway in fib_detect_deathDavid Ahern1-1/+9
Update fib_detect_death to handle an ipv6 gateway. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08ipv4: Handle ipv6 gateway in ipv4_confirm_neighDavid Ahern1-4/+6
Update ipv4_confirm_neigh to handle an ipv6 gateway. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08ipv4: Add helpers for neigh lookup for nexthopDavid Ahern2-17/+23
A common theme in the output path is looking up a neigh entry for a nexthop, either the gateway in an rtable or a fallback to the daddr in the skb: nexthop = (__force u32)rt_nexthop(rt, ip_hdr(skb)->daddr); neigh = __ipv4_neigh_lookup_noref(dev, nexthop); if (unlikely(!neigh)) neigh = __neigh_create(&arp_tbl, &nexthop, dev, false); To allow the nexthop to be an IPv6 address we need to consider the family of the nexthop and then call __ipv{4,6}_neigh_lookup_noref based on it. To make this simpler, add a ip_neigh_gw4 helper similar to ip_neigh_gw6 added in an earlier patch which handles: neigh = __ipv4_neigh_lookup_noref(dev, nexthop); if (unlikely(!neigh)) neigh = __neigh_create(&arp_tbl, &nexthop, dev, false); And then add a second one, ip_neigh_for_gw, that calls either ip_neigh_gw4 or ip_neigh_gw6 based on the address family of the gateway. Update the output paths in the VRF driver and core v4 code to use ip_neigh_for_gw simplifying the family based lookup and making both ready for a v6 nexthop. ipv4_neigh_lookup has a different need - the potential to resolve a passed in address in addition to any gateway in the rtable or skb. Since this is a one-off, add ip_neigh_gw4 and ip_neigh_gw6 diectly. The difference between __neigh_create used by the helpers and neigh_create called by ipv4_neigh_lookup is taking a refcount, so add rcu_read_lock_bh and bump the refcnt on the neigh entry. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08neighbor: Add skip_cache argument to neigh_outputDavid Ahern1-1/+1
A later patch allows an IPv6 gateway with an IPv4 route. The neighbor entry will exist in the v6 ndisc table and the cached header will contain the ipv6 protocol which is wrong for an IPv4 packet. For an IPv4 packet to use the v6 neighbor entry, neigh_output needs to skip the cached header and just use the output callback for the neigh entry. A future patchset can look at expanding the hh_cache to handle 2 protocols. For now, IPv6 gateways with an IPv4 route will take the extra overhead of generating the header. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08ipv4: Add fib_check_nh_v6_gwDavid Ahern1-0/+27
Add helper to use fib6_nh_init to validate a nexthop spec with an IPv6 gateway. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08ipv4: Refactor fib_check_nhDavid Ahern1-109/+125
fib_check_nh is currently huge covering multiple uses cases - device only, device + gateway, and device + gateway with ONLINK. The next patch adds validation checks for IPv6 which only further complicates it. So, break fib_check_nh into 2 helpers - one for gateway validation and one for device only. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08ipv4: Add support to fib_config for IPv6 gatewayDavid Ahern1-6/+23
Add support for an IPv6 gateway to fib_config. Since a gateway is either IPv4 or IPv6, make it a union with fc_gw4 where fc_gw_family decides which address is in use. Update current checks on family and gw4 to handle ipv6 as well. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08ipv4: Add support to rtable for ipv6 gatewayDavid Ahern2-5/+28
Add support for an IPv6 gateway to rtable. Since a gateway is either IPv4 or IPv6, make it a union with rt_gw4 where rt_gw_family decides which address is in use. When dumping the route data, encode an ipv6 nexthop using RTA_VIA. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08ipv4: Prepare fib_config for IPv6 gatewayDavid Ahern2-17/+31
Similar to rtable, fib_config needs to allow the gateway to be either an IPv4 or an IPv6 address. To that end, rename fc_gw to fc_gw4 to mean an IPv4 address and add fc_gw_family. Checks on 'is a gateway set' are changed to see if fc_gw_family is set. In the process prepare the code for a fc_gw_family == AF_INET6. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08ipv4: Prepare rtable for IPv6 gatewayDavid Ahern5-30/+34
To allow the gateway to be either an IPv4 or IPv6 address, remove rt_uses_gateway from rtable and replace with rt_gw_family. If rt_gw_family is set it implies rt_uses_gateway. Rename rt_gateway to rt_gw4 to represent the IPv4 version. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08net: Replace nhc_has_gw with nhc_gw_familyDavid Ahern2-16/+14
Allow the gateway in a fib_nh_common to be from a different address family than the outer fib{6}_nh. To that end, replace nhc_has_gw with nhc_gw_family and update users of nhc_has_gw to check nhc_gw_family. Now nhc_family is used to know if the nh_common is part of a fib_nh or fib6_nh (used for container_of to get to route family specific data), and nhc_gw_family represents the address family for the gateway. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08netfilter: nf_tables: merge route type into coreFlorian Westphal3-98/+0
very little code, so it really doesn't make sense to have extra modules or even a kconfig knob for this. Merge them and make functionality available unconditionally. The merge makes inet family route support trivial, so add it as well here. Before: text data bss dec hex filename 835 832 0 1667 683 nft_chain_route_ipv4.ko 870 832 0 1702 6a6 nft_chain_route_ipv6.ko 111568 2556 529 114653 1bfdd nf_tables.ko After: text data bss dec hex filename 113133 2556 529 116218 1c5fa nf_tables.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08datagram: remove rendundant 'peeked' argumentPaolo Abeni1-12/+7
After commit a297569fe00a ("net/udp: do not touch skb->peeked unless really needed") the 'peeked' argument of __skb_try_recv_datagram() and friends is always equal to !!'flags & MSG_PEEK'. Since such argument is really a boolean info, and the callers have already 'flags & MSG_PEEK' handy, we can remove it and clean-up the code a bit. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-07rhashtable: use bit_spin_locks to protect hash bucket.NeilBrown1-1/+0
This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the bucket pointer to lock the hash chain for that bucket. The benefits of a bit spin_lock are: - no need to allocate a separate array of locks. - no need to have a configuration option to guide the choice of the size of this array - locking cost is often a single test-and-set in a cache line that will have to be loaded anyway. When inserting at, or removing from, the head of the chain, the unlock is free - writing the new address in the bucket head implicitly clears the lock bit. For __rhashtable_insert_fast() we ensure this always happens when adding a new key. - even when lockings costs 2 updates (lock and unlock), they are in a cacheline that needs to be read anyway. The cost of using a bit spin_lock is a little bit of code complexity, which I think is quite manageable. Bit spin_locks are sometimes inappropriate because they are not fair - if multiple CPUs repeatedly contend of the same lock, one CPU can easily be starved. This is not a credible situation with rhashtable. Multiple CPUs may want to repeatedly add or remove objects, but they will typically do so at different buckets, so they will attempt to acquire different locks. As we have more bit-locks than we previously had spinlocks (by at least a factor of two) we can expect slightly less contention to go with the slightly better cache behavior and reduced memory consumption. To enhance type checking, a new struct is introduced to represent the pointer plus lock-bit that is stored in the bucket-table. This is "struct rhash_lock_head" and is empty. A pointer to this needs to be cast to either an unsigned lock, or a "struct rhash_head *" to be useful. Variables of this type are most often called "bkt". Previously "pprev" would sometimes point to a bucket, and sometimes a ->next pointer in an rhash_head. As these are now different types, pprev is NULL when it would have pointed to the bucket. In that case, 'blk' is used, together with correct locking protocol. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06tcp: remove redundant check on tskbColin Ian King1-5/+2
The non-null check on tskb is always false because it is in an else path of a check on tskb and hence tskb is null in this code block. This is check is therefore redundant and can be removed as well as the label coalesc. if (tsbk) { ... } else { ... if (unlikely(!skb)) { if (tskb) /* can never be true, redundant code */ goto coalesc; return; } } Addresses-Coverity: ("Logically dead code") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-25/+25
Minor comment merge conflict in mlx5. Staging driver has a fixup due to the skb->xmit_more changes in 'net-next', but was removed in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04tcp: Accept ECT on SYN in the presence of RFC8311Tilmans, Olivier (Nokia - BE/Antwerp)1-1/+6
Linux currently disable ECN for incoming connections when the SYN requests ECN and the IP header has ECT(0)/ECT(1) set, as some networks were reportedly mangling the ToS byte, hence could later trigger false congestion notifications. RFC8311 §4.3 relaxes RFC3168's requirements such that ECT can be set one TCP control packets (including SYNs). The main benefit of this is the decreased probability of losing a SYN in a congested ECN-capable network (i.e., it avoids the initial 1s timeout). Additionally, this allows the development of newer TCP extensions, such as AccECN. This patch relaxes the previous check, by enabling ECN on incoming connections using SYN+ECT if at least one bit of the reserved flags of the TCP header is set. Such bit would indicate that the sender of the SYN is using a newer TCP feature than what the host implements, such as AccECN, and is thus implementing RFC8311. This enables end-hosts not supporting such extensions to still negociate ECN, and to have some of the benefits of using ECN on control packets. Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com> Suggested-by: Bob Briscoe <research@bobbriscoe.net> Cc: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04tcp: Ensure DCTCP reacts to lossesKoen De Schepper1-18/+18
RFC8257 §3.5 explicitly states that "A DCTCP sender MUST react to loss episodes in the same way as conventional TCP". Currently, Linux DCTCP performs no cwnd reduction when losses are encountered. Optionally, the dctcp_clamp_alpha_on_loss resets alpha to its maximal value if a RTO happens. This behavior is sub-optimal for at least two reasons: i) it ignores losses triggering fast retransmissions; and ii) it causes unnecessary large cwnd reduction in the future if the loss was isolated as it resets the historical term of DCTCP's alpha EWMA to its maximal value (i.e., denoting a total congestion). The second reason has an especially noticeable effect when using DCTCP in high BDP environments, where alpha normally stays at low values. This patch replace the clamping of alpha by setting ssthresh to half of cwnd for both fast retransmissions and RTOs, at most once per RTT. Consequently, the dctcp_clamp_alpha_on_loss module parameter has been removed. The table below shows experimental results where we measured the drop probability of a PIE AQM (not applying ECN marks) at a bottleneck in the presence of a single TCP flow with either the alpha-clamping option enabled or the cwnd halving proposed by this patch. Results using reno or cubic are given for comparison. | Link | RTT | Drop TCP CC | speed | base+AQM | probability ==================|=========|==========|============ CUBIC | 40Mbps | 7+20ms | 0.21% RENO | | | 0.19% DCTCP-CLAMP-ALPHA | | | 25.80% DCTCP-HALVE-CWND | | | 0.22% ------------------|---------|----------|------------ CUBIC | 100Mbps | 7+20ms | 0.03% RENO | | | 0.02% DCTCP-CLAMP-ALPHA | | | 23.30% DCTCP-HALVE-CWND | | | 0.04% ------------------|---------|----------|------------ CUBIC | 800Mbps | 1+1ms | 0.04% RENO | | | 0.05% DCTCP-CLAMP-ALPHA | | | 18.70% DCTCP-HALVE-CWND | | | 0.06% We see that, without halving its cwnd for all source of losses, DCTCP drives the AQM to large drop probabilities in order to keep the queue length under control (i.e., it repeatedly faces RTOs). Instead, if DCTCP reacts to all source of losses, it can then be controlled by the AQM using similar drop levels than cubic or reno. Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com> Cc: Bob Briscoe <research@bobbriscoe.net> Cc: Lawrence Brakmo <brakmo@fb.com> Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <borkmann@iogearbox.net> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Andrew Shewmaker <agshew@gmail.com> Cc: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Florian Westphal <fw@strlen.de> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04net: use kfree_skb_list() from ip_do_fragment()Pablo Neira Ayuso1-5/+2
Just like 46cfd725c377 ("net: use kfree_skb_list() helper in more places"). Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03ipv6: Flip to fib_nexthop_infoDavid Ahern1-5/+17
Export fib_nexthop_info and fib_add_nexthop for use by IPv6 code. Remove rt6_nexthop_info and rt6_add_nexthop in favor of the IPv4 versions. Update fib_nexthop_info for IPv6 linkdown check and RTA_GATEWAY for AF_INET6. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03ipv4: Change fib_nexthop_info and fib_add_nexthop to take fib_nh_commonDavid Ahern1-20/+32
With the exception of the nexthop weight, the nexthop attributes used by fib_nexthop_info and fib_add_nexthop come from the fib_nh_common struct. Update both to use it and change fib_nexthop_info to check the family as needed. nexthop weight comes from the common struct for existing use cases, but for nexthop groups the weight is outside of the fib_nh_common to allow the same nexthop definition to be used in multiple groups with different weights. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03ipv4: Refactor nexthop attributes in fib_dump_infoDavid Ahern1-59/+107
Similar to ipv6, move addition of nexthop attributes to dump message into helpers that are called for both single path and multipath routes. Align the new helpers to the IPv6 variant which most notably means computing the flags argument based on settings in nh_flags. The RTA_FLOW argument is unique to IPv4, so it is appended after the new fib_nexthop_info helper. The intent of a later patch is to make both fib_nexthop_info and fib_add_nexthop usable for both IPv4 and IPv6. This patch is stepping stone in that direction. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03ipv4: Add fib_nh_common to fib_resultDavid Ahern5-33/+72
Most of the ipv4 code only needs data from fib_nh_common. Add fib_nh_common selection to fib_result and update users to use it. Right now, fib_nh_common in fib_result will point to a fib_nh struct that is embedded within a fib_info: fib_info --> fib_nh fib_nh ... fib_nh ^ fib_result->nhc ----+ Later, nhc can point to a fib_nh within a nexthop struct: fib_info --> nexthop --> fib_nh ^ fib_result->nhc ---------------+ or for a nexthop group: fib_info --> nexthop --> nexthop --> fib_nh nexthop --> fib_nh ... nexthop --> fib_nh ^ fib_result->nhc ---------------------------+ In all cases nhsel within fib_result will point to which leg in the multipath route is used. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03ipv4: Update fib_table_lookup tracepoint to take common nexthopDavid Ahern1-1/+1
Update fib_table_lookup tracepoint to take a fib_nh_common struct and dump the v6 gateway address if the nexthop uses it. Over the years saddr has not proven useful and the output of the tracepoint produces very long lines. Since saddr is not part of fib_nh_common, drop it. If it needs to be added later, fib_nh which contains saddr can be obtained from a fib_nh_common via container_of. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>