aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4 (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2019-02-06net: Get rid of SWITCHDEV_ATTR_ID_PORT_PARENT_IDFlorian Fainelli1-14/+5
Now that we have a dedicated NDO for getting a port's parent ID, get rid of SWITCHDEV_ATTR_ID_PORT_PARENT_ID and convert all callers to use the NDO exclusively. This is a preliminary change to getting rid of switchdev_ops eventually. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06net: Introduce ndo_get_port_parent_id()Florian Fainelli1-1/+7
In preparation for getting rid of switchdev_ops, create a dedicated NDO operation for getting the port's parent identifier. There are essentially two classes of drivers that need to implement getting the port's parent ID which are VF/PF drivers with a built-in switch, and pure switchdev drivers such as mlxsw, ocelot, dsa etc. We introduce a helper function: dev_get_port_parent_id() which supports recursion into the lower devices to obtain the first port's parent ID. Convert the bridge, core and ipv4 multicast routing code to check for such ndo_get_port_parent_id() and call the helper function when valid before falling back to switchdev_port_attr_get(). This will allow us to convert all relevant drivers in one go instead of having to implement both switchdev_port_attr_get() and ndo_get_port_parent_id() operations, then get rid of switchdev_port_attr_get(). Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03net: Fix ip_mc_{dec,inc}_group allocation contextFlorian Fainelli1-11/+24
After 4effd28c1245 ("bridge: join all-snoopers multicast address"), I started seeing the following sleep in atomic warnings: [ 26.763893] BUG: sleeping function called from invalid context at mm/slab.h:421 [ 26.771425] in_atomic(): 1, irqs_disabled(): 0, pid: 1658, name: sh [ 26.777855] INFO: lockdep is turned off. [ 26.781916] CPU: 0 PID: 1658 Comm: sh Not tainted 5.0.0-rc4 #20 [ 26.787943] Hardware name: BCM97278SV (DT) [ 26.792118] Call trace: [ 26.794645] dump_backtrace+0x0/0x170 [ 26.798391] show_stack+0x24/0x30 [ 26.801787] dump_stack+0xa4/0xe4 [ 26.805182] ___might_sleep+0x208/0x218 [ 26.809102] __might_sleep+0x78/0x88 [ 26.812762] kmem_cache_alloc_trace+0x64/0x28c [ 26.817301] igmp_group_dropped+0x150/0x230 [ 26.821573] ip_mc_dec_group+0x1b0/0x1f8 [ 26.825585] br_ip4_multicast_leave_snoopers.isra.11+0x174/0x190 [ 26.831704] br_multicast_toggle+0x78/0xcc [ 26.835887] store_bridge_parm+0xc4/0xfc [ 26.839894] multicast_snooping_store+0x3c/0x4c [ 26.844517] dev_attr_store+0x44/0x5c [ 26.848262] sysfs_kf_write+0x50/0x68 [ 26.852006] kernfs_fop_write+0x14c/0x1b4 [ 26.856102] __vfs_write+0x60/0x190 [ 26.859668] vfs_write+0xc8/0x168 [ 26.863059] ksys_write+0x70/0xc8 [ 26.866449] __arm64_sys_write+0x24/0x30 [ 26.870458] el0_svc_common+0xa0/0x11c [ 26.874291] el0_svc_handler+0x38/0x70 [ 26.878120] el0_svc+0x8/0xc while toggling the bridge's multicast_snooping attribute dynamically. Pass a gfp_t down to igmpv3_add_delrec(), introduce __igmp_group_dropped() and introduce __ip_mc_dec_group() to take a gfp_t argument. Similarly introduce ____ip_mc_inc_group() and __ip_mc_inc_group() to allow caller to specify gfp_t. IPv6 part of the patch appears fine. Fixes: 4effd28c1245 ("bridge: join all-snoopers multicast address") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03socket: Add SO_TIMESTAMPING_NEWDeepa Dinamani1-13/+17
Add SO_TIMESTAMPING_NEW variant of socket timestamp options. This is the y2038 safe versions of the SO_TIMESTAMPING_OLD for all architectures. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Cc: chris@zankel.net Cc: fenghua.yu@intel.com Cc: rth@twiddle.net Cc: tglx@linutronix.de Cc: ubraun@linux.ibm.com Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-ia64@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linux-s390@vger.kernel.org Cc: linux-xtensa@linux-xtensa.org Cc: sparclinux@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03socket: Add SO_TIMESTAMP[NS]_NEWDeepa Dinamani1-8/+25
Add SO_TIMESTAMP_NEW and SO_TIMESTAMPNS_NEW variants of socket timestamp options. These are the y2038 safe versions of the SO_TIMESTAMP_OLD and SO_TIMESTAMPNS_OLD for all architectures. Note that the format of scm_timestamping.ts[0] is not changed in this patch. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Cc: jejb@parisc-linux.org Cc: ralf@linux-mips.org Cc: rth@twiddle.net Cc: linux-alpha@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linux-parisc@vger.kernel.org Cc: linux-rdma@vger.kernel.org Cc: netdev@vger.kernel.org Cc: sparclinux@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03socket: Use old_timeval types for socket timestampsDeepa Dinamani1-1/+1
As part of y2038 solution, all internal uses of struct timeval are replaced by struct __kernel_old_timeval and struct compat_timeval by struct old_timeval32. Make socket timestamps use these new types. This is mainly to be able to verify that the kernel build is y2038 safe when such non y2038 safe types are not supported anymore. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Cc: isdn@linux-pingi.de Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03sockopt: Rename SO_TIMESTAMP* to SO_TIMESTAMP*_OLDDeepa Dinamani1-3/+3
SO_TIMESTAMP, SO_TIMESTAMPNS and SO_TIMESTAMPING options, the way they are currently defined, are not y2038 safe. Subsequent patches in the series add new y2038 safe versions of these options which provide 64 bit timestamps on all architectures uniformly. Hence, rename existing options with OLD tag suffixes. Also note that kernel will not use the untagged SO_TIMESTAMP* and SCM_TIMESTAMP* options internally anymore. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Cc: deller@gmx.de Cc: dhowells@redhat.com Cc: jejb@parisc-linux.org Cc: ralf@linux-mips.org Cc: rth@twiddle.net Cc: linux-afs@lists.infradead.org Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linux-parisc@vger.kernel.org Cc: linux-rdma@vger.kernel.org Cc: sparclinux@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03ipv4/igmp: Don't drop IGMP pkt with zeros src addrEdward Chron1-1/+2
Don't drop IGMP packets with a source address of all zeros which are IGMP proxy reports. This is documented in Section 2.1.1 IGMP Forwarding Rules of RFC 4541 IGMP and MLD Snooping Switches Considerations. Signed-off-by: Edward Chron <echron@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01ipconfig: add carrier_timeout kernel parameterMartin Kepplinger1-5/+22
commit 3fb72f1e6e61 ("ipconfig wait for carrier") added a "wait for carrier" policy, with a fixed worst case maximum wait of two minutes. Now make the wait for carrier timeout configurable on the kernel commandline and use the 120s as the default. The timeout messages introduced with commit 5e404cd65860 ("ipconfig: add informative timeout messages while waiting for carrier") are done in a fixed interval of 20 seconds, just like they were before (240/12). Signed-off-by: Martin Kepplinger <martin.kepplinger@ginzinger.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01ipv4: fib: use struct_size() in kzalloc()Gustavo A. R. Silva1-1/+1
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; instance = kzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30net: ip_gre: always reports o_key to userspaceLorenzo Bianconi1-1/+6
Erspan protocol (version 1 and 2) relies on o_key to configure session id header field. However TUNNEL_KEY bit is cleared in erspan_xmit since ERSPAN protocol does not set the key field of the external GRE header and so the configured o_key is not reported to userspace. The issue can be triggered with the following reproducer: $ip link add erspan1 type erspan local 192.168.0.1 remote 192.168.0.2 \ key 1 seq erspan_ver 1 $ip link set erspan1 up $ip -d link sh erspan1 erspan1@NONE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UNKNOWN mode DEFAULT link/ether 52:aa:99:95:9a:b5 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500 erspan remote 192.168.0.2 local 192.168.0.1 ttl inherit ikey 0.0.0.1 iseq oseq erspan_index 0 Fix the issue adding TUNNEL_KEY bit to the o_flags parameter in ipgre_fill_info Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28netfilter: ipv4: remove useless export_symbolFlorian Westphal1-18/+0
Only one caller; place it where needed and get rid of the EXPORT_SYMBOL. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-28esp: Skip TX bytes accounting when sending from a request socketMartin Willi1-1/+1
On ESP output, sk_wmem_alloc is incremented for the added padding if a socket is associated to the skb. When replying with TCP SYNACKs over IPsec, the associated sk is a casted request socket, only. Increasing sk_wmem_alloc on a request socket results in a write at an arbitrary struct offset. In the best case, this produces the following WARNING: WARNING: CPU: 1 PID: 0 at lib/refcount.c:102 esp_output_head+0x2e4/0x308 [esp4] refcount_t: addition on 0; use-after-free. CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.0.0-rc3 #2 Hardware name: Marvell Armada 380/385 (Device Tree) [...] [<bf0ff354>] (esp_output_head [esp4]) from [<bf1006a4>] (esp_output+0xb8/0x180 [esp4]) [<bf1006a4>] (esp_output [esp4]) from [<c05dee64>] (xfrm_output_resume+0x558/0x664) [<c05dee64>] (xfrm_output_resume) from [<c05d07b0>] (xfrm4_output+0x44/0xc4) [<c05d07b0>] (xfrm4_output) from [<c05956bc>] (tcp_v4_send_synack+0xa8/0xe8) [<c05956bc>] (tcp_v4_send_synack) from [<c0586ad8>] (tcp_conn_request+0x7f4/0x948) [<c0586ad8>] (tcp_conn_request) from [<c058c404>] (tcp_rcv_state_process+0x2a0/0xe64) [<c058c404>] (tcp_rcv_state_process) from [<c05958ac>] (tcp_v4_do_rcv+0xf0/0x1f4) [<c05958ac>] (tcp_v4_do_rcv) from [<c0598a4c>] (tcp_v4_rcv+0xdb8/0xe20) [<c0598a4c>] (tcp_v4_rcv) from [<c056eb74>] (ip_protocol_deliver_rcu+0x2c/0x2dc) [<c056eb74>] (ip_protocol_deliver_rcu) from [<c056ee6c>] (ip_local_deliver_finish+0x48/0x54) [<c056ee6c>] (ip_local_deliver_finish) from [<c056eecc>] (ip_local_deliver+0x54/0xec) [<c056eecc>] (ip_local_deliver) from [<c056efac>] (ip_rcv+0x48/0xb8) [<c056efac>] (ip_rcv) from [<c0519c2c>] (__netif_receive_skb_one_core+0x50/0x6c) [...] The issue triggers only when not using TCP syncookies, as for syncookies no socket is associated. Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible") Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible") Signed-off-by: Martin Willi <martin@strongswan.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-01-28netfilter: ipt_CLUSTERIP: fix warning unused variable cnAnders Roxell1-1/+1
When CONFIG_PROC_FS isn't set the variable cn isn't used. net/ipv4/netfilter/ipt_CLUSTERIP.c: In function ‘clusterip_net_exit’: net/ipv4/netfilter/ipt_CLUSTERIP.c:849:24: warning: unused variable ‘cn’ [-Wunused-variable] struct clusterip_net *cn = clusterip_pernet(net); ^~ Rework so the variable 'cn' is declared inside "#ifdef CONFIG_PROC_FS". Fixes: b12f7bad5ad3 ("netfilter: ipt_CLUSTERIP: remove wrong WARN_ON_ONCE in netns exit routine") Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-27tcp: change pingpong threshold to 3Wei Wang1-6/+9
In order to be more confident about an on-going interactive session, we increment pingpong count by 1 for every interactive transaction and we adjust TCP_PINGPONG_THRESH to 3. This means, we only consider a session in pingpong mode after we see 3 interactive transactions, and start to activate delayed acks in quick ack mode. And in order to not over-count the credits, we only increase pingpong count for the first packet sent in response for the previous received packet. This is mainly to prevent delaying the ack immediately after some handshake protocol but no real interactive traffic pattern afterwards. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-27tcp: Refactor pingpong codeWei Wang5-14/+14
Instead of using pingpong as a single bit information, we refactor the code to treat it as a counter. When interactive session is detected, we set pingpong count to TCP_PINGPONG_THRESH. And when pingpong count is >= TCP_PINGPONG_THRESH, we consider the session in pingpong mode. This patch is a pure refactor and sets foundation for the next patch. This patch itself does not change any pingpong logic. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-27net: ipv4: ip_input: fix blank line coding style issuesYang Wei1-1/+1
Fix blank line coding style issues, make the code cleaner. Remove a redundant blank line in ip_rcv_core(). Insert a blank line in ip_rcv() between different statement blocks. Signed-off-by: Yang Wei <yang.wei9@zte.com.cn> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-26ip_gre: Refactor collect metatdata mode tunnel xmit to ip_md_tunnel_xmitwenxu1-93/+19
Refactor collect metatdata mode tunnel xmit to the generic xmit function ip_md_tunnel_xmit. It makes codes more generic and support more feture such as pmtu_update through ip_md_tunnel_xmit Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-26ip_tunnel: Fix route fl4 init in ip_md_tunnel_xmitwenxu1-2/+3
Init the gre_key from tuninfo->key.tun_id and init the mark from the skb->mark, set the oif to zero in the collect metadata mode. Fixes: cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-26ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmitwenxu2-11/+28
Add tnl_update_pmtu in ip_md_tunnel_xmit to dynamic modify the pmtu which packet send through collect_metadata mode ip tunnel Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-26ip_tunnel: Add ip tunnel dst_cache in ip_md_tunnel_xmitwenxu1-5/+15
Add ip tunnel dst cache in ip_md_tunnel_xmit to make more efficient for the route lookup. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-25tcp: allow zerocopy with fastopenWillem de Bruijn2-7/+5
Accept MSG_ZEROCOPY in all the TCP states that allow sendmsg. Remove the explicit check for ESTABLISHED and CLOSE_WAIT states. This requires correctly handling zerocopy state (uarg, sk_zckey) in all paths reachable from other TCP states. Such as the EPIPE case in sk_stream_wait_connect, which a sendmsg() in incorrect state will now hit. Most paths are already safe. Only extension needed is for TCP Fastopen active open. This can build an skb with data in tcp_send_syn_data. Pass the uarg along with other fastopen state, so that this skb also generates a zerocopy notification on release. Tested with active and passive tcp fastopen packetdrill scripts at https://github.com/wdebruij/packetdrill/commit/1747eef03d25a2404e8132817d0f1244fd6f129d Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-25net: IP defrag: encapsulate rbtree defrag code into callable functionsPeter Oskolkov2-262/+320
This is a refactoring patch: without changing runtime behavior, it moves rbtree-related code from IPv4-specific files/functions into .h/.c defrag files shared with IPv6 defragmentation code. Signed-off-by: Peter Oskolkov <posk@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-24tcp_bbr: adapt cwnd based on ack aggregation estimationPriyaranjan Jha1-1/+121
Aggregation effects are extremely common with wifi, cellular, and cable modem link technologies, ACK decimation in middleboxes, and LRO and GRO in receiving hosts. The aggregation can happen in either direction, data or ACKs, but in either case the aggregation effect is visible to the sender in the ACK stream. Previously BBR's sending was often limited by cwnd under severe ACK aggregation/decimation because BBR sized the cwnd at 2*BDP. If packets were acked in bursts after long delays (e.g. one ACK acking 5*BDP after 5*RTT), BBR's sending was halted after sending 2*BDP over 2*RTT, leaving the bottleneck idle for potentially long periods. Note that loss-based congestion control does not have this issue because when facing aggregation it continues increasing cwnd after bursts of ACKs, growing cwnd until the buffer is full. To achieve good throughput in the presence of aggregation effects, this algorithm allows the BBR sender to put extra data in flight to keep the bottleneck utilized during silences in the ACK stream that it has evidence to suggest were caused by aggregation. A summary of the algorithm: when a burst of packets are acked by a stretched ACK or a burst of ACKs or both, BBR first estimates the expected amount of data that should have been acked, based on its estimated bandwidth. Then the surplus ("extra_acked") is recorded in a windowed-max filter to estimate the recent level of observed ACK aggregation. Then cwnd is increased by the ACK aggregation estimate. The larger cwnd avoids BBR being cwnd-limited in the face of ACK silences that recent history suggests were caused by aggregation. As a sanity check, the ACK aggregation degree is upper-bounded by the cwnd (at the time of measurement) and a global max of BW * 100ms. The algorithm is further described by the following presentation: https://datatracker.ietf.org/meeting/101/materials/slides-101-iccrg-an-update-on-bbr-work-at-google-00 In our internal testing, we observed a significant increase in BBR throughput (measured using netperf), in a basic wifi setup. - Host1 (sender on ethernet) -> AP -> Host2 (receiver on wifi) - 2.4 GHz -> BBR before: ~73 Mbps; BBR after: ~102 Mbps; CUBIC: ~100 Mbps - 5.0 GHz -> BBR before: ~362 Mbps; BBR after: ~593 Mbps; CUBIC: ~601 Mbps Also, this code is running globally on YouTube TCP connections and produced significant bandwidth increases for YouTube traffic. This is based on Ian Swett's max_ack_height_ algorithm from the QUIC BBR implementation. Signed-off-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-24tcp_bbr: refactor bbr_target_cwnd() for general inflight provisioningPriyaranjan Jha1-21/+39
Because bbr_target_cwnd() is really a general-purpose BBR helper for computing some volume of inflight data as a function of the estimated BDP, refactor it into following helper functions: - bbr_bdp() - bbr_quantization_budget() - bbr_inflight() Signed-off-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-24ip_tunnel: Make none-tunnel-dst tunnel port work with lwtunnelwenxu1-1/+7
ip l add dev tun type gretap key 1000 ip a a dev tun 10.0.0.1/24 Packets with tun-id 1000 can be recived by tun dev. But packet can't be sent through dev tun for non-tunnel-dst With this patch: tunnel-dst can be get through lwtunnel like beflow: ip r a 10.0.0.7 encap ip dst 172.168.0.11 dev tun Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-22bridge: simplify ip_mc_check_igmp() and ipv6_mc_check_mld() internalsLinus Lüssing1-29/+22
With this patch the internal use of the skb_trimmed is reduced to the ICMPv6/IGMP checksum verification. And for the length checks the newly introduced helper functions are used instead of calculating and checking with skb->len directly. These changes should hopefully make it easier to verify that length checks are performed properly. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-22bridge: simplify ip_mc_check_igmp() and ipv6_mc_check_mld() callsLinus Lüssing1-19/+4
This patch refactors ip_mc_check_igmp(), ipv6_mc_check_mld() and their callers (more precisely, the Linux bridge) to not rely on the skb_trimmed parameter anymore. An skb with its tail trimmed to the IP packet length was initially introduced for the following three reasons: 1) To be able to verify the ICMPv6 checksum. 2) To be able to distinguish the version of an IGMP or MLD query. They are distinguishable only by their size. 3) To avoid parsing data for an IGMPv3 or MLDv2 report that is beyond the IP packet but still within the skb. The first case still uses a cloned and potentially trimmed skb to verfiy. However, there is no need to propagate it to the caller. For the second and third case explicit IP packet length checks were added. This hopefully makes ip_mc_check_igmp() and ipv6_mc_check_mld() easier to read and verfiy, as well as easier to use. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-22net: ip_gre: use erspan key field for tunnel lookupLorenzo Bianconi2-9/+17
Use ERSPAN key header field as tunnel key in gre_parse_header routine since ERSPAN protocol sets the key field of the external GRE header to 0 resulting in a tunnel lookup fail in ip6gre_err. In addition remove key field parsing and pskb_may_pull check in erspan_rcv and ip6erspan_rcv Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-22net: introduce a knob to control whether to inherit devconf configCong Wang1-23/+20
There have been many people complaining about the inconsistent behaviors of IPv4 and IPv6 devconf when creating new network namespaces. Currently, for IPv4, we inherit all current settings from init_net, but for IPv6 we reset all setting to default. This patch introduces a new /proc file /proc/sys/net/core/devconf_inherit_init_net to control the behavior of whether to inhert sysctl current settings from init_net. This file itself is only available in init_net. As demonstrated below: Initial setup in init_net: # cat /proc/sys/net/ipv4/conf/all/rp_filter 2 # cat /proc/sys/net/ipv6/conf/all/accept_dad 1 Default value 0 (current behavior): # ip netns del test # ip netns add test # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter 2 # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad 0 Set to 1 (inherit from init_net): # echo 1 > /proc/sys/net/core/devconf_inherit_init_net # ip netns del test # ip netns add test # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter 2 # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad 1 Set to 2 (reset to default): # echo 2 > /proc/sys/net/core/devconf_inherit_init_net # ip netns del test # ip netns add test # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter 0 # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad 0 Set to a value out of range (invalid): # echo 3 > /proc/sys/net/core/devconf_inherit_init_net -bash: echo: write error: Invalid argument # echo -1 > /proc/sys/net/core/devconf_inherit_init_net -bash: echo: write error: Invalid argument Reported-by: Zhu Yanjun <Yanjun.Zhu@windriver.com> Reported-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-19net: ipv4: ipmr: perform strict checks also for doit handlersJakub Kicinski1-5/+56
Make RTM_GETROUTE's doit handler use strict checks when NETLINK_F_STRICT_CHK is set. v2: - improve extack messages (DaveA). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-19net: ipv4: route: perform strict checks also for doit handlersJakub Kicinski1-2/+70
Make RTM_GETROUTE's doit handler use strict checks when NETLINK_F_STRICT_CHK is set. v2: - new patch (DaveA). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-19net: ipv4: netconf: perform strict checks also for doit handlersJakub Kicinski1-4/+39
Make RTM_GETNETCONF's doit handler use strict checks when NETLINK_F_STRICT_CHK is set. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-18net: Fix usage of pskb_trim_rcsumRoss Lagerwall1-0/+1
In certain cases, pskb_trim_rcsum() may change skb pointers. Reinitialize header pointers afterwards to avoid potential use-after-frees. Add a note in the documentation of pskb_trim_rcsum(). Found by KASAN. Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-18netfilter: conntrack: avoid unneeded nf_conntrack_l4proto lookupsFlorian Westphal1-1/+1
after removal of the packet and invert function pointers, several places do not need to lookup the l4proto structure anymore. Remove those lookups. The function nf_ct_invert_tuplepr becomes redundant, replace it with nf_ct_invert_tuple everywhere. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-17tcp: move rx_opt & syn_data_acked init to tcp_disconnect()Eric Dumazet2-6/+4
If we make sure all listeners have these fields cleared, then a clone will also inherit zero values. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17tcp: move tp->rack init to tcp_disconnect()Eric Dumazet2-6/+6
If we make sure all listeners have proper tp->rack value, then a clone will also inherit proper initial value. Note that fresh sockets init tp->rack from tcp_init_sock() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17tcp: move app_limited init to tcp_disconnect()Eric Dumazet2-3/+3
If we make sure all listeners have app_limited set to ~0U, then a clone will also inherit proper initial value. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17tcp: move retrans_out, sacked_out, tlp_high_seq, last_oow_ack_time init to tcp_disconnect()Eric Dumazet2-4/+4
If we make sure all listeners have these fields cleared, then a clone will also inherit zero values. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17tcp: do not clear urg_data in tcp_create_openreq_childEric Dumazet1-2/+0
All listeners have this field cleared already, since tcp_disconnect() clears it and newly created sockets have also a zero value here. So a clone will inherit a zero value here. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17tcp: move snd_cwnd & snd_cwnd_cnt init to tcp_disconnect()Eric Dumazet2-9/+1
Passive connections can inherit proper value by cloning, if we make sure all listeners have the proper values there. tcp_disconnect() was setting snd_cwnd to 2, which seems quite obsolete since IW10 adoption. Also remove an obsolete comment. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17tcp: move mdev_us init to tcp_disconnect()Eric Dumazet2-1/+1
If we make sure a listener always has its mdev_us field set to TCP_TIMEOUT_INIT, we do not need to rewrite this field after a new clone is created. tcp_disconnect() is very seldom used in real applications. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17tcp: do not clear srtt_us in tcp_create_openreq_childEric Dumazet1-1/+0
All listeners have this field cleared already, since tcp_disconnect() clears it and newly created sockets have also a zero value here. So a clone will inherit a zero value here. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17tcp: do not clear packets_out in tcp_create_openreq_child()Eric Dumazet1-1/+0
New sockets have this field cleared, and tcp_disconnect() calls tcp_write_queue_purge() which among other things also clear tp->packets_out So a listener is guaranteed to have this field cleared. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17tcp: move icsk_rto init to tcp_disconnect()Eric Dumazet2-1/+1
If we make sure a listener always has its icsk_rto field set to TCP_TIMEOUT_INIT, we do not need to rewrite this field after a new clone is created. tcp_disconnect() is very seldom used in real applications. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17tcp: do not set snd_ssthresh in tcp_create_openreq_child()Eric Dumazet1-1/+0
New sockets get the field set to TCP_INFINITE_SSTHRESH in tcp_init_sock() In case a socket had this field changed and transitions to TCP_LISTEN state, tcp_disconnect() also makes sure snd_ssthresh is set to TCP_INFINITE_SSTHRESH. So a listener has this field set to TCP_INFINITE_SSTHRESH already. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17tcp: less aggressive window probing on local congestionYuchung Cheng1-15/+7
Previously when the sender fails to send (original) data packet or window probes due to congestion in the local host (e.g. throttling in qdisc), it'll retry within an RTO or two up to 500ms. In low-RTT networks such as data-centers, RTO is often far below the default minimum 200ms. Then local host congestion could trigger a retry storm pouring gas to the fire. Worse yet, the probe counter (icsk_probes_out) is not properly updated so the aggressive retry may exceed the system limit (15 rounds) until the packet finally slips through. On such rare events, it's wise to retry more conservatively (500ms) and update the stats properly to reflect these incidents and follow the system limit. Note that this is consistent with the behaviors when a keep-alive probe or RTO retry is dropped due to local congestion. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17tcp: retry more conservatively on local congestionYuchung Cheng1-5/+3
Previously when the sender fails to retransmit a data packet on timeout due to congestion in the local host (e.g. throttling in qdisc), it'll retry within an RTO up to 500ms. In low-RTT networks such as data-centers, RTO is often far below the default minimum 200ms (and the cap 500ms). Then local host congestion could trigger a retry storm pouring gas to the fire. Worse yet, the retry counter (icsk_retransmits) is not properly updated so the aggressive retry may exceed the system limit (15 rounds) until the packet finally slips through. On such rare events, it's wise to retry more conservatively (500ms) and update the stats properly to reflect these incidents and follow the system limit. Note that this is consistent with the behavior when a keep-alive probe is dropped due to local congestion. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17tcp: simplify window probe aborting on USER_TIMEOUTYuchung Cheng1-7/+7
Previously we use the next unsent skb's timestamp to determine when to abort a socket stalling on window probes. This no longer works as skb timestamp reflects the last instead of the first transmission. Instead we can estimate how long the socket has been stalling with the probe count and the exponential backoff behavior. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17tcp: create a helper to model exponential backoffYuchung Cheng1-13/+18
Create a helper to model TCP exponential backoff for the next patch. This is pure refactor w no behavior change. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>