aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4 (follow)
AgeCommit message (Collapse)AuthorFilesLines
2017-06-16networking: make skb_pull & friends return void pointersJohannes Berg2-4/+5
It seems like a historic accident that these return unsigned char *, and in many places that means casts are required, more often than not. Make these functions return void * and remove all the casts across the tree, adding a (u8 *) cast only where the unsigned char pointer was used directly, all done with the following spatch: @@ expression SKB, LEN; typedef u8; identifier fn = { skb_pull, __skb_pull, skb_pull_inline, __pskb_pull_tail, __pskb_pull, pskb_pull }; @@ - *(fn(SKB, LEN)) + *(u8 *)fn(SKB, LEN) @@ expression E, SKB, LEN; identifier fn = { skb_pull, __skb_pull, skb_pull_inline, __pskb_pull_tail, __pskb_pull, pskb_pull }; type T; @@ - E = ((T *)(fn(SKB, LEN))) + E = fn(SKB, LEN) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16networking: make skb_put & friends return void pointersJohannes Berg5-11/+11
It seems like a historic accident that these return unsigned char *, and in many places that means casts are required, more often than not. Make these functions (skb_put, __skb_put and pskb_put) return void * and remove all the casts across the tree, adding a (u8 *) cast only where the unsigned char pointer was used directly, all done with the following spatch: @@ expression SKB, LEN; typedef u8; identifier fn = { skb_put, __skb_put }; @@ - *(fn(SKB, LEN)) + *(u8 *)fn(SKB, LEN) @@ expression E, SKB, LEN; identifier fn = { skb_put, __skb_put }; type T; @@ - E = ((T *)(fn(SKB, LEN))) + E = fn(SKB, LEN) which actually doesn't cover pskb_put since there are only three users overall. A handful of stragglers were converted manually, notably a macro in drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many instances in net/bluetooth/hci_sock.c. In the former file, I also had to fix one whitespace problem spatch introduced. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16networking: convert many more places to skb_put_zero()Johannes Berg2-4/+2
There were many places that my previous spatch didn't find, as pointed out by yuan linyu in various patches. The following spatch found many more and also removes the now unnecessary casts: @@ identifier p, p2; expression len; expression skb; type t, t2; @@ ( -p = skb_put(skb, len); +p = skb_put_zero(skb, len); | -p = (t)skb_put(skb, len); +p = skb_put_zero(skb, len); ) ... when != p ( p2 = (t2)p; -memset(p2, 0, len); | -memset(p, 0, len); ) @@ type t, t2; identifier p, p2; expression skb; @@ t *p; ... ( -p = skb_put(skb, sizeof(t)); +p = skb_put_zero(skb, sizeof(t)); | -p = (t *)skb_put(skb, sizeof(t)); +p = skb_put_zero(skb, sizeof(t)); ) ... when != p ( p2 = (t2)p; -memset(p2, 0, sizeof(*p)); | -memset(p, 0, sizeof(*p)); ) @@ expression skb, len; @@ -memset(skb_put(skb, len), 0, len); +skb_put_zero(skb, len); Apply it to the tree (with one manual fixup to keep the comment in vxlan.c, which spatch removed.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15tcp: export do_tcp_sendpages and tcp_rate_check_app_limited functionsDave Watson2-2/+4
Export do_tcp_sendpages and tcp_rate_check_app_limited, since tls will need to sendpages while the socket is already locked. tcp_sendpage is exported, but requires the socket lock to not be held already. Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15tcp: ULP infrastructureDave Watson5-1/+190
Add the infrustructure for attaching Upper Layer Protocols (ULPs) over TCP sockets. Based on a similar infrastructure in tcp_cong. The idea is that any ULP can add its own logic by changing the TCP proto_ops structure to its own methods. Example usage: setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls")); modules will call: tcp_register_ulp(&tcp_tls_ulp_ops); to register/unregister their ulp, with an init function and name. A list of registered ulps will be returned by tcp_get_available_ulp, which is hooked up to /proc. Example: $ cat /proc/sys/net/ipv4/tcp_available_ulp tls There is currently no functionality to remove or chain ULPs, but it should be possible to add these in the future if needed. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-30/+37
The conflicts were two cases of overlapping changes in batman-adv and the qed driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-14net: don't global ICMP rate limit packets originating from loopbackJesper Dangaard Brouer1-2/+6
Florian Weimer seems to have a glibc test-case which requires that loopback interfaces does not get ICMP ratelimited. This was broken by commit c0303efeab73 ("net: reduce cycles spend on ICMP replies that gets rate limited"). An ICMP response will usually be routed back-out the same incoming interface. Thus, take advantage of this and skip global ICMP ratelimit when the incoming device is loopback. In the unlikely event that the outgoing it not loopback, due to strange routing policy rules, ICMP rate limiting still works via peer ratelimiting via icmpv4_xrlim_allow(). Thus, we should still comply with RFC1812 (section 4.3.2.8 "Rate Limiting"). This seems to fix the reproducer given by Florian. While still avoiding to perform expensive and unneeded outgoing route lookup for rate limited packets (in the non-loopback case). Fixes: c0303efeab73 ("net: reduce cycles spend on ICMP replies that gets rate limited") Reported-by: Florian Weimer <fweimer@redhat.com> Reported-by: "H.J. Lu" <hjl.tools@gmail.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-13igmp: acquire pmc lock for ip_mc_clear_src()WANG Cong1-8/+13
Andrey reported a use-after-free in add_grec(): for (psf = *psf_list; psf; psf = psf_next) { ... psf_next = psf->sf_next; where the struct ip_sf_list's were already freed by: kfree+0xe8/0x2b0 mm/slub.c:3882 ip_mc_clear_src+0x69/0x1c0 net/ipv4/igmp.c:2078 ip_mc_dec_group+0x19a/0x470 net/ipv4/igmp.c:1618 ip_mc_drop_socket+0x145/0x230 net/ipv4/igmp.c:2609 inet_release+0x4e/0x1c0 net/ipv4/af_inet.c:411 sock_release+0x8d/0x1e0 net/socket.c:597 sock_close+0x16/0x20 net/socket.c:1072 This happens because we don't hold pmc->lock in ip_mc_clear_src() and a parallel mr_ifc_timer timer could jump in and access them. The RCU lock is there but it is merely for pmc itself, this spinlock could actually ensure we don't access them in parallel. Thanks to Eric and Long for discussion on this bug. Reported-by: Andrey Konovalov <andreyknvl@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Xin Long <lucien.xin@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-12udp: try to avoid 2 cache miss on dequeuePaolo Abeni1-11/+103
when udp_recvmsg() is executed, on x86_64 and other archs, most skb fields are on cold cachelines. If the skb are linear and the kernel don't need to compute the udp csum, only a handful of skb fields are required by udp_recvmsg(). Since we already use skb->dev_scratch to cache hot data, and there are 32 bits unused on 64 bit archs, use such field to cache as much data as we can, and try to prefetch on dequeue the relevant fields that are left out. This can save up to 2 cache miss per packet. v1 -> v2: - changed udp_dev_scratch fields types to u{32,16} variant, replaced bitfiled with bool Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-12udp: avoid a cache miss on dequeuePaolo Abeni1-1/+5
Since UDP no more uses sk->destructor, we can clear completely the skb head state before enqueuing. Amend and use skb_release_head_state() for that. All head states share a single cacheline, which is not normally used/accesses on dequeue. We can avoid entirely accessing such cacheline implementing and using in the UDP code a specialized skb free helper which ignores the skb head state. This saves a cacheline miss at skb deallocation time. v1 -> v2: replaced secpath_reset() with skb_release_head_state() Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-11net: ipmr: Fix some mroute forwarding issues in vrf'sDonald Sharp1-17/+15
This patch fixes two issues: 1) When forwarding on *,G mroutes that are in a vrf, the kernel was dropping information about the actual incoming interface when calling ip_mr_forward from ip_mr_input. This caused ip_mr_forward to send the multicast packet back out the incoming interface. Fix this by modifying ip_mr_forward to be handed the correctly resolved dev. 2) When a unresolved cache entry is created we store the incoming skb on the unresolved cache entry and upon mroute resolution from the user space daemon, we attempt to forward the packet. Again we were not resolving to the correct incoming device for a vrf scenario, before calling ip_mr_forward. Fix this by resolving to the correct interface and calling ip_mr_forward with the result. Fixes: e58e41596811 ("net: Enable support for VRF with ipv4 multicast") Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com> Acked-by: David Ahern <dsahern@gmail.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-09Ipvlan should return an error when an address is already in use.Krister Johansen1-0/+33
The ipvlan code already knows how to detect when a duplicate address is about to be assigned to an ipvlan device. However, that failure is not propogated outward and leads to a silent failure. Introduce a validation step at ip address creation time and allow device drivers to register to validate the incoming ip addresses. The ipvlan code is the first consumer. If it detects an address in use, we can return an error to the user before beginning to commit the new ifa in the networking code. This can be especially useful if it is necessary to provision many ipvlans in containers. The provisioning software (or operator) can use this to detect situations where an ip address is unexpectedly in use. Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08net: ipmr: add getlink supportNikolay Aleksandrov1-0/+126
Currently there's no way to dump the VIF table for an ipmr table other than the default (via proc). This is a major issue when debugging ipmr issues and in general it is good to know which interfaces are configured. This patch adds support for RTM_GETLINK for the ipmr family so we can dump the VIF table and the ipmr table's current config for each table. We're protected by rtnl so no need to acquire RCU or mrt_lock. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08tcp: add TCPMemoryPressuresChrono counterEric Dumazet3-6/+27
DRAM supply shortage and poor memory pressure tracking in TCP stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning limits) and tcp_mem[] quite hazardous. TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl limits being hit, but only tracking number of transitions. If TCP stack behavior under stress was perfect : 1) It would maintain memory usage close to the limit. 2) Memory pressure state would be entered for short times. We certainly prefer 100 events lasting 10ms compared to one event lasting 200 seconds. This patch adds a new SNMP counter tracking cumulative duration of memory pressure events, given in ms units. $ cat /proc/sys/net/ipv4/tcp_mem 3088 4117 6176 $ grep TCP /proc/net/sockstat TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140 $ nstat -n ; sleep 10 ; nstat |grep Pressure TcpExtTCPMemoryPressures 1700 TcpExtTCPMemoryPressuresChrono 5209 v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David instructed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08tcp: Namespaceify sysctl_tcp_timestampsEric Dumazet5-19/+22
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08tcp: Namespaceify sysctl_tcp_window_scalingEric Dumazet5-12/+12
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08tcp: Namespaceify sysctl_tcp_sackEric Dumazet5-13/+14
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08tcp: add a struct net parameter to tcp_parse_options()Eric Dumazet3-10/+14
We want to move some TCP sysctls to net namespaces in the future. tcp_window_scaling, tcp_sack and tcp_timestamps being fetched from tcp_parse_options(), we need to pass an extra parameter. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07net: Fix inconsistent teardown and release of private netdev state.David S. Miller2-3/+3
Network devices can allocate reasources and private memory using netdev_ops->ndo_init(). However, the release of these resources can occur in one of two different places. Either netdev_ops->ndo_uninit() or netdev->destructor(). The decision of which operation frees the resources depends upon whether it is necessary for all netdev refs to be released before it is safe to perform the freeing. netdev_ops->ndo_uninit() presumably can occur right after the NETDEV_UNREGISTER notifier completes and the unicast and multicast address lists are flushed. netdev->destructor(), on the other hand, does not run until the netdev references all go away. Further complicating the situation is that netdev->destructor() almost universally does also a free_netdev(). This creates a problem for the logic in register_netdevice(). Because all callers of register_netdevice() manage the freeing of the netdev, and invoke free_netdev(dev) if register_netdevice() fails. If netdev_ops->ndo_init() succeeds, but something else fails inside of register_netdevice(), it does call ndo_ops->ndo_uninit(). But it is not able to invoke netdev->destructor(). This is because netdev->destructor() will do a free_netdev() and then the caller of register_netdevice() will do the same. However, this means that the resources that would normally be released by netdev->destructor() will not be. Over the years drivers have added local hacks to deal with this, by invoking their destructor parts by hand when register_netdevice() fails. Many drivers do not try to deal with this, and instead we have leaks. Let's close this hole by formalizing the distinction between what private things need to be freed up by netdev->destructor() and whether the driver needs unregister_netdevice() to perform the free_netdev(). netdev->priv_destructor() performs all actions to free up the private resources that used to be freed by netdev->destructor(), except for free_netdev(). netdev->needs_free_netdev is a boolean that indicates whether free_netdev() should be done at the end of unregister_netdevice(). Now, register_netdevice() can sanely release all resources after ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit() and netdev->priv_destructor(). And at the end of unregister_netdevice(), we invoke netdev->priv_destructor() and optionally call free_netdev(). Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-3/+6
Just some simple overlapping changes in marvell PHY driver and the DSA core code. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04ipsec: check return value of skb_to_sgvec alwaysJason A. Donenfeld2-9/+19
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04net: ping: do not abuse udp_poll()Eric Dumazet1-1/+1
Alexander reported various KASAN messages triggered in recent kernels The problem is that ping sockets should not use udp_poll() in the first place, and recent changes in UDP stack finally exposed this old bug. Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind") Fixes: 6d0bfe226116 ("net: ipv6: Add IPv6 support to the ping socket.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Sasha Levin <alexander.levin@verizon.com> Cc: Solar Designer <solar@openwall.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: Lorenzo Colitti <lorenzo@google.com> Acked-By: Lorenzo Colitti <lorenzo@google.com> Tested-By: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04neigh: Really delete an arp/neigh entry on "ip neigh delete" or "arp -d"Sowmini Varadhan1-0/+4
The command # arp -s 62.2.0.1 a:b:c:d:e:f dev eth2 adds an entry like the following (listed by "arp -an") ? (62.2.0.1) at 0a:0b:0c:0d:0e:0f [ether] PERM on eth2 but the symmetric deletion command # arp -i eth2 -d 62.2.0.1 does not remove the PERM entry from the table, and instead leaves behind ? (62.2.0.1) at <incomplete> on eth2 The reason is that there is a refcnt of 1 for the arp_tbl itself (neigh_alloc starts off the entry with a refcnt of 1), thus the neigh_release() call from arp_invalidate() will (at best) just decrement the ref to 1, but will never actually free it from the table. To fix this, we need to do something like neigh_forced_gc: if the refcnt is 1 (i.e., on the table's ref), remove the entry from the table and free it. This patch refactors and shares common code between neigh_forced_gc and the newly added neigh_remove_one. A similar issue exists for IPv6 Neighbor Cache entries, and is fixed in a similar manner by this patch. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04dccp: consistently use dccp_write_space()Eric Dumazet1-1/+0
DCCP uses dccp_write_space() for sk->sk_write_space method. Unfortunately a passive connection (as provided by accept()) is using the generic sk_stream_write_space() function. Lets simply inherit sk->sk_write_space from the parent instead of forcing the generic one. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-02tcp: remove unnecessary skb_reset_tail_pointer()Eric Dumazet1-4/+2
__pskb_trim_head() does not need to reset skb tail pointer. Also change the comments, __pskb_pull_head() does not exist. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-02tcp: use TS opt on RTTs for congestion controlYuchung Cheng1-7/+8
Currently when a data packet is retransmitted, we do not compute an RTT sample for congestion control due to Kern's check. Therefore the congestion control that uses RTT signals may not receive any update during loss recovery which could last many round trips. For example, BBR and Vegas may not be able to update its min RTT estimation if the network path has shortened until it recovers from losses. This patch mitigates that by using TCP timestamp options for RTT measurement for congestion control. Note that we already use timestamps for RTT estimation. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-02tcp: disallow cwnd undo when switching congestion controlYuchung Cheng1-0/+1
When the sender switches its congestion control during loss recovery, if the recovery is spurious then it may incorrectly revert cwnd and ssthresh to the older values set by a previous congestion control. Consider a congestion control (like BBR) that does not use ssthresh and keeps it infinite: the connection may incorrectly revert cwnd to an infinite value when switching from BBR to another congestion control. This patch fixes it by disallowing such cwnd undo operation upon switching congestion control. Note that undo_marker is not reset s.t. the packets that were incorrectly marked lost would be corrected. We only avoid undoing the cwnd in tcp_undo_cwnd_reduction(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-01ipv4: route: restore skb_dst_set in inet_rtm_getrouteRoopa Prabhu1-3/+4
recent updates to inet_rtm_getroute dropped skb_dst_set in inet_rtm_getroute. This patch restores it because it is needed to release the dst correctly. Fixes: 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup") Reported-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-31tcp: reinitialize MTU probing when setting MSS in a TCP repairDouglas Caetano dos Santos1-2/+4
MTU probing initialization occurred only at connect() and at SYN or SYN-ACK reception, but the former sets MSS to either the default or the user set value (through TCP_MAXSEG sockopt) and the latter never happens with repaired sockets. The result was that, with MTU probing enabled and unless TCP_MAXSEG sockopt was used before connect(), probing would be stuck at tcp_base_mss value until tcp_probe_interval seconds have passed. Signed-off-by: Douglas Caetano dos Santos <douglascs@taghos.com.br> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-30net: add extack arg to lwtunnel build stateDavid Ahern4-15/+21
Pass extack arg down to lwtunnel_build_state and the build_state callbacks. Add messages for failures in lwtunnel_build_state, and add the extarg to nla_parse where possible in the build_state callbacks. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-30net: lwtunnel: Add extack to encap attr validationDavid Ahern1-2/+4
Pass extack down to lwtunnel_valid_encap_type and lwtunnel_valid_encap_type_attr. Add messages for unknown or unsupported encap types. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-30net: ipv4: Add extack message for invalid prefix or lengthDavid Ahern2-9/+15
Add extack error message for invalid prefix length and invalid prefix. Example of the latter is a route spec containing 172.16.100.1/24, where the /24 mask means the lower 8-bits should be 0. Amazing how easy that one is to overlook when an EINVAL is returned. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-30net: ipv4: refactor key and length checksDavid Ahern1-10/+15
fib_table_insert and fib_table_delete have the same checks on the prefix and length. Refactor into a helper. Avoids duplicate extack messages in the next patch. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller5-12/+29
Overlapping changes in drivers/net/phy/marvell.c, bug fix in 'net' restricting a HW workaround alongside cleanups in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-26ipv4: add reference counting to metricsEric Dumazet2-8/+19
Andrey Konovalov reported crashes in ipv4_mtu() I could reproduce the issue with KASAN kernels, between 10.246.7.151 and 10.246.7.152 : 1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 & 2) At the same time run following loop : while : do ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500 ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500 done Cong Wang attempted to add back rt->fi in commit 82486aa6f1b9 ("ipv4: restore rt->fi for reference counting") but this proved to add some issues that were complex to solve. Instead, I suggested to add a refcount to the metrics themselves, being a standalone object (in particular, no reference to other objects) I tried to make this patch as small as possible to ease its backport, instead of being super clean. Note that we believe that only ipv4 dst need to take care of the metric refcount. But if this is wrong, this patch adds the basic infrastructure to extend this to other families. Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang for his efforts on this problem. Fixes: 2860583fe840 ("ipv4: Kill rt->fi") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-26net: ipv4: RTM_GETROUTE: return matched fib result when requestedRoopa Prabhu1-2/+11
This patch adds support to return matched fib result when RTM_F_FIB_MATCH flag is specified in RTM_GETROUTE request. This is useful for user-space applications/controllers wanting to query a matching route. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-26net: ipv4: Save trie prefix to fib lookup resultDavid Ahern1-0/+1
Prefix is needed for returning matching route spec on get route request. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-26net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookupDavid Ahern1-8/+13
Convert inet_rtm_getroute to use ip_route_input_rcu and ip_route_output_key_hash_rcu passing the fib_result arg to both. The rcu lock is held through the creation of the response, so the rtable/dst does not need to be attached to the skb and is passed to rt_fill_info directly. In converting from ip_route_output_key to ip_route_output_key_hash_rcu the xfrm_lookup_route in ip_route_output_flow is dropped since flowi4_proto is not set for a route get request. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-26net: ipv4: Remove event arg to rt_fill_infoDavid Ahern1-4/+3
rt_fill_info has 1 caller with the event set to RTM_NEWROUTE. Given that remove the arg and use RTM_NEWROUTE directly in rt_fill_info. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-26net: ipv4: refactor ip_route_input_norefDavid Ahern1-29/+37
A later patch wants access to the fib result on an input route lookup with the rcu lock held. Refactor ip_route_input_noref pushing the logic between rcu_read_lock ... rcu_read_unlock into a new helper that takes the fib_result as an input arg. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-26net: ipv4: refactor __ip_route_output_key_hashDavid Ahern2-23/+32
A later patch wants access to the fib result on an output route lookup with the rcu lock held. Refactor __ip_route_output_key_hash, pushing the logic between rcu_read_lock ... rcu_read_unlock into a new helper with the fib_result as an input arg. To keep the name length under control remove the leading underscores from the name and add _rcu to the name of the new helper indicating it is called with the rcu read lock held. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25arp: fixed -Wuninitialized compiler warningIhar Hrachyshka1-1/+1
Commit 7d472a59c0e5ec117220a05de6b370447fb6cb66 ("arp: always override existing neigh entries with gratuitous ARP") introduced a compiler warning: net/ipv4/arp.c:880:35: warning: 'addr_type' may be used uninitialized in this function [-Wmaybe-uninitialized] While the code logic seems to be correct and doesn't allow the variable to be used uninitialized, and the warning is not consistently reproducible, it's still worth fixing it for other people not to waste time looking at the warning in case it pops up in the build environment. Yes, compiler is probably at fault, but we will need to accommodate. Fixes: 7d472a59c0e5 ("arp: always override existing neigh entries with gratuitous ARP") Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25tcp: avoid fastopen API to be used on AF_UNSPECWei Wang1-2/+5
Fastopen API should be used to perform fastopen operations on the TCP socket. It does not make sense to use fastopen API to perform disconnect by calling it with AF_UNSPEC. The fastopen data path is also prone to race conditions and bugs when using with AF_UNSPEC. One issue reported and analyzed by Vegard Nossum is as follows: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Thread A: Thread B: ------------------------------------------------------------------------ sendto() - tcp_sendmsg() - sk_stream_memory_free() = 0 - goto wait_for_sndbuf - sk_stream_wait_memory() - sk_wait_event() // sleep | sendto(flags=MSG_FASTOPEN, dest_addr=AF_UNSPEC) | - tcp_sendmsg() | - tcp_sendmsg_fastopen() | - __inet_stream_connect() | - tcp_disconnect() //because of AF_UNSPEC | - tcp_transmit_skb()// send RST | - return 0; // no reconnect! | - sk_stream_wait_connect() | - sock_error() | - xchg(&sk->sk_err, 0) | - return -ECONNRESET - ... // wake up, see sk->sk_err == 0 - skb_entail() on TCP_CLOSE socket If the connection is reopened then we will send a brand new SYN packet after thread A has already queued a buffer. At this point I think the socket internal state (sequence numbers etc.) becomes messed up. When the new connection is closed, the FIN-ACK is rejected because the sequence number is outside the window. The other side tries to retransmit, but __tcp_retransmit_skb() calls tcp_trim_head() on an empty skb which corrupts the skb data length and hits a BUG() in copy_and_csum_bits(). +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Hence, this patch adds a check for AF_UNSPEC in the fastopen data path and return EOPNOTSUPP to user if such case happens. Fixes: cf60af03ca4e7 ("tcp: Fast Open client - sendmsg(MSG_FASTOPEN)") Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25tcp: better validation of received ack sequencesEric Dumazet1-13/+11
Paul Fiterau Brostean reported : <quote> Linux TCP stack we analyze exhibits behavior that seems odd to me. The scenario is as follows (all packets have empty payloads, no window scaling, rcv/snd window size should not be a factor): TEST HARNESS (CLIENT) LINUX SERVER 1. - LISTEN (server listen, then accepts) 2. - --> <SEQ=100><CTL=SYN> --> SYN-RECEIVED 3. - <-- <SEQ=300><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED 4. - --> <SEQ=101><ACK=301><CTL=ACK> --> ESTABLISHED 5. - <-- <SEQ=301><ACK=101><CTL=FIN,ACK> <-- FIN WAIT-1 (server opts to close the data connection calling "close" on the connection socket) 6. - --> <SEQ=101><ACK=99999><CTL=FIN,ACK> --> CLOSING (client sends FIN,ACK with not yet sent acknowledgement number) 7. - <-- <SEQ=302><ACK=102><CTL=ACK> <-- CLOSING (ACK is 102 instead of 101, why?) ... (silence from CLIENT) 8. - <-- <SEQ=301><ACK=102><CTL=FIN,ACK> <-- CLOSING (retransmission, again ACK is 102) Now, note that packet 6 while having the expected sequence number, acknowledges something that wasn't sent by the server. So I would expect the packet to maybe prompt an ACK response from the server, and then be ignored. Yet it is not ignored and actually leads to an increase of the acknowledgement number in the server's retransmission of the FIN,ACK packet. The explanation I found is that the FIN in packet 6 was processed, despite the acknowledgement number being unacceptable. Further experiments indeed show that the server processes this FIN, transitioning to CLOSING, then on receiving an ACK for the FIN it had send in packet 5, the server (or better said connection) transitions from CLOSING to TIME_WAIT (as signaled by netstat). </quote> Indeed, tcp_rcv_state_process() calls tcp_ack() but does not exploit the @acceptable status but for TCP_SYN_RECV state. What we want here is to send a challenge ACK, if not in TCP_SYN_RECV state. TCP_FIN_WAIT1 state is not the only state we should fix. Add a FLAG_NO_CHALLENGE_ACK so that tcp_rcv_state_process() can choose to send a challenge ACK and discard the packet instead of wrongly change socket state. With help from Neal Cardwell. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Paul Fiterau Brostean <p.fiterau-brostean@science.ru.nl> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-24tcp: fix TCP_SYNCNT flakesEric Dumazet1-15/+11
After the mentioned commit, some of our packetdrill tests became flaky. TCP_SYNCNT socket option can limit the number of SYN retransmits. retransmits_timed_out() has to compare times computations based on local_clock() while timers are based on jiffies. With NTP adjustments and roundings we can observe 999 ms delay for 1000 ms timers. We end up sending one extra SYN packet. Gimmick added in commit 6fa12c850314 ("Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value") makes no real sense for TCP_SYN_SENT sockets where no RTO backoff can happen at all. Lets use a simpler logic for TCP_SYN_SENT sockets and remove @syn_set parameter from retransmits_timed_out() Fixes: 9a568de4818d ("tcp: switch TCP TS option (RFC 7323) to 1ms clock") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-23Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsecDavid S. Miller1-1/+4
Steffen Klassert says: ==================== pull request (net): ipsec 2017-05-23 1) Fix wrong header offset for esp4 udpencap packets. 2) Fix a stack access out of bounds when creating a bundle with sub policies. From Sabrina Dubroca. 3) Fix slab-out-of-bounds in pfkey due to an incorrect sadb_x_sec_len calculation. 4) We checked the wrong feature flags when taking down an interface with IPsec offload enabled. Fix from Ilan Tayari. 5) Copy the anti replay sequence numbers when doing a state migration, otherwise we get out of sync with the sequence numbers. Fix from Antony Antony. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-17/+43
2017-05-22net: ipv4: tcp: fixed comment coding style issueRohit Chavan1-2/+3
Fixed a coding style issue Signed-off-by: Rohit Chavan <roheetchavan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-22net: ipv4: Add extack messages for route add failuresDavid Ahern2-22/+95
Add messages for non-obvious errors (e.g, no need to add text for malloc failures or ENODEV failures). This mostly covers the annoying EINVAL errors Some message strings violate the 80-columns but searchable strings need to trump that rule. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-22net: ipv4: Plumb extack through route add functionsDavid Ahern4-19/+26
Plumb extack argument down to route add functions. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>