aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv6/ip6_fib.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2017-08-03ipv6: fib: Add helpers to hold / drop a reference on rt6_infoIdo Schimmel1-10/+2
Similar to commit 1c677b3d2828 ("ipv4: fib: Add fib_info_hold() helper") and commit b423cb10807b ("ipv4: fib: Export free_fib_info()") add an helper to hold a reference on rt6_info and export rt6_release() to drop it and potentially release the route. This is needed so that drivers capable of FIB offload could hold a reference on the route before queueing it for offload and drop it after the route has been programmed to the device's tables. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03ipv6: fib: Unlink replaced routes from their nodesIdo Schimmel1-0/+2
When a route is deleted its node pointer is set to NULL to indicate it's no longer linked to its node. Do the same for routes that are replaced. This will later allow us to test if a route is still in the FIB by checking its node pointer instead of its reference count. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03ipv6: fib: Don't assume only nodes hold a reference on routesIdo Schimmel1-2/+0
The code currently assumes that only FIB nodes can hold a reference on routes. Therefore, after fib6_purge_rt() has run and the route is no longer present in any intermediate nodes, it's assumed that its reference count would be 1 - taken by the node where it's currently stored. However, we're going to allow users other than the FIB to take a reference on a route, so this assumption is no longer valid and the BUG_ON() needs to be removed. Note that purging only takes place if the initial reference count is different than 1. I've left that check intact, as in the majority of systems (where routes are only referenced by the FIB), it does actually mean the route is present in intermediate nodes. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03ipv6: fib: Dump tables during registration to FIB chainIdo Schimmel1-0/+92
Dump all the FIB tables in each net namespace upon registration to the FIB notification chain so that the callee will have a complete view of the tables. The integrity of the dump is ensured by a per-table sequence counter that is incremented (under write lock) whenever a route is added or deleted from the table. All the sequence counters are read (under each table's read lock) and summed, prior and after the dump. In case the counters differ, then the dump is either restarted or the registration fails. While it's possible for a table to be modified after its counter has been read, this isn't really a problem. In case it happened before it was read the second time, then the comparison at the end will fail. If it happened afterwards, then we're guaranteed to be notified about the change, as the notification block is registered prior to the second read. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03ipv6: fib: Add in-kernel notifications for route add / deleteIdo Schimmel1-0/+17
As with IPv4, allow listeners of the FIB notification chain to receive notifications whenever a route is added, replaced or deleted. This is done by placing calls to the FIB notification chain in the two lowest level functions that end up performing these operations - namely, fib6_add_rt2node() and fib6_del_route(). Unlike IPv4, APPEND notifications aren't sent as the kernel doesn't distinguish between "append" (NLM_F_CREATE|NLM_F_APPEND) and "prepend" (NLM_F_CREATE). If NLM_F_EXCL isn't set, duplicate routes are always added after the existing duplicate routes. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03ipv6: fib: Add FIB notifiers callbacksIdo Schimmel1-0/+7
We're about to add IPv6 FIB offload support, so implement the necessary callbacks in IPv6 code, which will later allow us to add routes and rules notifications. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-06net: ipv6: Compare lwstate in detecting duplicate nexthopsDavid Ahern1-4/+1
Lennert reported a failure to add different mpls encaps in a multipath route: $ ip -6 route add 1234::/16 \ nexthop encap mpls 10 via fe80::1 dev ens3 \ nexthop encap mpls 20 via fe80::1 dev ens3 RTNETLINK answers: File exists The problem is that the duplicate nexthop detection does not compare lwtunnel configuration. Add it. Fixes: 19e42e451506 ("ipv6: support for fib route lwtunnel encap attributes") Signed-off-by: David Ahern <dsahern@gmail.com> Reported-by: João Taveira Araújo <joao.taveira@gmail.com> Reported-by: Lennert Buytenhek <buytenh@wantstofly.org> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Tested-by: Lennert Buytenhek <buytenh@wantstofly.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-2/+1
Two entries being added at the same time to the IFLA policy table, whilst parallel bug fixes to decnet routing dst handling overlapping with the dst gc removal in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20ipv6: Do not leak throw route referencesSerhey Popovych1-2/+1
While commit 73ba57bfae4a ("ipv6: fix backtracking for throw routes") does good job on error propagation to the fib_rules_lookup() in fib rules core framework that also corrects throw routes handling, it does not solve route reference leakage problem happened when we return -EAGAIN to the fib_rules_lookup() and leave routing table entry referenced in arg->result. If rule with matched throw route isn't last matched in the list we overwrite arg->result losing reference on throw route stored previously forever. We also partially revert commit ab997ad40839 ("ipv6: fix the incorrect return value of throw route") since we never return routing table entry with dst.error == -EAGAIN when CONFIG_IPV6_MULTIPLE_TABLES is on. Also there is no point to check for RTF_REJECT flag since it is always set throw route. Fixes: 73ba57bfae4a ("ipv6: fix backtracking for throw routes") Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17net: remove DST_NOCACHE flagWei Wang1-3/+1
DST_NOCACHE flag check has been removed from dst_release() and dst_hold_safe() in a previous patch because all the dst are now ref counted properly and can be released based on refcnt only. Looking at the rest of the DST_NOCACHE use, all of them can now be removed or replaced with other checks. So this patch gets rid of all the DST_NOCACHE usage and remove this flag completely. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17ipv6: get rid of icmp6 dst garbage collectorWei Wang1-2/+1
icmp6 dst route is currently ref counted during creation and will be freed by user during its call of dst_release(). So no need of a garbage collector for it. Remove all icmp6 dst garbage collector related code. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17ipv6: mark DST_NOGC and remove the operation of dst_free()Wei Wang1-13/+2
With the previous preparation patches, we are ready to get rid of the dst gc operation in ipv6 code and release dst based on refcnt only. So this patch adds DST_NOGC flag for all IPv6 dst and remove the calls to dst_free() and its related functions. At this point, all dst created in ipv6 code do not use the dst gc anymore and will be destroyed at the point when refcnt drops to 0. Also, as icmp6 dst route is refcounted during creation and will be freed by user during its call of dst_release(), there is no need to add this dst to the icmp6 gc list as well. Instead, we need to add it into uncached list so that when a NETDEV_DOWN/NETDEV_UNREGISRER event comes, we can properly go through these icmp6 dst as well and release the net device properly. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17ipv6: call dst_dev_put() properlyWei Wang1-0/+2
As the intend of this patch series is to completely remove dst gc, we need to call dst_dev_put() to release the reference to dst->dev when removing routes from fib because we won't keep the gc list anymore and will lose the dst pointer right after removing the routes. Without the gc list, there is no way to find all the dst's that have dst->dev pointing to the going-down dev. Hence, we are doing dst_dev_put() immediately before we lose the last reference of the dst from the routing code. The next dst_check() will trigger a route re-lookup to find another route (if there is any). Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17ipv6: take dst->__refcnt for insertion into fib6 treeWei Wang1-1/+11
In IPv6 routing code, struct rt6_info is created for each static route and RTF_CACHE route and inserted into fib6 tree. In both cases, dst ref count is not taken. As explained in the previous patch, this leads to the need of the dst garbage collector. This patch holds ref count of dst before inserting the route into fib6 tree and properly releases the dst when deleting it from the fib6 tree as a preparation in order to fully get rid of dst gc later. Also, correct fib6_age() logic to check dst->__refcnt to be 1 to indicate no user is referencing the dst. And remove dst_hold() in vrf_rt6_create() as ip6_dst_alloc() already puts dst->__refcnt to 1. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-22net: ipv6: Add extack messages for route add failuresDavid Ahern1-0/+4
Add messages for non-obvious errors (e.g, no need to add text for malloc failures or ENODEV failures). This mostly covers the annoying EINVAL errors Some message strings violate the 80-columns but searchable strings need to trump that rule. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-22net: ipv6: Plumb extack through route add functionsDavid Ahern1-5/+9
Plumb extack argument down to route add functions. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13ipv6: make ECMP route replacement less greedySabrina Dubroca1-0/+2
Commit 27596472473a ("ipv6: fix ECMP route replacement") introduced a loop that removes all siblings of an ECMP route that is being replaced. However, this loop doesn't stop when it has replaced siblings, and keeps removing other routes with a higher metric. We also end up triggering the WARN_ON after the loop, because after this nsiblings < 0. Instead, stop the loop when we have taken care of all routes with the same metric as the route being replaced. Reproducer: =========== #!/bin/sh ip netns add ns1 ip netns add ns2 ip -net ns1 link set lo up for x in 0 1 2 ; do ip link add veth$x netns ns2 type veth peer name eth$x netns ns1 ip -net ns1 link set eth$x up ip -net ns2 link set veth$x up done ip -net ns1 -6 r a 2000::/64 nexthop via fe80::0 dev eth0 \ nexthop via fe80::1 dev eth1 nexthop via fe80::2 dev eth2 ip -net ns1 -6 r a 2000::/64 via fe80::42 dev eth0 metric 256 ip -net ns1 -6 r a 2000::/64 via fe80::43 dev eth0 metric 2048 echo "before replace, 3 routes" ip -net ns1 -6 r | grep -v '^fe80\|^ff00' echo ip -net ns1 -6 r c 2000::/64 nexthop via fe80::4 dev eth0 \ nexthop via fe80::5 dev eth1 nexthop via fe80::6 dev eth2 echo "after replace, only 2 routes, metric 2048 is gone" ip -net ns1 -6 r | grep -v '^fe80\|^ff00' Fixes: 27596472473a ("ipv6: fix ECMP route replacement") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04net: ipv6: Change notifications for multipath delete to RTA_MULTIPATHDavid Ahern1-1/+2
If an entire multipath route is deleted using prefix and len (without any nexthops), send a single RTM_DELROUTE notification with the full route using RTA_MULTIPATH. This is done by generating the skb before the route delete when all of the sibling routes are still present but sending it after the route has been removed from the FIB. The skip_notify flag is used to tell the lower fib code not to send notifications for the individual nexthop routes. If a route is deleted using RTA_MULTIPATH for any nexthops or a single nexthop entry is deleted, then the nexthops are deleted one at a time with notifications sent as each hop is deleted. This is necessary given that IPv6 allows individual hops within a route to be deleted. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04net: ipv6: Change notifications for multipath add to RTA_MULTIPATHDavid Ahern1-2/+4
Change ip6_route_multipath_add to send one notifciation with the full route encoded with RTA_MULTIPATH instead of a series of individual routes. This is done by adding a skip_notify flag to the nl_info struct. The flag is used to skip sending of the notification in the fib code that actually inserts the route. Once the full route has been added, a notification is generated with all nexthops. ip6_route_multipath_add handles 3 use cases: new routes, route replace, and route append. The multipath notification generated needs to be consistent with the order of the nexthops and it should be consistent with the order in a FIB dump which means the route with the first nexthop needs to be used as the route reference. For the first 2 cases (new and replace), a reference to the route used to send the notification is obtained by saving the first route added. For the append case, the last route added is used to loop back to its first sibling route which is the first nexthop in the multipath route. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04net: ipv6: Add support to dump multipath routes via RTA_MULTIPATH attributeDavid Ahern1-0/+10
IPv6 returns multipath routes as a series of individual routes making their display and handling by userspace different and more complicated than IPv4, putting the burden on the user to see that a route is part of a multipath route and internally creating a multipath route if desired (e.g., libnl does this as of commit 29b71371e764). This patch addresses this difference, allowing multipath routes to be returned using the RTA_MULTIPATH attribute. The end result is that IPv6 multipath routes can be treated and displayed in a format similar to IPv4: $ ip -6 ro ls vrf red 2001:db8:1::/120 dev eth1 proto kernel metric 256 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium 2001:db8:200::/120 metric 1024 nexthop via 2001:db8:1::2 dev eth1 weight 1 nexthop via 2001:db8:2::2 dev eth2 weight 1 Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-01net: ipv6: add NLM_F_APPEND in notifications when applicableDavid Ahern1-0/+3
IPv6 does not set the NLM_F_APPEND flag in notifications to signal that a NEWROUTE is an append versus a new route or a replaced one. Add the flag if the request has it. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09ipv6: report NLM_F_CREATE and NLM_F_EXCL flags in RTM_NEWROUTE eventsGuillaume Nault1-1/+5
Since commit 37a1d3611c12 ("ipv6: include NLM_F_REPLACE in route replace notifications"), RTM_NEWROUTE notifications have their NLM_F_REPLACE flag set if the new route replaced a preexisting one. However, other flags aren't set. This patch reports the missing NLM_F_CREATE and NLM_F_EXCL flag bits. NLM_F_APPEND is not reported, because in ipv6 a NLM_F_CREATE request is interpreted as an append request (contrary to ipv4, "prepend" is not supported, so if NLM_F_EXCL is not set then NLM_F_APPEND is implicit). As a result, the possible flag combination can now be reported (iproute2's terminology into parentheses): * NLM_F_CREATE | NLM_F_EXCL: route didn't exist, exclusive creation ("add"). * NLM_F_CREATE: route did already exist, new route added after preexisting ones ("append"). * NLM_F_REPLACE: route did already exist, new route replaced the first preexisting one ("change"). Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-05ipv6: Fix mem leak in rt6i_pcpuMartin KaFai Lau1-0/+1
It was first reported and reproduced by Petr (thanks!) in https://bugzilla.kernel.org/show_bug.cgi?id=119581 free_percpu(rt->rt6i_pcpu) used to always happen in ip6_dst_destroy(). However, after fixing a deadlock bug in commit 9c7370a166b4 ("ipv6: Fix a potential deadlock when creating pcpu rt"), free_percpu() is not called before setting non_pcpu_rt->rt6i_pcpu to NULL. It is worth to note that rt6i_pcpu is protected by table->tb6_lock. kmemleak somehow did not report it. We nailed it down by observing the pcpu entries in /proc/vmallocinfo (first suggested by Hannes, thanks!). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Fixes: 9c7370a166b4 ("ipv6: Fix a potential deadlock when creating pcpu rt") Reported-by: Petr Novopashenniy <pety@rusnet.ru> Tested-by: Petr Novopashenniy <pety@rusnet.ru> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Petr Novopashenniy <pety@rusnet.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06net: vrf: Create FIB tables on link createDavid Ahern1-0/+1
Tables have to exist for VRFs to function. Ensure they exist when VRF device is created. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08ipv6: per netns FIB garbage collectionMichal Kubeček1-5/+4
One of our customers observed issues with FIB6 garbage collectors running in different network namespaces blocking each other, resulting in soft lockups (fib6_run_gc() initiated from timer runs always in forced mode). Now that FIB6 walkers are separated per namespace, there is no more need for instances of fib6_run_gc() in different namespaces blocking each other. There is still a call to icmp6_dst_gc() which operates on shared data but this function is protected by its own shared lock. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08ipv6: per netns fib6 walkersMichal Kubeček1-32/+36
The IPv6 FIB data structures are separated per network namespace but there is still only one global walkers list and one global walker list lock. This means changes in one namespace unnecessarily interfere with walkers in other namespaces. Replace the global list with per-netns lists (and give each its own lock). Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08ipv6: replace global gc_args with local variableMichal Kubeček1-6/+8
Global variable gc_args is only used in fib6_run_gc() and functions called from it. As fib6_run_gc() makes sure there is at most one instance of fib6_clean_all() running at any moment, we can replace gc_args with a local variable which will be needed once multiple instances (per netns) of garbage collector are allowed. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+11
Conflicts: net/ipv6/xfrm6_output.c net/openvswitch/flow_netlink.c net/openvswitch/vport-gre.c net/openvswitch/vport-vxlan.c net/openvswitch/vport.c net/openvswitch/vport.h The openvswitch conflicts were overlapping changes. One was the egress tunnel info fix in 'net' and the other was the vport ->send() op simplification in 'net-next'. The xfrm6_output.c conflicts was also a simplification overlapping a bug fix. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-23ipv6: fix the incorrect return value of throw routelucien1-1/+11
The error condition -EAGAIN, which is signaled by throw routes, tells the rules framework to walk on searching for next matches. If the walk ends and we stop walking the rules with the result of a throw route we have to translate the error conditions to -ENETUNREACH. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-13net: Export fib6_get_table and nd_tblDavid Ahern1-0/+1
Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17ipv6: include NLM_F_REPLACE in route replace notificationsRoopa Prabhu1-3/+3
This patch adds NLM_F_REPLACE flag to ipv6 route replace notifications. This makes nlm_flags in ipv6 replace notifications consistent with ipv4. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-15ipv6: Replace spinlock with seqlock and rcu in ip6_tunnelMartin KaFai Lau1-2/+7
This patch uses a seqlock to ensure consistency between idst->dst and idst->cookie. It also makes dst freeing from fib tree to undergo a rcu grace period. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-15ipv6: Avoid double dst_freeMartin KaFai Lau1-2/+9
It is a prep work to get dst freeing from fib tree undergo a rcu grace period. The following is a common paradigm: if (ip6_del_rt(rt)) dst_free(rt) which means, if rt cannot be deleted from the fib tree, dst_free(rt) now. 1. We don't know the ip6_del_rt(rt) failure is because it was not managed by fib tree (e.g. DST_NOCACHE) or it had already been removed from the fib tree. 2. If rt had been managed by the fib tree, ip6_del_rt(rt) failure means dst_free(rt) has been called already. A second dst_free(rt) is not always obviously safe. The rt may have been destroyed already. 3. If rt is a DST_NOCACHE, dst_free(rt) should not be called. 4. It is a stopper to make dst freeing from fib tree undergo a rcu grace period. This patch is to use a DST_NOCACHE flag to indicate a rt is not managed by the fib tree. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+2
Conflicts: drivers/net/usb/qmi_wwan.c Overlapping additions of new device IDs to qmi_wwan.c Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20route: move lwtunnel state to dst_entryJiri Benc1-1/+0
Currently, the lwtunnel state resides in per-protocol data. This is a problem if we encapsulate ipv6 traffic in an ipv4 tunnel (or vice versa). The xmit function of the tunnel does not know whether the packet has been routed to it by ipv4 or ipv6, yet it needs the lwtstate data. Moving the lwtstate data to dst_entry makes such inter-protocol tunneling possible. As a bonus, this brings a nice diffstat. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17ipv6: Fix a potential deadlock when creating pcpu rtMartin KaFai Lau1-0/+2
rt6_make_pcpu_route() is called under read_lock(&table->tb6_lock). rt6_make_pcpu_route() calls ip6_rt_pcpu_alloc(rt) which then calls dst_alloc(). dst_alloc() _may_ call ip6_dst_gc() which takes the write_lock(&tabl->tb6_lock). A visualized version: read_lock(&table->tb6_lock); rt6_make_pcpu_route(); => ip6_rt_pcpu_alloc(); => dst_alloc(); => ip6_dst_gc(); => write_lock(&table->tb6_lock); /* oops */ The fix is to do a read_unlock first before calling ip6_rt_pcpu_alloc(). A reported stack: [141625.537638] INFO: rcu_sched self-detected stall on CPU { 27} (t=60000 jiffies g=4159086 c=4159085 q=2139) [141625.547469] Task dump for CPU 27: [141625.550881] mtr R running task 0 22121 22081 0x00000008 [141625.558069] 0000000000000000 ffff88103f363d98 ffffffff8106e488 000000000000001b [141625.565641] ffffffff81684900 ffff88103f363db8 ffffffff810702b0 0000000008000000 [141625.573220] ffffffff81684900 ffff88103f363de8 ffffffff8108df9f ffff88103f375a00 [141625.580803] Call Trace: [141625.583345] <IRQ> [<ffffffff8106e488>] sched_show_task+0xc1/0xc6 [141625.589650] [<ffffffff810702b0>] dump_cpu_task+0x35/0x39 [141625.595144] [<ffffffff8108df9f>] rcu_dump_cpu_stacks+0x6a/0x8c [141625.601320] [<ffffffff81090606>] rcu_check_callbacks+0x1f6/0x5d4 [141625.607669] [<ffffffff810940c8>] update_process_times+0x2a/0x4f [141625.613925] [<ffffffff8109fbee>] tick_sched_handle+0x32/0x3e [141625.619923] [<ffffffff8109fc2f>] tick_sched_timer+0x35/0x5c [141625.625830] [<ffffffff81094a1f>] __hrtimer_run_queues+0x8f/0x18d [141625.632171] [<ffffffff81094c9e>] hrtimer_interrupt+0xa0/0x166 [141625.638258] [<ffffffff8102bf2a>] local_apic_timer_interrupt+0x4e/0x52 [141625.645036] [<ffffffff8102c36f>] smp_apic_timer_interrupt+0x39/0x4a [141625.651643] [<ffffffff8140b9e8>] apic_timer_interrupt+0x68/0x70 [141625.657895] <EOI> [<ffffffff81346ee8>] ? dst_destroy+0x7c/0xb5 [141625.664188] [<ffffffff813d45b5>] ? fib6_flush_trees+0x20/0x20 [141625.670272] [<ffffffff81082b45>] ? queue_write_lock_slowpath+0x60/0x6f [141625.677140] [<ffffffff8140aa33>] _raw_write_lock_bh+0x23/0x25 [141625.683218] [<ffffffff813d4553>] __fib6_clean_all+0x40/0x82 [141625.689124] [<ffffffff813d45b5>] ? fib6_flush_trees+0x20/0x20 [141625.695207] [<ffffffff813d6058>] fib6_clean_all+0xe/0x10 [141625.700854] [<ffffffff813d60d3>] fib6_run_gc+0x79/0xc8 [141625.706329] [<ffffffff813d0510>] ip6_dst_gc+0x85/0xf9 [141625.711718] [<ffffffff81346d68>] dst_alloc+0x55/0x159 [141625.717105] [<ffffffff813d09b5>] __ip6_dst_alloc.isra.32+0x19/0x63 [141625.723620] [<ffffffff813d1830>] ip6_pol_route+0x36a/0x3e8 [141625.729441] [<ffffffff813d18d6>] ip6_pol_route_output+0x11/0x13 [141625.735700] [<ffffffff813f02c8>] fib6_rule_action+0xa7/0x1bf [141625.741698] [<ffffffff813d18c5>] ? ip6_pol_route_input+0x17/0x17 [141625.748043] [<ffffffff81357c48>] fib_rules_lookup+0xb5/0x12a [141625.754050] [<ffffffff81141628>] ? poll_select_copy_remaining+0xf9/0xf9 [141625.761002] [<ffffffff813f0535>] fib6_rule_lookup+0x37/0x5c [141625.766914] [<ffffffff813d18c5>] ? ip6_pol_route_input+0x17/0x17 [141625.773260] [<ffffffff813d008c>] ip6_route_output+0x7a/0x82 [141625.779177] [<ffffffff813c44c8>] ip6_dst_lookup_tail+0x53/0x112 [141625.785437] [<ffffffff813c45c3>] ip6_dst_lookup_flow+0x2a/0x6b [141625.791604] [<ffffffff813ddaab>] rawv6_sendmsg+0x407/0x9b6 [141625.797423] [<ffffffff813d7914>] ? do_ipv6_setsockopt.isra.8+0xd87/0xde2 [141625.804464] [<ffffffff8139d4b4>] inet_sendmsg+0x57/0x8e [141625.810028] [<ffffffff81329ba3>] sock_sendmsg+0x2e/0x3c [141625.815588] [<ffffffff8132be57>] SyS_sendto+0xfe/0x143 [141625.821063] [<ffffffff813dd551>] ? rawv6_setsockopt+0x5e/0x67 [141625.827146] [<ffffffff8132c9f8>] ? sock_common_setsockopt+0xf/0x11 [141625.833660] [<ffffffff8132c08c>] ? SyS_setsockopt+0x81/0xa2 [141625.839565] [<ffffffff8140ac17>] entry_SYSCALL_64_fastpath+0x12/0x6a Fixes: d52d3997f843 ("pv6: Create percpu rt6_info") Signed-off-by: Martin KaFai Lau <kafai@fb.com> CC: Hannes Frederic Sowa <hannes@stressinduktion.org> Reported-by: Steinar H. Gunderson <sgunderson@bigfoot.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27lwtunnel: change prototype of lwtunnel_state_get()Nicolas Dichtel1-1/+1
It saves some lines and simplify a bit the code when the state is returning by this function. It's also useful to handle a NULL entry. To avoid too long lines, I've also renamed lwtunnel_state_get() and lwtunnel_state_put() to lwtstate_get() and lwtstate_put(). CC: Thomas Graf <tgraf@suug.ch> CC: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21ipv6: support for fib route lwtunnel encap attributesRoopa Prabhu1-0/+2
This patch adds support in ipv6 fib functions to parse Netlink RTA encap attributes and attach encap state data to rt6_info. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25ipv6: Create percpu rt6_infoMartin KaFai Lau1-1/+23
After the patch 'ipv6: Only create RTF_CACHE routes after encountering pmtu exception', we need to compensate the performance hit (bouncing dst->__refcnt). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25ipv6: Only create RTF_CACHE routes after encountering pmtu exceptionMartin KaFai Lau1-0/+1
This patch creates a RTF_CACHE routes only after encountering a pmtu exception. After ip6_rt_update_pmtu() has inserted the RTF_CACHE route to the fib6 tree, the rt->rt6i_node->fn_sernum is bumped which will fail the ip6_dst_check() and trigger a relookup. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20ipv6: fix ECMP route replacementMichal Kubeček1-2/+37
When replacing an IPv6 multipath route with "ip route replace", i.e. NLM_F_CREATE | NLM_F_REPLACE, fib6_add_rt2node() replaces only first matching route without fixing its siblings, resulting in corrupted siblings linked list; removing one of the siblings can then end in an infinite loop. IPv6 ECMP implementation is a bit different from IPv4 so that route replacement cannot work in exactly the same way. This should be a reasonable approximation: 1. If the new route is ECMP-able and there is a matching ECMP-able one already, replace it and all its siblings (if any). 2. If the new route is ECMP-able and no matching ECMP-able route exists, replace first matching non-ECMP-able (if any) or just add the new one. 3. If the new route is not ECMP-able, replace first matching non-ECMP-able route (if any) or add the new route. We also need to remove the NLM_F_REPLACE flag after replacing old route(s) by first nexthop of an ECMP route so that each subsequent nexthop does not replace previous one. Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31ipv6: coding style: comparison for inequality with NULLIan Morris1-2/+2
The ipv6 code uses a mixture of coding styles. In some instances check for NULL pointer is done as x != NULL and sometimes as x. x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-19/+26
Conflicts: arch/arm/boot/dts/imx6sx-sdb.dts net/sched/cls_bpf.c Two simple sets of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-27ipv6: replacing a rt6_info needs to purge possible propagated rt6_infos tooHannes Frederic Sowa1-19/+26
Lubomir Rintel reported that during replacing a route the interface reference counter isn't correctly decremented. To quote bug <https://bugzilla.kernel.org/show_bug.cgi?id=91941>: | [root@rhel7-5 lkundrak]# sh -x lal | + ip link add dev0 type dummy | + ip link set dev0 up | + ip link add dev1 type dummy | + ip link set dev1 up | + ip addr add 2001:db8:8086::2/64 dev dev0 | + ip route add 2001:db8:8086::/48 dev dev0 proto static metric 20 | + ip route add 2001:db8:8088::/48 dev dev1 proto static metric 10 | + ip route replace 2001:db8:8086::/48 dev dev1 proto static metric 20 | + ip link del dev0 type dummy | Message from syslogd@rhel7-5 at Jan 23 10:54:41 ... | kernel:unregister_netdevice: waiting for dev0 to become free. Usage count = 2 | | Message from syslogd@rhel7-5 at Jan 23 10:54:51 ... | kernel:unregister_netdevice: waiting for dev0 to become free. Usage count = 2 During replacement of a rt6_info we must walk all parent nodes and check if the to be replaced rt6_info got propagated. If so, replace it with an alive one. Fixes: 4a287eba2de3957 ("IPv6 routing, NLM_F_* flag support: REPLACE and EXCL flags support, warn about missing CREATE flag") Reported-by: Lubomir Rintel <lkundrak@v3.sk> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Tested-by: Lubomir Rintel <lkundrak@v3.sk> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-18netlink: make nlmsg_end() and genlmsg_end() voidJohannes Berg1-1/+0
Contrary to common expectations for an "int" return, these functions return only a positive value -- if used correctly they cannot even return 0 because the message header will necessarily be in the skb. This makes the very common pattern of if (genlmsg_end(...) < 0) { ... } be a whole bunch of dead code. Many places also simply do return nlmsg_end(...); and the caller is expected to deal with it. This also commonly (at least for me) causes errors, because it is very common to write if (my_function(...)) /* error condition */ and if my_function() does "return nlmsg_end()" this is of course wrong. Additionally, there's not a single place in the kernel that actually needs the message length returned, and if anyone needs it later then it'll be very easy to just use skb->len there. Remove this, and make the functions void. This removes a bunch of dead code as described above. The patch adds lines because I did - return nlmsg_end(...); + nlmsg_end(...); + return 0; I could have preserved all the function's return values by returning skb->len, but instead I've audited all the places calling the affected functions and found that none cared. A few places actually compared the return value with <= 0 in dump functionality, but that could just be changed to < 0 with no change in behaviour, so I opted for the more efficient version. One instance of the error I've made numerous times now is also present in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't check for <0 or <=0 and thus broke out of the loop every single time. I've preserved this since it will (I think) have caused the messages to userspace to be formatted differently with just a single message for every SKB returned to userspace. It's possible that this isn't needed for the tools that actually use this, but I don't even know what they are so couldn't test that changing this behaviour would be acceptable. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-05net: fib6: convert cfg metric to u32 outside of table write lockFlorian Westphal1-33/+36
Do the nla validation earlier, outside the write lock. This is needed by followup patch which needs to be able to call request_module (which can sleep) if needed. Joint work with Daniel Borkmann. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-05net: fib6: fib6_commit_metrics: fix potential NULL pointer dereferenceDaniel Borkmann1-7/+6
When IPv6 host routes with metrics attached are being added, we fetch the metrics store from the dst via COW through dst_metrics_write_ptr(), added through commit e5fd387ad5b3. One remaining problem here is that we actually call into inet_getpeer() and may end up allocating/creating a new peer from the kmemcache, which may fail. Example trace from perf probe (inet_getpeer:41) where create is 1: ip 6877 [002] 4221.391591: probe:inet_getpeer: (ffffffff8165e293) 85e294 inet_getpeer.part.7 (<- kmem_cache_alloc()) 85e578 inet_getpeer 8eb333 ipv6_cow_metrics 8f10ff fib6_commit_metrics Therefore, a check for NULL on the return of dst_metrics_write_ptr() is necessary here. Joint work with Florian Westphal. Fixes: e5fd387ad5b3 ("ipv6: do not overwrite inetpeer metrics prematurely") Cc: Michal Kubeček <mkubecek@suse.cz> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-07ipv6: don't walk node's leaf during serial number updateHannes Frederic Sowa1-17/+30
Cc: YOSHIFUJI Hideaki <hideaki@yoshifuji.org> Cc: Martin Lau <kafai@fb.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-07ipv6: make fib6 serial number per namespaceHannes Frederic Sowa1-7/+6
Try to reduce number of possible fn_sernum mutation by constraining them to their namespace. Also remove rt_genid which I forgot to remove in 705f1c869d577c ("ipv6: remove rt6i_genid"). Cc: YOSHIFUJI Hideaki <hideaki@yoshifuji.org> Cc: Martin Lau <kafai@fb.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-07ipv6: only generate one new serial number per fib mutationHannes Frederic Sowa1-6/+6
Cc: YOSHIFUJI Hideaki <hideaki@yoshifuji.org> Cc: Martin Lau <kafai@fb.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>