aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4/fib_trie.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2020-01-19Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-0/+6
2020-01-14ipv4: Add "offload" and "trap" indications to routesIdo Schimmel1-0/+52
When performing L3 offload, routes and nexthops are usually programmed into two different tables in the underlying device. Therefore, the fact that a nexthop resides in hardware does not necessarily mean that all the associated routes also reside in hardware and vice-versa. While the kernel can signal to user space the presence of a nexthop in hardware (via 'RTNH_F_OFFLOAD'), it does not have a corresponding flag for routes. In addition, the fact that a route resides in hardware does not necessarily mean that the traffic is offloaded. For example, unreachable routes (i.e., 'RTN_UNREACHABLE') are programmed to trap packets to the CPU so that the kernel will be able to generate the appropriate ICMP error packet. This patch adds an "offload" and "trap" indications to IPv4 routes, so that users will have better visibility into the offload process. 'struct fib_alias' is extended with two new fields that indicate if the route resides in hardware or not and if it is offloading traffic from the kernel or trapping packets to it. Note that the new fields are added in the 6 bytes hole and therefore the struct still fits in a single cache line [1]. Capable drivers are expected to invoke fib_alias_hw_flags_set() with the route's key in order to set the flags. The indications are dumped to user space via a new flags (i.e., 'RTM_F_OFFLOAD' and 'RTM_F_TRAP') in the 'rtm_flags' field in the ancillary header. v2: * Make use of 'struct fib_rt_info' in fib_alias_hw_flags_set() [1] struct fib_alias { struct hlist_node fa_list; /* 0 16 */ struct fib_info * fa_info; /* 16 8 */ u8 fa_tos; /* 24 1 */ u8 fa_type; /* 25 1 */ u8 fa_state; /* 26 1 */ u8 fa_slen; /* 27 1 */ u32 tb_id; /* 28 4 */ s16 fa_default; /* 32 2 */ u8 offload:1; /* 34: 0 1 */ u8 trap:1; /* 34: 1 1 */ u8 unused:6; /* 34: 2 1 */ /* XXX 5 bytes hole, try to pack */ struct callback_head rcu __attribute__((__aligned__(8))); /* 40 16 */ /* size: 56, cachelines: 1, members: 12 */ /* sum members: 50, holes: 1, sum holes: 5 */ /* sum bitfield members: 8 bits (1 bytes) */ /* forced alignments: 1, forced holes: 1, sum forced holes: 5 */ /* last cacheline: 56 bytes */ } __attribute__((__aligned__(8))); Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14ipv4: Encapsulate function arguments in a structIdo Schimmel1-5/+9
fib_dump_info() is used to prepare RTM_{NEW,DEL}ROUTE netlink messages using the passed arguments. Currently, the function takes 11 arguments, 6 of which are attributes of the route being dumped (e.g., prefix, TOS). The next patch will need the function to also dump to user space an indication if the route is present in hardware or not. Instead of passing yet another argument, change the function to take a struct containing the different route attributes. v2: * Name last argument of fib_dump_info() * Move 'struct fib_rt_info' to include/net/ip_fib.h so that it could later be passed to fib_alias_hw_flags_set() Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14ipv4: Replace route in list before notifyingIdo Schimmel1-4/+7
Subsequent patches will add an offload / trap indication to routes which will signal if the route is present in hardware or not. After programming the route to the hardware, drivers will have to ask the IPv4 code to set the flags by passing the route's key. In the case of route replace, the new route is notified before it is actually inserted into the FIB alias list. This can prevent simple drivers (e.g., netdevsim) that program the route to the hardware in the same context it is notified in from being able to set the flag. Solve this by first inserting the new route to the list and rollback the operation in case the route was vetoed. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-10ipv4: Detect rollover in specific fib table dumpDavid Ahern1-0/+6
Sven-Haegar reported looping on fib dumps when 255.255.255.255 route has been added to a table. The looping is caused by the key rolling over from FFFFFFFF to 0. When dumping a specific table only, we need a means to detect when the table dump is done. The key and count saved to cb args are both 0 only at the start of the table dump. If key is 0 and count > 0, then we are in the rollover case. Detect and return to avoid looping. This only affects dumps of a specific table; for dumps of all tables (the case prior to the change in the Fixes tag) inet_dump_fib moved the entry counter to the next table and reset the cb args used by fib_table_dump and fn_trie_dump_leaf, so the rollover ffffffff back to 0 did not cause looping with the dumps. Fixes: effe67926624 ("net: Enable kernel side filtering of route dumps") Reported-by: Sven-Haegar Koch <haegar@sdinet.de> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-16ipv4: Remove old route notifications and convert listenersIdo Schimmel1-31/+7
Unlike mlxsw, the other listeners to the FIB notification chain do not require any special modifications as they never considered multiple identical routes. This patch removes the old route notifications and converts all the listeners to use the new replace / delete notifications. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-16ipv4: Only Replay routes of interest to new listenersIdo Schimmel1-0/+11
When a new listener is registered to the FIB notification chain it receives a dump of all the available routes in the system. Instead, make sure to only replay the IPv4 routes that are actually used in the data path and are of any interest to the new listener. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-16ipv4: Handle route deletion notification during flushIdo Schimmel1-0/+2
In a similar fashion to previous patch, when a route is deleted as part of table flushing, promote the next route in the list, if exists. Otherwise, simply emit a delete notification. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-16ipv4: Handle route deletion notificationIdo Schimmel1-0/+31
When a route is deleted we potentially need to promote the next route in the FIB alias list (e.g., with an higher metric). In case we find such a route, a replace notification is emitted. Otherwise, a delete notification for the deleted route. v2: * Convert to use fib_find_alias() instead of fib_find_first_alias() Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-16ipv4: Notify newly added route if should be offloadedIdo Schimmel1-0/+10
When a route is added, it should only be notified in case it is the first route in the FIB alias list with the given {prefix, prefix length, table ID}. Otherwise, it is not used in the data path and should not be considered by switch drivers. v2: * Convert to use fib_find_alias() instead of fib_find_first_alias() Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-16ipv4: Notify route if replacing currently offloaded oneIdo Schimmel1-0/+11
When replacing a route, its replacement should only be notified in case the replaced route is of any interest to listeners. In other words, if the replaced route is currently used in the data path, which means it is the first route in the FIB alias list with the given {prefix, prefix length, table ID}. v2: * Convert to use fib_find_alias() instead of fib_find_first_alias() Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-16ipv4: Extend FIB alias find functionIdo Schimmel1-3/+8
Extend the function with another argument, 'find_first'. When set, the function returns the first FIB alias with the matching {prefix, prefix length, table ID}. The TOS and priority parameters are ignored. Current callers are converted to pass 'false' in order to maintain existing behavior. This will be used by subsequent patches in the series. v2: * New patch Signed-off-by: Ido Schimmel <idosch@mellanox.com> Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-16ipv4: Notify route after insertion to the routing tableIdo Schimmel1-15/+14
Currently, a new route is notified in the FIB notification chain before it is inserted to the FIB alias list. Subsequent patches will use the placement of the new route in the ordered FIB alias list in order to determine if the route should be notified or not. As a preparatory step, change the order so that the route is first inserted into the FIB alias list and only then notified. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-04net: fib_notifier: propagate extack down to the notifier block callbackJiri Pirko1-7/+13
Since errors are propagated all the way up to the caller, propagate possible extack of the caller all the way down to the notifier block callback. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-04net: fib_notifier: propagate possible error during fib notifier registrationJiri Pirko1-9/+22
Unlike events for registered notifier, during the registration, the errors that happened for the block being registered are not propagated up to the caller. Make sure the error is propagated for FIB rules and entries. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-04net: fib_notifier: make FIB notifier per-netnsJiri Pirko1-9/+8
Currently all users of FIB notifier only cares about events in init_net. Later in this patchset, users get interested in other namespaces too. However, for every registered block user is interested only about one namespace. Make the FIB notifier registration per-netns and avoid unnecessary calls of notifier block for other namespaces. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-24net: route dump netlink NLM_F_MULTI flag missingJohn Fastabend1-1/+1
An excerpt from netlink(7) man page, In multipart messages (multiple nlmsghdr headers with associated payload in one byte stream) the first and all following headers have the NLM_F_MULTI flag set, except for the last header which has the type NLMSG_DONE. but, after (ee28906) there is a missing NLM_F_MULTI flag in the middle of a FIB dump. The result is user space applications following above man page excerpt may get confused and may stop parsing msg believing something went wrong. In the golang netlink lib [0] the library logic stops parsing believing the message is not a multipart message. Found this running Cilium[1] against net-next while adding a feature to auto-detect routes. I noticed with multiple route tables we no longer could detect the default routes on net tree kernels because the library logic was not returning them. Fix this by handling the fib_dump_info_fnhe() case the same way the fib_dump_info() handles it by passing the flags argument through the call chain and adding a flags argument to rt_fill_info(). Tested with Cilium stack and auto-detection of routes works again. Also annotated libs to dump netlink msgs and inspected NLM_F_MULTI and NLMSG_DONE flags look correct after this. Note: In inet_rtm_getroute() pass rt_fill_info() '0' for flags the same as is done for fib_dump_info() so this looks correct to me. [0] https://github.com/vishvananda/netlink/ [1] https://github.com/cilium/ Fixes: ee28906fd7a14 ("ipv4: Dump route exceptions if requested") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-02ipv4: Fix off-by-one in route dump counter without netlink strict checkingStefano Brivio1-8/+14
In commit ee28906fd7a1 ("ipv4: Dump route exceptions if requested") I added a counter of per-node dumped routes (including actual routes and exceptions), analogous to the existing counter for dumped nodes. Dumping exceptions means we need to also keep track of how many routes are dumped for each node: this would be just one route per node, without exceptions. When netlink strict checking is not enabled, we dump both routes and exceptions at the same time: the RTM_F_CLONED flag is not used as a filter. In this case, the per-node counter 'i_fa' is incremented by one to track the single dumped route, then also incremented by one for each exception dumped, and then stored as netlink callback argument as skip counter, 's_fa', to be used when a partial dump operation restarts. The per-node counter needs to be increased by one also when we skip a route (exception) due to a previous non-zero skip counter, because it needs to match the existing skip counter, if we are dumping both routes and exceptions. I missed this, and only incremented the counter, for regular routes, if the previous skip counter was zero. This means that, in case of a mixed dump, partial dump operations after the first one will start with a mismatching skip counter value, one less than expected. This means in turn that the first exception for a given node is skipped every time a partial dump operation restarts, if netlink strict checking is not enabled (iproute < 5.0). It turns out I didn't repeat the test in its final version, commit de755a85130e ("selftests: pmtu: Introduce list_flush_ipv4_exception test case"), which also counts the number of route exceptions returned, with iproute2 versions < 5.0 -- I was instead using the equivalent of the IPv6 test as it was before commit b964641e9925 ("selftests: pmtu: Make list_flush_ipv6_exception test more demanding"). Always increment the per-node counter by one if we previously dumped a regular route, so that it matches the current skip counter. Fixes: ee28906fd7a1 ("ipv4: Dump route exceptions if requested") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-24ipv4: Dump route exceptions if requestedStefano Brivio1-13/+31
Since commit 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions."), cached exception routes are stored as a separate entity, so they are not dumped on a FIB dump, even if the RTM_F_CLONED flag is passed. This implies that the command 'ip route list cache' doesn't return any result anymore. If the RTM_F_CLONED is passed, and strict checking requested, retrieve nexthop exception routes and dump them. If no strict checking is requested, filtering can't be performed consistently: dump everything in that case. With this, we need to add an argument to the netlink callback in order to track how many entries were already dumped for the last leaf included in a partial netlink dump. A single additional argument is sufficient, even if we traverse logically nested structures (nexthop objects, hash table buckets, bucket chains): it doesn't matter if we stop in the middle of any of those, because they are always traversed the same way. As an example, s_i values in [], s_fa values in (): node (fa) #1 [1] nexthop #1 bucket #1 -> #0 in chain (1) bucket #2 -> #0 in chain (2) -> #1 in chain (3) -> #2 in chain (4) bucket #3 -> #0 in chain (5) -> #1 in chain (6) nexthop #2 bucket #1 -> #0 in chain (7) -> #1 in chain (8) bucket #2 -> #0 in chain (9) -- node (fa) #2 [2] nexthop #1 bucket #1 -> #0 in chain (1) -> #1 in chain (2) bucket #2 -> #0 in chain (3) it doesn't matter if we stop at (3), (4), (7) for "node #1", or at (2) for "node #2": walking flattens all that. It would even be possible to drop the distinction between the in-tree (s_i) and in-node (s_fa) counter, but a further improvement might advise against this. This is only as accurate as the existing tracking mechanism for leaves: if a partial dump is restarted after exceptions are removed or expired, we might skip some non-dumped entries. To improve this, we could attach a 'sernum' attribute (similar to the one used for IPv6) to nexthop entities, and bump this counter whenever exceptions change: having a distinction between the two counters would make this more convenient. Listing of exception routes (modified routes pre-3.5) was tested against these versions of kernel and iproute2: iproute2 kernel 4.14.0 4.15.0 4.19.0 5.0.0 5.1.0 3.5-rc4 + + + + + 4.4 4.9 4.14 4.15 4.19 5.0 5.1 fixed + + + + + v7: - Move loop over nexthop objects to route.c, and pass struct fib_info and table ID to it, not a struct fib_alias (suggested by David Ahern) - While at it, note that the NULL check on fa->fa_info is redundant, and the check on RTNH_F_DEAD is also not consistent with what's done with regular route listing: just keep it for nhc_flags - Rename entry point function for dumping exceptions to fib_dump_info_fnhe(), and rearrange arguments for consistency with fib_dump_info() - Rename fnhe_dump_buckets() to fnhe_dump_bucket() and make it handle one bucket at a time - Expand commit message to describe why we can have a single "skip" counter for all exceptions stored in bucket chains in nexthop objects (suggested by David Ahern) v6: - Rebased onto net-next - Loop over nexthop paths too. Move loop over fnhe buckets to route.c, avoids need to export rt_fill_info() and to touch exceptions from fib_trie.c. Pass NULL as flow to rt_fill_info(), it now allows that (suggested by David Ahern) Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19net/ipv4: fib_trie: Avoid cryptic ternary expressionsMatthias Kaehlcke1-2/+8
empty_child_inc/dec() use the ternary operator for conditional operations. The conditions involve the post/pre in/decrement operator and the operation is only performed when the condition is *not* true. This is hard to parse for humans, use a regular 'if' construct instead and perform the in/decrement separately. This also fixes two warnings that are emitted about the value of the ternary expression being unused, when building the kernel with clang + "kbuild: Remove unnecessary -Wno-unused-value" (https://lore.kernel.org/patchwork/patch/1089869/): CC net/ipv4/fib_trie.o net/ipv4/fib_trie.c:351:2: error: expression result unused [-Werror,-Wunused-value] ++tn_info(n)->empty_children ? : ++tn_info(n)->full_children; Fixes: 95f60ea3e99a ("fib_trie: Add collapse() and should_collapse() to resize") Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-13/+1
Some ISDN files that got removed in net-next had some changes done in mainline, take the removals. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04ipv4: Plumb support for nexthop object in a fib_infoDavid Ahern1-0/+7
Add 'struct nexthop' and nh_list list_head to fib_info. nh_list is the fib_info side of the nexthop <-> fib_info relationship. Add fi_list list_head to 'struct nexthop' to track fib_info entries using a nexthop instance. Add __remove_nexthop_fib and add it to __remove_nexthop to walk the new list_head and mark those fib entries as dead when the nexthop is deleted. Add a few nexthop helpers for use when a nexthop is added to fib_info: - nexthop_cmp to determine if 2 nexthops are the same - nexthop_path_fib_result to select a path for a multipath 'struct nexthop' - nexthop_fib_nhc to select a specific fib_nh_common within a multipath 'struct nexthop' Update existing fib_info_nhc to use nexthop_fib_nhc if a fib_info uses a 'struct nexthop', and mark fib_info_nh as only used for the non-nexthop case. Update the fib_info functions to check for fi->nh and take a different path as needed: - free_fib_info_rcu - put the nexthop object reference - fib_release_info - remove the fib_info from the nexthop's fi_list - nh_comp - use nexthop_cmp when either fib_info references a nexthop object - fib_info_hashfn - use the nexthop id for the hashing vs the oif of each fib_nh in a fib_info - fib_nlmsg_size - add space for the RTA_NH_ID attribute - fib_create_info - verify nexthop reference can be taken, verify nexthop spec is valid for fib entry, and add fib_info to fi_list for a nexthop - fib_select_multipath - use the new nexthop_path_fib_result to select a path when nexthop objects are used - fib_table_lookup - if the 'struct nexthop' is a blackhole nexthop, treat it the same as a fib entry using 'blackhole' The bulk of the changes are in fib_semantics.c and most of that is moving the existing change_nexthops into an else branch. Update the nexthop code to walk fi_list on a nexthop deleted to remove fib entries referencing it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04ipv4: Prepare for fib6_nh from a nexthop objectDavid Ahern1-6/+9
Convert more IPv4 code to use fib_nh_common over fib_nh to enable routes to use a fib6_nh based nexthop. In the end, only code not using a nexthop object in a fib_info should directly access fib_nh in a fib_info without checking the famiy and going through fib_nh_common. Those functions will be marked when it is not directly evident. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04ipv4: Use accessors for fib_info nexthop dataDavid Ahern1-10/+16
Use helpers to access fib_nh and fib_nhs fields of a fib_info. Drop the fib_dev macro which is an alias for the first nexthop. Replacements: fi->fib_dev --> fib_info_nh(fi, 0)->fib_nh_dev fi->fib_nh --> fib_info_nh(fi, 0) fi->fib_nh[i] --> fib_info_nh(fi, i) fi->fib_nhs --> fib_info_num_path(fi) where fib_info_nh(fi, i) returns fi->fib_nh[nhsel] and fib_info_num_path returns fi->fib_nhs. Move the existing fib_info_nhc to nexthop.h and define the new ones there. A later patch adds a check if a fib_info uses a nexthop object, and defining the helpers in nexthop.h avoid circular header dependencies. After this all remaining open coded references to fi->fib_nhs and fi->fib_nh are in: - fib_create_info and helpers used to lookup an existing fib_info entry, and - the netdev event functions fib_sync_down_dev and fib_sync_up. The latter two will not be reused for nexthops, and the fib_create_info will be updated to handle a nexthop in a fib_info. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152Thomas Gleixner1-13/+1
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-26ipv4: remove redundant assignment to nColin Ian King1-1/+0
The pointer n is being assigned a value however this value is never read in the code block and the end of the code block continues to the next loop iteration. Clean up the code by removing the redundant assignment. Fixes: 1bff1a0c9bbda ("ipv4: Add function to send route updates") Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22ipv4: Add function to send route updatesDavid Ahern1-0/+72
Add fib_info_notify_update to walk the fib and send RTM_NEWROUTE notifications with NLM_F_REPLACE set for entries linked to a fib_info that have nh_updated flag set. This helper will be used by the nexthop code to notify userspace of routes that are impacted when a nexthop config is updated via replace. The new function and its helper are similar to how fib_flush and fib_table_flush work for address delete and link down events. This notification is needed for legacy apps that do not understand the new nexthop object. Apps that are nexthop aware can use the RTA_NH_ID attribute in the route notification to just ignore it. In the future this should be wrapped in a sysctl to allow OS'es that are fully updated to avoid the notificaton storm. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03ipv4: Add fib_nh_common to fib_resultDavid Ahern1-6/+7
Most of the ipv4 code only needs data from fib_nh_common. Add fib_nh_common selection to fib_result and update users to use it. Right now, fib_nh_common in fib_result will point to a fib_nh struct that is embedded within a fib_info: fib_info --> fib_nh fib_nh ... fib_nh ^ fib_result->nhc ----+ Later, nhc can point to a fib_nh within a nexthop struct: fib_info --> nexthop --> fib_nh ^ fib_result->nhc ---------------+ or for a nexthop group: fib_info --> nexthop --> nexthop --> fib_nh nexthop --> fib_nh ... nexthop --> fib_nh ^ fib_result->nhc ---------------------------+ In all cases nhsel within fib_result will point to which leg in the multipath route is used. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03ipv4: Update fib_table_lookup tracepoint to take common nexthopDavid Ahern1-1/+1
Update fib_table_lookup tracepoint to take a fib_nh_common struct and dump the v6 gateway address if the nexthop uses it. Over the years saddr has not proven useful and the output of the tracepoint produces very long lines. Since saddr is not part of fib_nh_common, drop it. If it needs to be added later, fib_nh which contains saddr can be obtained from a fib_nh_common via container_of. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29ipv4: Rename fib_nh entriesDavid Ahern1-6/+6
Rename fib_nh entries that will be moved to a fib_nh_common struct. Specifically, the device, oif, gateway, flags, scope, lwtstate, nh_weight and nh_upper_bound are common with all nexthop definitions. In the process shorten fib_nh_lwtstate to fib_nh_lws to avoid really long lines. Rename only; no functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29ipv4: Move IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN to helperDavid Ahern1-3/+1
in_dev lookup followed by IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN check is called in several places, some with the rcu lock and others with the rtnl held. Move the check to a helper similar to what IPv6 has. Since the helper can be invoked from either context use rcu_dereference_rtnl to dereference ip_ptr. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21ipv4: Allow amount of dirty memory from fib resizing to be controllableDavid Ahern1-6/+8
fib_trie implementation calls synchronize_rcu when a certain amount of pages are dirty from freed entries. The number of pages was determined experimentally in 2009 (commit c3059477fce2d). At the current setting, synchronize_rcu is called often -- 51 times in a second in one test with an average of an 8 msec delay adding a fib entry. The total impact is a lot of slow down modifying the fib. This is seen in the output of 'time' - the difference between real time and sys+user. For example, using 720,022 single path routes and 'ip -batch'[1]: $ time ./ip -batch ipv4/routes-1-hops real 0m14.214s user 0m2.513s sys 0m6.783s So roughly 35% of the actual time to install the routes is from the ip command getting scheduled out, most notably due to synchronize_rcu (this is observed using 'perf sched timehist'). This patch makes the amount of dirty memory configurable between 64k where the synchronize_rcu is called often (small, low end systems that are memory sensitive) to 64M where synchronize_rcu is called rarely during a large FIB change (for high end systems with lots of memory). The default is 512kB which corresponds to the current setting of 128 pages with a 4kB page size. As an example, at 16MB the worst interval shows 4 calls to synchronize_rcu in a second blocking for up to 30 msec in a single instance, and a total of almost 100 msec across the 4 calls in the second. The trade off is allowing FIB entries to consume more memory in a given time window but but with much better fib insertion rates (~30% increase in prefixes/sec). With this patch and net.ipv4.fib_sync_mem set to 16MB, the same batch file runs in: $ time ./ip -batch ipv4/routes-1-hops real 0m9.692s user 0m2.491s sys 0m6.769s So the dead time is reduced to about 1/2 second or <5% of the real time. [1] 'ip' modified to not request ACK messages which improves route insertion times by about 20% Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-15net: ipv4: Fix memory leak in network namespace dismantleIdo Schimmel1-3/+12
IPv4 routing tables are flushed in two cases: 1. In response to events in the netdev and inetaddr notification chains 2. When a network namespace is being dismantled In both cases only routes associated with a dead nexthop group are flushed. However, a nexthop group will only be marked as dead in case it is populated with actual nexthops using a nexthop device. This is not the case when the route in question is an error route (e.g., 'blackhole', 'unreachable'). Therefore, when a network namespace is being dismantled such routes are not flushed and leaked [1]. To reproduce: # ip netns add blue # ip -n blue route add unreachable 192.0.2.0/24 # ip netns del blue Fix this by not skipping error routes that are not marked with RTNH_F_DEAD when flushing the routing tables. To prevent the flushing of such routes in case #1, add a parameter to fib_table_flush() that indicates if the table is flushed as part of namespace dismantle or not. Note that this problem does not exist in IPv6 since error routes are associated with the loopback device. [1] unreferenced object 0xffff888066650338 (size 56): comm "ip", pid 1206, jiffies 4294786063 (age 26.235s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 b0 1c 62 61 80 88 ff ff ..........ba.... e8 8b a1 64 80 88 ff ff 00 07 00 08 fe 00 00 00 ...d............ backtrace: [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220 [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20 [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380 [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690 [<0000000014f62875>] netlink_sendmsg+0x929/0xe10 [<00000000bac9d967>] sock_sendmsg+0xc8/0x110 [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0 [<000000002e94f880>] __sys_sendmsg+0xf7/0x250 [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610 [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe [<000000003a8b605b>] 0xffffffffffffffff unreferenced object 0xffff888061621c88 (size 48): comm "ip", pid 1206, jiffies 4294786063 (age 26.235s) hex dump (first 32 bytes): 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 6b 6b 6b 6b 6b 6b 6b 6b d8 8e 26 5f 80 88 ff ff kkkkkkkk..&_.... backtrace: [<00000000733609e3>] fib_table_insert+0x978/0x1500 [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220 [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20 [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380 [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690 [<0000000014f62875>] netlink_sendmsg+0x929/0xe10 [<00000000bac9d967>] sock_sendmsg+0xc8/0x110 [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0 [<000000002e94f880>] __sys_sendmsg+0xf7/0x250 [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610 [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe [<000000003a8b605b>] 0xffffffffffffffff Fixes: 8cced9eff1d4 ("[NETNS]: Enable routing configuration in non-initial namespace.") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16net/ipv4: Plumb support for filtering route dumpsDavid Ahern1-11/+26
Implement kernel side filtering of routes by table id, egress device index, protocol and route type. If the table id is given in the filter, lookup the table and call fib_table_dump directly for it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds1-5/+9
Pull networking updates from David Miller: 1) Add Maglev hashing scheduler to IPVS, from Inju Song. 2) Lots of new TC subsystem tests from Roman Mashak. 3) Add TCP zero copy receive and fix delayed acks and autotuning with SO_RCVLOWAT, from Eric Dumazet. 4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard Brouer. 5) Add ttl inherit support to vxlan, from Hangbin Liu. 6) Properly separate ipv6 routes into their logically independant components. fib6_info for the routing table, and fib6_nh for sets of nexthops, which thus can be shared. From David Ahern. 7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP messages from XDP programs. From Nikita V. Shirokov. 8) Lots of long overdue cleanups to the r8169 driver, from Heiner Kallweit. 9) Add BTF ("BPF Type Format"), from Martin KaFai Lau. 10) Add traffic condition monitoring to iwlwifi, from Luca Coelho. 11) Plumb extack down into fib_rules, from Roopa Prabhu. 12) Add Flower classifier offload support to igb, from Vinicius Costa Gomes. 13) Add UDP GSO support, from Willem de Bruijn. 14) Add documentation for eBPF helpers, from Quentin Monnet. 15) Add TLS tx offload to mlx5, from Ilya Lesokhin. 16) Allow applications to be given the number of bytes available to read on a socket via a control message returned from recvmsg(), from Soheil Hassas Yeganeh. 17) Add x86_32 eBPF JIT compiler, from Wang YanQing. 18) Add AF_XDP sockets, with zerocopy support infrastructure as well. From Björn Töpel. 19) Remove indirect load support from all of the BPF JITs and handle these operations in the verifier by translating them into native BPF instead. From Daniel Borkmann. 20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha. 21) Allow XDP programs to do lookups in the main kernel routing tables for forwarding. From David Ahern. 22) Allow drivers to store hardware state into an ELF section of kernel dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy. 23) Various RACK and loss detection improvements in TCP, from Yuchung Cheng. 24) Add TCP SACK compression, from Eric Dumazet. 25) Add User Mode Helper support and basic bpfilter infrastructure, from Alexei Starovoitov. 26) Support ports and protocol values in RTM_GETROUTE, from Roopa Prabhu. 27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard Brouer. 28) Add lots of forwarding selftests, from Petr Machata. 29) Add generic network device failover driver, from Sridhar Samudrala. * ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits) strparser: Add __strp_unpause and use it in ktls. rxrpc: Fix terminal retransmission connection ID to include the channel net: hns3: Optimize PF CMDQ interrupt switching process net: hns3: Fix for VF mailbox receiving unknown message net: hns3: Fix for VF mailbox cannot receiving PF response bnx2x: use the right constant Revert "net: sched: cls: Fix offloading when ingress dev is vxlan" net: dsa: b53: Fix for brcm tag issue in Cygnus SoC enic: fix UDP rss bits netdev-FAQ: clarify DaveM's position for stable backports rtnetlink: validate attributes in do_setlink() mlxsw: Add extack messages for port_{un, }split failures netdevsim: Add extack error message for devlink reload devlink: Add extack to reload and port_{un, }split operations net: metrics: add proper netlink validation ipmr: fix error path when ipmr_new_table fails ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds net: hns3: remove unused hclgevf_cfg_func_mta_filter netfilter: provide udp*_lib_lookup for nf_tproxy qed*: Utilize FW 8.37.2.0 ...
2018-05-24net/ipv4: Udate fib_table_lookup tracepointDavid Ahern1-5/+9
Commit 4a2d73a4fb36 ("ipv4: fib_rules: support match on sport, dport and ip proto") added support for protocol and ports to FIB rules. Update the FIB lookup tracepoint to dump the parameters. In addition, make the IPv4 tracepoint similar to the IPv6 one where the lookup parameters and result are dumped in 1 event. It is much easier to use and understand the outcome of the lookup. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16proc: introduce proc_create_net_singleChristoph Hellwig1-14/+2
Variant of proc_create_data that directly take a seq_file show callback and deals with network namespaces in ->open and ->release. All callers of proc_create + single_open_net converted over, and single_{open,release}_net are removed entirely. Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16proc: introduce proc_create_net{,_data}Christoph Hellwig1-28/+4
Variants of proc_create{,_data} that directly take a struct seq_operations and deal with network namespaces in ->open and ->release. All callers of proc_create + seq_open_net converted over, and seq_{open,release}_net are removed entirely. Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-03-29net/ipv4: Allow notifier to fail route replaceDavid Ahern1-2/+7
Add checking to call to call_fib_entry_notifiers for IPv4 route replace. Allows a notifier handler to fail the replace. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29net/ipv4: Move call_fib_entry_notifiers up for new routesDavid Ahern1-2/+16
Move call to call_fib_entry_notifiers for new IPv4 routes to right before the call to fib_insert_alias. At this point the only remaining failure path is memory allocations in fib_insert_node. Handle that very unlikely failure with a call to call_fib_entry_notifiers to tell drivers about it. At this point notifier handlers can decide the fate of the new route with a clean path to delete the potential new entry if the notifier returns non-0. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26net: Use octal not symbolic permissionsJoe Perches1-3/+3
Prefer the direct use of octal for permissions. Done with checkpatch -f --types=SYMBOLIC_PERMS --fix-inplace and some typing. Miscellanea: o Whitespace neatening around these conversions. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-26net: make kmem caches as __ro_after_initAlexey Dobriyan1-2/+3
All kmem caches aren't reallocated once set up. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-16net: delete /proc THIS_MODULE referencesAlexey Dobriyan1-3/+0
/proc has been ignoring struct file_operations::owner field for 10 years. Specifically, it started with commit 786d7e1612f0b0adb6046f19b906609e4fe8b1ba ("Fix rmmod/read/write races in /proc entries"). Notice the chunk where inode->i_fop is initialized with proxy struct file_operations for regular files: - if (de->proc_fops) - inode->i_fop = de->proc_fops; + if (de->proc_fops) { + if (S_ISREG(inode->i_mode)) + inode->i_fop = &proc_reg_file_ops; + else + inode->i_fop = de->proc_fops; + } VFS stopped pinning module at this point. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-01net: Add extack to fib_notifier_infoDavid Ahern1-5/+8
Add extack to fib_notifier_info and plumb through stack to call_fib_rule_notifiers, call_fib_entry_notifiers and call_fib6_entry_notifiers. This allows notifer handlers to return messages to user. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20net: ipv4: Change fib notifiers to take a fib_aliasDavid Ahern1-24/+15
All of the notifier data (fib_info, tos, type and table id) are contained in the fib_alias. Pass it to the notifier instead of each data separately shortening the argument list by 3. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-23ipv4: do metrics match when looking up and deleting a routeXin Long1-1/+2
Now when ipv4 route inserts a fib_info, it memcmp fib_metrics. It means ipv4 route identifies one route also with metrics. But when removing a route, it tries to find the route without caring about the metrics. It will cause that the route with right metrics can't be removed. Thomas noticed this issue when doing the testing: 1. add: # ip route append 192.168.7.0/24 dev v window 1000 # ip route append 192.168.7.0/24 dev v window 1001 # ip route append 192.168.7.0/24 dev v window 1002 # ip route append 192.168.7.0/24 dev v window 1003 2. delete: # ip route delete 192.168.7.0/24 dev v window 1002 3. show: 192.168.7.0/24 proto boot scope link window 1001 192.168.7.0/24 proto boot scope link window 1002 192.168.7.0/24 proto boot scope link window 1003 The one with window 1002 wasn't deleted but the first one was. This patch is to do metrics match when looking up and deleting one route. Reported-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03net: core: Make the FIB notification chain genericIdo Schimmel1-2/+3
The FIB notification chain is currently soley used by IPv4 code. However, we're going to introduce IPv6 FIB offload support, which requires these notification as well. As explained in commit c3852ef7f2f8 ("ipv4: fib: Replay events when registering FIB notifier"), upon registration to the chain, the callee receives a full dump of the FIB tables and rules by traversing all the net namespaces. The integrity of the dump is ensured by a per-namespace sequence counter that is incremented whenever a change to the tables or rules occurs. In order to allow more address families to use the chain, each family is expected to register its fib_notifier_ops in its pernet init. These operations allow the common code to read the family's sequence counter as well as dump its tables and rules in the given net namespace. Additionally, a 'family' parameter is added to sent notifications, so that listeners could distinguish between the different families. Implement the common code that allows listeners to register to the chain and for address families to register their fib_notifier_ops. Subsequent patches will implement these operations in IPv6. In the future, ipmr and ip6mr will be extended to provide these notifications as well. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-04net, ipv4: convert fib_info.fib_clntref from atomic_t to refcount_tReshetova, Elena1-1/+1
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-30net: add extack arg to lwtunnel build stateDavid Ahern1-1/+1
Pass extack arg down to lwtunnel_build_state and the build_state callbacks. Add messages for failures in lwtunnel_build_state, and add the extarg to nla_parse where possible in the build_state callbacks. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-30net: ipv4: Add extack message for invalid prefix or lengthDavid Ahern1-6/+11
Add extack error message for invalid prefix length and invalid prefix. Example of the latter is a route spec containing 172.16.100.1/24, where the /24 mask means the lower 8-bits should be 0. Amazing how easy that one is to overlook when an EINVAL is returned. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>