aboutsummaryrefslogtreecommitdiffstats
path: root/net (follow)
AgeCommit message (Collapse)AuthorFilesLines
2018-07-12tcp: use monotonic timestamps for PAWSArnd Bergmann3-6/+7
Using get_seconds() for timestamps is deprecated since it can lead to overflows on 32-bit systems. While the interface generally doesn't overflow until year 2106, the specific implementation of the TCP PAWS algorithm breaks in 2038 when the intermediate signed 32-bit timestamps overflow. A related problem is that the local timestamps in CLOCK_REALTIME form lead to unexpected behavior when settimeofday is called to set the system clock backwards or forwards by more than 24 days. While the first problem could be solved by using an overflow-safe method of comparing the timestamps, a nicer solution is to use a monotonic clocksource with ktime_get_seconds() that simply doesn't overflow (at least not until 136 years after boot) and that doesn't change during settimeofday(). To make 32-bit and 64-bit architectures behave the same way here, and also save a few bytes in the tcp_options_received structure, I'm changing the type to a 32-bit integer, which is now safe on all architectures. Finally, the ts_recent_stamp field also (confusingly) gets used to store a jiffies value in tcp_synq_overflow()/tcp_synq_no_recent_overflow(). This is currently safe, but changing the type to 32-bit requires some small changes there to keep it working. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12net/tls: Use aead_request_alloc/free for request alloc/freeVakul Garg1-8/+4
Instead of kzalloc/free for aead_request allocation and free, use functions aead_request_alloc(), aead_request_free(). It ensures that any sensitive crypto material held in crypto transforms is securely erased from memory. Signed-off-by: Vakul Garg <vakul.garg@nxp.com> Acked-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11tipc: check session number before accepting link protocol messagesJon Maloy3-22/+52
In some virtual environments we observe a significant higher number of packet reordering and delays than we have been used to traditionally. This makes it necessary with stricter checks on incoming link protocol messages' session number, which until now only has been validated for RESET messages. Since the other two message types, ACTIVATE and STATE messages also carry this number, it is easy to extend the validation check to those messages. We also introduce a flag indicating if a link has a valid peer session number or not. This eliminates the mixing of 32- and 16-bit arithmethics we are currently using to achieve this. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11tipc: add sequence number check for link STATE messagesJon Maloy4-6/+32
Some switch infrastructures produce huge amounts of packet duplicates. This becomes a problem if those messages are STATE/NACK protocol messages, causing unnecessary retransmissions of already accepted packets. We now introduce a unique sequence number per STATE protocol message so that duplicates can be identified and ignored. This will also be useful when tracing such cases, and to avert replay attacks when TIPC is encrypted. For compatibility reasons we have to introduce a new capability flag TIPC_LINK_PROTO_SEQNO to handle this new feature. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queueDavid S. Miller4-31/+173
Jeff Kirsher says: ==================== L2 Fwd Offload & 10GbE Intel Driver Updates 2018-07-09 This patch series is meant to allow support for the L2 forward offload, aka MACVLAN offload without the need for using ndo_select_queue. The existing solution currently requires that we use ndo_select_queue in the transmit path if we want to associate specific Tx queues with a given MACVLAN interface. In order to get away from this we need to repurpose the tc_to_txq array and XPS pointer for the MACVLAN interface and use those as a means of accessing the queues on the lower device. As a result we cannot offload a device that is configured as multiqueue, however it doesn't really make sense to configure a macvlan interfaced as being multiqueue anyway since it doesn't really have a qdisc of its own in the first place. The big changes in this set are: Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN Disable XPS for single queue devices Replace accel_priv with sb_dev in ndo_select_queue Add sb_dev parameter to fallback function for ndo_select_queue Consolidated ndo_select_queue functions that appeared to be duplicates ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11tcp: expose both send and receive intervals for rate sampleDeepti Raghavan1-0/+4
Congestion control algorithms, which access the rate sample through the tcp_cong_control function, only have access to the maximum of the send and receive interval, for cases where the acknowledgment rate may be inaccurate due to ACK compression or decimation. Algorithms may want to use send rates and receive rates as separate signals. Signed-off-by: Deepti Raghavan <deeptir@mit.edu> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11net: sched: fix unprotected access to rcu cookie pointerVlad Buslov1-2/+7
Fix action attribute size calculation function to take rcu read lock and access act_cookie pointer with rcu dereference. Fixes: eec94fdb0480 ("net: sched: use rcu for action cookie update") Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11net: sched: act_ife: fix memory leak in ife initVlad Buslov1-1/+3
Free params if tcf_idr_check_alloc() returned error. Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11net/ipv6: propagate net.ipv6.conf.all.addr_gen_mode to devicesSabrina Dubroca1-0/+12
This aligns the addr_gen_mode sysctl with the expected behavior of the "all" variant. Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address generation mode") Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11net/ipv6: reserve room for IFLA_INET6_ADDR_GEN_MODESabrina Dubroca1-1/+3
inet6_ifla6_size() is called to check how much space is needed by inet6_fill_link_af() and inet6_fill_ifinfo(), both of which include the IFLA_INET6_ADDR_GEN_MODE attribute. Reserve some room for it. Fixes: bc91b0f07ada ("ipv6: addrconf: implement address generation modes") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11net/ipv6: don't reinitialize ndev->cnf.addr_gen_mode on new inet6_devSabrina Dubroca1-2/+0
The value has already been copied from this netns's devconf_dflt, it shouldn't be reset to the global kernel default. Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address generation mode") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11net/ipv6: fix addrconf_sysctl_addr_gen_modeSabrina Dubroca1-13/+14
addrconf_sysctl_addr_gen_mode() has multiple problems. First, it ignores the errors returned by proc_dointvec(). addrconf_sysctl_addr_gen_mode() calls proc_dointvec() directly, which writes the value to memory, and then checks if it's valid and may return EINVAL. If a bad value is given, the value displayed when reading net.ipv6.conf.foo.addr_gen_mode next time will be invalid. In case the value provided by the user was valid, addrconf_dev_config() won't be called since idev->cnf.addr_gen_mode has already been updated. Fix this in the usual way we deal with values that need to be checked after the proc_do*() helper has returned: define a local ctl_table and storage, call proc_dointvec() on that temporary area, then check and store. addrconf_sysctl_addr_gen_mode() also writes the new value to the global ipv6_devconf_dflt, when we're writing to some netns's default, so that new netns will inherit the value that was set by the change occuring in any netns. That doesn't make any sense, so let's drop this assignment. Finally, since addr_gen_mode is a __u32, switch to proc_douintvec(). Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address generation mode") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11net/sched: flower: Fix null pointer dereference when run tc vlan commandJianbo Liu1-22/+26
Zahari issued tc vlan command without setting vlan_ethtype, which will crash kernel. To avoid this, we must check tb[TCA_FLOWER_KEY_VLAN_ETH_TYPE] is not null before use it. Also we don't need to dump vlan_ethtype or cvlan_ethtype in this case. Fixes: d64efd0926ba ('net/sched: flower: Add supprt for matching on QinQ vlan headers') Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Reported-by: Zahari Doychev <zahari.doychev@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10sch_cake: Conditionally split GSO segmentsToke Høiland-Jørgensen1-26/+73
At lower bandwidths, the transmission time of a single GSO segment can add an unacceptable amount of latency due to HOL blocking. Furthermore, with a software shaper, any tuning mechanism employed by the kernel to control the maximum size of GSO segments is thrown off by the artificial limit on bandwidth. For this reason, we split GSO segments into their individual packets iff the shaper is active and configured to a bandwidth <= 1 Gbps. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10sch_cake: Add overhead compensation support to the rate shaperToke Høiland-Jørgensen1-1/+123
This commit adds configurable overhead compensation support to the rate shaper. With this feature, userspace can configure the actual bottleneck link overhead and encapsulation mode used, which will be used by the shaper to calculate the precise duration of each packet on the wire. This feature is needed because CAKE is often deployed one or two hops upstream of the actual bottleneck (which can be, e.g., inside a DSL or cable modem). In this case, the link layer characteristics and overhead reported by the kernel does not match the actual bottleneck. Being able to set the actual values in use makes it possible to configure the shaper rate much closer to the actual bottleneck rate (our experience shows it is possible to get with 0.1% of the actual physical bottleneck rate), thus keeping latency low without sacrificing bandwidth. The overhead compensation has three tunables: A fixed per-packet overhead size (which, if set, will be accounted from the IP packet header), a minimum packet size (MPU) and a framing mode supporting either ATM or PTM framing. We include a set of common keywords in TC to help users configure the right parameters. If no overhead value is set, the value reported by the kernel is used. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10sch_cake: Add DiffServ handlingToke Høiland-Jørgensen1-16/+423
This adds support for DiffServ-based priority queueing to CAKE. If the shaper is in use, each priority tier gets its own virtual clock, which limits that tier's rate to a fraction of the overall shaped rate, to discourage trying to game the priority mechanism. CAKE defaults to a simple, three-tier mode that interprets most code points as "best effort", but places CS1 traffic into a low-priority "bulk" tier which is assigned 1/16 of the total rate, and a few code points indicating latency-sensitive or control traffic (specifically TOS4, VA, EF, CS6, CS7) into a "latency sensitive" high-priority tier, which is assigned 1/4 rate. The other supported DiffServ modes are a 4-tier mode matching the 802.11e precedence rules, as well as two 8-tier modes, one of which implements strict precedence of the eight priority levels. This commit also adds an optional DiffServ 'wash' mode, which will zero out the DSCP fields of any packet passing through CAKE. While this can technically be done with other mechanisms in the kernel, having the feature available in CAKE significantly decreases configuration complexity; and the implementation cost is low on top of the other DiffServ-handling code. Filters and applications can set the skb->priority field to override the DSCP-based classification into tiers. If TC_H_MAJ(skb->priority) matches CAKE's qdisc handle, the minor number will be interpreted as a priority tier if it is less than or equal to the number of configured priority tiers. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10sch_cake: Add NAT awareness to packet classifierToke Høiland-Jørgensen1-2/+49
When CAKE is deployed on a gateway that also performs NAT (which is a common deployment mode), the host fairness mechanism cannot distinguish internal hosts from each other, and so fails to work correctly. To fix this, we add an optional NAT awareness mode, which will query the kernel conntrack mechanism to obtain the pre-NAT addresses for each packet and use that in the flow and host hashing. When the shaper is enabled and the host is already performing NAT, the cost of this lookup is negligible. However, in unlimited mode with no NAT being performed, there is a significant CPU cost at higher bandwidths. For this reason, the feature is turned off by default. Cc: netfilter-devel@vger.kernel.org Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10netfilter: Add nf_ct_get_tuple_skb global lookup functionToke Høiland-Jørgensen2-0/+51
This adds a global netfilter function to extract a conntrack tuple from an skb. The function uses a new function added to nf_ct_hook, which will try to get the tuple from skb->_nfct, and do a full lookup if that fails. This makes it possible to use the lookup function before the skb has passed through the conntrack init hooks (e.g., in an ingress qdisc). The tuple is copied to the caller to avoid issues with reference counting. The function returns false if conntrack is not loaded, allowing it to be used without incurring a module dependency on conntrack. This is used by the NAT mode in sch_cake. Cc: netfilter-devel@vger.kernel.org Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10sch_cake: Add optional ACK filterToke Høiland-Jørgensen1-2/+452
The ACK filter is an optional feature of CAKE which is designed to improve performance on links with very asymmetrical rate limits. On such links (which are unfortunately quite prevalent, especially for DSL and cable subscribers), the downstream throughput can be limited by the number of ACKs capable of being transmitted in the *upstream* direction. Filtering ACKs can, in general, have adverse effects on TCP performance because it interferes with ACK clocking (especially in slow start), and it reduces the flow's resiliency to ACKs being dropped further along the path. To alleviate these drawbacks, the ACK filter in CAKE tries its best to always keep enough ACKs queued to ensure forward progress in the TCP flow being filtered. It does this by only filtering redundant ACKs. In its default 'conservative' mode, the filter will always keep at least two redundant ACKs in the queue, while in 'aggressive' mode, it will filter down to a single ACK. The ACK filter works by inspecting the per-flow queue on every packet enqueue. Starting at the head of the queue, the filter looks for another eligible packet to drop (so the ACK being dropped is always closer to the head of the queue than the packet being enqueued). An ACK is eligible only if it ACKs *fewer* bytes than the new packet being enqueued, including any SACK options. This prevents duplicate ACKs from being filtered, to avoid interfering with retransmission logic. In addition, we check TCP header options and only drop those that are known to not interfere with sender state. In particular, packets with unknown option codes are never dropped. In aggressive mode, an eligible packet is always dropped, while in conservative mode, at least two ACKs are kept in the queue. Only pure ACKs (with no data segments) are considered eligible for dropping, but when an ACK with data segments is enqueued, this can cause another pure ACK to become eligible for dropping. The approach described above ensures that this ACK filter avoids most of the drawbacks of a naive filtering mechanism that only keeps flow state but does not inspect the queue. This is the rationale for including the ACK filter in CAKE itself rather than as separate module (as the TC filter, for instance). Our performance evaluation has shown that on a 30/1 Mbps link with a bidirectional traffic test (RRUL), turning on the ACK filter on the upstream link improves downstream throughput by ~20% (both modes) and upstream throughput by ~12% in conservative mode and ~40% in aggressive mode, at the cost of ~5ms of inter-flow latency due to the increased congestion. In *really* pathological cases, the effect can be a lot more; for instance, the ACK filter increases the achievable downstream throughput on a link with 100 Kbps in the upstream direction by an order of magnitude (from ~2.5 Mbps to ~25 Mbps). Finally, even though we consider the ACK filter to be safer than most, we do not recommend turning it on everywhere: on more symmetrical link bandwidths the effect is negligible at best. Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10sch_cake: Add ingress modeToke Høiland-Jørgensen1-4/+83
The ingress mode is meant to be enabled when CAKE runs downlink of the actual bottleneck (such as on an IFB device). The mode changes the shaper to also account dropped packets to the shaped rate, as these have already traversed the bottleneck. Enabling ingress mode will also tune the AQM to always keep at least two packets queued *for each flow*. This is done by scaling the minimum queue occupancy level that will disable the AQM by the number of active bulk flows. The rationale for this is that retransmits are more expensive in ingress mode, since dropped packets have to traverse the bottleneck again when they are retransmitted; thus, being more lenient and keeping a minimum number of packets queued will improve throughput in cases where the number of active flows are so large that they saturate the bottleneck even at their minimum window size. This commit also adds a separate switch to enable ingress mode rate autoscaling. If enabled, the autoscaling code will observe the actual traffic rate and adjust the shaper rate to match it. This can help avoid latency increases in the case where the actual bottleneck rate decreases below the shaped rate. The scaling filters out spikes by an EWMA filter. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10sched: Add Common Applications Kept Enhanced (cake) qdiscToke Høiland-Jørgensen3-0/+1879
sch_cake targets the home router use case and is intended to squeeze the most bandwidth and latency out of even the slowest ISP links and routers, while presenting an API simple enough that even an ISP can configure it. Example of use on a cable ISP uplink: tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter To shape a cable download link (ifb and tc-mirred setup elided) tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash CAKE is filled with: * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel derived Flow Queuing system, which autoconfigures based on the bandwidth. * A novel "triple-isolate" mode (the default) which balances per-host and per-flow FQ even through NAT. * An deficit based shaper, that can also be used in an unlimited mode. * 8 way set associative hashing to reduce flow collisions to a minimum. * A reasonable interpretation of various diffserv latency/loss tradeoffs. * Support for zeroing diffserv markings for entering and exiting traffic. * Support for interacting well with Docsis 3.0 shaper framing. * Extensive support for DSL framing types. * Support for ack filtering. * Extensive statistics for measuring, loss, ecn markings, latency variation. A paper describing the design of CAKE is available at https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN). This patch adds the base shaper and packet scheduler, while subsequent commits add the optional (configurable) features. The full userspace API and most data structures are included in this commit, but options not understood in the base version will be ignored. Various versions baking have been available as an out of tree build for kernel versions going back to 3.10, as the embedded router world has been running a few years behind mainline Linux. A stable version has been generally available on lede-17.01 and later. sch_cake replaces a combination of iptables, tc filter, htb and fq_codel in the sqm-scripts, with sane defaults and vastly simpler configuration. CAKE's principal author is Jonathan Morton, with contributions from Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller, Ryan Mounce, Tony Ambardar, Dean Scarff, Nils Andreas Svee, Dave Täht, and Loganaden Velvindron. Testing from Pete Heist, Georgios Amanakis, and the many other members of the cake@lists.bufferbloat.net mailing list. tc -s qdisc show dev eth2 qdisc cake 8017: root refcnt 2 bandwidth 1Gbit diffserv3 triple-isolate split-gso rtt 100.0ms noatm overhead 38 mpu 84 Sent 51504294511 bytes 37724591 pkt (dropped 6, overlimits 64958695 requeues 12) backlog 0b 0p requeues 12 memory used: 1053008b of 15140Kb capacity estimate: 970Mbit min/max network layer size: 28 / 1500 min/max overhead-adjusted size: 84 / 1538 average network hdr offset: 14 Bulk Best Effort Voice thresh 62500Kbit 1Gbit 250Mbit target 5.0ms 5.0ms 5.0ms interval 100.0ms 100.0ms 100.0ms pk_delay 5us 5us 6us av_delay 3us 2us 2us sp_delay 2us 1us 1us backlog 0b 0b 0b pkts 3164050 25030267 9530280 bytes 3227519915 35396974782 12879808898 way_inds 0 8 0 way_miss 21 366 25 way_cols 0 0 0 drops 5 0 1 marks 0 0 0 ack_drop 0 0 0 sp_flows 1 3 0 bk_flows 0 1 1 un_flows 0 0 0 max_len 68130 68130 68130 Tested-by: Pete Heist <peteheist@gmail.com> Tested-by: Georgios Amanakis <gamanakis@gmail.com> Signed-off-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-09tcp: remove SG-related comment in tcp_sendmsg()Julian Wiedmann1-3/+0
Since commit 74d4a8f8d378 ("tcp: remove sk_can_gso() use"), the code doesn't care whether the interface supports SG. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-09net: core: fix use-after-free in __netif_receive_skb_list_coreEdward Cree1-2/+7
__netif_receive_skb_core can free the skb, so we have to use the dequeue- enqueue model when calling it from __netif_receive_skb_list_core. Fixes: 88eb1944e18c ("net: core: propagate SKB lists through packet_type lookup") Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-09net: core: fix uses-after-free in list processingEdward Cree1-8/+13
In netif_receive_skb_list_internal(), all of skb_defer_rx_timestamp(), do_xdp_generic() and enqueue_to_backlog() can lead to kfree(skb). Thus, we cannot wait until after they return to remove the skb from the list; instead, we remove it first and, in the pass case, add it to a sublist afterwards. In the case of enqueue_to_backlog() we have already decided not to pass when we call the function, so we do not need a sublist. Fixes: 7da517a3bc52 ("net: core: Another step of skb receive list processing") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-09net: allow fallback function to pass netdevAlexander Duyck2-12/+7
For most of these calls we can just pass NULL through to the fallback function as the sb_dev. The only cases where we cannot are the cases where we might be dealing with either an upper device or a driver that would have configured things to support an sb_dev itself. The only driver that has any significant change in this patch set should be ixgbe as we can drop the redundant functionality that existed in both the ndo_select_queue function and the fallback function that was passed through to us. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-07-09net: allow ndo_select_queue to pass netdevAlexander Duyck2-4/+6
This patch makes it so that instead of passing a void pointer as the accel_priv we instead pass a net_device pointer as sb_dev. Making this change allows us to pass the subordinate device through to the fallback function eventually so that we can keep the actual code in the ndo_select_queue call as focused on possible on the exception cases. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-07-09net: Add generic ndo_select_queue functionsAlexander Duyck2-1/+15
This patch adds a generic version of the ndo_select_queue functions for either returning 0 or selecting a queue based on the processor ID. This is generally meant to just reduce the number of functions we have to change in the future when we have to deal with ndo_select_queue changes. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-07-09net: Add support for subordinate traffic classes to netdev_pick_txAlexander Duyck1-23/+35
This change makes it so that we can support the concept of subordinate device traffic classes to the core networking code. In doing this we can start pulling out the driver specific bits needed to support selecting a queue based on an upper device. The solution at is currently stands is only partially implemented. I have the start of some XPS bits in here, but I would still need to allow for configuration of the XPS maps on the queues reserved for the subordinate devices. For now I am using the reference to the sb_dev XPS map as just a way to skip the lookup of the lower device XPS map for now as that would result in the wrong queue being picked. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-07-09net: Add support for subordinate device traffic classesAlexander Duyck2-1/+109
This patch is meant to provide the basic tools needed to allow us to create subordinate device traffic classes. The general idea here is to allow subdividing the queues of a device into queue groups accessible through an upper device such as a macvlan. The idea here is to enforce the idea that an upper device has to be a single queue device, ideally with IFF_NO_QUQUE set. With that being the case we can pretty much guarantee that the tc_to_txq mappings and XPS maps for the upper device are unused. As such we could reuse those in order to support subdividing the lower device and distributing those queues between the subordinate devices. In order to distinguish between a regular set of traffic classes and if a device is carrying subordinate traffic classes I changed num_tc from a u8 to a s16 value and use the negative values to represent the subordinate pool values. So starting at -1 and running to -32768 we can encode those as pool values, and the existing values of 0 to 15 can be maintained. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-07-09net-sysfs: Drop support for XPS and traffic_class on single queue deviceAlexander Duyck1-2/+13
This patch makes it so that we do not report the traffic class or allow XPS configuration on single queue devices. This is mostly to avoid unnecessary complexity with changes I have planned that will allow us to reuse the unused tc_to_txq and XPS configuration on a single queue device to allow it to make use of a subset of queues on an underlying device. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-07-08tcp: remove redundant SOCK_DONE checksEric Dumazet1-9/+5
In both tcp_splice_read() and tcp_recvmsg(), we already test sock_flag(sk, SOCK_DONE) right before evaluating sk->sk_state, so "!sock_flag(sk, SOCK_DONE)" is always true. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: Fix warnings from xchg() on RCU'd cookie pointer.David S. Miller1-1/+1
The kbuild test robot reports: >> net/sched/act_api.c:71:15: sparse: incorrect type in initializer (different address spaces) @@ expected struct tc_cookie [noderef] <asn:4>*__ret @@ got [noderef] <asn:4>*__ret @@ net/sched/act_api.c:71:15: expected struct tc_cookie [noderef] <asn:4>*__ret net/sched/act_api.c:71:15: got struct tc_cookie *new_cookie >> net/sched/act_api.c:71:13: sparse: incorrect type in assignment (different address spaces) @@ expected struct tc_cookie *old @@ got struct tc_cookie [noderef] <struct tc_cookie *old @@ net/sched/act_api.c:71:13: expected struct tc_cookie *old net/sched/act_api.c:71:13: got struct tc_cookie [noderef] <asn:4>*[assigned] __ret >> net/sched/act_api.c:132:48: sparse: dereference of noderef expression Handle this in the usual way by force casting away the __rcu annotation when we are using xchg() on it. Fixes: eec94fdb0480 ("net: sched: use rcu for action cookie update") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: change action API to use array of pointers to actionsVlad Buslov2-54/+56
Act API used linked list to pass set of actions to functions. It is intrusive data structure that stores list nodes inside action structure itself, which means it is not safe to modify such list concurrently. However, action API doesn't use any linked list specific operations on this set of actions, so it can be safely refactored into plain pointer array. Refactor action API to use array of pointers to tc_actions instead of linked list. Change argument 'actions' type of exported action init, destroy and dump functions. Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: atomically check-allocate actionVlad Buslov17-59/+213
Implement function that atomically checks if action exists and either takes reference to it, or allocates idr slot for action index to prevent concurrent allocations of actions with same index. Use EBUSY error pointer to indicate that idr slot is reserved. Implement cleanup helper function that removes temporary error pointer from idr. (in case of error between idr allocation and insertion of newly created action to specified index) Refactor all action init functions to insert new action to idr using this API. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: use reference counting action initVlad Buslov1-18/+17
Change action API to assume that action init function always takes reference to action, even when overwriting existing action. This is necessary because action API continues to use action pointer after init function is done. At this point action becomes accessible for concurrent modifications, so user must always hold reference to it. Implement helper put list function to atomically release list of actions after action API init code is done using them. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: don't release reference on action overwriteVlad Buslov17-58/+50
Return from action init function with reference to action taken, even when overwriting existing action. Action init API initializes its fourth argument (pointer to pointer to tc action) to either existing action with same index or newly created action. In case of existing index(and bind argument is zero), init function returns without incrementing action reference counter. Caller of action init then proceeds working with action, without actually holding reference to it. This means that action could be deleted concurrently. Change action init behavior to always take reference to action before returning successfully, in order to protect from concurrent deletion. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: implement reference counted action releaseVlad Buslov2-23/+62
Implement helper delete function that uses new action ops 'delete', instead of destroying action directly. This is required so act API could delete actions by index, without holding any references to action that is being deleted. Implement function __tcf_action_put() that releases reference to action and frees it, if necessary. Refactor action deletion code to use new put function and not to rely on rtnl lock. Remove rtnl lock assertions that are no longer needed. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: add 'delete' function to action opsVlad Buslov16-0/+136
Extend action ops with 'delete' function. Each action type to implements its own delete function that doesn't depend on rtnl lock. Implement delete function that is required to delete actions without holding rtnl lock. Use action API function that atomically deletes action only if it is still in action idr. This implementation prevents concurrent threads from deleting same action twice. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: implement action API that deletes action by indexVlad Buslov1-0/+39
Implement new action API function that atomically finds and deletes action from idr by index. Intended to be used by lockless actions that do not rely on rtnl lock. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: always take reference to actionVlad Buslov1-26/+20
Without rtnl lock protection it is no longer safe to use pointer to tc action without holding reference to it. (it can be destroyed concurrently) Remove unsafe action idr lookup function. Instead of it, implement safe tcf idr check function that atomically looks up action in idr and increments its reference and bind counters. Implement both action search and check using new safe function Reference taken by idr check is temporal and should not be accounted by userspace clients (both logically and to preserver current API behavior). Subtract temporal reference when dumping action to userspace using existing tca_get_fill function arguments. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: implement unlocked action init APIVlad Buslov18-27/+46
Add additional 'rtnl_held' argument to act API init functions. It is required to implement actions that need to release rtnl lock before loading kernel module and reacquire if afterwards. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: change type of reference and bind countersVlad Buslov17-42/+54
Change type of action reference counter to refcount_t. Change type of action bind counter to atomic_t. This type is used to allow decrementing bind counter without testing for 0 result. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: use rcu for action cookie updateVlad Buslov1-14/+30
Implement functions to atomically update and free action cookie using rcu mechanism. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08openvswitch: kernel datapath clone actionYifeng Sun2-0/+106
Add 'clone' action to kernel datapath by using existing functions. When actions within clone don't modify the current flow, the flow key is not cloned before executing clone actions. This is a follow up patch for this incomplete work: https://patchwork.ozlabs.org/patch/722096/ v1 -> v2: Refactor as advised by reviewer. Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com> Signed-off-by: Andy Zhou <azhou@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07tipc: extend link reset criteria for stale packet retransmissionJon Maloy1-19/+24
Currently a link is declared stale and reset if there has been 100 repeated attempts to retransmit the same packet. However, in certain infrastructures we see that packet (NACK) duplicates and delays may cause such retransmit attempts to occur at a high rate, so that the peer doesn't have a reasonable chance to acknowledge the reception before the 100-limit is hit. This may take much less than the stipulated link tolerance time, and despite that probe/probe replies otherwise go through as normal. We now extend the criteria for link reset to also being time based. I.e., we don't reset the link until the link tolerance time is passed AND we have made 100 retransmissions attempts. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07net/sched: flower: Add supprt for matching on QinQ vlan headersJianbo Liu1-14/+51
As support dissecting of QinQ inner and outer vlan headers, user can add rules to match on QinQ vlan headers. Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07net/sched: flower: Dump the ethertype encapsulated in vlanJianbo Liu1-0/+4
Currently the encapsulated ethertype is not dumped as it's the same as TCA_FLOWER_KEY_ETH_TYPE keyvalue. But the dumping result is inconsistent with input, we add dumping it with TCA_FLOWER_KEY_VLAN_ETH_TYPE. Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07net/flow_dissector: Add support for QinQ dissectionJianbo Liu1-15/+17
Dissect the QinQ packets to get both outer and inner vlan information, then store to the extended flow keys. Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07net/sched: flower: Add support for matching on vlan ethertypeJianbo Liu1-2/+5
As flow dissector stores vlan ethertype, tc flower now can match on that. It is to make preparation for supporting QinQ. Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07net/flow_dissector: Save vlan ethertype from headersJianbo Liu1-0/+2
Change vlan dissector key to save vlan tpid to support both 802.1Q and 802.1AD ethertype. Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>