aboutsummaryrefslogtreecommitdiffstats
path: root/net (follow)
AgeCommit message (Collapse)AuthorFilesLines
2020-01-16xdp: Use bulking for non-map XDP_REDIRECT and consolidate code pathsToke Høiland-Jørgensen1-71/+19
Since the bulk queue used by XDP_REDIRECT now lives in struct net_device, we can re-use the bulking for the non-map version of the bpf_redirect() helper. This is a simple matter of having xdp_do_redirect_slow() queue the frame on the bulk queue instead of sending it out with __bpf_tx_xdp(). Unfortunately we can't make the bpf_redirect() helper return an error if the ifindex doesn't exit (as bpf_redirect_map() does), because we don't have a reference to the network namespace of the ingress device at the time the helper is called. So we have to leave it as-is and keep the device lookup in xdp_do_redirect_slow(). Since this leaves less reason to have the non-map redirect code in a separate function, so we get rid of the xdp_do_redirect_slow() function entirely. This does lose us the tracepoint disambiguation, but fortunately the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint entry structures. This means both can contain a map index, so we can just amend the tracepoint definitions so we always emit the xdp_redirect(_err) tracepoints, but with the map ID only populated if a map is present. This means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep the definitions around in case someone is still listening for them. With this change, the performance of the xdp_redirect sample program goes from 5Mpps to 8.4Mpps (a 68% increase). Since the flush functions are no longer map-specific, rename the flush() functions to drop _map from their names. One of the renamed functions is the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To keep from having to update all drivers, use a #define to keep the old name working, and only update the virtual drivers in this patch. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
2020-01-16xdp: Move devmap bulk queue into struct net_deviceToke Høiland-Jørgensen1-0/+2
Commit 96360004b862 ("xdp: Make devmap flush_list common for all map instances"), changed devmap flushing to be a global operation instead of a per-map operation. However, the queue structure used for bulking was still allocated as part of the containing map. This patch moves the devmap bulk queue into struct net_device. The motivation for this is reusing it for the non-map variant of XDP_REDIRECT, which will be changed in a subsequent commit. To avoid other fields of struct net_device moving to different cache lines, we also move a couple of other members around. We defer the actual allocation of the bulk queue structure until the NETDEV_REGISTER notification devmap.c. This makes it possible to check for ndo_xdp_xmit support before allocating the structure, which is not possible at the time struct net_device is allocated. However, we keep the freeing in free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER. Because of this change, we lose the reference back to the map that originated the redirect, so change the tracepoint to always return 0 as the map ID and index. Otherwise no functional change is intended with this patch. After this patch, the relevant part of struct net_device looks like this, according to pahole: /* --- cacheline 14 boundary (896 bytes) --- */ struct netdev_queue * _tx __attribute__((__aligned__(64))); /* 896 8 */ unsigned int num_tx_queues; /* 904 4 */ unsigned int real_num_tx_queues; /* 908 4 */ struct Qdisc * qdisc; /* 912 8 */ unsigned int tx_queue_len; /* 920 4 */ spinlock_t tx_global_lock; /* 924 4 */ struct xdp_dev_bulk_queue * xdp_bulkq; /* 928 8 */ struct xps_dev_maps * xps_cpus_map; /* 936 8 */ struct xps_dev_maps * xps_rxqs_map; /* 944 8 */ struct mini_Qdisc * miniq_egress; /* 952 8 */ /* --- cacheline 15 boundary (960 bytes) --- */ struct hlist_head qdisc_hash[16]; /* 960 128 */ /* --- cacheline 17 boundary (1088 bytes) --- */ struct timer_list watchdog_timer; /* 1088 40 */ /* XXX last struct has 4 bytes of padding */ int watchdog_timeo; /* 1128 4 */ /* XXX 4 bytes hole, try to pack */ struct list_head todo_list; /* 1136 16 */ /* --- cacheline 18 boundary (1152 bytes) --- */ Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Björn Töpel <bjorn.topel@intel.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk
2020-01-15xsk: Support allocations of large umemsMagnus Karlsson1-3/+4
When registering a umem area that is sufficiently large (>1G on an x86), kmalloc cannot be used to allocate one of the internal data structures, as the size requested gets too large. Use kvmalloc instead that falls back on vmalloc if the allocation is too large for kmalloc. Also add accounting for this structure as it is triggered by a user space action (the XDP_UMEM_REG setsockopt) and it is by far the largest structure of kernel allocated memory in xsk. Reported-by: Ryan Goodfellow <rgoodfel@isi.edu> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Link: https://lore.kernel.org/bpf/1578995365-7050-1-git-send-email-magnus.karlsson@intel.com
2020-01-14bpf: Return -EBADRQC for invalid map type in __bpf_tx_xdp_mapLi RongQing1-1/+1
A negative value should be returned if map->map_type is invalid although that is impossible now, but if we run into such situation in future, then xdpbuff could be leaked. Daniel Borkmann suggested: -EBADRQC should be returned to stay consistent with generic XDP for the tracepoint output and not to be confused with -EOPNOTSUPP from other locations like dev_map_enqueue() when ndo_xdp_xmit is missing and such. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/1578618277-18085-1-git-send-email-lirongqing@baidu.com
2020-01-09bpf: Add BPF_FUNC_tcp_send_ack helperMartin KaFai Lau1-1/+23
Add a helper to send out a tcp-ack. It will be used in the later bpf_dctcp implementation that requires to send out an ack when the CE state changed. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200109004551.3900448-1-kafai@fb.com
2020-01-09bpf: tcp: Support tcp_congestion_ops in bpfMartin KaFai Lau7-15/+251
This patch makes "struct tcp_congestion_ops" to be the first user of BPF STRUCT_OPS. It allows implementing a tcp_congestion_ops in bpf. The BPF implemented tcp_congestion_ops can be used like regular kernel tcp-cc through sysctl and setsockopt. e.g. [root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic net.ipv4.tcp_congestion_control = bpf_cubic There has been attempt to move the TCP CC to the user space (e.g. CCP in TCP). The common arguments are faster turn around, get away from long-tail kernel versions in production...etc, which are legit points. BPF has been the continuous effort to join both kernel and userspace upsides together (e.g. XDP to gain the performance advantage without bypassing the kernel). The recent BPF advancements (in particular BTF-aware verifier, BPF trampoline, BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc) possible in BPF. It allows a faster turnaround for testing algorithm in the production while leveraging the existing (and continue growing) BPF feature/framework instead of building one specifically for userspace TCP CC. This patch allows write access to a few fields in tcp-sock (in bpf_tcp_ca_btf_struct_access()). The optional "get_info" is unsupported now. It can be added later. One possible way is to output the info with a btf-id to describe the content. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200109003508.3856115-1-kafai@fb.com
2020-01-05net: dsa: Pass pcs_poll flag from driver to PHYLINKVladimir Oltean1-0/+1
The DSA drivers that implement .phylink_mac_link_state should normally register an interrupt for the PCS, from which they should call phylink_mac_change(). However not all switches implement this, and those who don't should set this flag in dsa_switch in the .setup callback, so that PHYLINK will poll for a few ms until the in-band AN link timer expires and the PCS state settles. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05net: dsa: tag_sja1105: Slightly improve the Xmas tree in sja1105_xmitVladimir Oltean1-2/+1
This is a cosmetic patch that makes the dp, tx_vid, queue_mapping and pcp local variable definitions a bit closer in length, so they don't look like an eyesore as much. The 'ds' variable is not used otherwise, except for ds->dp. Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05net: dsa: Make deferred_xmit private to sja1105Vladimir Oltean3-39/+15
There are 3 things that are wrong with the DSA deferred xmit mechanism: 1. Its introduction has made the DSA hotpath ever so slightly more inefficient for everybody, since DSA_SKB_CB(skb)->deferred_xmit needs to be initialized to false for every transmitted frame, in order to figure out whether the driver requested deferral or not (a very rare occasion, rare even for the only driver that does use this mechanism: sja1105). That was necessary to avoid kfree_skb from freeing the skb. 2. Because L2 PTP is a link-local protocol like STP, it requires management routes and deferred xmit with this switch. But as opposed to STP, the deferred work mechanism needs to schedule the packet rather quickly for the TX timstamp to be collected in time and sent to user space. But there is no provision for controlling the scheduling priority of this deferred xmit workqueue. Too bad this is a rather specific requirement for a feature that nobody else uses (more below). 3. Perhaps most importantly, it makes the DSA core adhere a bit too much to the NXP company-wide policy "Innovate Where It Doesn't Matter". The sja1105 is probably the only DSA switch that requires some frames sent from the CPU to be routed to the slave port via an out-of-band configuration (register write) rather than in-band (DSA tag). And there are indeed very good reasons to not want to do that: if that out-of-band register is at the other end of a slow bus such as SPI, then you limit that Ethernet flow's throughput to effectively the throughput of the SPI bus. So hardware vendors should definitely not be encouraged to design this way. We do _not_ want more widespread use of this mechanism. Luckily we have a solution for each of the 3 issues: For 1, we can just remove that variable in the skb->cb and counteract the effect of kfree_skb with skb_get, much to the same effect. The advantage, of course, being that anybody who doesn't use deferred xmit doesn't need to do any extra operation in the hotpath. For 2, we can create a kernel thread for each port's deferred xmit work. If the user switch ports are named swp0, swp1, swp2, the kernel threads will be named swp0_xmit, swp1_xmit, swp2_xmit (there appears to be a 15 character length limit on kernel thread names). With this, the user can change the scheduling priority with chrt $(pidof swp2_xmit). For 3, we can actually move the entire implementation to the sja1105 driver. So this patch deletes the generic implementation from the DSA core and adds a new one, more adequate to the requirements of PTP TX timestamping, in sja1105_main.c. Suggested-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-03l2tp: Remove redundant BUG_ON() check in l2tp_pernetXu Wang1-2/+0
Passing NULL to l2tp_pernet causes a crash via BUG_ON. Dereferencing net in net_generic() also has the same effect. This patch removes the redundant BUG_ON check on the same parameter. Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-03net: Remove redundant BUG_ON() check in phonet_pernetXu Wang1-2/+0
Passing NULL to phonet_pernet causes a crash via BUG_ON. Dereferencing net in net_generic() also has the same effect. This patch removes the redundant BUG_ON check on the same parameter. Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-03net: remove the check argument from __skb_gro_checksum_convertLi RongQing3-3/+3
The argument is always ignored, so remove it. Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-03ethtool: remove set but not used variable 'lsettings'YueHaibing1-2/+0
Fixes gcc '-Wunused-but-set-variable' warning: net/ethtool/linkmodes.c: In function 'ethnl_set_linkmodes': net/ethtool/linkmodes.c:326:32: warning: variable 'lsettings' set but not used [-Wunused-but-set-variable] struct ethtool_link_settings *lsettings; ^ It is never used, so remove it. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02tcp: use REXMIT_NEW instead of magic numberMao Wenan1-1/+1
REXMIT_NEW is a macro for "FRTO-style transmit of unsent/new packets", this patch makes it more readable. Signed-off-by: Mao Wenan <maowenan@huawei.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02net: Add device index to tcp_md5sigDavid Ahern2-1/+37
Add support for userspace to specify a device index to limit the scope of an entry via the TCP_MD5SIG_EXT setsockopt. The existing __tcpm_pad is renamed to tcpm_ifindex and the new field is only checked if the new TCP_MD5SIG_FLAG_IFINDEX is set in tcpm_flags. For now, the device index must point to an L3 master device (e.g., VRF). The API and error handling are setup to allow the constraint to be relaxed in the future to any device index. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02tcp: Add l3index to tcp_md5sig_key and md5 functionsDavid Ahern2-39/+101
Add l3index to tcp_md5sig_key to represent the L3 domain of a key, and add l3index to tcp_md5_do_add and tcp_md5_do_del to fill in the key. With the key now based on an l3index, add the new parameter to the lookup functions and consider the l3index when looking for a match. The l3index comes from the skb when processing ingress packets leveraging the helpers created for socket lookups, tcp_v4_sdif and inet_iif (and the v6 variants). When the sdif index is set it means the packet ingressed a device that is part of an L3 domain and inet_iif points to the VRF device. For egress, the L3 domain is determined from the socket binding and sk_bound_dev_if. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02ipv4/tcp: Pass dif and sdif to tcp_v4_inbound_md5_hashDavid Ahern1-5/+8
The original ingress device index is saved to the cb space of the skb and the cb is moved during tcp processing. Since tcp_v4_inbound_md5_hash can be called before and after the cb move, pass dif and sdif to it so the caller can save both prior to the cb move. Both are used by a later patch. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02ipv6/tcp: Pass dif and sdif to tcp_v6_inbound_md5_hashDavid Ahern1-6/+9
The original ingress device index is saved to the cb space of the skb and the cb is moved during tcp processing. Since tcp_v6_inbound_md5_hash can be called before and after the cb move, pass dif and sdif to it so the caller can save both prior to the cb move. Both are used by a later patch. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02ipv4/tcp: Use local variable for tcp_md5_addrDavid Ahern1-17/+26
Extract the typecast to (union tcp_md5_addr *) to a local variable rather than the current long, inline declaration with function calls. No functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02page_pool: help compiler remove code in case CONFIG_NUMA=nJesper Dangaard Brouer1-0/+9
When kernel is compiled without NUMA support, then page_pool NUMA config setting (pool->p.nid) doesn't make any practical sense. The compiler cannot see that it can remove the code paths. This patch avoids reading pool->p.nid setting in case of !CONFIG_NUMA, in allocation and numa check code, which helps compiler to see the optimisation potential. It leaves update code intact to keep API the same. $ ./scripts/bloat-o-meter net/core/page_pool.o-numa-enabled \ net/core/page_pool.o-numa-disabled add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-113 (-113) Function old new delta page_pool_create 401 398 -3 __page_pool_alloc_pages_slow 439 426 -13 page_pool_refill_alloc_cache 425 328 -97 Total: Before=3611, After=3498, chg -3.13% Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02page_pool: handle page recycle for NUMA_NO_NODE conditionJesper Dangaard Brouer1-19/+61
The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE. The goal of the NUMA changes in commit d5394610b1ba ("page_pool: Don't recycle non-reusable pages"), were to have RX-pages that belongs to the same NUMA node as the CPU processing RX-packet during softirq/NAPI. As illustrated by the performance measurements. This patch moves the NAPI checks out of fast-path, and at the same time solves the NUMA_NO_NODE issue. First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used as the the preferred nid. It is only in rare situations, where e.g. NUMA zone runs dry, that page gets doesn't get allocated from preferred nid. The page_pool API allows drivers to control the nid themselves via controlling pool->p.nid. This patch moves the NAPI check to when alloc cache is refilled, via dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing pages from remote NUMA into the ptr_ring, as the dequeue/consume step will check the NUMA node. All current drivers using page_pool will alloc/refill RX-ring from same CPU running softirq/NAPI process. Drivers that control the nid explicitly, also use page_pool_update_nid when changing nid runtime. To speed up transision to new nid the alloc cache is now flushed on nid changes. This force pages to come from ptr_ring, which does the appropate nid check. For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA node, we accept that transitioning the alloc cache doesn't happen immediately. The preferred nid change runtime via consulting numa_mem_id() based on the CPU processing RX-packets. Notice, to avoid stressing the page buddy allocator and avoid doing too much work under softirq with preempt disabled, the NUMA check at ptr_ring dequeue will break the refill cycle, when detecting a NUMA mismatch. This will cause a slower transition, but its done on purpose. Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages") Reported-by: Li RongQing <lirongqing@baidu.com> Reported-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller41-295/+330
Simple overlapping changes in bpf land wrt. bpf_helper_defs.h handling. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30hsr: fix slab-out-of-bounds Read in hsr_debugfs_rename()Taehee Yoo1-1/+2
hsr slave interfaces don't have debugfs directory. So, hsr_debugfs_rename() shouldn't be called when hsr slave interface name is changed. Test commands: ip link add dummy0 type dummy ip link add dummy1 type dummy ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1 ip link set dummy0 name ap Splat looks like: [21071.899367][T22666] ap: renamed from dummy0 [21071.914005][T22666] ================================================================== [21071.919008][T22666] BUG: KASAN: slab-out-of-bounds in hsr_debugfs_rename+0xaa/0xb0 [hsr] [21071.923640][T22666] Read of size 8 at addr ffff88805febcd98 by task ip/22666 [21071.926941][T22666] [21071.927750][T22666] CPU: 0 PID: 22666 Comm: ip Not tainted 5.5.0-rc2+ #240 [21071.929919][T22666] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [21071.935094][T22666] Call Trace: [21071.935867][T22666] dump_stack+0x96/0xdb [21071.936687][T22666] ? hsr_debugfs_rename+0xaa/0xb0 [hsr] [21071.937774][T22666] print_address_description.constprop.5+0x1be/0x360 [21071.939019][T22666] ? hsr_debugfs_rename+0xaa/0xb0 [hsr] [21071.940081][T22666] ? hsr_debugfs_rename+0xaa/0xb0 [hsr] [21071.940949][T22666] __kasan_report+0x12a/0x16f [21071.941758][T22666] ? hsr_debugfs_rename+0xaa/0xb0 [hsr] [21071.942674][T22666] kasan_report+0xe/0x20 [21071.943325][T22666] hsr_debugfs_rename+0xaa/0xb0 [hsr] [21071.944187][T22666] hsr_netdev_notify+0x1fe/0x9b0 [hsr] [21071.945052][T22666] ? __module_text_address+0x13/0x140 [21071.945897][T22666] notifier_call_chain+0x90/0x160 [21071.946743][T22666] dev_change_name+0x419/0x840 [21071.947496][T22666] ? __read_once_size_nocheck.constprop.6+0x10/0x10 [21071.948600][T22666] ? netdev_adjacent_rename_links+0x280/0x280 [21071.949577][T22666] ? __read_once_size_nocheck.constprop.6+0x10/0x10 [21071.950672][T22666] ? lock_downgrade+0x6e0/0x6e0 [21071.951345][T22666] ? do_setlink+0x811/0x2ef0 [21071.951991][T22666] do_setlink+0x811/0x2ef0 [21071.952613][T22666] ? is_bpf_text_address+0x81/0xe0 [ ... ] Reported-by: syzbot+9328206518f08318a5fd@syzkaller.appspotmail.com Fixes: 4c2d5e33dcd3 ("hsr: rename debugfs file when interface name is changed") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30net/sched: add delete_empty() to filters and use it in cls_flowerDavide Caratti3-51/+17
Revert "net/sched: cls_u32: fix refcount leak in the error path of u32_change()", and fix the u32 refcount leak in a more generic way that preserves the semantic of rule dumping. On tc filters that don't support lockless insertion/removal, there is no need to guard against concurrent insertion when a removal is in progress. Therefore, for most of them we can avoid a full walk() when deleting, and just decrease the refcount, like it was done on older Linux kernels. This fixes situations where walk() was wrongly detecting a non-empty filter, like it happened with cls_u32 in the error path of change(), thus leading to failures in the following tdc selftests: 6aa7: (filter, u32) Add/Replace u32 with source match and invalid indev 6658: (filter, u32) Add/Replace u32 with custom hash table and invalid handle 74c2: (filter, u32) Add/Replace u32 filter with invalid hash table id On cls_flower, and on (future) lockless filters, this check is necessary: move all the check_empty() logic in a callback so that each filter can have its own implementation. For cls_flower, it's sufficient to check if no IDRs have been allocated. This reverts commit 275c44aa194b7159d1191817b20e076f55f0e620. Changes since v1: - document the need for delete_empty() when TCF_PROTO_OPS_DOIT_UNLOCKED is used, thanks to Vlad Buslov - implement delete_empty() without doing fl_walk(), thanks to Vlad Buslov - squash revert and new fix in a single patch, to be nice with bisect tests that run tdc on u32 filter, thanks to Dave Miller Fixes: 275c44aa194b ("net/sched: cls_u32: fix refcount leak in the error path of u32_change()") Fixes: 6676d5e416ee ("net: sched: set dedicated tcf_walker flag when tp is empty") Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com> Suggested-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Vlad Buslov <vladbu@mellanox.com> Tested-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30net/ncsi: Fix gma flag setting after responseVijay Khemka2-3/+6
gma_flag was set at the time of GMA command request but it should only be set after getting successful response. Movinng this flag setting in GMA response handler. This flag is used mainly for not repeating GMA command once received MAC address. Signed-off-by: Vijay Khemka <vijaykhemka@fb.com> Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30sctp: add enabled check for path tracepoint loop.Kevin Kou1-2/+3
sctp_outq_sack is the main function handles SACK, it is called very frequently. As the commit "move trace_sctp_probe_path into sctp_outq_sack" added below code to this function, sctp tracepoint is disabled most of time, but the loop of transport list will be always called even though the tracepoint is disabled, this is unnecessary. + /* SCTP path tracepoint for congestion control debugging. */ + list_for_each_entry(transport, transport_list, transports) { + trace_sctp_probe_path(transport, asoc); + } This patch is to add tracepoint enabled check at outside of the loop of transport list, and avoid traversing the loop when trace is disabled, it is a small optimization. Signed-off-by: Kevin Kou <qdkevin.kou@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30tcp: Fix highest_sack and highest_sack_seqCambda Zhu1-0/+3
>From commit 50895b9de1d3 ("tcp: highest_sack fix"), the logic about setting tp->highest_sack to the head of the send queue was removed. Of course the logic is error prone, but it is logical. Before we remove the pointer to the highest sack skb and use the seq instead, we need to set tp->highest_sack to NULL when there is no skb after the last sack, and then replace NULL with the real skb when new skb inserted into the rtx queue, because the NULL means the highest sack seq is tp->snd_nxt. If tp->highest_sack is NULL and new data sent, the next ACK with sack option will increase tp->reordering unexpectedly. This patch sets tp->highest_sack to the tail of the rtx queue if it's NULL and new data is sent. The patch keeps the rule that the highest_sack can only be maintained by sack processing, except for this only case. Fixes: 50895b9de1d3 ("tcp: highest_sack fix") Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30tcp_cubic: refactor code to perform a divide only when neededEric Dumazet1-23/+28
Neal Cardwell suggested to not change ca->delay_min and apply the ack delay cushion only when Hystart ACK train is still under consideration. This should avoid a 64bit divide unless needed. Tested: 40Gbit(mlx4) testbed (with sch_fq as packet scheduler) $ echo -n 'file tcp_cubic.c +p' >/sys/kernel/debug/dynamic_debug/control $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart" 14815 16280 15293 15563 11574 15145 14789 18548 16972 12520 TcpExtTCPHystartTrainDetect 10 0.0 TcpExtTCPHystartTrainCwnd 1396 0.0 $ dmesg | tail -10 [ 4873.951350] hystart_ack_train (116 > 93) delay_min 24 (+ ack_delay 69) cwnd 80 [ 4875.155379] hystart_ack_train (55 > 50) delay_min 21 (+ ack_delay 29) cwnd 160 [ 4876.333921] hystart_ack_train (69 > 62) delay_min 23 (+ ack_delay 39) cwnd 130 [ 4877.519037] hystart_ack_train (69 > 60) delay_min 22 (+ ack_delay 38) cwnd 130 [ 4878.701559] hystart_ack_train (87 > 63) delay_min 24 (+ ack_delay 39) cwnd 160 [ 4879.844597] hystart_ack_train (93 > 50) delay_min 21 (+ ack_delay 29) cwnd 216 [ 4880.956650] hystart_ack_train (74 > 67) delay_min 20 (+ ack_delay 47) cwnd 108 [ 4882.098500] hystart_ack_train (61 > 57) delay_min 23 (+ ack_delay 34) cwnd 130 [ 4883.262056] hystart_ack_train (72 > 67) delay_min 21 (+ ack_delay 46) cwnd 130 [ 4884.418760] hystart_ack_train (74 > 67) delay_min 29 (+ ack_delay 38) cwnd 152 10Gbit(bnx2x) testbed (with sch_fq as packet scheduler) $ echo -n 'file tcp_cubic.c +p' >/sys/kernel/debug/dynamic_debug/control $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpk52 -l -4000000; done;nstat|egrep "Hystart" 7050 7065 7100 6900 7202 7263 7189 6869 7463 7034 TcpExtTCPHystartTrainDetect 10 0.0 TcpExtTCPHystartTrainCwnd 3199 0.0 $ dmesg | tail -10 [ 176.920012] hystart_ack_train (161 > 141) delay_min 83 (+ ack_delay 58) cwnd 264 [ 179.144645] hystart_ack_train (164 > 159) delay_min 120 (+ ack_delay 39) cwnd 444 [ 181.354527] hystart_ack_train (214 > 168) delay_min 125 (+ ack_delay 43) cwnd 436 [ 183.539565] hystart_ack_train (170 > 147) delay_min 96 (+ ack_delay 51) cwnd 326 [ 185.727309] hystart_ack_train (177 > 160) delay_min 61 (+ ack_delay 99) cwnd 128 [ 187.947142] hystart_ack_train (184 > 167) delay_min 123 (+ ack_delay 44) cwnd 367 [ 190.166680] hystart_ack_train (230 > 153) delay_min 116 (+ ack_delay 37) cwnd 444 [ 192.327285] hystart_ack_train (210 > 206) delay_min 86 (+ ack_delay 120) cwnd 152 [ 194.511392] hystart_ack_train (173 > 151) delay_min 94 (+ ack_delay 57) cwnd 239 [ 196.736023] hystart_ack_train (149 > 146) delay_min 105 (+ ack_delay 41) cwnd 399 Fixes: 42f3a8aaae66 ("tcp_cubic: tweak Hystart detection for short RTT flows") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Neal Cardwell <ncardwell@google.com> Link: https://www.spinics.net/lists/netdev/msg621886.html Link: https://www.spinics.net/lists/netdev/msg621797.html Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller5-144/+352
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Remove #ifdef pollution around nf_ingress(), from Lukas Wunner. 2) Document ingress hook in netdevice, also from Lukas. 3) Remove htons() in tunnel metadata port netlink attributes, from Xin Long. 4) Missing erspan netlink attribute validation also from Xin Long. 5) Missing erspan version in tunnel, from Xin Long. 6) Missing attribute nest in NFTA_TUNNEL_KEY_OPTS_{VXLAN,ERSPAN} Patch from Xin Long. 7) Missing nla_nest_cancel() in tunnel netlink dump path, from Xin Long. 8) Remove two exported conntrack symbols with no clients, from Florian Westphal. 9) Add nft_meta_get_eval_time() helper to nft_meta, from Florian. 10) Add nft_meta_pkttype helper for loopback, also from Florian. 11) Add nft_meta_socket uid helper, from Florian Westphal. 12) Add nft_meta_cgroup helper, from Florian. 13) Add nft_meta_ifkind helper, from Florian. 14) Group all interface related meta selector, from Florian. 15) Add nft_prandom_u32() helper, from Florian. 16) Add nft_meta_rtclassid helper, from Florian. 17) Add support for matching on the slave device index, from Florian. This batch, among other things, contains updates for the netfilter tunnel netlink interface: This extension is still incomplete and lacking proper userspace support which is actually my fault, I did not find the time to go back and finish this. This update is breaking tunnel UAPI in some aspects to fix it but do it better sooner than never. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-28net: dsa: Deny PTP on master if switch supports itVladimir Oltean1-0/+30
It is possible to kill PTP on a DSA switch completely and absolutely, until a reboot, with a simple command: tcpdump -i eth2 -j adapter_unsynced where eth2 is the switch's DSA master. Why? Well, in short, the PTP API in place today is a bit rudimentary and relies on applications to retrieve the TX timestamps by polling the error queue and looking at the cmsg structure. But there is no timestamp identification of any sorts (except whether it's HW or SW), you don't know how many more timestamps are there to come, which one is this one, from whom it is, etc. In other words, the SO_TIMESTAMPING API is fundamentally limited in that you can get a single HW timestamp from the stack. And the "-j adapter_unsynced" flag of tcpdump enables hardware timestamping. So let's imagine what happens when the DSA master decides it wants to deliver TX timestamps to the skb's socket too: - The timestamp that the user space sees is taken by the DSA master. Whereas the RX timestamp will eventually be overwritten by the DSA switch. So the RX and TX timestamps will be in different time bases (aka garbage). - The user space applications have no way to deal with the second (real) TX timestamp finally delivered by the DSA switch, or even to know to wait for it. Take ptp4l from the linuxptp project, for example. This is its behavior after running tcpdump, before the patch: ptp4l[172]: [6469.594] Unexpected data on socket err queue: ptp4l[172]: [6469.693] rms 8 max 16 freq -21257 +/- 11 delay 748 +/- 0 ptp4l[172]: [6469.711] Unexpected data on socket err queue: ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 03 aa 05 00 fd ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00 ptp4l[172]: [6469.721] Unexpected data on socket err queue: ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02 ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00 ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 01 c6 b1 00 fd ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00 ptp4l[172]: [6469.838] Unexpected data on socket err queue: ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02 ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00 ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 03 aa 06 00 fd ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00 ptp4l[172]: [6469.848] Unexpected data on socket err queue: ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 13 02 ptp4l[172]: 0010 00 36 00 00 02 00 00 00 00 00 00 00 00 00 00 00 ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 04 1a 45 05 7f ptp4l[172]: 0030 00 00 5e 05 41 32 27 c2 1a 68 00 04 9f ff fe 05 ptp4l[172]: 0040 de 06 00 01 ptp4l[172]: [6469.855] Unexpected data on socket err queue: ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02 ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00 ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 01 c6 b2 00 fd ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00 ptp4l[172]: [6469.974] Unexpected data on socket err queue: ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02 ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00 ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 03 aa 07 00 fd ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00 The ptp4l program itself is heavily patched to show this (more details here [0]). Otherwise, by default it just hangs. On the other hand, with the DSA patch to disallow HW timestamping applied: tcpdump -i eth2 -j adapter_unsynced tcpdump: SIOCSHWTSTAMP failed: Device or resource busy So it is a fact of life that PTP timestamping on the DSA master is incompatible with timestamping on the switch MAC, at least with the current API. And if the switch supports PTP, taking the timestamps from the switch MAC is highly preferable anyway, due to the fact that those don't contain the queuing latencies of the switch. So just disallow PTP on the DSA master if there is any PTP-capable switch attached. [0]: https://sourceforge.net/p/linuxptp/mailman/message/36880648/ Fixes: 0336369d3a4d ("net: dsa: forward hardware timestamping ioctls to switch driver") Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: provide link state with LINKSTATE_GET requestMichal Kubecek7-5/+100
Implement LINKSTATE_GET netlink request to get link state information. At the moment, only link up flag as provided by ETHTOOL_GLINK ioctl command is returned. LINKSTATE_GET request can be used with NLM_F_DUMP (without device identification) to request the information for all devices in current network namespace providing the data. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: add LINKMODES_NTF notificationMichal Kubecek3-2/+10
Send ETHTOOL_MSG_LINKMODES_NTF notification message whenever device link settings or advertised modes are modified using ETHTOOL_MSG_LINKMODES_SET netlink message or ETHTOOL_SLINKSETTINGS or ETHTOOL_SSET ioctl commands. The notification message has the same format as reply to LINKMODES_GET request. ETHTOOL_MSG_LINKMODES_SET netlink request only triggers the notification if there is a change but the ioctl command handlers do not check if there is an actual change and trigger the notification whenever the commands are executed. As all work is done by ethnl_default_notify() handler and callback functions introduced to handle LINKMODES_GET requests, all that remains is adding entries for ETHTOOL_MSG_LINKMODES_NTF into ethnl_notify_handlers and ethnl_default_notify_ops lookup tables and calls to ethtool_notify() where needed. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: set link modes related data with LINKMODES_SET requestMichal Kubecek3-0/+241
Implement LINKMODES_SET netlink request to set advertised linkmodes and related attributes as ETHTOOL_SLINKSETTINGS and ETHTOOL_SSET commands do. The request allows setting autonegotiation flag, speed, duplex and advertised link modes. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: provide link mode information with LINKMODES_GET requestMichal Kubecek4-1/+150
Implement LINKMODES_GET netlink request to get link modes related information provided by ETHTOOL_GLINKSETTINGS and ETHTOOL_GSET ioctl commands. This request provides supported, advertised and peer advertised link modes, autonegotiation flag, speed and duplex. LINKMODES_GET request can be used with NLM_F_DUMP (without device identification) to request the information for all devices in current network namespace providing the data. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: add LINKINFO_NTF notificationMichal Kubecek3-2/+14
Send ETHTOOL_MSG_LINKINFO_NTF notification message whenever device link settings are modified using ETHTOOL_MSG_LINKINFO_SET netlink message or ETHTOOL_SLINKSETTINGS or ETHTOOL_SSET ioctl commands. The notification message has the same format as reply to LINKINFO_GET request. ETHTOOL_MSG_LINKINFO_SET netlink request only triggers the notification if there is a change but the ioctl command handlers do not check if there is an actual change and trigger the notification whenever the commands are executed. As all work is done by ethnl_default_notify() handler and callback functions introduced to handle LINKINFO_GET requests, all that remains is adding entries for ETHTOOL_MSG_LINKINFO_NTF into ethnl_notify_handlers and ethnl_default_notify_ops lookup tables and calls to ethtool_notify() where needed. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: add default notification handlerMichal Kubecek2-1/+92
The ethtool netlink notifications have the same format as related GET replies so that if generic GET handling framework is used to process GET requests, its callbacks and instance of struct get_request_ops can be also used to compose corresponding notification message. Provide function ethnl_std_notify() to be used as notification handler in ethnl_notify_handlers table. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: set link settings with LINKINFO_SET requestMichal Kubecek3-0/+78
Implement LINKINFO_SET netlink request to set link settings queried by LINKINFO_GET message. Only physical port, phy MDIO address and MDI(-X) control can be set, attempt to modify MDI(-X) status and transceiver is rejected. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: provide link settings with LINKINFO_GET requestMichal Kubecek7-49/+156
Implement LINKINFO_GET netlink request to get basic link settings provided by ETHTOOL_GLINKSETTINGS and ETHTOOL_GSET ioctl commands. This request provides settings not directly related to autonegotiation and link mode selection: physical port, phy MDIO address, MDI(-X) status, MDI(-X) control and transceiver. LINKINFO_GET request can be used with NLM_F_DUMP (without device identification) to request the information for all devices in current network namespace providing the data. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: provide string sets with STRSET_GET requestMichal Kubecek4-1/+438
Requests a contents of one or more string sets, i.e. indexed arrays of strings; this information is provided by ETHTOOL_GSSET_INFO and ETHTOOL_GSTRINGS commands of ioctl interface. Unlike ioctl interface, all information can be retrieved with one request and mulitple string sets can be requested at once. There are three types of requests: - no NLM_F_DUMP, no device: get "global" stringsets - no NLM_F_DUMP, with device: get string sets related to the device - NLM_F_DUMP, no device: get device related string sets for all devices Client can request either all string sets of given type (global or device related) or only specific sets. With ETHTOOL_A_STRSET_COUNTS flag set, only set sizes (numbers of strings) are returned. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: default handlers for GET requestsMichal Kubecek2-1/+436
Significant part of GET request processing is common for most request types but unfortunately it cannot be easily separated from type specific code as we need to alternate between common actions (parsing common request header, allocating message and filling netlink/genetlink headers etc.) and specific actions (querying the device, composing the reply). The processing also happens in three different situations: "do" request, "dump" request and notification, each doing things in slightly different way. The request specific code is implemented in four or five callbacks defined in an instance of struct get_request_ops: parse_request() - parse incoming message prepare_data() - retrieve data from driver or NIC reply_size() - estimate reply message size fill_reply() - compose reply message cleanup_data() - (optional) clean up additional data Other members of struct get_request_ops describe the data structure holding information from client request and data used to compose the message. The default handlers ethnl_default_doit(), ethnl_default_dumpit(), ethnl_default_start() and ethnl_default_done() can be then used in genl_ops handler. Notification handler will be introduced in a later patch. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: support for netlink notificationsMichal Kubecek1-0/+32
Add infrastructure for ethtool netlink notifications. There is only one multicast group "monitor" which is used to notify userspace about changes and actions performed. Notification messages (types using suffix _NTF) share the format with replies to GET requests. Notifications are supposed to be broadcasted on every configuration change, whether it is done using the netlink interface or ioctl one. Netlink SET requests only trigger a notification if some data is actually changed. To trigger an ethtool notification, both ethtool netlink and external code use ethtool_notify() helper. This helper requires RTNL to be held and may sleep. Handlers sending messages for specific notification message types are registered in ethnl_notify_handlers array. As notifications can be triggered from other code, ethnl_ok flag is used to prevent an attempt to send notification before genetlink family is registered. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: netlink bitset handlingMichal Kubecek3-1/+764
The ethtool netlink code uses common framework for passing arbitrary length bit sets to allow future extensions. A bitset can be a list (only one bitmap) or can consist of value and mask pair (used e.g. when client want to modify only some bits). A bitset can use one of two formats: verbose (bit by bit) or compact. Verbose format consists of bitset size (number of bits), list flag and an array of bit nests, telling which bits are part of the list or which bits are in the mask and which of them are to be set. In requests, bits can be identified by index (position) or by name. In replies, kernel provides both index and name. Verbose format is suitable for "one shot" applications like standard ethtool command as it avoids the need to either keep bit names (e.g. link modes) in sync with kernel or having to add an extra roundtrip for string set request (e.g. for private flags). Compact format uses one (list) or two (value/mask) arrays of 32-bit words to store the bitmap(s). It is more suitable for long running applications (ethtool in monitor mode or network management daemons) which can retrieve the names once and then pass only compact bitmaps to save space. Userspace requests can use either format; ETHTOOL_FLAG_COMPACT_BITSETS flag in request header tells kernel which format to use in reply. Notifications always use compact format. As some code uses arrays of unsigned long for internal representation and some arrays of u32 (or even a single u32), two sets of parse/compose helpers are introduced. To avoid code duplication, helpers for unsigned long arrays are implemented as wrappers around helpers for u32 arrays. There are two reasons for this choice: (1) u32 arrays are more frequent in ethtool code and (2) unsigned long array can be always interpreted as an u32 array on little endian 64-bit and all 32-bit architectures while we would need special handling for odd number of u32 words in the opposite direction. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: helper functions for netlink interfaceMichal Kubecek2-0/+371
Add common request/reply header definition and helpers to parse request header and fill reply header. Provide ethnl_update_* helpers to update structure members from request attributes (to be used for *_SET requests). Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: introduce ethtool netlink interfaceMichal Kubecek4-1/+56
Basic genetlink and init infrastructure for the netlink interface, register genetlink family "ethtool". Add CONFIG_ETHTOOL_NETLINK Kconfig option to make the build optional. Add initial overall interface description into Documentation/networking/ethtool-netlink.rst, further patches will add more detailed information. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27net/sched: act_mirred: Pull mac prior redir to non mac_header_xmit deviceShmulik Ladkani1-10/+12
There's no skb_pull performed when a mirred action is set at egress of a mac device, with a target device/action that expects skb->data to point at the network header. As a result, either the target device is errornously given an skb with data pointing to the mac (egress case), or the net stack receives the skb with data pointing to the mac (ingress case). E.g: # tc qdisc add dev eth9 root handle 1: prio # tc filter add dev eth9 parent 1: prio 9 protocol ip handle 9 basic \ action mirred egress redirect dev tun0 (tun0 is a tun device. result: tun0 errornously gets the eth header instead of the iph) Revise the push/pull logic of tcf_mirred_act() to not rely on the skb_at_tc_ingress() vs tcf_mirred_act_wants_ingress() comparison, as it does not cover all "pull" cases. Instead, calculate whether the required action on the target device requires the data to point at the network header, and compare this to whether skb->data points to network header - and make the push/pull adjustments as necessary. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Shmulik Ladkani <sladkani@proofpoint.com> Tested-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27sctp: do trace_sctp_probe after SACK validation and checkKevin Kou1-9/+9
The function sctp_sf_eat_sack_6_2 now performs the Verification Tag validation, Chunk length validation, Bogu check, and also the detection of out-of-order SACK based on the RFC2960 Section 6.2 at the beginning, and finally performs the further processing of SACK. The trace_sctp_probe now triggered before the above necessary validation and check. this patch is to do the trace_sctp_probe after the chunk sanity tests, but keep doing trace if the SACK received is out of order, for the out-of-order SACK is valuable to congestion control debugging. v1->v2: - keep doing SCTP trace if the SACK is out of order as Marcelo's suggestion. v2->v3: - regenerate the patch as v2 generated on top of v1, and add 'net-next' tag to the new one as Marcelo's comments. Signed-off-by: Kevin Kou <qdkevin.kou@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27tcp_cubic: make Hystart aware of pacingEric Dumazet1-1/+12
For years we disabled Hystart ACK train detection at Google because it was fooled by TCP pacing. ACK train detection uses a simple heuristic, detecting if we receive ACK past half the RTT, to exit slow start before hitting the bottleneck and experience massive drops. But pacing by design might delay packets up to RTT/2, so we need to tweak the Hystart logic to be aware of this extra delay. Tested: Added a 100 usec delay at receiver. Before: nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart" 9117 7057 9553 8300 7030 6849 9533 10126 6876 8473 TcpExtTCPHystartTrainDetect 10 0.0 TcpExtTCPHystartTrainCwnd 1230 0.0 After : nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart" 9845 10103 10866 11096 11936 11487 11773 12188 11066 11894 TcpExtTCPHystartTrainDetect 10 0.0 TcpExtTCPHystartTrainCwnd 6462 0.0 Disabling Hystart ACK Train detection gives similar numbers echo 2 >/sys/module/tcp_cubic/parameters/hystart_detect nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart" 11173 10954 12455 10627 11578 11583 11222 10880 10665 11366 Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27tcp_cubic: tweak Hystart detection for short RTT flowsEric Dumazet1-2/+21
After switching ca->delay_min to usec resolution, we exit slow start prematurely for very low RTT flows, setting snd_ssthresh to 20. The reason is that delay_min is fed with RTT of small packet trains. Then as cwnd is increased, TCP sends bigger TSO packets. LRO/GRO aggregation and/or interrupt mitigation strategies on receiver tend to inflate RTT samples. Fix this by adding to delay_min the expected delay of two TSO packets, given current pacing rate. Tested: Sender uses pfifo_fast qdisc Before : $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart" 11348 11707 11562 11428 11773 11534 9878 11693 10597 10968 TcpExtTCPHystartTrainDetect 10 0.0 TcpExtTCPHystartTrainCwnd 200 0.0 After : $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart" 14877 14517 15797 18466 17376 14833 17558 17933 16039 18059 TcpExtTCPHystartTrainDetect 10 0.0 TcpExtTCPHystartTrainCwnd 1670 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27tcp_cubic: switch bictcp_clock() to usec resolutionEric Dumazet1-21/+14
Current 1ms clock feeds ca->round_start, ca->delay_min, ca->last_ack. This is quite problematic for data-center flows, where delay_min is way below 1 ms. This means Hystart Train detection triggers every time jiffies value is updated, since "((s32)(now - ca->round_start) > ca->delay_min >> 4)" expression becomes true. This kind of random behavior can be solved by reusing the existing usec timestamp that TCP keeps in tp->tcp_mstamp Note that a followup patch will tweak things a bit, because during slow start, GRO aggregation on receivers naturally increases the RTT as TSO packets gradually come to ~64KB size. To recap, right after this patch CUBIC Hystart train detection is more aggressive, since short RTT flows might exit slow start at cwnd = 20, instead of being possibly unbounded. Following patch will address this problem. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27tcp_cubic: remove one conditional from hystart_update()Eric Dumazet1-2/+2
If we initialize ca->curr_rtt to ~0U, we do not need to test for zero value in hystart_update() We only read ca->curr_rtt if at least HYSTART_MIN_SAMPLES have been processed, and thus ca->curr_rtt will have a sane value. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>