aboutsummaryrefslogtreecommitdiffstats
path: root/net/core (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-03-07ethtool: reduce stack usage with clangArnd Bergmann1-7/+9
clang inlines the dev_ethtool() more aggressively than gcc does, leading to a larger amount of used stack space: net/core/ethtool.c:2536:24: error: stack frame size of 1216 bytes in function 'dev_ethtool' [-Werror,-Wframe-larger-than=] Marking the sub-functions that require the most stack space as noinline_for_stack gives us reasonable behavior on all compilers. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-2/+3
2019-03-04devlink: Add support for direct reporter health state updateEran Ben Elisha1-5/+17
It is possible that a reporter state will be updated due to a recover flow which is not triggered by a devlink health related operation, but as a side effect of some other operation in the system. Expose devlink health API for a direct update of a reporter status. Move devlink_health_reporter_state enum definition to devlink.h so it could be used from drivers as a parameter of devlink_health_reporter_state_update. In addition, add trace_devlink_health_reporter_state_update to provide user notification for reporter state change. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-04devlink: Update reporter state to error even if recover abortedEran Ben Elisha1-1/+4
If devlink_health_report() aborted the recover flow due to grace period checker, it left the reporter status as DEVLINK_HEALTH_REPORTER_STATE_HEALTHY, which is a bug. Fix that by always setting the reporter state to DEVLINK_HEALTH_REPORTER_STATE_ERROR prior to running the checker mentioned above. In addition, save the previous health_state in a temporary variable, then use it in the abort check comparison instead of using reporter->health_state which might be already changed. Fixes: c8e1da0bf923 ("devlink: Add health report functionality") Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-04net-sysfs: Switch to bitmap_zalloc()Andy Shevchenko1-7/+5
Switch to bitmap_zalloc() to show clearly what we are allocating. Besides that it returns pointer of bitmap type instead of opaque void *. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-3/+41
Daniel Borkmann says: ==================== pull-request: bpf-next 2019-03-04 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Add AF_XDP support to libbpf. Rationale is to facilitate writing AF_XDP applications by offering higher-level APIs that hide many of the details of the AF_XDP uapi. Sample programs are converted over to this new interface as well, from Magnus. 2) Introduce a new cant_sleep() macro for annotation of functions that cannot sleep and use it in BPF_PROG_RUN() to assert that BPF programs run under preemption disabled context, from Peter. 3) Introduce per BPF prog stats in order to monitor the usage of BPF; this is controlled by kernel.bpf_stats_enabled sysctl knob where monitoring tools can make use of this to efficiently determine the average cost of programs, from Alexei. 4) Split up BPF selftest's test_progs similarly as we already did with test_verifier. This allows to further reduce merge conflicts in future and to get more structure into our quickly growing BPF selftest suite, from Stanislav. 5) Fix a bug in BTF's dedup algorithm which can cause an infinite loop in some circumstances; also various BPF doc fixes and improvements, from Andrii. 6) Various BPF sample cleanups and migration to libbpf in order to further isolate the old sample loader code (so we can get rid of it at some point), from Jakub. 7) Add a new BPF helper for BPF cgroup skb progs that allows to set ECN CE code point and a Host Bandwidth Manager (HBM) sample program for limiting the bandwidth used by v2 cgroups, from Lawrence. 8) Enable write access to skb->queue_mapping from tc BPF egress programs in order to let BPF pick TX queue, from Jesper. 9) Fix a bug in BPF spinlock handling for map-in-map which did not propagate spin_lock_off to the meta map, from Yonghong. 10) Fix a bug in the new per-CPU BPF prog counters to properly initialize stats for each CPU, from Eric. 11) Add various BPF helper prototypes to selftest's bpf_helpers.h, from Willem. 12) Fix various BPF samples bugs in XDP and tracing progs, from Toke, Daniel and Yonghong. 13) Silence preemption splat in test_bpf after BPF_PROG_RUN() enforces it now everywhere, from Anders. 14) Fix a signedness bug in libbpf's btf_dedup_ref_type() to get error handling working, from Dan. 15) Fix bpftool documentation and auto-completion with regards to stream_{verdict,parser} attach types, from Alban. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-03net-sysfs: Fix mem leak in netdev_register_kobjectYueHaibing1-0/+3
syzkaller report this: BUG: memory leak unreferenced object 0xffff88837a71a500 (size 256): comm "syz-executor.2", pid 9770, jiffies 4297825125 (age 17.843s) hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N.......... ff ff ff ff ff ff ff ff 20 c0 ef 86 ff ff ff ff ........ ....... backtrace: [<00000000db12624b>] netdev_register_kobject+0x124/0x2e0 net/core/net-sysfs.c:1751 [<00000000dc49a994>] register_netdevice+0xcc1/0x1270 net/core/dev.c:8516 [<00000000e5f3fea0>] tun_set_iff drivers/net/tun.c:2649 [inline] [<00000000e5f3fea0>] __tun_chr_ioctl+0x2218/0x3d20 drivers/net/tun.c:2883 [<000000001b8ac127>] vfs_ioctl fs/ioctl.c:46 [inline] [<000000001b8ac127>] do_vfs_ioctl+0x1a5/0x10e0 fs/ioctl.c:690 [<0000000079b269f8>] ksys_ioctl+0x89/0xa0 fs/ioctl.c:705 [<00000000de649beb>] __do_sys_ioctl fs/ioctl.c:712 [inline] [<00000000de649beb>] __se_sys_ioctl fs/ioctl.c:710 [inline] [<00000000de649beb>] __x64_sys_ioctl+0x74/0xb0 fs/ioctl.c:710 [<000000007ebded1e>] do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:290 [<00000000db315d36>] entry_SYSCALL_64_after_hwframe+0x49/0xbe [<00000000115be9bb>] 0xffffffffffffffff It should call kset_unregister to free 'dev->queues_kset' in error path of register_queue_kobjects, otherwise will cause a mem leak. Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: 1d24eb4815d1 ("xps: Transmit Packet Steering") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-02net: sched: put back q.qlen into a single locationEric Dumazet1-2/+0
In the series fc8b81a5981f ("Merge branch 'lockless-qdisc-series'") John made the assumption that the data path had no need to read the qdisc qlen (number of packets in the qdisc). It is true when pfifo_fast is used as the root qdisc, or as direct MQ/MQPRIO children. But pfifo_fast can be used as leaf in class full qdiscs, and existing logic needs to access the child qlen in an efficient way. HTB breaks badly, since it uses cl->leaf.q->q.qlen in : htb_activate() -> WARN_ON() htb_dequeue_tree() to decide if a class can be htb_deactivated when it has no more packets. HFSC, DRR, CBQ, QFQ have similar issues, and some calls to qdisc_tree_reduce_backlog() also read q.qlen directly. Using qdisc_qlen_sum() (which iterates over all possible cpus) in the data path is a non starter. It seems we have to put back qlen in a central location, at least for stable kernels. For all qdisc but pfifo_fast, qlen is guarded by the qdisc lock, so the existing q.qlen{++|--} are correct. For 'lockless' qdisc (pfifo_fast so far), we need to use atomic_{inc|dec}() because the spinlock might be not held (for example from pfifo_fast_enqueue() and pfifo_fast_dequeue()) This patch adds atomic_qlen (in the same location than qlen) and renames the following helpers, since we want to express they can be used without qdisc lock, and that qlen is no longer percpu. - qdisc_qstats_cpu_qlen_dec -> qdisc_qstats_atomic_qlen_dec() - qdisc_qstats_cpu_qlen_inc -> qdisc_qstats_atomic_qlen_inc() Later (net-next) we might revert this patch by tracking all these qlen uses and replace them by a more efficient method (not having to access a precise qlen, but an empty/non_empty status that might be less expensive to maintain/track). Another possibility is to have a legacy pfifo_fast version that would be used when used a a child qdisc, since the parent qdisc needs a spinlock anyway. But then, future lockless qdiscs would also have the same problem. Fixes: 7e66016f2c65 ("net: sched: helpers to sum qlen and qlen for per cpu logic") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-02bpf: add bpf helper bpf_skb_ecn_set_cebrakmo1-0/+28
This patch adds a new bpf helper BPF_FUNC_skb_ecn_set_ce "int bpf_skb_ecn_set_ce(struct sk_buff *skb)". It is added to BPF_PROG_TYPE_CGROUP_SKB typed bpf_prog which currently can be attached to the ingress and egress path. The helper is needed because his type of bpf_prog cannot modify the skb directly. This helper is used to set the ECN field of ECN capable IP packets to ce (congestion encountered) in the IPv6 or IPv4 header of the skb. It can be used by a bpf_prog to manage egress or ingress network bandwdith limit per cgroupv2 by inducing an ECN response in the TCP sender. This works best when using DCTCP. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-01net: support 64bit rates for getsockopt(SO_MAX_PACING_RATE)Eric Dumazet1-2/+8
For legacy applications using 32bit variable, SO_MAX_PACING_RATE has to cap the returned value to 0xFFFFFFFF, meaning that rates above 34.35 Gbit are capped. This patch allows applications to read socket pacing rate at full resolution, if they provide a 64bit variable to store it, and the kernel is 64bit. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01net: support 64bit values for setsockopt(SO_MAX_PACING_RATE)Eric Dumazet1-5/+13
64bit kernels now support 64bit pacing rates. This commit changes setsockopt() to accept 64bit values provided by applications. Old applications providing 32bit value are still supported, but limited to the old 34Gbit limitation. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01devlink: fix kdocJakub Kicinski1-7/+5
devlink suffers from a few kdoc warnings: net/core/devlink.c:5292: warning: Function parameter or member 'dev' not described in 'devlink_register' net/core/devlink.c:5351: warning: Function parameter or member 'port_index' not described in 'devlink_port_register' net/core/devlink.c:5753: warning: Function parameter or member 'parent_resource_id' not described in 'devlink_resource_register' net/core/devlink.c:5753: warning: Function parameter or member 'size_params' not described in 'devlink_resource_register' net/core/devlink.c:5753: warning: Excess function parameter 'top_hierarchy' description in 'devlink_resource_register' net/core/devlink.c:5753: warning: Excess function parameter 'reload_required' description in 'devlink_resource_register' net/core/devlink.c:5753: warning: Excess function parameter 'parent_reosurce_id' description in 'devlink_resource_register' net/core/devlink.c:6451: warning: Function parameter or member 'region' not described in 'devlink_region_snapshot_create' net/core/devlink.c:6451: warning: Excess function parameter 'devlink_region' description in 'devlink_region_snapshot_create' Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-27ethtool: Use explicit designated initializers for .cmdLi RongQing1-2/+2
Initialize the .cmd member by using a designated struct initializer. This fixes warning of missing field initializers, and makes code a little easier to read. Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-26devlink: require non-NULL ops for devlink instancesJakub Kicinski1-26/+22
Commit 76726ccb7f46 ("devlink: add flash update command") and commit 2d8dc5bbf4e7 ("devlink: Add support for reload") access devlink ops without NULL-checking. There is, however, no driver which would pass in NULL ops, so let's just make that a requirement. Remove the now unnecessary NULL-checking. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-26devlink: hold a reference to the netdevice around ethtool compatJakub Kicinski2-11/+15
When ethtool is calling into devlink compat code make sure we have a reference on the netdevice on which the operation was invoked. v3: move the hold/lock logic into devlink_compat_* functions (Florian) Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-26devlink: create a special NDO for getting the devlink instanceJakub Kicinski1-39/+17
Instead of iterating over all devlink ports add a NDO which will return the devlink instance from the driver. v2: add the netdev_to_devlink() helper (Michal) v3: check that devlink has ops (Florian) v4: hold devlink_mutex (Jiri) Suggested-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-26net: devlink: turn devlink into a built-inJakub Kicinski1-13/+2
Being able to build devlink as a module causes growing pains. First all drivers had to add a meta dependency to make sure they are not built in when devlink is built as a module. Now we are struggling to invoke ethtool compat code reliably. Make devlink code built-in, users can still not build it at all but the dynamically loadable module option is removed. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24net: fix double-free in bpf_lwt_xmit_reroutePeter Oskolkov1-1/+1
dst_output() frees skb when it fails (see, for example, ip_finish_output2), so it must not be freed in this case. Fixes: 3bd0b15281af ("bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c") Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24ip_tunnel: Add dst_cache support in lwtunnel_state of ip tunnelwenxu1-8/+8
The lwtunnel_state is not init the dst_cache Which make the ip_md_tunnel_xmit can't use the dst_cache. It will lookup route table every packets. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24net: dev: add generic protodown handlerAndy Roulin1-0/+19
Introduce dev_change_proto_down_generic, a generic ndo_change_proto_down implementation, which sets the netdev carrier state according to proto_down. This adds the ability to set protodown on vxlan and macvlan devices in a generic way for use by control protocols like VRRPD. Signed-off-by: Andy Roulin <aroulin@cumulusnetworks.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24net: Skip GSO length estimation if transport header is not setMaxim Mikityanskiy1-1/+1
qdisc_pkt_len_init expects transport_header to be set for GSO packets. Patch [1] skips transport_header validation for GSO packets that don't have network_header set at the moment of calling virtio_net_hdr_to_skb, and allows them to pass into the stack. After patch [2] no placeholder value is assigned to transport_header if dissection fails, so this patch adds a check to the place where the value of transport_header is used. [1] https://patchwork.ozlabs.org/patch/1044429/ [2] https://patchwork.ozlabs.org/patch/1046122/ Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24net: Use RCU_INIT_POINTER() to set sk_wqLi RongQing1-3/+3
This pointer is RCU protected, so proper primitives should be used. Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-21devlink: Modify reply of DEVLINK_CMD_HEALTH_REPORTER_GETAya Levin1-2/+4
Avoid sending attributes related to recovery: DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD and DEVLINK_ATTR_HEALTH_REPORTER_AUTO_RECOVER in reply to DEVLINK_CMD_HEALTH_REPORTER_GET for a reporter which didn't register a recover operation. These parameters can't be configured on a reporter that did not provide a recover operation, thus not needed to return them. Fixes: 7afe335a8bed ("devlink: Add health get command") Signed-off-by: Aya Levin <ayal@mellanox.com> Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-21devlink: Rename devlink health attributesAya Levin1-2/+2
Rename devlink health attributes for better reflect the attributes use. Add COUNT prefix on error counter attribute and recovery counter attribute. Fixes: 7afe335a8bed ("devlink: Add health get command") Signed-off-by: Aya Levin <ayal@mellanox.com> Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-10/+10
Two easily resolvable overlapping change conflicts, one in TCP and one in the eBPF verifier. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-19bpf: add skb->queue_mapping write access from tc clsactJesper Dangaard Brouer1-3/+13
The skb->queue_mapping already have read access, via __sk_buff->queue_mapping. This patch allow BPF tc qdisc clsact write access to the queue_mapping via tc_cls_act_is_valid_access. Also handle that the value NO_QUEUE_MAPPING is not allowed. It is already possible to change this via TC filter action skbedit tc-skbedit(8). Due to the lack of TC examples, lets show one: # tc qdisc add dev ixgbe1 clsact # tc filter add dev ixgbe1 ingress matchall action skbedit queue_mapping 5 # tc filter list dev ixgbe1 ingress The most common mistake is that XPS (Transmit Packet Steering) takes precedence over setting skb->queue_mapping. XPS is configured per DEVICE via /sys/class/net/DEVICE/queues/tx-*/xps_cpus via a CPU hex mask. To disable set mask=00. The purpose of changing skb->queue_mapping is to influence the selection of the net_device "txq" (struct netdev_queue), which influence selection of the qdisc "root_lock" (via txq->qdisc->q.lock) and txq->_xmit_lock. When using the MQ qdisc the txq->qdisc points to different qdiscs and associated locks, and HARD_TX_LOCK (txq->_xmit_lock), allowing for CPU scalability. Due to lack of TC examples, lets show howto attach clsact BPF programs: # tc qdisc add dev ixgbe2 clsact # tc filter add dev ixgbe2 egress bpf da obj XXX_kern.o sec tc_qmap2cpu # tc filter list dev ixgbe2 egress Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-17net: Do not allocate page fragments that are not skb alignedAlexander Duyck1-0/+4
This patch addresses the fact that there are drivers, specifically tun, that will call into the network page fragment allocators with buffer sizes that are not cache aligned. Doing this could result in data alignment and DMA performance issues as these fragment pools are also shared with the skb allocator and any other devices that will use napi_alloc_frags or netdev_alloc_frags. Fixes: ffde7328a36d ("net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17ethtool: add compat for flash updateJakub Kicinski2-3/+39
If driver does not support ethtool flash update operation call into devlink. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17devlink: add flash update commandJakub Kicinski1-0/+30
Add devlink flash update command. Advanced NICs have firmware stored in flash and often cryptographically secured. Updating that flash is handled by management firmware. Ethtool has a flash update command which served us well, however, it has two shortcomings: - it takes rtnl_lock unnecessarily - really flash update has nothing to do with networking, so using a networking device as a handle is suboptimal, which leads us to the second one: - it requires a functioning netdev - in case device enters an error state and can't spawn a netdev (e.g. communication with the device fails) there is no netdev to use as a handle for flashing. Devlink already has the ability to report the firmware versions, now with the ability to update the firmware/flash we will be able to recover devices in bad state. To enable updates of sub-components of the FW allow passing component name. This name should correspond to one of the versions reported in devlink info. v1: - replace target id with component name (Jiri). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17neigh: hook tracepoints in neigh update codeRoopa Prabhu1-0/+11
hook tracepoints at the end of functions that update a neigh entry. neigh_update gets an additional tracepoint to trace the update flags and old and new neigh states. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17trace: events: add a few neigh tracepointsRoopa Prabhu1-0/+8
The goal here is to trace neigh state changes covering all possible neigh update paths. Plus have a specific trace point in neigh_update to cover flags sent to neigh_update. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2-183/+626
Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-02-16 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) numerous libbpf API improvements, from Andrii, Andrey, Yonghong. 2) test all bpf progs in alu32 mode, from Jiong. 3) skb->sk access and bpf_sk_fullsock(), bpf_tcp_sock() helpers, from Martin. 4) support for IP encap in lwt bpf progs, from Peter. 5) remove XDP_QUERY_XSK_UMEM dead code, from Jan. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller1-8/+4
Alexei Starovoitov says: ==================== pull-request: bpf 2019-02-16 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) fix lockdep false positive in bpf_get_stackid(), from Alexei. 2) several AF_XDP fixes, from Bjorn, Magnus, Davidlohr. 3) fix narrow load from struct bpf_sock, from Martin. 4) mips JIT fixes, from Paul. 5) gso handling fix in bpf helpers, from Willem. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-16sock: consistent handling of extreme SO_SNDBUF/SO_RCVBUF valuesGuillaume Nault1-0/+20
SO_SNDBUF and SO_RCVBUF (and their *BUFFORCE version) may overflow or underflow their input value. This patch aims at providing explicit handling of these extreme cases, to get a clear behaviour even with values bigger than INT_MAX / 2 or lower than INT_MIN / 2. For simplicity, only SO_SNDBUF and SO_SNDBUFFORCE are described here, but the same explanation and fix apply to SO_RCVBUF and SO_RCVBUFFORCE (with 'SNDBUF' replaced by 'RCVBUF' and 'wmem_max' by 'rmem_max'). Overflow of positive values =========================== When handling SO_SNDBUF or SO_SNDBUFFORCE, if 'val' exceeds INT_MAX / 2, the buffer size is set to its minimum value because 'val * 2' overflows, and max_t() considers that it's smaller than SOCK_MIN_SNDBUF. For SO_SNDBUF, this can only happen with net.core.wmem_max > INT_MAX / 2. SO_SNDBUF and SO_SNDBUFFORCE are actually designed to let users probe for the maximum buffer size by setting an arbitrary large number that gets capped to the maximum allowed/possible size. Having the upper half of the positive integer space to potentially reduce the buffer size to its minimum value defeats this purpose. This patch caps the base value to INT_MAX / 2, so that bigger values don't overflow and keep setting the buffer size to its maximum. Underflow of negative values ============================ For negative numbers, SO_SNDBUF always considers them bigger than net.core.wmem_max, which is bounded by [SOCK_MIN_SNDBUF, INT_MAX]. Therefore such values are set to net.core.wmem_max and we're back to the behaviour of positive integers described above (return maximum buffer size if wmem_max <= INT_MAX / 2, return SOCK_MIN_SNDBUF otherwise). However, SO_SNDBUFFORCE behaves differently. The user value is directly multiplied by two and compared with SOCK_MIN_SNDBUF. If 'val * 2' doesn't underflow or if it underflows to a value smaller than SOCK_MIN_SNDBUF then buffer size is set to its minimum value. Otherwise the buffer size is set to the underflowed value. This patch treats negative values passed to SO_SNDBUFFORCE as null, to prevent underflows. Therefore negative values now always set the buffer size to its minimum value. Even though SO_SNDBUF behaves inconsistently by setting buffer size to the maximum value when passed a negative number, no attempt is made to modify this behaviour. There may exist some programs that rely on using negative numbers to set the maximum buffer size. Avoiding overflows because of extreme net.core.wmem_max values is the most we can do here. Summary of altered behaviours ============================= val : user-space value passed to setsockopt() val_uf : the underflowed value resulting from doubling val when val < INT_MIN / 2 wmem_max : short for net.core.wmem_max val_cap : min(val, wmem_max) min_len : minimal buffer length (that is, SOCK_MIN_SNDBUF) max_len : maximal possible buffer length, regardless of wmem_max (that is, INT_MAX - 1) ^^^^ : altered behaviour SO_SNDBUF: +-------------------------+-------------+------------+----------------+ | CONDITION | OLD RESULT | NEW RESULT | COMMENT | +-------------------------+-------------+------------+----------------+ | val < 0 && | | | No overflow, | | wmem_max <= INT_MAX/2 | wmem_max*2 | wmem_max*2 | keep original | | | | | behaviour | +-------------------------+-------------+------------+----------------+ | val < 0 && | | | Cap wmem_max | | INT_MAX/2 < wmem_max | min_len | max_len | to prevent | | | | ^^^^^^^ | overflow | +-------------------------+-------------+------------+----------------+ | 0 <= val <= min_len/2 | min_len | min_len | Ordinary case | +-------------------------+-------------+------------+----------------+ | min_len/2 < val && | val_cap*2 | val_cap*2 | Ordinary case | | val_cap <= INT_MAX/2 | | | | +-------------------------+-------------+------------+----------------+ | min_len < val && | | | Cap val_cap | | INT_MAX/2 < val_cap | min_len | max_len | again to | | (implies that | | ^^^^^^^ | prevent | | INT_MAX/2 < wmem_max) | | | overflow | +-------------------------+-------------+------------+----------------+ SO_SNDBUFFORCE: +------------------------------+---------+---------+------------------+ | CONDITION | BEFORE | AFTER | COMMENT | | | PATCH | PATCH | | +------------------------------+---------+---------+------------------+ | val < INT_MIN/2 && | min_len | min_len | Underflow with | | val_uf <= min_len | | | no consequence | +------------------------------+---------+---------+------------------+ | val < INT_MIN/2 && | val_uf | min_len | Set val to 0 to | | val_uf > min_len | | ^^^^^^^ | avoid underflow | +------------------------------+---------+---------+------------------+ | INT_MIN/2 <= val < 0 | min_len | min_len | No underflow | +------------------------------+---------+---------+------------------+ | 0 <= val <= min_len/2 | min_len | min_len | Ordinary case | +------------------------------+---------+---------+------------------+ | min_len/2 < val <= INT_MAX/2 | val*2 | val*2 | Ordinary case | +------------------------------+---------+---------+------------------+ | INT_MAX/2 < val | min_len | max_len | Cap val to | | | | ^^^^^^^ | prevent overflow | +------------------------------+---------+---------+------------------+ Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-15net: Fix for_each_netdev_feature on Big endianHauke Mehrtens1-2/+2
The features attribute is of type u64 and stored in the native endianes on the system. The for_each_set_bit() macro takes a pointer to a 32 bit array and goes over the bits in this area. On little Endian systems this also works with an u64 as the most significant bit is on the highest address, but on big endian the words are swapped. When we expect bit 15 here we get bit 47 (15 + 32). This patch converts it more or less to its own for_each_set_bit() implementation which works on 64 bit integers directly. This is then completely in host endianness and should work like expected. Fixes: fd867d51f ("net/core: generic support for disabling netdev features down stack") Signed-off-by: Hauke Mehrtens <hauke.mehrtens@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+1
The netfilter conflicts were rather simple overlapping changes. However, the cls_tcindex.c stuff was a bit more complex. On the 'net' side, Cong is fixing several races and memory leaks. Whilst on the 'net-next' side we have Vlad adding the rtnl-ness support. What I've decided to do, in order to resolve this, is revert the conversion over to using a workqueue that Cong did, bringing us back to pure RCU. I did it this way because I believe that either Cong's races don't apply with have Vlad did things, or Cong will have to implement the race fix slightly differently. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-14bpf: fix memory leak in bpf_lwt_xmit_reroutePeter Oskolkov1-9/+20
On error the skb should be freed. Tested with diff/steps provided by David Ahern. v2: surface routing errors to the user instead of a generic EINVAL, as suggested by David Ahern. Reported-by: David Ahern <dsahern@gmail.com> Fixes: 3bd0b15281af ("bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c") Signed-off-by: Peter Oskolkov <posk@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-14devlink: Fix list access without lock while reading regionParav Pandit1-2/+5
While finding the devlink device during region reading, devlink device list is accessed and devlink device is returned without holding a lock. This could lead to use-after-free accesses. While at it, add lockdep assert to ensure that all future callers hold the lock when calling devlink_get_from_attrs(). Fixes: 4e54795a27f5 ("devlink: Add support for region snapshot read command") Signed-off-by: Parav Pandit <parav@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-14devlink: Return right error code in case of errors for region readParav Pandit1-7/+19
devlink_nl_cmd_region_read_dumpit() misses to return right error code on most error conditions. Return the right error code on such errors. Fixes: 4e54795a27f5 ("devlink: Add support for region snapshot read command") Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-13page_pool: use DMA_ATTR_SKIP_CPU_SYNC for DMA mappingsJesper Dangaard Brouer1-5/+6
As pointed out by Alexander Duyck, the DMA mapping done in page_pool needs to use the DMA attribute DMA_ATTR_SKIP_CPU_SYNC. As the principle behind page_pool keeping the pages mapped is that the driver takes over the DMA-sync steps. Reported-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-13net: page_pool: don't use page->private to store dma_addr_tIlias Apalodimas1-4/+9
As pointed out by David Miller the current page_pool implementation stores dma_addr_t in page->private. This won't work on 32-bit platforms with 64-bit DMA addresses since the page->private is an unsigned long and the dma_addr_t a u64. A previous patch is adding dma_addr_t on struct page to accommodate this. This patch adapts the page_pool related functions to use the newly added struct for storing and retrieving DMA addresses from network drivers. Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-13net: fix possible overflow in __sk_mem_raise_allocated()Eric Dumazet1-1/+1
With many active TCP sockets, fat TCP sockets could fool __sk_mem_raise_allocated() thanks to an overflow. They would increase their share of the memory, instead of decreasing it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-13bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.cPeter Oskolkov1-2/+124
This patch builds on top of the previous patch in the patchset, which added BPF_LWT_ENCAP_IP mode to bpf_lwt_push_encap. As the encapping can result in the skb needing to go via a different interface/route/dst, bpf programs can indicate this by returning BPF_LWT_REROUTE, which triggers a new route lookup for the skb. v8 changes: fix kbuild errors when LWTUNNEL_BPF is builtin, but IPV6 is a module: as LWTUNNEL_BPF can only be either Y or N, call IPV6 routing functions only if they are built-in. v9 changes: - fixed a kbuild test robot compiler warning; - call IPV6 routing functions via ipv6_stub. v10 changes: removed unnecessary IS_ENABLED and pr_warn_once. v11 changes: fixed a potential dst leak. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-13bpf: handle GSO in bpf_lwt_push_encapPeter Oskolkov1-2/+65
This patch adds handling of GSO packets in bpf_lwt_push_ip_encap() (called from bpf_lwt_push_encap): * IPIP, GRE, and UDP encapsulation types are deduced by looking into iphdr->protocol or ipv6hdr->next_header; * SCTP GSO packets are not supported (as bpf_skb_proto_4_to_6 and similar do); * UDP_L4 GSO packets are also not supported (although they are not blocked in bpf_skb_proto_4_to_6 and similar), as skb_decrease_gso_size() will break it; * SKB_GSO_DODGY bit is set. Note: it may be possible to support SCTP and UDP_L4 gso packets; but as these cases seem to be not well handled by other tunneling/encapping code paths, the solution should be generic enough to apply to all tunneling/encapping code. v8 changes: - make sure that if GRE or UDP encap is detected, there is enough of pushed bytes to cover both IP[v6] + GRE|UDP headers; - do not reject double-encapped packets; - whitelist TCP GSO packets rather than block SCTP GSO and UDP GSO. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-13bpf: implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encapPeter Oskolkov2-1/+67
Implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap BPF helper. It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN and BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers to packets (e.g. IP/GRE, GUE, IPIP). This is useful when thousands of different short-lived flows should be encapped, each with different and dynamically determined destination. Although lwtunnels can be used in some of these scenarios, the ability to dynamically generate encap headers adds more flexibility, e.g. when routing depends on the state of the host (reflected in global bpf maps). v7 changes: - added a call skb_clear_hash(); - removed calls to skb_set_transport_header(); - refuse to encap GSO-enabled packets. v8 changes: - fix build errors when LWT is not enabled. Note: the next patch in the patchset with deal with GSO-enabled packets, which are currently rejected at encapping attempt. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-13bpf: add plumbing for BPF_LWT_ENCAP_IP in bpf_lwt_push_encapPeter Oskolkov1-5/+43
This patch adds all needed plumbing in preparation to allowing bpf programs to do IP encapping via bpf_lwt_push_encap. Actual implementation is added in the next patch in the patchset. Of note: - bpf_lwt_push_encap can now be called from BPF_PROG_TYPE_LWT_XMIT prog types in addition to BPF_PROG_TYPE_LWT_IN; - if the skb being encapped has GSO set, encapsulation is limited to IPIP/IP+GRE/IP+GUE (both IPv4 and IPv6); - as route lookups are different for ingress vs egress, the single external bpf_lwt_push_encap BPF helper is routed internally to either bpf_lwt_in_push_encap or bpf_lwt_xmit_push_encap BPF_CALLs, depending on prog type. v8 changes: fixed a typo. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-12devlink: use direct return of genlmsg_replyLi RongQing1-4/+1
This can remove redundant check Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12Revert "devlink: Add a generic wake_on_lan port parameter"Vasundhara Volam1-5/+0
This reverts commit b639583f9e36d044ac1b13090ae812266992cbac. As per discussion with Jakub Kicinski and Michal Kubecek, this will be better addressed by soon-too-come ethtool netlink API with additional indication that given configuration request is supposed to be persisted. Also, remove the parameter support from bnxt_en driver. Cc: Jiri Pirko <jiri@mellanox.com> Cc: Michael Chan <michael.chan@broadcom.com> Cc: Michal Kubecek <mkubecek@suse.cz> Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11devlink: don't allocate attrs on the stackJakub Kicinski1-4/+10
Number of devlink attributes has grown over 128, causing the following warning: ../net/core/devlink.c: In function ‘devlink_nl_cmd_region_read_dumpit’: ../net/core/devlink.c:3740:1: warning: the frame size of 1064 bytes is larger than 1024 bytes [-Wframe-larger-than=] } ^ Since the number of attributes is only going to grow allocate the array dynamically. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11devlink: fix condition for compat device infoJakub Kicinski1-1/+1
We need the port to be both ethernet and have the rigth netdev, not one or the other. Fixes: ddb6e99e2db1 ("ethtool: add compat for devlink info") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>