aboutsummaryrefslogtreecommitdiffstats
path: root/include/uapi (follow)
AgeCommit message (Collapse)AuthorFilesLines
2020-05-27RDMA/mlx5: Set ECE options during modify QPLeon Romanovsky1-1/+2
The most common way to set ECE option will be during modify QP command in INIT2RTR, RTR2RTS and RTS2RTS stages, so update mlx5 to support it. The new bit in the comp_mask is needed to mark that kernel supports ECE and can receive data instead of "reserved" field in the struct mlx5_ib_modify_qp. Link: https://lore.kernel.org/r/20200526115440.205922-8-leon@kernel.org Reviewed-by: Mark Zhang <markz@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27RDMA/mlx5: Set ECE options during QP createLeon Romanovsky1-0/+2
Allow users to ask creation of QPs with specific ECE options. Such early set even before RDMA-CM connection is established is useful if user knows exactly which option he needs. Link: https://lore.kernel.org/r/20200526115440.205922-4-leon@kernel.org Reviewed-by: Mark Zhang <markz@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27RDMA/mlx5: Get ECE options from FW during create QPLeon Romanovsky1-1/+1
Supported ECE options are returned from FW in the create_qp phase and zero means that field is not valid. Such default value allows us to reuse reserved field without worries about comp_mask. Update create QP API to return ECE options. Link: https://lore.kernel.org/r/20200526115440.205922-3-leon@kernel.org Reviewed-by: Mark Zhang <markz@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27RDMA/cma: Provide ECE reject reasonLeon Romanovsky1-1/+2
IBTA declares "vendor option not supported" reject reason in REJ messages if passive side doesn't want to accept proposed ECE options. Due to the fact that ECE is managed by userspace, there is a need to let users to provide such rejected reason. Link: https://lore.kernel.org/r/20200526103304.196371-7-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27RDMA/cma: Connect ECE to rdma_acceptLeon Romanovsky1-0/+1
The rdma_accept() is called by both passive and active sides of CMID connection to mark readiness to start data transfer. For passive side, this is called explicitly, for active side, it is called implicitly while receiving REP message. Provide ECE data to rdma_accept function needed for passive side to send that REP message. Link: https://lore.kernel.org/r/20200526103304.196371-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27RDMA/ucma: Deliver ECE parameters through UCMA eventsLeon Romanovsky1-0/+1
Passive side of CMID connection receives ECE request through REQ message and needs to respond with relevant REP message which will be forwarded to active side. The UCMA events interface is responsible for such communication with the user space (librdmacm). Extend it to provide ECE wire data. Link: https://lore.kernel.org/r/20200526103304.196371-4-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27RDMA/ucma: Extend ucma_connect to receive ECE parametersLeon Romanovsky1-0/+6
Active side of CMID initiates connection through librdmacm's rdma_connect() and kernel's ucma_connect(). Extend UCMA interface to handle those new parameters. Link: https://lore.kernel.org/r/20200526103304.196371-3-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27bridge: mrp: Rework the MRP netlink interfaceHoratiu Vultur1-7/+57
This patch reworks the MRP netlink interface. Before, each attribute represented a binary structure which made it hard to be extended. Therefore update the MRP netlink interface such that each existing attribute to be a nested attribute which contains the fields of the binary structures. In this way the MRP netlink interface can be extended without breaking the backwards compatibility. It is also using strict checking for attributes under the MRP top attribute. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27/dev/mem: Revoke mappings when a driver claims the regionDan Williams1-0/+1
Close the hole of holding a mapping over kernel driver takeover event of a given address range. Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges") introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the kernel against scenarios where a /dev/mem user tramples memory that a kernel driver owns. However, this protection only prevents *new* read(), write() and mmap() requests. Established mappings prior to the driver calling request_mem_region() are left alone. Especially with persistent memory, and the core kernel metadata that is stored there, there are plentiful scenarios for a /dev/mem user to violate the expectations of the driver and cause amplified damage. Teach request_mem_region() to find and shoot down active /dev/mem mappings that it believes it has successfully claimed for the exclusive use of the driver. Effectively a driver call to request_mem_region() becomes a hole-punch on the /dev/mem device. The typical usage of unmap_mapping_range() is part of truncate_pagecache() to punch a hole in a file, but in this case the implementation is only doing the "first half" of a hole punch. Namely it is just evacuating current established mappings of the "hole", and it relies on the fact that /dev/mem establishes mappings in terms of absolute physical address offsets. Once existing mmap users are invalidated they can attempt to re-establish the mapping, or attempt to continue issuing read(2) / write(2) to the invalidated extent, but they will then be subject to the CONFIG_IO_STRICT_DEVMEM checking that can block those subsequent accesses. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixes: 90a545e98126 ("restrict /dev/mem to idle io memory ranges") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/159009507306.847224.8502634072429766747.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-27wireless: Use linux/stddef.h instead of stddef.hHauke Mehrtens1-1/+5
When compiling inside the kernel include linux/stddef.h instead of stddef.h. When I compile this header file in backports for power PC I run into a conflict with ptrdiff_t. I was unable to reproduce this in mainline kernel. I still would like to fix this problem in the kernel. Fixes: 6989310f5d43 ("wireless: Use offsetof instead of custom macro.") Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Link: https://lore.kernel.org/r/20200521201422.16493-1-hauke@hauke-m.de Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27nl80211: Add support to configure TID specific Tx rate configurationTamizh Chelvam1-0/+21
This patch adds support to configure per TID Tx Rate configuration through NL80211_TID_CONFIG_ATTR_TX_RATE* attributes. And it uses nl80211_parse_tx_bitrate_mask api to validate the Tx rate mask. Signed-off-by: Tamizh Chelvam <tamizhr@codeaurora.org> Link: https://lore.kernel.org/r/1589357504-10175-1-git-send-email-tamizhr@codeaurora.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27nl80211: add ability to report TX status for control port TXMarkus Theil1-0/+12
This adds the necessary capabilities in nl80211 to allow drivers to assign a cookie to control port TX frames (returned via extack in the netlink ACK message of the command) and then later report the frame's status. Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de> Link: https://lore.kernel.org/r/20200508144202.7678-2-markus.theil@tu-ilmenau.de [use extack cookie instead of explicit message, recombine patches] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27nl80211: support scan frequencies in KHzThomas Pedersen1-1/+12
If the driver advertises NL80211_EXT_FEATURE_SCAN_FREQ_KHZ userspace can omit NL80211_ATTR_SCAN_FREQUENCIES in favor of an NL80211_ATTR_SCAN_FREQ_KHZ. To get scan results in KHz userspace must also set the NL80211_SCAN_FLAG_FREQ_KHZ. This lets nl80211 remain compatible with older userspaces while not requring and sending redundant (and potentially incorrect) scan frequency sets. Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com> Link: https://lore.kernel.org/r/20200430172554.18383-4-thomas@adapt-ip.com [use just nla_nest_start() (not _noflag) for NL80211_ATTR_SCAN_FREQ_KHZ] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27nl80211: add KHz frequency offset for most wifi commandsThomas Pedersen1-17/+33
cfg80211 recently gained the ability to understand a frequency offset component in KHz. Expose this in nl80211 through the new attributes NL80211_ATTR_WIPHY_FREQ_OFFSET, NL80211_FREQUENCY_ATTR_OFFSET, NL80211_ATTR_CENTER_FREQ1_OFFSET, and NL80211_BSS_FREQUENCY_OFFSET. These add support to send and receive a KHz offset component with the following NL80211 commands: - NL80211_CMD_FRAME - NL80211_CMD_GET_SCAN - NL80211_CMD_AUTHENTICATE - NL80211_CMD_ASSOCIATE - NL80211_CMD_CONNECT Along with any other command which takes a chandef, ie: - NL80211_CMD_SET_CHANNEL - NL80211_CMD_SET_WIPHY - NL80211_CMD_START_AP - NL80211_CMD_RADAR_DETECT - NL80211_CMD_NOTIFY_RADAR - NL80211_CMD_CHANNEL_SWITCH - NL80211_JOIN_IBSS - NL80211_CMD_REMAIN_ON_CHANNEL - NL80211_CMD_JOIN_OCB - NL80211_CMD_JOIN_MESH - NL80211_CMD_TDLS_CHANNEL_SWITCH If the driver advertises a band containing channels with frequency offset, it must also verify support for frequency offset channels in its cfg80211 ops, or return an error. Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com> Link: https://lore.kernel.org/r/20200430172554.18383-3-thomas@adapt-ip.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27nl80211: simplify peer specific TID configurationSergey Matyukevich1-6/+4
Current rule for applying TID configuration for specific peer looks overly complicated. No need to reject new TID configuration when override flag is specified. Another call with the same TID configuration, but without override flag, allows to apply new configuration anyway. Use the same approach as for the 'all peers' case: if override flag is specified, then reset existing TID configuration and immediately apply a new one. Signed-off-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com> Link: https://lore.kernel.org/r/20200424112905.26770-5-sergey.matyukevich.os@quantenna.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27cfg80211: add support for TID specific AMSDU configurationSergey Matyukevich1-3/+7
This patch adds support to control per TID MSDU aggregation using the NL80211_TID_CONFIG_ATTR_AMSDU_CTRL attribute. Signed-off-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com> Link: https://lore.kernel.org/r/20200424112905.26770-4-sergey.matyukevich.os@quantenna.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-26net: ethtool: Allow PHY cable test TDR data to configuredAndrew Lunn1-0/+13
Allow the user to configure where on the cable the TDR data should be retrieved, in terms of first and last sample, and the step between samples. Also add the ability to ask for TDR data for just one pair. If this configuration is not provided, it defaults to 1-150m at 1m intervals for all pairs. Signed-off-by: Andrew Lunn <andrew@lunn.ch> v3: Move the TDR configuration into a structure Add a range check on step Use NL_SET_ERR_MSG_ATTR() when appropriate Move TDR configuration into a nest Document attributes in the request Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26net: ethtool: Add attributes for cable test TDR dataAndrew Lunn1-0/+63
Some Ethernet PHYs can return the raw time domain reflectromatry data. Add the attributes to allow this data to be requested and returned via netlink ethtool. Signed-off-by: Andrew Lunn <andrew@lunn.ch> v2: m -> cm Report what the PHY actually used for start/stop/step. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26Merge tag 'mac80211-next-for-net-next-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-nextDavid S. Miller1-0/+23
Johannes Berg says: ==================== One batch of changes, containing: * hwsim improvements from Jouni and myself, to be able to test more scenarios easily * some more HE (802.11ax) support * some initial S1G (sub 1 GHz) work for fractional MHz channels * some (action) frame registration updates to help DPP support * along with other various improvements/fixes ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26cls_flower: Support filtering on multiple MPLS Label Stack EntriesGuillaume Nault1-0/+23
With struct flow_dissector_key_mpls now recording the first FLOW_DIS_MPLS_MAX labels, we can extend Flower to filter on any of these LSEs independently. In order to avoid creating new netlink attributes for every possible depth, let's define a new TCA_FLOWER_KEY_MPLS_OPTS nested attribute that contains the list of LSEs to match. Each LSE is represented by another attribute, TCA_FLOWER_KEY_MPLS_OPTS_LSE, which then contains the attributes representing the depth and the MPLS fields to match at this depth (label, TTL, etc.). For each MPLS field, the mask is always set to all-ones, as this is what the original API did. We could allow user configurable masks in the future if there is demand for more flexibility. The new API also allows to only specify an LSE depth. In that case, Flower only verifies that the MPLS label stack depth is greater or equal to the provided depth (that is, an LSE exists at this depth). Filters that only match on one (or more) fields of the first LSE are dumped using the old netlink attributes, to avoid confusing user space programs that don't understand the new API. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-25Merge tag 'tee-subsys-for-5.8' of git://git.linaro.org/people/jens.wiklander/linux-tee into arm/driversArnd Bergmann1-0/+9
TEE subsystem work - Reserve GlobalPlatform implementation defined logon method range - Add support to register kernel memory with TEE to allow TEE bus drivers to register memory references. * tag 'tee-subsys-for-5.8' of git://git.linaro.org/people/jens.wiklander/linux-tee: tee: add private login method for kernel clients tee: enable support to register kernel memory Link: https://lore.kernel.org/r/20200504181049.GA10860@jade Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2020-05-25btrfs: remove more obsolete v0 extent ref declarationsDavid Sterba1-9/+0
The extent references v0 have been superseded long time go, there are some unused declarations of access helpers. We can safely remove them now. The struct btrfs_extent_ref_v0 is not used anywhere, but struct btrfs_extent_item_v0 is still part of a backward compatibility check in relocation.c and thus not removed. Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-13/+95
The MSCC bug fix in 'net' had to be slightly adjusted because the register accesses are done slightly differently in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-0/+4
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-05-23 The following pull-request contains BPF updates for your *net-next* tree. We've added 50 non-merge commits during the last 8 day(s) which contain a total of 109 files changed, 2776 insertions(+), 2887 deletions(-). The main changes are: 1) Add a new AF_XDP buffer allocation API to the core in order to help lowering the bar for drivers adopting AF_XDP support. i40e, ice, ixgbe as well as mlx5 have been moved over to the new API and also gained a small improvement in performance, from Björn Töpel and Magnus Karlsson. 2) Add getpeername()/getsockname() attach types for BPF sock_addr programs in order to allow for e.g. reverse translation of load-balancer backend to service address/port tuple from a connected peer, from Daniel Borkmann. 3) Improve the BPF verifier is_branch_taken() logic to evaluate pointers being non-NULL, e.g. if after an initial test another non-NULL test on that pointer follows in a given path, then it can be pruned right away, from John Fastabend. 4) Larger rework of BPF sockmap selftests to make output easier to understand and to reduce overall runtime as well as adding new BPF kTLS selftests that run in combination with sockmap, also from John Fastabend. 5) Batch of misc updates to BPF selftests including fixing up test_align to match verifier output again and moving it under test_progs, allowing bpf_iter selftest to compile on machines with older vmlinux.h, and updating config options for lirc and v6 segment routing helpers, from Stanislav Fomichev, Andrii Nakryiko and Alan Maguire. 6) Conversion of BPF tracing samples outdated internal BPF loader to use libbpf API instead, from Daniel T. Lee. 7) Follow-up to BPF kernel test infrastructure in order to fix a flake in the XDP selftests, from Jesper Dangaard Brouer. 8) Minor improvements to libbpf's internal hashmap implementation, from Ian Rogers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22vxlan: ecmp support for mac fdb entriesRoopa Prabhu1-0/+1
Todays vxlan mac fdb entries can point to multiple remote ips (rdsts) with the sole purpose of replicating broadcast-multicast and unknown unicast packets to those remote ips. E-VPN multihoming [1,2,3] requires bridged vxlan traffic to be load balanced to remote switches (vteps) belonging to the same multi-homed ethernet segment (E-VPN multihoming is analogous to multi-homed LAG implementations, but with the inter-switch peerlink replaced with a vxlan tunnel). In other words it needs support for mac ecmp. Furthermore, for faster convergence, E-VPN multihoming needs the ability to update fdb ecmp nexthops independent of the fdb entries. New route nexthop API is perfect for this usecase. This patch extends the vxlan fdb code to take a nexthop id pointing to an ecmp nexthop group. Changes include: - New NDA_NH_ID attribute for fdbs - Use the newly added fdb nexthop groups - makes vxlan rdsts and nexthop handling code mutually exclusive - since this is a new use-case and the requirement is for ecmp nexthop groups, the fdb add and update path checks that the nexthop is really an ecmp nexthop group. This check can be relaxed in the future, if we want to introduce replication fdb nexthop groups and allow its use in lieu of current rdst lists. - fdb update requests with nexthop id's only allowed for existing fdb's that have nexthop id's - learning will not override an existing fdb entry with nexthop group - I have wrapped the switchdev offload code around the presence of rdst [1] E-VPN RFC https://tools.ietf.org/html/rfc7432 [2] E-VPN with vxlan https://tools.ietf.org/html/rfc8365 [3] http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf Includes a null check fix in vxlan_xmit from Nikolay v2 - Fixed build issue: Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22nexthop: support for fdb ecmp nexthopsRoopa Prabhu1-0/+3
This patch introduces ecmp nexthops and nexthop groups for mac fdb entries. In subsequent patches this is used by the vxlan driver fdb entries. The use case is E-VPN multihoming [1,2,3] which requires bridged vxlan traffic to be load balanced to remote switches (vteps) belonging to the same multi-homed ethernet segment (This is analogous to a multi-homed LAG but over vxlan). Changes include new nexthop flag NHA_FDB for nexthops referenced by fdb entries. These nexthops only have ip. This patch includes appropriate checks to avoid routes referencing such nexthops. example: $ip nexthop add id 12 via 172.16.1.2 fdb $ip nexthop add id 13 via 172.16.1.3 fdb $ip nexthop add id 102 group 12/13 fdb $bridge fdb add 02:02:00:00:00:13 dev vxlan1000 nhid 101 self [1] E-VPN https://tools.ietf.org/html/rfc7432 [2] E-VPN VxLAN: https://tools.ietf.org/html/rfc8365 [3] LPC talk with mention of nexthop groups for L2 ecmp http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf v4 - fixed uninitialized variable reported by kernel test robot Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22drm: Generalized NV Block Linear DRM format modJames Jones1-8/+114
Builds upon the existing NVIDIA 16Bx2 block linear format modifiers by adding more "fields" to the existing parameterized DRM_FORMAT_MOD_NVIDIA_16BX2_BLOCK format modifier macro that allow fully defining a unique-across- all-NVIDIA-hardware bit layout using a minimal set of fields and values. The new modifier macro DRM_FORMAT_MOD_NVIDIA_BLOCK_LINEAR_2D is effectively backwards compatible with the existing macro, introducing a superset of the previously definable format modifiers. Backwards compatibility has two quirks. First, the zero value for the "kind" field, which is implied by the DRM_FORMAT_MOD_NVIDIA_16BX2_BLOCK macro, must be special cased in drivers and assumed to map to the pre-Turing generic kind of 0xfe, since a kind of "zero" is reserved for linear buffer layouts on all GPUs. Second, it is assumed backwards compatibility is only needed when running on Tegra GPUs, and specifically Tegra GPUs prior to Xavier. This is based on two assertions: -Tegra GPUs prior to Xavier used a slightly different raw bit layout than desktop GPUs, making it impossible to directly share block linear buffers between the two. -Support for the existing block linear modifiers was incomplete, making them useful only for exporting buffers created by nouveau and importing them to Tegra DRM as framebuffers for scan out. There was no support for adding framebuffers using format modifiers in nouveau, nor importing dma-buf/PRIME GEM objects into nouveau userspace drivers with modifiers in Mesa. Hence it is assumed the prior modifiers were not intended for use on desktop GPUs, and as a corollary, were not intended to support sharing block linear buffers across two different NVIDIA GPUs. v2: - Added canonicalize helper function v3: - Added additional bit to compression field to support Tesla (NV5x,G8x,G9x,GT1xx,GT2xx) class chips. Signed-off-by: James Jones <jajones@nvidia.com> Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
2020-05-21ethtool: provide UAPI for PHY Signal Quality Index (SQI)Oleksij Rempel1-0/+2
Signal Quality Index is a mandatory value required by "OPEN Alliance SIG" for the 100Base-T1 PHYs [1]. This indicator can be used for cable integrity diagnostic and investigating other noise sources and implement by at least two vendors: NXP[2] and TI[3]. [1] http://www.opensig.org/download/document/218/Advanced_PHY_features_for_automotive_Ethernet_V1.0.pdf [2] https://www.nxp.com/docs/en/data-sheet/TJA1100.pdf [3] https://www.ti.com/product/DP83TC811R-Q1 Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-21net: psample: Add tunnel supportChris Mi1-0/+22
Currently, psample can only send the packet bits after decapsulation. The tunnel information is lost. Add the tunnel support. If the sampled packet has no tunnel info, the behavior is the same as before. If it has, add a nested metadata field named PSAMPLE_ATTR_TUNNEL and include the tunnel subfields if applicable. Increase the metadata length for sampled packet with the tunnel info. If new subfields of tunnel info should be included, update the metadata length accordingly. Signed-off-by: Chris Mi <chrism@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-21IB/uverbs: Introduce create/destroy QP commands over ioctlYishai Hadas2-0/+37
Introduce create/destroy QP commands over the ioctl interface to let it be extended to get an asynchronous event FD. Link: https://lore.kernel.org/r/20200519072711.257271-8-leon@kernel.org Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-21IB/uverbs: Introduce create/destroy WQ commands over ioctlYishai Hadas1-0/+25
Introduce create/destroy WQ commands over the ioctl interface to let it be extended to get an asynchronous event FD. Link: https://lore.kernel.org/r/20200519072711.257271-7-leon@kernel.org Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-21IB/uverbs: Introduce create/destroy SRQ commands over ioctlYishai Hadas1-0/+27
Introduce create/destroy SRQ commands over the ioctl interface to let it be extended to get an asynchronous event FD. Link: https://lore.kernel.org/r/20200519072711.257271-6-leon@kernel.org Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-21IB/uverbs: Move QP, SRQ, WQ type and flags to UAPIYishai Hadas1-0/+34
These constants are going to be used in the ioctl interface in coming patches so they are part of the UAPI, place them in the correct header for clarity. Link: https://lore.kernel.org/r/20200519072711.257271-5-leon@kernel.org Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-21IB/uverbs: Extend CQ to get its own asynchronous event FDYishai Hadas1-0/+1
Extend CQ to get its own asynchronous event FD. The event FD is an optional attribute, in case wasn't given the ufile event FD will be used. Link: https://lore.kernel.org/r/20200519072711.257271-4-leon@kernel.org Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-21Merge tag 'v5.7-rc6' into rdma.git for-nextJason Gunthorpe14-33/+133
Linux 5.7-rc6 Conflict in drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c resolved by deleting dr_cq_event, matching how netdev resolved it. Required for dependencies in the following patches. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-21IB/hfi1: Add accelerated IP capability bitKaike Wan1-1/+2
The accelerated IP capability bit is added to allow users to control which feature is enabled and disabled. Link: https://lore.kernel.org/r/20200511160541.173205.96870.stgit@awfm-01.aw.intel.com Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-21loop: Add LOOP_CONFIGURE ioctlMartijn Coenen1-0/+21
This allows userspace to completely setup a loop device with a single ioctl, removing the in-between state where the device can be partially configured - eg the loop device has a backing file associated with it, but is reading from the wrong offset. Besides removing the intermediate state, another big benefit of this ioctl is that LOOP_SET_STATUS can be slow; the main reason for this slowness is that LOOP_SET_STATUS(64) calls blk_mq_freeze_queue() to freeze the associated queue; this requires waiting for RCU synchronization, which I've measured can take about 15-20ms on this device on average. In addition to doing what LOOP_SET_STATUS can do, LOOP_CONFIGURE can also be used to: - Set the correct block size immediately by setting loop_config.block_size (avoids LOOP_SET_BLOCK_SIZE) - Explicitly request direct I/O mode by setting LO_FLAGS_DIRECT_IO in loop_config.info.lo_flags (avoids LOOP_SET_DIRECT_IO) - Explicitly request read-only mode by setting LO_FLAGS_READ_ONLY in loop_config.info.lo_flags Here's setting up ~70 regular loop devices with an offset on an x86 Android device, using LOOP_SET_FD and LOOP_SET_STATUS: vsoc_x86:/system/apex # time for i in `seq 30 100`; do losetup -r -o 4096 /dev/block/loop$i com.android.adbd.apex; done 0m03.40s real 0m00.02s user 0m00.03s system Here's configuring ~70 devices in the same way, but using a modified losetup that uses the new LOOP_CONFIGURE ioctl: vsoc_x86:/system/apex # time for i in `seq 30 100`; do losetup -r -o 4096 /dev/block/loop$i com.android.adbd.apex; done 0m01.94s real 0m00.01s user 0m00.01s system Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21loop: Clean up LOOP_SET_STATUS lo_flags handlingMartijn Coenen1-2/+8
LOOP_SET_STATUS(64) will actually allow some lo_flags to be modified; in particular, LO_FLAGS_AUTOCLEAR can be set and cleared, whereas LO_FLAGS_PARTSCAN can be set to request a partition scan. Make this explicit by updating the UAPI to include the flags that can be set/cleared using this ioctl. The implementation can then blindly take over the passed in flags, and use the previous flags for those flags that can't be set / cleared using LOOP_SET_STATUS. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-20Merge tag 'noinstr-x86-kvm-2020-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEADPaolo Bonzini12-17/+34
2020-05-20Merge tag 'amd-drm-next-5.8-2020-05-19' of git://people.freedesktop.org/~agd5f/linux into drm-nextDave Airlie1-0/+4
amd-drm-next-5.8-2020-05-19: amdgpu: - Improved handling for CTF (Critical Thermal Fault) situations - Clarify AC/DC mode switches - SR-IOV fixes - XGMI fixes for RAS - Misc cleanups - Add autodump debugfs node to aid in GPU hang debugging UAPI: - Add a MEM_SYNC IB flag for handling proper acquire memory semantics if UMDs expect the kernel to handle this Used by AMDVLK: https://github.com/GPUOpen-Drivers/pal/blob/dev/src/core/os/amdgpu/amdgpuQueue.cpp#L1262 Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexdeucher@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/20200519202505.4126-1-alexander.deucher@amd.com
2020-05-19bpf: Add get{peer, sock}name attach types for sock_addrDaniel Borkmann1-0/+4
As stated in 983695fa6765 ("bpf: fix unconnected udp hooks"), the objective for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be transparent to applications. In Cilium we make use of these hooks [0] in order to enable E-W load balancing for existing Kubernetes service types for all Cilium managed nodes in the cluster. Those backends can be local or remote. The main advantage of this approach is that it operates as close as possible to the socket, and therefore allows to avoid packet-based NAT given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses. This also allows to expose NodePort services on loopback addresses in the host namespace, for example. As another advantage, this also efficiently blocks bind requests for applications in the host namespace for exposed ports. However, one missing item is that we also need to perform reverse xlation for inet{,6}_getname() hooks such that we can return the service IP/port tuple back to the application instead of the remote peer address. The vast majority of applications does not bother about getpeername(), but in a few occasions we've seen breakage when validating the peer's address since it returns unexpectedly the backend tuple instead of the service one. Therefore, this trivial patch allows to customise and adds a getpeername() as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order to address this situation. Simple example: # ./cilium/cilium service list ID Frontend Service Type Backend 1 1.2.3.4:80 ClusterIP 1 => 10.0.0.10:80 Before; curl's verbose output example, no getpeername() reverse xlation: # curl --verbose 1.2.3.4 * Rebuilt URL to: 1.2.3.4/ * Trying 1.2.3.4... * TCP_NODELAY set * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0) > GET / HTTP/1.1 > Host: 1.2.3.4 > User-Agent: curl/7.58.0 > Accept: */* [...] After; with getpeername() reverse xlation: # curl --verbose 1.2.3.4 * Rebuilt URL to: 1.2.3.4/ * Trying 1.2.3.4... * TCP_NODELAY set * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0) > GET / HTTP/1.1 > Host: 1.2.3.4 > User-Agent: curl/7.58.0 > Accept: */* [...] Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed peer to the context similar as in inet{,6}_getname() fashion, but API-wise this is suboptimal as it always enforces programs having to test for ctx->peer which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split. Similarly, the checked return code is on tnum_range(1, 1), but if a use case comes up in future, it can easily be changed to return an error code instead. Helper and ctx member access is the same as with connect/sendmsg/etc hooks. [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Andrey Ignatov <rdna@fb.com> Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
2020-05-19fscrypt: add support for IV_INO_LBLK_32 policiesEric Biggers1-1/+2
The eMMC inline crypto standard will only specify 32 DUN bits (a.k.a. IV bits), unlike UFS's 64. IV_INO_LBLK_64 is therefore not applicable, but an encryption format which uses one key per policy and permits the moving of encrypted file contents (as f2fs's garbage collector requires) is still desirable. To support such hardware, add a new encryption format IV_INO_LBLK_32 that makes the best use of the 32 bits: the IV is set to 'SipHash-2-4(inode_number) + file_logical_block_number mod 2^32', where the SipHash key is derived from the fscrypt master key. We hash only the inode number and not also the block number, because we need to maintain contiguity of DUNs to merge bios. Unlike with IV_INO_LBLK_64, with this format IV reuse is possible; this is unavoidable given the size of the DUN. This means this format should only be used where the requirements of the first paragraph apply. However, the hash spreads out the IVs in the whole usable range, and the use of a keyed hash makes it difficult for an attacker to determine which files use which IVs. Besides the above differences, this flag works like IV_INO_LBLK_64 in that on ext4 it is only allowed if the stable_inodes feature has been enabled to prevent inode numbers and the filesystem UUID from changing. Link: https://lore.kernel.org/r/20200515204141.251098-1-ebiggers@kernel.org Reviewed-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Paul Crowley <paulcrowley@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-19watch_queue: Add a key/keyring notification facilityDavid Howells2-1/+29
Add a key/keyring change notification facility whereby notifications about changes in key and keyring content and attributes can be received. Firstly, an event queue needs to be created: pipe2(fds, O_NOTIFICATION_PIPE); ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, 256); then a notification can be set up to report notifications via that queue: struct watch_notification_filter filter = { .nr_filters = 1, .filters = { [0] = { .type = WATCH_TYPE_KEY_NOTIFY, .subtype_filter[0] = UINT_MAX, }, }, }; ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter); keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01); After that, records will be placed into the queue when events occur in which keys are changed in some way. Records are of the following format: struct key_notification { struct watch_notification watch; __u32 key_id; __u32 aux; } *n; Where: n->watch.type will be WATCH_TYPE_KEY_NOTIFY. n->watch.subtype will indicate the type of event, such as NOTIFY_KEY_REVOKED. n->watch.info & WATCH_INFO_LENGTH will indicate the length of the record. n->watch.info & WATCH_INFO_ID will be the second argument to keyctl_watch_key(), shifted. n->key will be the ID of the affected key. n->aux will hold subtype-dependent information, such as the key being linked into the keyring specified by n->key in the case of NOTIFY_KEY_LINKED. Note that it is permissible for event records to be of variable length - or, at least, the length may be dependent on the subtype. Note also that the queue can be shared between multiple notifications of various types. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: James Morris <jamorris@linux.microsoft.com>
2020-05-19pipe: Add general notification queue supportDavid Howells1-0/+20
Make it possible to have a general notification queue built on top of a standard pipe. Notifications are 'spliced' into the pipe and then read out. splice(), vmsplice() and sendfile() are forbidden on pipes used for notifications as post_one_notification() cannot take pipe->mutex. This means that notifications could be posted in between individual pipe buffers, making iov_iter_revert() difficult to effect. The way the notification queue is used is: (1) An application opens a pipe with a special flag and indicates the number of messages it wishes to be able to queue at once (this can only be set once): pipe2(fds, O_NOTIFICATION_PIPE); ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth); (2) The application then uses poll() and read() as normal to extract data from the pipe. read() will return multiple notifications if the buffer is big enough, but it will not split a notification across buffers - rather it will return a short read or EMSGSIZE. Notification messages include a length in the header so that the caller can split them up. Each message has a header that describes it: struct watch_notification { __u32 type:24; __u32 subtype:8; __u32 info; }; The type indicates the source (eg. mount tree changes, superblock events, keyring changes, block layer events) and the subtype indicates the event type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field indicates a number of things, including the entry length, an ID assigned to a watchpoint contributing to this buffer and type-specific flags. Supplementary data, such as the key ID that generated an event, can be attached in additional slots. The maximum message size is 127 bytes. Messages may not be padded or aligned, so there is no guarantee, for example, that the notification type will be on a 4-byte bounary. Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-19pipe: Add O_NOTIFICATION_PIPEDavid Howells1-0/+3
Add an O_NOTIFICATION_PIPE flag that can be passed to pipe2() to indicate that the pipe being created is going to be used for notifications. This suppresses the use of splice(), vmsplice(), tee() and sendfile() on the pipe as calling iov_iter_revert() on a pipe when a kernel notification message has been inserted into the middle of a multi-buffer splice will be messy. The flag is given the same value as O_EXCL as it seems unlikely that this flag will ever be applicable to pipes and I don't want to use up another O_* bit unnecessarily. An alternative could be to add a pipe3() system call. Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-19uapi: General notification queue definitionsDavid Howells1-0/+55
Add UAPI definitions for the general notification queue, including the following pieces: (*) struct watch_notification. This is the metadata header for notification messages. It includes a type and subtype that indicate the source of the message (eg. WATCH_TYPE_MOUNT_NOTIFY) and the kind of the message (eg. NOTIFY_MOUNT_NEW_MOUNT). The header also contains an information field that conveys the following information: - WATCH_INFO_LENGTH. The size of the entry (entries are variable length). - WATCH_INFO_ID. The watch ID specified when the watchpoint was set. - WATCH_INFO_TYPE_INFO. (Sub)type-specific information. - WATCH_INFO_FLAG_*. Flag bits overlain on the type-specific information. For use by the type. All the information in the header can be used in filtering messages at the point of writing into the buffer. (*) struct watch_notification_removal This is an extended watch-removal notification record that includes an 'id' field that can indicate the identifier of the object being removed if available (for instance, a keyring serial number). Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-19uapi: habanalabs: add gaudi definesOded Gabbay1-2/+162
Add the new defines for GAUDI uapi interface. It includes the queue IDs, the engine IDs, SRAM reserved space and Sync Manager reserved resources. There is no new IOCTL or additional operations in existing IOCTLs. Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-05-19habanalabs: get card type, location from F/WOmer Shpigelman1-1/+2
For Gaudi the driver gets two new additional properties from the F/W: 1. The card's type - PCI or PMC 2. The card's location in the Gaudi's box (relevant only for PMC). The card's location is also passed to the user in the HW IP info structure as it needs this property for establishing communication between Gaudis. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-05-19uapi: habanalabs: add signal/wait operationsOmer Shpigelman1-16/+51
This is a pre-requisite to upstreaming GAUDI support. Signal/wait operations are done by the user to perform sync between two Primary Queues (PQs). The sync is done using the sync manager and it is usually resolved inside the device, but sometimes it can be resolved in the host, i.e. the user should be able to wait in the host until a signal has been completed. The mechanism to define signal and wait operations is done by the driver because it needs atomicity and serialization, which is already done in the driver when submitting work to the different queues. To implement this feature, the driver "takes" a couple of h/w resources, and this is reflected by the defines added to the uapi file. The signal/wait operations are done via the existing CS IOCTL, and they use the same data structure. There is a difference in the meaning of some of the parameters, and for that we added unions to make the code more readable. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-05-19habanalabs: leave space for 2xMSG_PROT in CBOded Gabbay1-1/+2
The user must leave space for 2xMSG_PROT in the external CB, so adjust the define of max size accordingly. The driver, however, can still create a CB with the maximum size of 2MB. Therefore, we need to add a check specifically for the user requested size. Reviewed-by: Tomer Tayar <ttayar@habana.ai> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>