aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/infiniband/core (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-05-03RDMA: Introduce and use GID attr helper to read RoCE L2 fieldsParav Pandit1-0/+55
Instead of RoCE drivers figuring out vlan, smac fields while working on QP/AH, provide a helper routine to read the L2 fields such as vlan_id and source mac address. This moves logic from mlx5 driver to core for wider usage for RoCE ports. This is a preparation patch to allow detaching netdev in subsequent patch. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-03IB/cm: Reduce dependency on gid attribute ndev checkParav Pandit1-2/+3
GID type to path record type conversion can be done directly based on port type and gid attribute type. There is no need to find out using indirect way by its GID attribute's ndev field. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-03RDMA: Get rid of iw_cm_verbsKamal Heib2-17/+26
Integrate iw_cm_verbs data members into ib_device_ops and ib_device structs, this is done to achieve the following: 1) Avoid memory related bugs durring error unwind 2) Make the code more cleaner 3) Reduce code duplication Signed-off-by: Kamal Heib <kamalheib1@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-03RDMA/core: Do not invoke init_port on compat devicesParav Pandit3-11/+10
The driver interface cannot manipulate the sysfs of the compat device, only of the full device so we must avoid calling the driver sysfs APIs on compat devices. This prevents an oops: Call Trace: dump_stack+0x5a/0x73 kobject_init+0x74/0x80 kobject_init_and_add+0x35/0xb0 hfi1_create_port_files+0x6e/0x3c0 [hfi1] ib_setup_port_attrs+0x43b/0x560 [ib_core] add_one_compat_dev+0x16a/0x230 [ib_core] rdma_dev_init_net+0x110/0x160 [ib_core] ops_init+0x38/0xf0 setup_net+0xcf/0x1e0 copy_net_ns+0xb7/0x130 create_new_namespaces+0x11a/0x1b0 unshare_nsproxy_namespaces+0x55/0xa0 ksys_unshare+0x1a7/0x340 __x64_sys_unshare+0xe/0x20 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 5417783eabb2 ("RDMA/core: Support core port attributes in non init_net") Reported-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-03IB/core: Set qp->real_qp before it may be accessedArtemy Kovalyov3-2/+1
real_qp should be initialized before ib_destroy_qp() is called. ib_destroy_qp() may be called in the error flow if ib_create_qp_security() failed. Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-03RDMA/uverbs: Initialize uverbs_attr_bundle ucontext in ib_uverbs_get_contextShamir Rabinovitch1-0/+2
ib_uverbs_get_context does not have a uobject so it does not call the rdma_lookup_get_uobject which is used to set up the uverbs_attr_bundle ucontext. For ib_uverbs_get_context we need to set up this manually before we send the uverbs_attr_bundle down to the driver layer. This completes the change that was done in commit 70f06b26f07e ("IB: ucontext should be set properly for all cmd & ioctl paths") Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-3/+50
Three trivial overlapping conflicts. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-02RDMA/uverbs: Initialize udata struct on destroy flowsGal Pressman1-0/+2
Cited commit introduced the udata parameter to different destroy flows but the uapi method definition does not have udata (i.e has_udata flag is not set). As a result, an uninitialized udata struct is being passed down to the driver callbacks. Fix that by clearing the driver udata even in cases where has_udata flag is not set. Fixes: c4367a26357b ("IB: Pass uverbs_attr_bundle down ib_x destroy path") Cc: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Co-developed-by: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-02RDMA/umem: Handle page combining avoidance correctly in ib_umem_add_sg_table()Shiraz Saleem1-6/+10
The flag update_cur_sg tracks whether contiguous pages from a new set of page_list pages can be merged into the SGE passed into ib_umem_add_sg_table(). If this flag is true, but the total segment length exceeds the max_seg_size supported by HW, we avoid combining to this SGE and move to a new SGE (x) and merge 'len' pages to it. However, if i < npages, the next iteration can incorrectly merge 'len' contiguous pages into x instead of into a new SGE since update_cur_sg is still true. Reset update_cur_sg to false always after the check to merge pages into the first SGE passed in to ib_umem_add_sg_table(). Also, prevent a new SGE's segment length from ever exceeding HW max_seg_sz. There is a crash on hfi1 as result of this where-in max_seg_sz is defaulting to 64K. Due to above bug, unfolding SGE's in __ib_umem_release points to a bad page ptr. TEST comp-wfr.perfnative.STL-22166-WDT _ perftest native 2-Write_4097QP_4MB STARTING at 1555387093 BUG: Bad page state in process ib_write_bw pfn:7ebca0 page:ffffcd675faf2800 count:0 mapcount:1 mapping:0000000000000000 index:0x1 flags: 0x17ffffc0000000() raw: 0017ffffc0000000 dead000000000100 dead000000000200 0000000000000000 raw: 0000000000000001 0000000000000000 0000000000000000 0000000000000000 page dumped because: nonzero mapcount CPU: 18 PID: 15853 Comm: ib_write_bw Tainted: G B 5.1.0-rc4 #1 Hardware name: Intel Corporation S2600CWR/S2600CW, BIOS SE5C610.86B.01.01.0014.121820151719 12/18/2015 Call Trace: dump_stack+0x5a/0x73 bad_page+0xf5/0x10f free_pcppages_bulk+0x62c/0x680 free_unref_page+0x54/0x70 __ib_umem_release+0x148/0x1a0 [ib_uverbs] ib_umem_release+0x22/0x80 [ib_uverbs] rvt_dereg_mr+0x67/0xb0 [rdmavt] ib_dereg_mr_user+0x37/0x60 [ib_core] destroy_hw_idr_uobject+0x1c/0x50 [ib_uverbs] uverbs_destroy_uobject+0x2e/0x180 [ib_uverbs] uobj_destroy+0x4d/0x60 [ib_uverbs] __uobj_get_destroy+0x33/0x50 [ib_uverbs] __uobj_perform_destroy+0xa/0x30 [ib_uverbs] ib_uverbs_dereg_mr+0x66/0x90 [ib_uverbs] ib_uverbs_write+0x3e1/0x500 [ib_uverbs] vfs_write+0xad/0x1b0 ksys_write+0x5a/0xd0 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: d10bcf947a3e ("RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEs") Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-01RDMA/core: Introduce RDMA subsystem ibdev_* print functionsGal Pressman1-0/+60
Similarly to dev/netdev/etc printk helpers, add standard printk helpers for the RDMA subsystem. Example output: efa 0000:00:06.0 efa_0: Hello World! efa_0: Hello World! (no parent device set) (NULL ib_device): Hello World! (ibdev is NULL) Cc: Jason Baron <jbaron@akamai.com> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Suggested-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Gal Pressman <galpress@amazon.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-04-29rdma: fix build errors on s390 and MIPS due to bad ZERO_PAGE useLinus Torvalds1-1/+1
The parameter to ZERO_PAGE() was wrong, but since all architectures except for MIPS and s390 ignore it, it wasn't noticed until 0-day reported the build error. Fixes: 67f269b37f9b ("RDMA/ucontext: Fix regression with disassociate") Cc: stable@vger.kernel.org Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-28Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds2-3/+50
Pull rdma fixes from Jason Gunthorpe: "One core bug fix and a few driver ones - FRWR memory registration for hfi1/qib didn't work with with some iovas causing a NFSoRDMA failure regression due to a fix in the NFS side - A command flow error in mlx5 allowed user space to send a corrupt command (and also smash the kernel stack we've since learned) - Fix a regression and some bugs with device hot unplug that was discovered while reviewing Andrea's patches - hns has a failure if the user asks for certain QP configurations" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: RDMA/hns: Bugfix for mapping user db RDMA/ucontext: Fix regression with disassociate RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages RDMA/mlx5: Do not allow the user to write to the clock page IB/mlx5: Fix scatter to CQE in DCT QP creation IB/rdmavt: Fix frwr memory registration
2019-04-27netlink: make validation more configurable for future strictnessJohannes Berg4-28/+28
We currently have two levels of strict validation: 1) liberal (default) - undefined (type >= max) & NLA_UNSPEC attributes accepted - attribute length >= expected accepted - garbage at end of message accepted 2) strict (opt-in) - NLA_UNSPEC attributes accepted - attribute length >= expected accepted Split out parsing strictness into four different options: * TRAILING - check that there's no trailing data after parsing attributes (in message or nested) * MAXTYPE - reject attrs > max known type * UNSPEC - reject attributes with NLA_UNSPEC policy entries * STRICT_ATTRS - strictly validate attribute size The default for future things should be *everything*. The current *_strict() is a combination of TRAILING and MAXTYPE, and is renamed to _deprecated_strict(). The current regular parsing has none of this, and is renamed to *_parse_deprecated(). Additionally it allows us to selectively set one of the new flags even on old policies. Notably, the UNSPEC flag could be useful in this case, since it can be arranged (by filling in the policy) to not be an incompatible userspace ABI change, but would then going forward prevent forgetting attribute entries. Similar can apply to the POLICY flag. We end up with the following renames: * nla_parse -> nla_parse_deprecated * nla_parse_strict -> nla_parse_deprecated_strict * nlmsg_parse -> nlmsg_parse_deprecated * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict * nla_parse_nested -> nla_parse_nested_deprecated * nla_validate_nested -> nla_validate_nested_deprecated Using spatch, of course: @@ expression TB, MAX, HEAD, LEN, POL, EXT; @@ -nla_parse(TB, MAX, HEAD, LEN, POL, EXT) +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression TB, MAX, NLA, POL, EXT; @@ -nla_parse_nested(TB, MAX, NLA, POL, EXT) +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT) @@ expression START, MAX, POL, EXT; @@ -nla_validate_nested(START, MAX, POL, EXT) +nla_validate_nested_deprecated(START, MAX, POL, EXT) @@ expression NLH, HDRLEN, MAX, POL, EXT; @@ -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT) +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT) For this patch, don't actually add the strict, non-renamed versions yet so that it breaks compile if I get it wrong. Also, while at it, make nla_validate and nla_parse go down to a common __nla_validate_parse() function to avoid code duplication. Ultimately, this allows us to have very strict validation for every new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the next patch, while existing things will continue to work as is. In effect then, this adds fully strict validation for any new command. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27netlink: make nla_nest_start() add NLA_F_NESTED flagMichal Kubecek1-4/+5
Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most netlink based interfaces (including recently added ones) are still not setting it in kernel generated messages. Without the flag, message parsers not aware of attribute semantics (e.g. wireshark dissector or libmnl's mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display the structure of their contents. Unfortunately we cannot just add the flag everywhere as there may be userspace applications which check nlattr::nla_type directly rather than through a helper masking out the flags. Therefore the patch renames nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start() as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually are rewritten to use nla_nest_start(). Except for changes in include/net/netlink.h, the patch was generated using this semantic patch: @@ expression E1, E2; @@ -nla_nest_start(E1, E2) +nla_nest_start_noflag(E1, E2) @@ expression E1, E2; @@ -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED) +nla_nest_start(E1, E2) Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+3
Two easy cases of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-25uverbs: Convert idr to XArrayMatthew Wilcox2-55/+25
The word 'idr' is scattered throughout the API, so I haven't changed it, but the 'idr' variable is now an XArray. Signed-off-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24Merge branch 'rdma_mmap' into rdma.git for-nextJason Gunthorpe2-53/+60
Jason Gunthorpe says: ==================== Upon review it turns out there are some long standing problems in BAR mapping area: * BAR pages intended for read-only can be switched to writable via mprotect. * Missing use of rdma_user_mmap_io for the mlx5 clock BAR page. * Disassociate causes SIGBUS when touching the pages. * CPU pages are being mapped through to the process via remap_pfn_range instead of the more appropriate vm_insert_page, causing weird behaviors during disassociation. This series adds the missing VM_* flag manipulation, adds faulting a zero page for disassociation and revises the CPU page mappings to use vm_insert_page. ==================== For dependencies this branch is based on for-rc from git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git * branch 'rdma_mmap': RDMA: Remove rdma_user_mmap_page RDMA/mlx5: Use get_zeroed_page() for clock_info RDMA/ucontext: Fix regression with disassociate RDMA/mlx5: Use rdma_user_map_io for mapping BAR pages RDMA/mlx5: Do not allow the user to write to the clock page Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24RDMA: Remove rdma_user_mmap_pageJason Gunthorpe1-51/+11
Upon further research drivers that want this should simply call the core function vm_insert_page(). The VMA holds a reference on the page and it will be automatically freed when the last reference drops. No need for disassociate to sequence the cleanup. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24RDMA/ucontext: Fix regression with disassociateJason Gunthorpe2-3/+50
When this code was consolidated the intention was that the VMA would become backed by anonymous zero pages after the zap_vma_pte - however this very subtly relied on setting the vm_ops = NULL and clearing the VM_SHARED bits to transform the VMA into an anonymous VMA. Since the vm_ops was removed this broke. Now userspace gets a SIGBUS if it touches the vma after disassociation. Instead of converting the VMA to anonymous provide a fault handler that puts a zero'd page into the VMA when user-space touches it after disassociation. Cc: stable@vger.kernel.org Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Fixes: 5f9794dc94f5 ("RDMA/ucontext: Add a core API for mmaping driver IO memory") Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24RDMA/cma: Consider scope_id while binding to ipv6 ll addressParav Pandit1-6/+19
When two netdev have same link local addresses (such as vlan and non vlan), two rdma cm listen id should be able to bind to following different addresses. listener-1: addr=lla, scope_id=A, port=X listener-2: addr=lla, scope_id=B, port=X However while comparing the addresses only addr and port are considered, due to which 2nd listener fails to listen. In below example of two listeners, 2nd listener is failing with address in use error. $ rping -sv -a fe80::268a:7ff:feb3:d113%ens2f1 -p 4545& $ rping -sv -a fe80::268a:7ff:feb3:d113%ens2f1.200 -p 4545 rdma_bind_addr: Address already in use To overcome this, consider the scope_ids as well which forms the accurate IPv6 link local address. Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-24IB/core: Allow vlan link local address based RoCE GIDsParav Pandit1-23/+4
IPv6 link local address for a VLAN netdevice has nothing to do with its resemblance with the default GID, because VLAN link local GID is in different layer 2 domain. Now that RoCE MAD packet processing and route resolution consider the right GID index, there is no need for an unnecessary check which prevents the addition of vlan based IPv6 link local GIDs. Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-22RDMA/core: Add a netlink command to change net namespace of rdma deviceParav Pandit3-6/+65
Provide an option to change the net namespace of a rdma device through a netlink command. When multiple rdma devices exists in a system, and when containers are used, this will limit rdma device visibility to a specified net namespace. An example command to change net namespace of mlx5_1 device to the previously created net namespace 'foo' is: $ ip netns add foo $ rdma dev set mlx5_1 netns foo Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-22RDMA/core: Introduce a helper function to change net namespace of rdma deviceParav Pandit1-0/+77
Introduce a helper function that changes rdma device's net namespace which performs mini disable/enable sequence to have device visible only in assigned net namespace. Device unregistration, device rename and device change net namespace may be invoked concurrently. (a) device unregistration needs to wait if a device change (rename or net namespace change) operation is in progress. (b) device net namespace change should not proceed if the unregistration has started. (c) while one cpu is changing device net namespace, other cpu should not be able to rename or change net namespace. To address above concurrency, (a) Use unreg_mutex to synchronize between ib_unregister_device() and net namespace change operation (b) In cases where unregister_device() has started unregistration before change_netns got chance to acquire unreg_mutex, validate the refcount - if it dropped to zero, abort the net namespace change operation. Finally use the helper function to change net namespace of ib device to move the device back to init_net when such net is deleted. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-22RDMA/core: Avoid freeing netdevs in disable_device()Parav Pandit1-3/+4
So we can use the disable_device() helper while changing the net namespace of the rdma device in a subsequent patch, move free_netdevs() out of it. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-19coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumpingAndrea Arcangeli1-0/+3
The core dumping code has always run without holding the mmap_sem for writing, despite that is the only way to ensure that the entire vma layout will not change from under it. Only using some signal serialization on the processes belonging to the mm is not nearly enough. This was pointed out earlier. For example in Hugh's post from Jul 2017: https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils "Not strictly relevant here, but a related note: I was very surprised to discover, only quite recently, how handle_mm_fault() may be called without down_read(mmap_sem) - when core dumping. That seems a misguided optimization to me, which would also be nice to correct" In particular because the growsdown and growsup can move the vm_start/vm_end the various loops the core dump does around the vma will not be consistent if page faults can happen concurrently. Pretty much all users calling mmget_not_zero()/get_task_mm() and then taking the mmap_sem had the potential to introduce unexpected side effects in the core dumping code. Adding mmap_sem for writing around the ->core_dump invocation is a viable long term fix, but it requires removing all copy user and page faults and to replace them with get_dump_page() for all binary formats which is not suitable as a short term fix. For the time being this solution manually covers the places that can confuse the core dump either by altering the vma layout or the vma flags while it runs. Once ->core_dump runs under mmap_sem for writing the function mmget_still_valid() can be dropped. Allowing mmap_sem protected sections to run in parallel with the coredump provides some minor parallelism advantage to the swapoff code (which seems to be safe enough by never mangling any vma field and can keep doing swapins in parallel to the core dumping) and to some other corner case. In order to facilitate the backporting I added "Fixes: 86039bd3b4e6" however the side effect of this same race condition in /proc/pid/mem should be reproducible since before 2.6.12-rc2 so I couldn't add any other "Fixes:" because there's no hash beyond the git genesis commit. Because find_extend_vma() is the only location outside of the process context that could modify the "mm" structures under mmap_sem for reading, by adding the mmget_still_valid() check to it, all other cases that take the mmap_sem for reading don't need the new check after mmget_not_zero()/get_task_mm(). The expand_stack() in page fault context also doesn't need the new check, because all tasks under core dumping are frozen. Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Jann Horn <jannh@google.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Jann Horn <jannh@google.com> Acked-by: Jason Gunthorpe <jgg@mellanox.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-08ipv4: Prepare rtable for IPv6 gatewayDavid Ahern1-1/+1
To allow the gateway to be either an IPv4 or IPv6 address, remove rt_uses_gateway from rtable and replace with rt_gw_family. If rt_gw_family is set it implies rt_uses_gateway. Rename rt_gateway to rt_gw4 to represent the IPv4 version. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08RDMA/umem: Use correct value for SG entries in sg_copy_to_buffer()Shiraz Saleem1-2/+2
With page combining, the assumption that number of SG entries in umem SGL equal to number of system pages in umem no longer holds. umem->sg_nents tracks the SG entries in umem SGL. Use it in sg_pcopy_to_buffer() as opposed to ib_umem_num_pages(umem). Fixes: d10bcf947a3e ("RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEs") Reported-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08RDMA: Handle SRQ allocations by IB/coreLeon Romanovsky3-43/+48
Convert SRQ allocation from drivers to be in the IB/core Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08RDMA: Handle AH allocations by IB/coreLeon Romanovsky2-17/+25
Simplify drivers by ensuring lifetime of ib_ah object. The changes in .create_ah() go hand in hand with relevant update in .destroy_ah(). We will use this opportunity and convert .destroy_ah() to don't fail, as it was suggested a long time ago, because there is nothing to do in case of failure during destroy. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08IB: When attrs.udata/ufile is available use that instead of uobjectJason Gunthorpe5-8/+8
The ucontext and ufile should not be accessed via the uobject, all these cases have an attrs so use that instead. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08RDMA/nldev: Return device protocolLeon Romanovsky1-1/+23
Add new RDMA_NLDEV_ATTR_DEV_PROTOCOL attribute to give ability for UDEV rules create IB device stable names based on link type protocol. The assumption that devices like mlx4 with duality in their link type under one IB device struct won't be allowed in the future. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08RDMA/cm: Move debug counters to be under relevant IB deviceLeon Romanovsky3-38/+57
The sysfs layout is created by CM incorrectly presented RDMA devices with InfiniBand link layer. Layout of such devices represents device tree of connections. By moving CM statistics to be under relevant port of IB device, we will fix the following issues: * Symlink name - It used device name instead of specific identifier. * Target location - It was supposed to point to PCI-ID/infiniband_cm/ instead of PCI-ID/infiniband/ * Target name - It created extra device file under already existing device folder, e.g. mlx5_0/mlx5_0 * Crash during boot with RDMA persistent naming patches. sysfs: cannot create duplicate filename '/class/infiniband_cm/mlx5_0' CPU: 29 PID: 433 Comm: modprobe Not tainted 5.0.0-rc5+ #178 Call Trace: dump_stack+0xcc/0x180 sysfs_warn_dup.cold.3+0x17/0x2d sysfs_do_create_link_sd.isra.2+0xd0/0xf0 device_add+0x7cb/0x1450 device_create_groups_vargs+0x1ae/0x220 device_create+0x93/0xc0 cm_add_one+0x38f/0xf60 [ib_cm] add_client_context+0x167/0x210 [ib_core] enable_device_and_get+0x230/0x3f0 [ib_core] ib_register_device+0x823/0xbf0 [ib_core] __mlx5_ib_add+0x45/0x150 [mlx5_ib] mlx5_ib_add+0x1b3/0x5e0 [mlx5_ib] mlx5_add_device+0x130/0x3a0 [mlx5_core] mlx5_register_interface+0x1a9/0x270 [mlx5_core] do_one_initcall+0x14f/0x5de do_init_module+0x247/0x7c0 load_module+0x4c2f/0x60d0 entry_SYSCALL_64_after_hwframe+0x49/0xbe After this change: [leonro@server ~]$ ls -al /sys/class/infiniband/ibp0s12f0/ports/1/ drwxr-xr-x 2 root root 0 Mar 11 11:17 cm_rx_duplicates drwxr-xr-x 2 root root 0 Mar 11 11:17 cm_rx_msgs drwxr-xr-x 2 root root 0 Mar 11 11:17 cm_tx_msgs drwxr-xr-x 2 root root 0 Mar 11 11:17 cm_tx_retries Fixes: 110cf374a809 ("infiniband: make cm_device use a struct device and not a kobject.") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-08RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEsShiraz Saleem3-22/+86
Combine contiguous regions of PAGE_SIZE pages into single scatter list entry while building the scatter table for a umem. This minimizes the number of the entries in the scatter list and reduces the DMA mapping overhead, particularly with the IOMMU. Set default max_seg_size in core for IB devices to 2G and do not combine if we exceed this limit. Also, purge npages in struct ib_umem as we now DMA map the umem SGL with sg_nents and npage computation is not needed. Drivers should now be using ib_umem_num_pages(), so fix the last stragglers. Move npages tracking to ib_umem_odp as ODP drivers still need it. Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Adit Ranadive <aditr@vmware.com> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Tested-by: Gal Pressman <galpress@amazon.com> Tested-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-04RDMA/cm: Remove useless zeroing of static global variableLeon Romanovsky1-1/+0
Static global variables are initialized to zero by C standard, there is no need to zero them again. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-03RDMA/cma: Set proper port number as indexLeon Romanovsky1-1/+1
Conversion from IDR to XArray missed the fact that idr_alloc() returned index as a return value, this index was saved in port variable and used as query index later on. This caused to the following error. BUG: KASAN: use-after-free in cma_check_port+0x86a/0xa20 [rdma_cm] Read of size 8 at addr ffff888069fde998 by task ucmatose/387 CPU: 3 PID: 387 Comm: ucmatose Not tainted 5.1.0-rc2+ #253 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014 Call Trace: dump_stack+0x7c/0xc0 print_address_description+0x6c/0x23c ? cma_check_port+0x86a/0xa20 [rdma_cm] kasan_report.cold.3+0x1c/0x35 ? cma_check_port+0x86a/0xa20 [rdma_cm] ? cma_check_port+0x86a/0xa20 [rdma_cm] cma_check_port+0x86a/0xa20 [rdma_cm] rdma_bind_addr+0x11bc/0x1b00 [rdma_cm] ? find_held_lock+0x33/0x1c0 ? cma_ndev_work_handler+0x180/0x180 [rdma_cm] ? wait_for_completion+0x3d0/0x3d0 ucma_bind+0x120/0x160 [rdma_ucm] ? ucma_resolve_addr+0x1a0/0x1a0 [rdma_ucm] ucma_write+0x1f8/0x2b0 [rdma_ucm] ? ucma_open+0x260/0x260 [rdma_ucm] vfs_write+0x157/0x460 ksys_write+0xb8/0x170 ? __ia32_sys_read+0xb0/0xb0 ? trace_hardirqs_off_caller+0x5b/0x160 ? do_syscall_64+0x18/0x3c0 do_syscall_64+0x95/0x3c0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Allocated by task 381: __kasan_kmalloc.constprop.5+0xc1/0xd0 cma_alloc_port+0x4d/0x160 [rdma_cm] rdma_bind_addr+0x14e7/0x1b00 [rdma_cm] ucma_bind+0x120/0x160 [rdma_ucm] ucma_write+0x1f8/0x2b0 [rdma_ucm] vfs_write+0x157/0x460 ksys_write+0xb8/0x170 do_syscall_64+0x95/0x3c0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 381: __kasan_slab_free+0x12e/0x180 kfree+0xed/0x290 rdma_destroy_id+0x6b6/0x9e0 [rdma_cm] ucma_close+0x110/0x300 [rdma_ucm] __fput+0x25a/0x740 task_work_run+0x10e/0x190 do_exit+0x85e/0x29e0 do_group_exit+0xf0/0x2e0 get_signal+0x2e0/0x17e0 do_signal+0x94/0x1570 exit_to_usermode_loop+0xfa/0x130 do_syscall_64+0x327/0x3c0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Reported-by: <syzbot+2e3e485d5697ea610460@syzkaller.appspotmail.com> Reported-by: Ran Rozenstein <ranro@mellanox.com> Fixes: 638267537ad9 ("cma: Convert portspace IDRs to XArray") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-01IB: Pass only ib_udata in function prototypesShamir Rabinovitch4-11/+8
Now when ib_udata is passed to all the driver's object create/destroy APIs the ib_udata will carry the ib_ucontext for every user command. There is no need to also pass the ib_ucontext via the functions prototypes. Make ib_udata the only argument psssed. Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-01IB: Remove 'uobject->context' dependency in object destroy APIsShamir Rabinovitch4-11/+13
Now that we have the udata passed to all the ib_xxx object destroy APIs and the additional macro 'rdma_udata_to_drv_context' to get the ib_ucontext from ib_udata stored in uverbs_attr_bundle, we can finally start to remove the dependency of the drivers in the ib_xxx->uobject->context. Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-01IB: Pass uverbs_attr_bundle down ib_x destroy pathShamir Rabinovitch8-64/+73
The uverbs_attr_bundle with the ucontext is sent down to the drivers ib_x destroy path as ib_udata. The next patch will use the ib_udata to free the drivers destroy path from the dependency in 'uobject->context' as we already did for the create path. Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-01IB: Pass uverbs_attr_bundle down uobject destroy pathShamir Rabinovitch10-67/+96
Pass uverbs_attr_bundle down the uobject destroy path. The next patch will use this to eliminate the dependecy of the drivers in ib_x->uobject pointers. Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-04-01IB: ucontext should be set properly for all cmd & ioctl pathsShamir Rabinovitch4-61/+28
the Attempt to use the below commit to initialize the ucontext for the uobject destroy path has shown that the below commit is incomplete. Parts were reverted and the ucontext set up in the uverbs_attr_bundle was moved to rdma_lookup_get_uobject which is called from the uobj_get_XXX macros and rdma_alloc_begin_uobject which is called when uobject is created. Fixes: 3d9dfd060391 ("IB/uverbs: Add ib_ucontext to uverbs_attr_bundle sent from ioctl and cmd flows") Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-29ipv6: Move ipv6 stubs to a separate header fileDavid Ahern1-1/+1
The number of stubs is growing and has nothing to do with addrconf. Move the definition of the stubs to a separate header file and update users. In the move, drop the vxlan specific comment before ipv6_stub. Code move only; no functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28RDMA/core: Add command to set ib_core device net namspace sharing modeParav Pandit3-0/+114
Add netlink command that enables/disables sharing rdma device among multiple net namespaces. Using rdma tool, $rdma sys set netns shared (default mode) When rdma subsystem netns mode is set to shared mode, rdma devices will be accessible in all net namespaces. Using rdma tool, $rdma sys set netns exclusive When rdma subsystem netns mode is set to exclusive mode, devices will be accessible in only one net namespace at any given point of time. If there are any net namespaces other than default init_net exists, while executing this command, it will fail and mode cannot be changed. To change this mode, netlink command is used instead of sysctl, because netlink command allows to auto load a module. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28RDMA/core: Add interface to read device namespace sharing modeParav Pandit3-1/+34
Add an interface via netlink command to query whether rdma devices are shared among multiple net namespaces or not. When using RDMAtool, it can be queried as, $rdma system show netns netns shared Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28RDMA/core: Extend ib_device_get_by_index for net namespaceParav Pandit3-11/+21
Extend ib_device_get_by_index() API to check device access for net namespace for serving netlink commands. Also enforce net ns check on dumpit commands which iterate over all registered rdma devices and which don't call ib_device_get_by_index(). Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28RDMA: Check net namespace access for uverbs, umad, cma and nldevParav Pandit4-0/+38
Introduce an API rdma_dev_access_netns() to check whether a rdma device can be accessed from the specified net namespace or not. Use rdma_dev_access_netns() while opening character uverbs, umad network device and also check while rdma cm_id binds to rdma device. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28RDMA/core: Add module param to disable device sharing among net nsParav Pandit1-0/+7
Add module parameter to change a sharing mode of ib_core early in the boot process. This parameter helps to those systems where modern up to date rdma tool (iproute2) package may not be available during kernel upgrade cycle. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28RDMA/core: Support core port attributes in non init_netParav Pandit3-7/+18
Now that sysfs compatibility layer for non init_net exists, add core port attributes such as pkey and gid table to non init_net ns. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28RDMA/core: Implement compat device/sysfs tree in net namespaceParav Pandit1-4/+257
Implement compatibility layer sysfs entries of ib_core so that non init_net net namespaces can also discover rdma devices. Each non init_net net namespace has ib_core_device created in it. Such ib_core_device sysfs tree resembles rdma devices found in init_net namespace. This allows discovering rdma devices in multiple non init_net net namespaces via sysfs entries and helpful to rdma-core userspace. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28RDMA/core: Restrict sysfs entries view to init_netParav Pandit1-1/+11
This is a preparation patch to provide isolation of rdma device in a network namespace. As first step, make rdma device visible only in init net namespace. Subsequent patch will enable rdma device visibility back in multiple net namespaces using compat ib_core_device device/sysfs tree. Given that the IB subsystem depends on net stack, it needs to be initialized after netdev and since it support devices, it needs to be initialized before the device subsystem; therefore, change initcall sequence to fs_initcall, so that when ib_core is compiled in the kernel image, the right init sequence is followed. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-03-28RDMA/core: Introduce ib_core_device to hold deviceParav Pandit2-17/+36
In order to support sysfs entries in multiple net namespaces for a rdma device, introduce a ib_core_device whose scope is limited to hold core device and per port sysfs related entries. This is preparation patch so that multiple ib_core_devices in each net namespace can be created in subsequent patch who all can share ib_device. (a) Move sysfs specific fields to ib_core_device. (b) Make sysfs and device life cycle related routines to work on ib_core_device. (c) Introduce and use rdma_init_coredev() helper to initialize coredev fields. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>