aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/infiniband/ulp/ipoib (follow)
AgeCommit message (Collapse)AuthorFilesLines
2016-02-04IB/IPoIB: Do not set skb truesize since using one linearskbCarol L Soto1-2/+0
We are seeing this warning: at net/core/skbuff.c:4174 and before commit a44878d10063 ("IB/ipoib: Use one linear skb in RX flow") skb truesize was not being set when ipoib was using just one skb. Removing this line avoids the warning when running tcp tests like iperf. Fixes: a44878d10063 ("IB/ipoib: Use one linear skb in RX flow") Signed-off-by: Carol L Soto <clsoto@linux.vnet.ibm.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19IB/IPoIB: Fix kernel panic on multicast flowErez Shitrit3-7/+6
ipoib_mcast_restart_task calls ipoib_mcast_remove_list with the parameter mcast->dev. That mcast is a temporary (used as an iterator) variable that may be uninitialized. There is no need to send the variable dev to the function, as each mcast has its dev as a member in the mcast struct. This causes the next panic: RIP: 0010: ipoib_mcast_leave+0x6d/0xf0 [ib_ipoib] RSP: 0018: EFLAGS: 00010246 RAX: f0201 RBX: 24e00 RCX: 00000 .... .... Stack: Call Trace: ipoib_mcast_remove_list+0x3a/0x70 [ib_ipoib] ipoib_mcast_restart_task+0x3bb/0x520 [ib_ipoib] process_one_work+0x164/0x470 worker_thread+0x11d/0x420 ... Fixes: 5a0e81f6f483 ('IB/IPoIB: factor out common multicast list removal code') Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Reported-by: Doron Tsur <doront@mellanox.com> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-12-23IB/IPoIB: Move multicast specific code out of ipoib_main.cChristoph Lameter3-14/+23
Code cleanup to move multicast specific code that checks for a sendonly join to ipoib_multicast.c. This allows the removal of the export of __ipoib_mcast_find(). Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-12-23IB/IPoIB: factor out common multicast list removal codeChristoph Lameter3-17/+17
Code cleanup to remove multicast specific code from ipoib_main.c The removal of a list of multicast groups occurs in three places. Create a new function ipoib_mcast_remove_list(). Use this new function in ipoib_main.c too. That in turn allows the dropping of two functions that were exported from ipoib_multicast.c for expiration of mc groups. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-12-22Merge branches '4.5/Or-cleanup' and '4.5/rdma-cq' into k.o/for-4.5Doug Ledford1-1/+1
Signed-off-by: Doug Ledford <dledford@redhat.com> Conflicts: drivers/infiniband/ulp/iser/iser_verbs.c
2015-12-22IB/ulps: Avoid calling ib_query_deviceOr Gerlitz3-42/+12
Instead, use the cached copy of the attributes present on the device. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-12-11IB: add a proper completion queue abstractionChristoph Hellwig1-1/+1
This adds an abstraction that allows ULPs to simply pass a completion object and completion callback with each submitted WR and let the RDMA core handle the nitty gritty details of how to handle completion interrupts and poll the CQ. In detail there is a new ib_cqe structure which just contains the completion callback, and which can be used to get at the containing object using container_of. It is pointed to by the WR and WC as an alternative to the wr_id field, similar to how many ULPs already use the field to store a pointer using casts. A driver using the new completion callbacks allocates it's CQs using the new ib_create_cq API, which in addition to the number of CQEs and the completion vectors also takes a mode on how we poll for CQEs. Three modes are available: direct for drivers that never take CQ interrupts and just poll for them, softirq to poll from softirq context using the to be renamed blk-iopoll infrastructure which takes care of rearming and budgeting, or a workqueue for consumer who want to be called from user context. Thanks a lot to Sagi Grimberg who helped reviewing the API, wrote the current version of the workqueue code because my two previous attempts sucked too much and converted the iSER initiator to the new API. Signed-off-by: Christoph Hellwig <hch@lst.de>
2015-10-28Merge branch 'wr-cleanup' into k.o/for-4.4Doug Ledford6-20/+20
2015-10-21IB/core: Add netdev and gid attributes paramteres to cacheMatan Barak2-2/+2
Adding an ability to query the IB cache by a netdev and get the attributes of a GID. These parameters are necessary in order to successfully resolve the required GID (when the netdevice is known) and get the Ethernet L2 attributes from a GID. Signed-off-by: Matan Barak <matanb@mellanox.com> Reviewed-By: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-13IB/ipoib: For sendonly join free the multicast group on leaveChristoph Lameter3-2/+5
When we leave the multicast group on expiration of a neighbor we do not free the mcast structure. This results in a memory leak that causes ib_dealloc_pd to fail and print a WARN_ON message and backtrace. Fixes: bd99b2e05c4d (IB/ipoib: Expire sendonly multicast joins) Signed-off-by: Christoph Lameter <cl@linux.com> Tested-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-08IB: split struct ib_send_wrChristoph Hellwig6-20/+20
This patch split up struct ib_send_wr so that all non-trivial verbs use their own structure which embedds struct ib_send_wr. This dramaticly shrinks the size of a WR for most common operations: sizeof(struct ib_send_wr) (old): 96 sizeof(struct ib_send_wr): 48 sizeof(struct ib_rdma_wr): 64 sizeof(struct ib_atomic_wr): 96 sizeof(struct ib_ud_wr): 88 sizeof(struct ib_fast_reg_wr): 88 sizeof(struct ib_bind_mw_wr): 96 sizeof(struct ib_sig_handover_wr): 80 And with Sagi's pending MR rework the fast registration WR will also be down to a reasonable size: sizeof(struct ib_fastreg_wr): 64 Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> [srp, srpt] Reviewed-by: Chuck Lever <chuck.lever@oracle.com> [sunrpc] Tested-by: Haggai Eran <haggaie@mellanox.com> Tested-by: Sagi Grimberg <sagig@mellanox.com> Tested-by: Steve Wise <swise@opengridcomputing.com>
2015-09-25IB/ipoib: increase the max mcast backlog queueDoug Ledford1-1/+1
When performing sendonly joins, we queue the packets that trigger the join until the join completes. This may take on the order of hundreds of milliseconds. It is easy to have many more than three packets come in during that time. Expand the maximum queue depth in order to try and prevent dropped packets during the time it takes to join the multicast group. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-09-25IB/ipoib: Make sendonly multicast joins create the mcast groupDoug Ledford1-10/+12
Since IPoIB should, as much as possible, emulate how multicast sends work on Ethernet for regular TCP/IP apps, there should be no requirement to subscribe to a multicast group before your sends are properly sent. However, due to the difference in how multicast is handled on InfiniBand, we must join the appropriate multicast group before we can send to it. Previously we tried not to trigger the auto-create feature of the subnet manager when doing this because we didn't have tracking of these sendonly groups and the auto-creation might never get undone. The previous patch added timing to these sendonly joins and allows us to leave them after a reasonable idle expiration time. So supply all of the information needed to auto-create group. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-09-25IB/ipoib: Expire sendonly multicast joinsChristoph Lameter3-2/+22
On neighbor expiration, check to see if the neighbor was actually a sendonly multicast join, and if so, leave the multicast group as we expire the neighbor. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-09-03IB/ipoib: Suppress warning for send only join failuresJason Gunthorpe1-2/+10
We expect send only joins to fail, it just means there are no listeners for the group. The correct thing to do is silently drop the packet at source. Eg avahi will full join 224.0.0.251 which causes a send only IGMP packet to 224.0.0.22, and then a warning level kmessage like this: ib0: sendonly multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0016, status -22 If there is no IP router listening to IGMP. Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-09-03IB/ipoib: Clean up send-only multicast joinsDoug Ledford1-11/+27
Even though we don't expect the group to be created by the SM we sill need to provide all the parameters to force the SM to validate they are correct. Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30IB/core: Make ib_dealloc_pd return voidJason Gunthorpe1-3/+1
The majority of callers never check the return value, and even if they did, they can't do anything about a failure. All possible failure cases represent a bug in the caller, so just WARN_ON inside the function instead. This fixes a few random errors: net/rd/iw.c infinite loops while it fails. (racing with EBUSY?) This also lays the ground work to get rid of error return from the drivers. Most drivers do not error, the few that do are broken since it cannot be handled. Since uverbs can legitimately make use of EBUSY, open code the check. Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30IB/ipoib: Remove ib_get_dma_mr callsJason Gunthorpe3-17/+4
The pd now has a local_dma_lkey member which completely replaces ib_get_dma_mr, use it instead. Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30IB/cm: Remove compare_data checksHaggai Eran1-1/+1
Now that there are no ib_cm clients using the compare_data feature for matching IB CM requests' private data, remove the compare_data parameter of ib_cm_listen and remove the code implementing the feature. Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30IB/ipoib: Return IPoIB devices matching connection parametersGuy Shapiro1-1/+228
Implement the get_net_device_by_port_pkey_ip callback that returns network device to ib_core according to connection parameters. Check the ipoib device and iterate over all child devices to look for a match. For each IPoIB device we iterate through all upper devices when searching for a matching IP, in order to support bonding. Signed-off-by: Guy Shapiro <guysh@mellanox.com> Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Yotam Kenneth <yotamke@mellanox.com> Signed-off-by: Shachar Raindel <raindel@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30IB/core: lock client data with lists_rwsemHaggai Eran1-4/+3
An ib_client callback that is called with the lists_rwsem locked only for read is protected from changes to the IB client lists, but not from ib_unregister_device() freeing its client data. This is because ib_unregister_device() will remove the device from the device list with lists_rwsem locked for write, but perform the rest of the cleanup, including the call to remove() without that lock. Mark client data that is undergoing de-registration with a new going_down flag in the client data context. Lock the client data list with lists_rwsem for write in addition to using the spinlock, so that functions calling the callback would be able to lock only lists_rwsem for read and let callbacks sleep. Since ib_unregister_client() now marks the client data context, no need for remove() to search the context again, so pass the client data directly to remove() callbacks. Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-24IB/ipoib: Fix CONFIG_INFINIBAND_IPOIB_CMJason Gunthorpe1-1/+2
If the above is turned off then ipoib_cm_dev_init unconditionally returns ENOSYS, and the newly added error handling in 0b3957 prevents ipoib from coming up at all: kernel: mlx4_0: ipoib_transport_dev_init failed kernel: mlx4_0: failed to initialize port 1 (ret = -12) Fixes: 0b39578bcde4 (IB/ipoib: Use dedicated workqueues per interface) Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/ipoib: Set MTU to max allowed by mode when mode changesErez Shitrit1-0/+1
When switching between modes (datagram / connected) change the MTU accordingly. datagram mode up to 4K, connected mode up to (64K - 0x10). Signed-off-by: ELi Cohen <eli@mellanox.com> Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/ipoib: Scatter-Gather support in connected modeYuval Shaia4-46/+54
By default, IPoIB-CM driver uses 64k MTU. Larger MTU gives better performance. This MTU plus overhead puts the memory allocation for IP based packets at 32 4k pages (order 5), which have to be contiguous. When the system memory under pressure, it was observed that allocating 128k contiguous physical memory is difficult and causes serious errors (such as system becomes unusable). This enhancement resolve the issue by removing the physically contiguous memory requirement using Scatter/Gather feature that exists in Linux stack. With this fix Scatter-Gather will be supported also in connected mode. This change reverts some of the change made in commit e112373fd6aa ("IPoIB/cm: Reduce connected mode TX object size"). The ability to use SG in IPoIB CM is possible because the coupling between NETIF_F_SG and NETIF_F_CSUM was removed in commit ec5f06156423 ("net: Kill link between CSUM and SG features.") Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> Acked-by: Christian Marie <christian@ponies.io> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/ipoib: Prevent lockdep warning in __ipoib_ib_dev_flushHaggai Eran1-6/+7
__ipoib_ib_dev_flush calls itself recursively on child devices, and lockdep complains about locking vlan_rwsem twice (see below). Use down_read_nested instead of down_read to prevent the warning. ============================================= [ INFO: possible recursive locking detected ] 4.1.0-rc4+ #36 Tainted: G O --------------------------------------------- kworker/u20:2/261 is trying to acquire lock: (&priv->vlan_rwsem){.+.+..}, at: [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib] but task is already holding lock: (&priv->vlan_rwsem){.+.+..}, at: [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&priv->vlan_rwsem); lock(&priv->vlan_rwsem); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by kworker/u20:2/261: #0: ("%s""ipoib_flush"){.+.+..}, at: [<ffffffff810827cc>] process_one_work+0x15c/0x760 #1: ((&priv->flush_heavy)){+.+...}, at: [<ffffffff810827cc>] process_one_work+0x15c/0x760 #2: (&priv->vlan_rwsem){.+.+..}, at: [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib] stack backtrace: CPU: 3 PID: 261 Comm: kworker/u20:2 Tainted: G O 4.1.0-rc4+ #36 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 Workqueue: ipoib_flush ipoib_ib_dev_flush_heavy [ib_ipoib] ffff8801c6c54790 ffff8801c9927af8 ffffffff81665238 0000000000000001 ffffffff825b5b30 ffff8801c9927bd8 ffffffff810bba51 ffff880100000000 ffffffff00000001 ffff880100000001 ffff8801c6c55428 ffff8801c6c54790 Call Trace: [<ffffffff81665238>] dump_stack+0x4f/0x6f [<ffffffff810bba51>] __lock_acquire+0x741/0x1820 [<ffffffff810bcbf8>] lock_acquire+0xc8/0x240 [<ffffffffa0791e2a>] ? __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib] [<ffffffff81669d2c>] down_read+0x4c/0x70 [<ffffffffa0791e2a>] ? __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib] [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib] [<ffffffffa0791e4a>] __ipoib_ib_dev_flush+0x5a/0x2b0 [ib_ipoib] [<ffffffffa07920ba>] ipoib_ib_dev_flush_heavy+0x1a/0x20 [ib_ipoib] [<ffffffff81082871>] process_one_work+0x201/0x760 [<ffffffff810827cc>] ? process_one_work+0x15c/0x760 [<ffffffff81082ef0>] worker_thread+0x120/0x4d0 [<ffffffff81082dd0>] ? process_one_work+0x760/0x760 [<ffffffff81082dd0>] ? process_one_work+0x760/0x760 [<ffffffff81088b7e>] kthread+0xfe/0x120 [<ffffffff81088a80>] ? __init_kthread_worker+0x70/0x70 [<ffffffff8166c6e2>] ret_from_fork+0x42/0x70 [<ffffffff81088a80>] ? __init_kthread_worker+0x70/0x70 Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/IPoIB: Fix bad error flow in ipoib_add_port()Amir Vadai1-2/+4
Error values of ib_query_port() and ib_query_device() weren't propagated correctly. Because of that, ipoib_add_port() could return NULL value, which escaped the IS_ERR() check in ipoib_add_one() and we crashed. Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB: Add rdma_cap_ib_switch helper and use where appropriateHal Rosenstock1-10/+2
Persuant to Liran's comments on node_type on linux-rdma mailing list: In an effort to reform the RDMA core and ULPs to minimize use of node_type in struct ib_device, an additional bit is added to struct ib_device for is_switch (IB switch). This is needed to be initialized by any IB switch device driver. This is a NEW requirement on such device drivers which are all "out of tree". In addition, an ib_switch helper was added to ib_verbs.h based on the is_switch device bit rather than node_type (although those should be consistent). The RDMA core (MAD, SMI, agent, sa_query, multicast, sysfs) as well as (IPoIB and SRP) ULPs are updated where appropriate to use this new helper. In some cases, the helper is now used under the covers of using rdma_[start end]_port rather than the open coding previously used. Reviewed-by: Sean Hefty <sean.hefty@intel.com> Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Tested-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Hal Rosenstock <hal@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-06-12IB/core: Change ib_create_cq to use struct ib_cq_init_attrMatan Barak1-2/+6
Currently, ib_create_cq uses cqe and comp_vecotr instead of the extendible ib_cq_init_attr struct. Earlier patches already changed the vendors to work with ib_cq_init_attr. This patch changes the consumers too. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-06-02Merge branch 'for-4.2-misc' into k.o/for-4.2Doug Ledford1-2/+2
2015-06-02IB/ipoib: Fix RCU annotations in ipoib_neigh_hash_init()Bart Van Assche1-2/+2
Avoid that sparse complains about ipoib_neigh_hash_init(). This patch does not change any functionality. See also patch "IPoIB: Fix memory leak in the neigh table deletion flow" (commit ID 66172c09938b). Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Shlomo Pongratz <shlomop@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-05-18IB/Verbs: Reform IB-ulp ipoibMichael Wang1-7/+8
Use raw management helpers to reform IB-ulp ipoib. Signed-off-by: Michael Wang <yun.wang@profitbricks.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Tested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Sean Hefty <sean.hefty@intel.com> Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Tested-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-05-05IPoIB/CM: Fix indentation levelBart Van Assche1-2/+2
See also patch "IPoIB/cm: Add connected mode support for devices without SRQs" (commit ID 68e995a29572). Detected by smatch. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-22Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infinibandLinus Torvalds6-389/+492
Pull InfiniBand/RDMA updates from Roland Dreier: - IPoIB fixes from Doug Ledford and Erez Shitrit - iSER updates from Sagi Grimberg - mlx4 GUID handling changes from Yishai Hadas - other misc fixes * tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (51 commits) mlx5: wrong page mask if CONFIG_ARCH_DMA_ADDR_T_64BIT enabled for 32Bit architectures IB/iser: Rewrite bounce buffer code path IB/iser: Bump version to 1.6 IB/iser: Remove code duplication for a single DMA entry IB/iser: Pass struct iser_mem_reg to iser_fast_reg_mr and iser_reg_sig_mr IB/iser: Modify struct iser_mem_reg members IB/iser: Make fastreg pool cache friendly IB/iser: Move PI context alloc/free to routines IB/iser: Move fastreg descriptor pool get/put to helper functions IB/iser: Merge build page-vec into register page-vec IB/iser: Get rid of struct iser_rdma_regd IB/iser: Remove redundant assignments in iser_reg_page_vec IB/iser: Move memory reg/dereg routines to iser_memory.c IB/iser: Don't pass ib_device to fall_to_bounce_buff routine IB/iser: Remove a redundant struct iser_data_buf IB/iser: Remove redundant cmd_data_len calculation IB/iser: Fix wrong calculation of protection buffer length IB/iser: Handle fastreg/local_inv completion errors IB/iser: Fix unload during ep_poll wrong dereference ib_srpt: convert printk's to pr_* functions ...
2015-04-17IB/ipoib: Fix ndo_get_iflinkErez Shitrit2-2/+6
Currently, iflink of the parent interface was always accessed, even when interface didn't have a parent and hence we crashed there. Handle the interface types properly: for a child interface, return the ifindex of the parent, for parent interface, return its ifindex. For child devices, make sure to set the parent pointer prior to invoking register_netdevice(), this allows the new ndo to be called by the stack immediately after the child device is registered. Fixes: 5aa7add8f14b ('infiniband/ipoib: implement ndo_get_iflink') Reported-by: Honggang Li <honli@redhat.com> Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Honggang Li <honli@redhat.com> Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>+ Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-15IB/ipoib: Remove IPOIB_MCAST_RUN bitErez Shitrit2-5/+2
After Doug Ledford's changes there is no need in that bit, it's semantic becomes subset of the IPOIB_FLAG_OPER_UP bit. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Save only IPOIB_MAX_PATH_REC_QUEUE skb'sErez Shitrit1-3/+10
Whenever there is no path->ah to the destination, keep only defined number of skb's. Otherwise there are cases that the driver can keep infinite list of skb's. For example, when one device want to send unicast arp to the destination, and from some reason the SM doesn't respond, the driver currently keeps all the skb's. If that unicast arp traffic stopped, all these skb's are kept by the path object till the interface is down. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Handle QP in SQE stateErez Shitrit2-1/+63
As the result of a completion error the QP can moved to SQE state by the hardware. Since it's not the Error state, there are no flushes and hence the driver doesn't know about that. The fix creates a task that after completion with error which is not a flush tracks the QP state and if it is in SQE state moves it back to RTS. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Update broadcast record values after each successful join requestErez Shitrit1-1/+17
Update the cached broadcast record in the priv object after every new join of this broadcast domain group. These values are needed for the port configuration (MTU size) and to all the new multicast (non-broadcast) join requests initial parameters. For example, SM starts with 2K MTU for all the fabric, and after that it restarts (or handover to new SM) with new port configuration of 4K MTU. Without using the new values, the driver will keep its old configuration of 2K and will not apply the new configuration of 4K. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Use one linear skb in RX flowErez Shitrit3-72/+13
The current code in the RX flow uses two sg entries for each incoming packet, the first one was for the IB headers and the second for the rest of the data, that causes two dma map/unmap and two allocations, and few more actions that were done at the data path. Use only one linear skb on each incoming packet, for the data (IB headers and payload), that reduces the packet processing in the data-path (only one skb, no frags, the first frag was not used anyway, less memory allocations) and the dma handling (only one dma map/unmap over each incoming packet instead of two map/unmap per each incoming packet). After commit 73d3fe6d1c6d ("gro: fix aggregation for skb using frag_list") from Eric Dumazet, we will get full aggregation for large packets. When running bandwidth tests before and after the (over the card's numa node), using "netperf -H 1.1.1.3 -T -t TCP_STREAM", the results before are ~12Gbs before and after ~16Gbs on my setup (Mellanox's ConnectX3). Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: drop mcast_mutex usageDoug Ledford1-38/+32
We needed the mcast_mutex when we had to prevent the join completion callback from having the value it stored in mcast->mc overwritten by a delayed return from ib_sa_join_multicast. By storing the return of ib_sa_join_multicast in an intermediate variable, we prevent a delayed return from ib_sa_join_multicast overwriting the valid contents of mcast->mc, and we no longer need a mutex to force the join callback to run after the return of ib_sa_join_multicast. This allows us to do away with the mutex entirely and protect our critical sections with a just a spinlock instead. This is highly desirable as there were some places where we couldn't use a mutex because the code was not allowed to sleep, and so we were currently using a mix of mutex and spinlock to protect what we needed to protect. Now we only have a spin lock and the locking complexity is greatly reduced. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: deserialize multicast joinsDoug Ledford1-168/+82
Allow the ipoib layer to attempt to join all outstanding multicast groups at once. The ib_sa layer will serialize multiple attempts to join the same group, but will process attempts to join different groups in parallel. Take advantage of that. In order to make this happen, change the mcast_join_thread to loop through all needed joins, sending a join request for each one that we still need to join. There are a few special cases we handle though: 1) Don't attempt to join anything but the broadcast group until the join of the broadcast group has succeeded. 2) No longer restart the join task at the end of completion handling. If we completed successfully, we are done. The join task now needs kicked either by mcast_send or mcast_restart_task or mcast_start_thread, but should not need started anytime else except when scheduling a backoff attempt to rejoin. 3) No longer use separate join/completion routines for regular and sendonly joins, pass them all through the same routine and just do the right thing based on the SENDONLY join flag. 4) Only try to join a SENDONLY join twice, then drop the packets and quit trying. We leave the mcast group in the list so that if we get a new packet, all that we have to do is queue up the packet and restart the join task and it will automatically try to join twice and then either send or flush the queue again. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: fix MCAST_FLAG_BUSY usageDoug Ledford2-128/+238
Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast objects") added a new flag MCAST_JOIN_STARTED, but was not very strict in how it was used. We didn't always initialize the completion struct before we set the flag, and we didn't always call complete on the completion struct from all paths that complete it. And when we did complete it, sometimes we continued to touch the mcast entry after the completion, opening us up to possible use after free issues. This made it less than totally effective, and certainly made its use confusing. And in the flush function we would use the presence of this flag to signal that we should wait on the completion struct, but we never cleared this flag, ever. In order to make things clearer and aid in resolving the rtnl deadlock bug I've been chasing, I cleaned this up a bit. 1) Remove the MCAST_JOIN_STARTED flag entirely 2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight 3) Test mcast->mc directly to see if we have completed ib_sa_join_multicast (using IS_ERR_OR_NULL) 4) Make sure that before setting MCAST_FLAG_BUSY we always initialize the mcast->done completion struct 5) Make sure that before calling complete(&mcast->done), we always clear the MCAST_FLAG_BUSY bit 6) Take the mcast_mutex before we call ib_sa_multicast_join and also take the mutex in our join callback. This forces ib_sa_multicast_join to return and set mcast->mc before we process the callback. This way, our callback can safely clear mcast->mc if there is an error on the join and we will do the right thing as a result in mcast_dev_flush. 7) Because we need the mutex to synchronize mcast->mc, we can no longer call mcast_sendonly_join directly from mcast_send and instead must add sendonly join processing to the mcast_join_task 8) Make MCAST_RUN mean that we have a working mcast subsystem, not that we have a running task. We know when we need to reschedule our join task thread and don't need a flag to tell us. 9) Add a helper for rescheduling the join task thread A number of different races are resolved with these changes. These races existed with the old MCAST_FLAG_BUSY usage, the MCAST_JOIN_STARTED flag was an attempt to address them, and while it helped, a determined effort could still trip things up. One race looks something like this: Thread 1 Thread 2 ib_sa_join_multicast (as part of running restart mcast task) alloc member call callback ifconfig ib0 down wait_for_completion callback call completes wait_for_completion in mcast_dev_flush completes mcast->mc is PTR_ERR_OR_NULL so we skip ib_sa_leave_multicast return from callback return from ib_sa_join_multicast set mcast->mc = return from ib_sa_multicast We now have a permanently unbalanced join/leave issue that trips up the refcounting in core/multicast.c Another like this: Thread 1 Thread 2 Thread 3 ib_sa_multicast_join ifconfig ib0 down priv->broadcast = NULL join_complete wait_for_completion mcast->mc is not yet set, so don't clear return from ib_sa_join_multicast and set mcast->mc complete return -EAGAIN (making mcast->mc invalid) call ib_sa_multicast_leave on invalid mcast->mc, hang forever By holding the mutex around ib_sa_multicast_join and taking the mutex early in the callback, we force mcast->mc to be valid at the time we run the callback. This allows us to clear mcast->mc if there is an error and the join is going to fail. We do this before we complete the mcast. In this way, mcast_dev_flush always sees consistent state in regards to mcast->mc membership at the time that the wait_for_completion() returns. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: No longer use flush as a parameterDoug Ledford4-30/+39
Various places in the IPoIB code had a deadlock related to flushing the ipoib workqueue. Now that we have per device workqueues and a specific flush workqueue, there is no longer a deadlock issue with flushing the device specific workqueues and we can do so unilaterally. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Use dedicated workqueues per interfaceDoug Ledford6-38/+66
During my recent work on the rtnl lock deadlock in the IPoIB driver, I saw that even once I fixed the apparent races for a single device, as soon as that device had any children, new races popped up. It turns out that this is because no matter how well we protect against races on a single device, the fact that all devices use the same workqueue, and flush_workqueue() flushes *everything* from that workqueue means that we would also have to prevent all races between different devices (for instance, ipoib_mcast_restart_task on interface ib0 can race with ipoib_mcast_flush_dev on interface ib0.8002, resulting in a deadlock on the rtnl_lock). There are several possible solutions to this problem: Make carrier_on_task and mcast_restart_task try to take the rtnl for some set period of time and if they fail, then bail. This runs the real risk of dropping work on the floor, which can end up being its own separate kind of deadlock. Set some global flag in the driver that says some device is in the middle of going down, letting all tasks know to bail. Again, this can drop work on the floor. Or the method this patch attempts to use, which is when we bring an interface up, create a workqueue specifically for that interface, so that when we take it back down, we are flushing only those tasks associated with our interface. In addition, keep the global workqueue, but now limit it to only flush tasks. In this way, the flush tasks can always flush the device specific work queues without having deadlock issues. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Make the carrier_on_task race awareDoug Ledford1-8/+17
We blindly assume that we can just take the rtnl lock and that will prevent races with downing this interface. Unfortunately, that's not the case. In ipoib_mcast_stop_thread() we will call flush_workqueue() in an attempt to clear out all remaining instances of ipoib_join_task. But, since this task is put on the same workqueue as the join task, the flush_workqueue waits on this thread too. But this thread is deadlocked on the rtnl lock. The better thing here is to use trylock and loop on that until we either get the lock or we see that FLAG_OPER_UP has been cleared, in which case we don't need to do anything anyway and we just return. While investigating which flag should be used, FLAG_ADMIN_UP or FLAG_OPER_UP, it was determined that FLAG_OPER_UP was the more appropriate flag to use. However, there was a mix of these two flags in use in the existing code. So while we check for that flag here as part of this race fix, also cleanup the two places that had used the less appropriate flag for their tests. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Consolidate rtnl_lock tasks in workqueueDoug Ledford1-6/+2
The ipoib_mcast_flush_dev routine is called with the rtnl_lock held and needs to keep it held. It also needs to call flush_workqueue() to flush out any outstanding work. In the past, we've had to try and make sure that we didn't flush out any outstanding join completions because they also wanted to grab rtnl_lock() and that would deadlock. It turns out that the only thing in the join completion handler that needs this lock can be safely moved to our carrier_on_task, thereby reducing the potential for the join completion code and the flush code to deadlock against each other. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: change init sequence orderingDoug Ledford1-7/+17
In preparation for using per device work queues, we need to move the start of the neighbor thread task to after ipoib_ib_dev_init and move the destruction of the neighbor task to before ipoib_ib_dev_cleanup. Otherwise we will end up freeing our workqueue with work possibly still on it. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: factor out ah flushingDoug Ledford1-18/+28
Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at appropriate times to flush out all remaining ah entries before we shut the device down. Because neighbors and mcast entries can each have a reference on any given ah, we must make sure to free all of those first before our ah will actually have a 0 refcount and be able to be reaped. This factoring is needed in preparation for having per-device work queues. The original per-device workqueue code resulted in the following error message: <ibdev>: ib_dealloc_pd failed That error was tracked down to this issue. With the changes to which workqueues were flushed when, there were no flushes of the per device workqueue after the last ah's were freed, resulting in an attempt to dealloc the pd with outstanding resources still allocated. This code puts the explicit flushes in the needed places to avoid that problem. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-02infiniband/ipoib: implement ndo_get_iflinkNicolas Dichtel2-1/+8
Don't use dev->iflink anymore. CC: Roland Dreier <roland@kernel.org> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-30Revert "IPoIB: Consolidate rtnl_lock tasks in workqueue"Roland Dreier1-2/+6
This reverts commit afe1de664ef3cb756e70938d99417dcbc6b1379a. The series of IPoIB bug fixes that went into 3.19-rc1 introduce regressions, and after trying to sort things out, we decided to revert to 3.18's IPoIB driver and get things right for 3.20. Signed-off-by: Roland Dreier <roland@purestorage.com>