aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/net/ethernet/mellanox/mlx5/core/health.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2017-08-08mlx5: convert to generic pci_alloc_irq_vectorsSagi Grimberg1-1/+1
Now that we have a generic code to allocate an array of irq vectors and even correctly spread their affinity, correctly handle cpu hotplug events and more, were much better off using it. Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-07-03Merge https://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+14
Some overlapping changes in the mlx5 driver. A merge conflict resolution posted by Stephen Rothwell was used as a guide. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-27net/mlx5: Cancel delayed recovery work when unloading the driverMohamad Haj Yahia1-1/+14
Draining the health workqueue will ignore future health works including the one that report hardware failure and thus we can't enter error state Instead cancel the recovery flow and make sure only recovery flow won't be scheduled. Fixes: 5e44fca50470 ('net/mlx5: Only cancel recovery work when cleaning up device') Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-06-16net/mlx5: Add fast unload support in shutdown flowMajd Dibbiny1-2/+2
Adding a support to flush all HW resources with one FW command and skip all the heavy unload flows of the driver on kernel shutdown. There's no need to free all the SW context since a new fresh kernel will be loaded afterwards. Regarding the FW resources, they should be closed, otherwise we will have leakage in the FW. To accelerate this flow, we execute one command in the beginning that tells the FW that the driver isn't going to close any of the FW resources and asks the FW to clean up everything. Once the commands complete, it's safe to close the PCI resources and finish the routine. Signed-off-by: Majd Dibbiny <majd@mellanox.com> Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-06-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-6/+5
The conflicts were two cases of overlapping changes in batman-adv and the qed driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-11net/mlx5: Continue health polling until it is explicitly stoppedMohamad Haj Yahia1-6/+5
The issue is that when we get an assert we will stop polling the health and thus we cant enter error state when we have a real health issue. Fixes: fd76ee4da55a ('net/mlx5_core: Fix internal error detection conditions') Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-05-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+1
Overlapping changes in drivers/net/phy/marvell.c, bug fix in 'net' restricting a HW workaround alongside cleanups in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-23net/mlx5: Avoid using pending command interface slotsMohamad Haj Yahia1-1/+1
Currently when firmware command gets stuck or it takes long time to complete, the driver command will get timeout and the command slot is freed and can be used for new commands, and if the firmware receive new command on the old busy slot its behavior is unexpected and this could be harmful. To fix this when the driver command gets timeout we return failure, but we don't free the command slot and we wait for the firmware to explicitly respond to that command. Once all the entries are busy we will stop processing new firmware commands. Fixes: 9cba4ebcf374 ('net/mlx5: Fix potential deadlock in command mode change') Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Cc: kernel-team@fb.com Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-05-14net/mlx5: Introduce trigger_health_work functionIlan Tayari1-11/+21
Introduce new function for entering bad-health state. This function will be called from FPGA-related logic in a later patch from asynchronous event (IRQ) context, for that we change the spin lock to an IRQ-safe one. Signed-off-by: Ilan Tayari <ilant@mellanox.com> Reviewed-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-19net/mlx5: Fix version printout in case of health issueEli Cohen1-17/+3
Firmware representation of the firmware version on the health buffer has changed for newer device. The representation in the initialization segment does not and will not change. In addition, we print the health buffer firmware version as a raw hex number. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2016-10-29net/mlx5: PCI error recovery health care simulationMohamad Haj Yahia1-4/+41
In case that the kernel PCI error handlers are not called, we will trigger our own recovery flow. The health work will give priority to the kernel pci error handlers to recover the PCI by waiting for a small period, if the pci error handlers are not triggered the manual recovery flow will be executed. We don't save pci state in case of manual recovery because it will ruin the pci configuration space and we will lose dma sync. Fixes: 89d44f0a6c73 ('net/mlx5_core: Add pci error handlers to mlx5_core driver') Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29net/mlx5: Fix race between PCI error handlers and health workMohamad Haj Yahia1-3/+27
Currently there is a race between the health care work and the kernel pci error handlers because both of them detect the error, the first one to be called will do the error handling. There is a chance that health care will disable the pci after resuming pci slot. Also create a separate WQ because now we will have two types of health works, one for the error detection and one for the recovery. Fixes: 89d44f0a6c73 ('net/mlx5_core: Add pci error handlers to mlx5_core driver') Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29net/mlx5: Clear health sick bit when starting health pollMohamad Haj Yahia1-0/+1
The health sick status should be cleared when we start the health poll. This is crucial for driver reload (unload + load) in order to behave right in case of health issue. Fixes: fd76ee4da55a ('net/mlx5_core: Fix internal error detection conditions') Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-26net/mlx5_core/health: Remove deprecated create_singlethread_workqueueBhaktipriya Shridhar1-5/+2
The workqueue health->wq was used as per device private health thread. This was done to perform delayed work. The workqueue has a single workitem(&health->work) and hence doesn't require ordering. It is involved in handling the health of the device and is not being used on a memory reclaim path. Hence, the singlethreaded workqueue has been replaced with the use of system_wq. Work item has been flushed in mlx5_health_cleanup() to ensure that there are no pending tasks while disconnecting the driver. Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com> Acked-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01net/mlx5: Avoid calling sleeping function by the health poll threadMohamad Haj Yahia1-3/+8
In internal error state the health poll thread will eventually call synchronize_irq() (to safely trigger command completions) which might sleep, so we are calling sleeping function from atomic context which is invalid. Here we move trigger_cmd_completions(dev) to enter error state which is the earliest stage in error state handling. This way we won't need to wait for next health poll to trigger command completions and will solve the scheduling while atomic issue. mlx5_enter_error_state can be called from two contexts, protect it with dev->intf_state_lock Fixes: 89d44f0a6c73 ('net/mlx5_core: Add pci error handlers to mlx5_core driver') Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28treewide: Fix typos in printkMasanari Iida1-2/+2
This patch fix spelling typos in printk from various part of the codes. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-10-14net/mlx5_core: Add pci error handlers to mlx5_core driverMajd Dibbiny1-0/+72
This patch implement the pci_error_handlers for mlx5_core which allow the driver to recover from PCI error. Once an error is detected in the PCI, the mlx5_pci_err_detected is called and it: 1) Marks the device to be in 'Internal Error' state. 2) Dispatches an event to the mlx5_ib to flush all the outstanding cqes with error. 3) Returns all the on going commands with error. 4) Unloads the driver. Afterwards, the FW is reset and mlx5_pci_slot_reset is called and it enables the device and restore it's pci state. If the later succeeds, mlx5_pci_resume is called, and it loads the SW stack. Signed-off-by: Majd Dibbiny <majd@mellanox.com> Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-14net/mlx5_core: Fix internal error detection conditionsEli Cohen1-7/+44
The detection of a fatal condition has been updated to take into account the state reported by the device or by detecting an all ones read of the firmware version which indicates that the device is not accessible. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-09net/mlx5_core: Use private health thread for each deviceEli Cohen1-34/+29
Use a single threaded work queue for each device in the system instead of using one thread for any device. This is required so we can concurrently process system error handling for all the devices that need that. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-09net/mlx5_core: Use accessor functions to read from device memoryEli Cohen1-13/+22
Use ioread function to read health buffer data. In addition, print the firmware version as a string for readability and also use dev_err to have the device string to be printed. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-28net/mlx5_core: Update health syndromesEli Cohen1-0/+9
Update new health monitored syndromes and their descriptions. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-28net/mlx5_core: Fix wrong name in structEli Cohen1-1/+1
The name refers to syndrome so uset ext_synd instread of ext_sync. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02net/mlx5_core: Fix Mellanox copyright noteSaeed Mahameed1-1/+1
Signed-off-by: Achiad Shochat <achiad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-05mlx5: remove health handler pluginEli Cohen1-28/+1
Remove this code, per Dave Miller's request, since it is not being used anywhere in the kernel. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-08mlx5_core: Fixes for sparse warningsRoland Dreier1-9/+19
- use be32_to_cpu() instead of cpu_to_be32() where appropriate. - use proper accessors for pointers marked __iomem. Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-07-08mlx5: Add driver for Mellanox Connect-IB adaptersEli Cohen1-0/+217
The driver is comprised of two kernel modules: mlx5_ib and mlx5_core. This partitioning resembles what we have for mlx4, except that mlx5_ib is the pci device driver and not mlx5_core. mlx5_core is essentially a library that provides general functionality that is intended to be used by other Mellanox devices that will be introduced in the future. mlx5_ib has a similar role as any hardware device under drivers/infiniband/hw. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> [ Merge in coccinelle fixes from Fengguang Wu <fengguang.wu@intel.com>. - Roland ] Signed-off-by: Roland Dreier <roland@purestorage.com>