diff options
| author | 2021-10-21 14:02:40 +0300 | |
|---|---|---|
| committer | 2021-12-26 08:59:03 +0200 | |
| commit | 4cd454a205069965463515e2068190f56b0e4206 (patch) | |
| tree | a666387eb2b655c27385c5b96fe457e685c1e4cd | |
| parent | habanalabs: modify wait for boot fit in dynamic FW load (diff) | |
| download | linux-dev-4cd454a205069965463515e2068190f56b0e4206.tar.xz linux-dev-4cd454a205069965463515e2068190f56b0e4206.zip | |
habanalabs/gaudi: recover from CPU WD event
There are rare cases where the device CPU's watchdog has expired and as
a result, the watchdog reset has happened and the CPU will now move to
running its preboot f/w.
When that happens, the driver will only know that a heartbeat failure
occurred. As a result, the driver will send a message to the CPU's main
f/w asking it to reset the device, but because the CPU is now running
preboot, it won't respond and the re-initialization process will later
fail when trying to load the f/w.
The solution is to send the request to the preboot as well, only if the
reset was caused because of HB failure.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
| -rw-r--r-- | drivers/misc/habanalabs/gaudi/gaudi.c | 20 |
1 files changed, 19 insertions, 1 deletions
diff --git a/drivers/misc/habanalabs/gaudi/gaudi.c b/drivers/misc/habanalabs/gaudi/gaudi.c index 825737dfe381..d2b7ecb45497 100644 --- a/drivers/misc/habanalabs/gaudi/gaudi.c +++ b/drivers/misc/habanalabs/gaudi/gaudi.c @@ -1,7 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 /* - * Copyright 2016-2020 HabanaLabs, Ltd. + * Copyright 2016-2021 HabanaLabs, Ltd. * All Rights Reserved. */ @@ -4296,6 +4296,24 @@ static void gaudi_hw_fini(struct hl_device *hdev, bool hard_reset, bool fw_reset WREG32(irq_handler_offset, gaudi_irq_map_table[GAUDI_EVENT_HALT_MACHINE].cpu_id); + + /* This is a hail-mary attempt to revive the card in the small chance that the + * f/w has experienced a watchdog event, which caused it to return back to preboot. + * In that case, triggering reset through GIC won't help. We need to trigger the + * reset as if Linux wasn't loaded. + * + * We do it only if the reset cause was HB, because that would be the indication + * of such an event. + * + * In case watchdog hasn't expired but we still got HB, then this won't do any + * damage. + */ + if (hdev->curr_reset_cause == HL_RESET_CAUSE_HEARTBEAT) { + if (hdev->asic_prop.hard_reset_done_by_fw) + hl_fw_ask_hard_reset_without_linux(hdev); + else + hl_fw_ask_halt_machine_without_linux(hdev); + } } else { if (hdev->asic_prop.hard_reset_done_by_fw) hl_fw_ask_hard_reset_without_linux(hdev); |
