Mellanox catastrophic error recovery after device reset doesn't work and
in fact leads to unusable node for IB network since the HCA's ports
go down. At times hard reset is needed to get the system rebooted
which is a real problem in production environment. Once the
network outage detected, unreachable node gets evicted and rebooted
on engineered system using reboot. So hanged reboot command is
problematic. So the idea is let the kernel panic which can recover
system on its own with necessary logs captured. There was a debate
on whether to use panic or machine restart, but it was agreed to use
panic instead of silent reboot since thats the preferred option.
There is Mellanox case open to investigate this issue. As such this
is a rare case scenario and even if the issue is fixed, it is expected
to avoid leading to catas error case. This panic is limited to
only error case.
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Orabug:
25873690
This is a change taken from QU2, it is not upstream.
(cherry picked from commit
271d694b34bd22e5632eaad41ea1d9a47f1bde3a)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
mlx4_err(dev, "device was reset successfully\n");
mutex_unlock(&persist->device_state_mutex);
+ /* Mellanox device reset and recovery has never worked and
+ * in fact ends up hanging the system which needs a hard reboot
+ * of the system. Instead of waiting for recovery which never
+ * going to happen, just panic the system so that it can capture
+ * all the necessary logs/vmcore and let the node graceful shutdown.
+ */
+ panic("MLX4 device reset due to unrecoverable catastrophic failure\n");
+
/* At that step HW was already reset, now notify clients */
mlx4_dispatch_event(dev, MLX4_DEV_EVENT_CATASTROPHIC_ERROR, 0);
mlx4_cmd_wake_completions(dev);