From 99398e12e8dd62754ef571f8ff92729805e2253b Mon Sep 17 00:00:00 2001 From: Santosh Shilimkar Date: Wed, 7 Dec 2016 15:06:59 -0800 Subject: [PATCH] net/mlx4_core: panic the system on unrecoverable errors Mellanox catastrophic error recovery after device reset doesn't work and in fact leads to unusable node for IB network since the HCA's ports go down. At times hard reset is needed to get the system rebooted which is a real problem in production environment. Once the network outage detected, unreachable node gets evicted and rebooted on engineered system using reboot. So hanged reboot command is problematic. So the idea is let the kernel panic which can recover system on its own with necessary logs captured. There was a debate on whether to use panic or machine restart, but it was agreed to use panic instead of silent reboot since thats the preferred option. There is Mellanox case open to investigate this issue. As such this is a rare case scenario and even if the issue is fixed, it is expected to avoid leading to catas error case. This panic is limited to only error case. Reviewed-by: Yuval Shaia Reviewed-by: Mukesh Kacker Signed-off-by: Santosh Shilimkar Orabug: 25873690 This is a change taken from QU2, it is not upstream. (cherry picked from commit 271d694b34bd22e5632eaad41ea1d9a47f1bde3a) Signed-off-by: Jack Vogel --- drivers/net/ethernet/mellanox/mlx4/catas.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx4/catas.c b/drivers/net/ethernet/mellanox/mlx4/catas.c index 715de8affcc9..91d8c925b778 100644 --- a/drivers/net/ethernet/mellanox/mlx4/catas.c +++ b/drivers/net/ethernet/mellanox/mlx4/catas.c @@ -188,6 +188,14 @@ void mlx4_enter_error_state(struct mlx4_dev_persistent *persist) mlx4_err(dev, "device was reset successfully\n"); mutex_unlock(&persist->device_state_mutex); + /* Mellanox device reset and recovery has never worked and + * in fact ends up hanging the system which needs a hard reboot + * of the system. Instead of waiting for recovery which never + * going to happen, just panic the system so that it can capture + * all the necessary logs/vmcore and let the node graceful shutdown. + */ + panic("MLX4 device reset due to unrecoverable catastrophic failure\n"); + /* At that step HW was already reset, now notify clients */ mlx4_dispatch_event(dev, MLX4_DEV_EVENT_CATASTROPHIC_ERROR, 0); mlx4_cmd_wake_completions(dev); -- 2.50.1