www.infradead.org Git - users/jedix/linux-maple.git/commitdiff

Vladimir Oltean says:

====================
Fix rtnl_mutex deadlock with DPAA2 and SFP modules

This patch set deliberately targets net-next and lacks Fixes: tags due
to caution on my part.

While testing some SFP modules on the Solidrun Honeycomb LX2K platform,
I noticed that rebooting causes a deadlock:

============================================
WARNING: possible recursive locking detected
6.1.0-rc5-07010-ga9b9500ffaac-dirty #656 Not tainted
--------------------------------------------
systemd-shutdow/1 is trying to acquire lock:
ffffa62db6cf42f0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x1c/0x30

but task is already holding lock:
ffffa62db6cf42f0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x1c/0x30

other info that might help us debug this:
Possible unsafe locking scenario:

       CPU0
       ----
  lock(rtnl_mutex);
  lock(rtnl_mutex);

*** DEADLOCK ***

May be due to missing lock nesting notation

6 locks held by systemd-shutdow/1:
#0: ffffa62db6863c70 (system_transition_mutex){+.+.}-{4:4}, at: __do_sys_reboot+0xd4/0x260
#1: ffff2f2b0176f100 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xf4/0x260
#2: ffff2f2b017be900 (&dev->mutex){....}-{4:4}, at: device_shutdown+0x104/0x260
#3: ffff2f2b017680f0 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x40/0x260
#4: ffff2f2b0e1608f0 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x40/0x260
#5: ffffa62db6cf42f0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x1c/0x30

stack backtrace:
CPU: 6 PID: 1 Comm: systemd-shutdow Not tainted 6.1.0-rc5-07010-ga9b9500ffaac-dirty #656
Hardware name: SolidRun LX2160A Honeycomb (DT)
Call trace:
lock_acquire+0x68/0x84
__mutex_lock+0x98/0x460
mutex_lock_nested+0x2c/0x40
rtnl_lock+0x1c/0x30
sfp_bus_del_upstream+0x1c/0xac
phylink_destroy+0x1c/0x50
dpaa2_mac_disconnect+0x28/0x70
dpaa2_eth_remove+0x1dc/0x1f0
fsl_mc_driver_remove+0x24/0x60
device_remove+0x70/0x80
device_release_driver_internal+0x1f0/0x260
device_links_unbind_consumers+0xe0/0x110
device_release_driver_internal+0x138/0x260
device_release_driver+0x18/0x24
bus_remove_device+0x12c/0x13c
device_del+0x16c/0x424
fsl_mc_device_remove+0x28/0x40
__fsl_mc_device_remove+0x10/0x20
device_for_each_child+0x5c/0xac
dprc_remove+0x94/0xb4
fsl_mc_driver_remove+0x24/0x60
device_remove+0x70/0x80
device_release_driver_internal+0x1f0/0x260
device_release_driver+0x18/0x24
bus_remove_device+0x12c/0x13c
device_del+0x16c/0x424
fsl_mc_bus_remove+0x8c/0x10c
fsl_mc_bus_shutdown+0x10/0x20
platform_shutdown+0x24/0x3c
device_shutdown+0x15c/0x260
kernel_restart+0x40/0xa4
__do_sys_reboot+0x1e4/0x260
__arm64_sys_reboot+0x24/0x30

But fixing this appears to be not so simple. The patch set represents my
attempt to address it.

In short, the problem is that dpaa2_mac_connect() and dpaa2_mac_disconnect()
call 2 phylink functions in a row, one takes rtnl_lock() itself -
phylink_create(), and one which requires rtnl_lock() to be held by the
caller - phylink_fwnode_phy_connect(). The existing approach in the
drivers is too simple. We take rtnl_lock() when calling dpaa2_mac_connect(),
which is what results in the deadlock.

Fixing just that creates another problem. The drivers make use of
rtnl_lock() for serializing with other code paths too. I think I've
found all those code paths, and established other mechanisms for
serializing with them.
====================

Link: https://lore.kernel.org/r/20221129141221.872653-1-vladimir.oltean@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Trivial merge

author	Paolo Abeni <pabeni@redhat.com>
	Thu, 1 Dec 2022 12:40:25 +0000 (13:40 +0100)
committer	Paolo Abeni <pabeni@redhat.com>
	Thu, 1 Dec 2022 12:40:25 +0000 (13:40 +0100)