After GPU reset with VRAM loss, a general protection fault occurs
during user queue restoration when accessing vm_bo->vm after
spinlock release in amdgpu_vm_bo_reset_state_machine.
The root cause is that vm_bo points to the last entry from the
list_for_each_entry loop, but this becomes invalid after the
spinlock is released. Accessing vm_bo->vm at this point leads
to memory corruption.
Crash log shows:
[ 326.981811] Oops: general protection fault, probably for non-canonical address 0x4156415741e58ac8: 0000 [#1] SMP NOPTI
[ 326.981820] CPU: 13 UID: 0 PID: 1035 Comm: kworker/13:3 Tainted: G E 6.16.0+ #25 PREEMPT(voluntary)
[ 326.981826] Tainted: [E]=UNSIGNED_MODULE
[ 326.981827] Hardware name: Gigabyte Technology Co., Ltd. X870E AORUS PRO ICE/X870E AORUS PRO ICE, BIOS F3i 12/19/2024
[ 326.981831] Workqueue: events amdgpu_userq_restore_worker [amdgpu]
[ 326.981999] RIP: 0010:amdgpu_vm_assert_locked+0x16/0x70 [amdgpu]
[ 326.982094] Code: 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 45 48 8b 87 80 03 00 00 48 85 c0 74 40 <48> 8b b8 80 01 00 00 48 85 ff 74 3b 8b 05 0c b7 0e f0 85 c0 75 05
[ 326.982098] RSP: 0018:
ffffaa91c2a6bc20 EFLAGS:
00010206
[ 326.982100] RAX:
4156415741e58948 RBX:
ffff9e8f013e8330 RCX:
0000000000000000
[ 326.982102] RDX:
0000000000000005 RSI:
000000001d254e88 RDI:
ffffffffc144814a
[ 326.982104] RBP:
ffffaa91c2a6bc68 R08:
0000004c21a25674 R09:
0000000000000001
[ 326.982106] R10:
0000000000000001 R11:
dccaf3f2f82863fc R12:
ffff9e8f013e8000
[ 326.982108] R13:
ffff9e8f013e8000 R14:
0000000000000000 R15:
ffff9e8f09980000
[ 326.982110] FS:
0000000000000000(0000) GS:
ffff9e9e79995000(0000) knlGS:
0000000000000000
[ 326.982112] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 326.982114] CR2:
000055ed6c9caa80 CR3:
0000000797060000 CR4:
0000000000750ef0
[ 326.982116] PKRU:
55555554
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
vm_bo->moved = true;
spin_unlock(&vm->invalidated_lock);
- amdgpu_vm_assert_locked(vm_bo->vm);
+ amdgpu_vm_assert_locked(vm);
list_for_each_entry_safe(vm_bo, tmp, &vm->idle, vm_status) {
struct amdgpu_bo *bo = vm_bo->bo;