David Belanger [Mon, 29 May 2023 17:23:08 +0000 (13:23 -0400)]
drm/amdkfd: Added gfx_v12_kfd2kgd interface for GFX12.
Initial implementation, based on GFX11.
v2: Removed functions not needed by cp scheduler.
v3: Fixed typos.
v4: squash in warning fix (Alex)
Signed-off-by: David Belanger <david.belanger@amd.com> Acked-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
shaoyunl [Wed, 22 Nov 2023 19:34:11 +0000 (14:34 -0500)]
drm/amdgpu: Enable unmapped doorbell handling basic mode on mes 12
Enable basic mode handling for doorbell ring on unmapped CP queue.
In this mode, MES can start schedule the queue mapping based on HW
interrupt instead of timer.
Tom St Denis [Thu, 24 Aug 2023 13:23:04 +0000 (09:23 -0400)]
drm/amd/amdgpu: update GFX12 wave data registers
Update the registers for gfx12.
Signed-off-by: Tom St Denis <tom.stdenis@amd.com> Reviewed-by: Jonathan Kim <jonathan.kim@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Jack Xiao [Mon, 7 Aug 2023 07:55:25 +0000 (15:55 +0800)]
drm/amdgpu/gfx12: recalculate available compute rings to use
Recalculate the number of compute rings to use based on
the gfx hardware configuration. As needed reserve half of
compute rings for mes, kgd can't use up all compute rings.
Signed-off-by: Jack Xiao <Jack.Xiao@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
v1: Add gfx v12_0 ip block support. (Likun)
v2: Switch to gfx.kiq array.
Move the vmhub from ring callback to ring. (Hawking)
v3: Update various callback function impl. (Hawking)
v4: Warning fixes (Alex)
v5: squash in imu fix, csb, rlc autoload implementations (Alex)
v6: Rebase (Alex)
Yunxiang Li [Fri, 26 Apr 2024 03:15:28 +0000 (23:15 -0400)]
drm/amdgpu: Move ras resume into SRIOV function
This is part of the reset, move it into the reset function.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Emily Deng <Emily.Deng@amd.com> Reviewed-by: Zhigang Luo <zhigang.luo@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
amdxcp is a platform driver for creating partition devices. libdrm
library identifies a platform device based on 'OF_FULLNAME' or
'MODALIAS'. If two or more devices have the same platform name, drm
library only picks the first device. Platform driver core uses name of
the device to populate 'MODALIAS'. When 'amdgpu_xcp' is used as the base
name, only first partition device gets identified. Assign unique name so
that drm library identifies partition devices separately.
amdxcp doesn't support probe of partitions, it doesn't bother about
modaliases.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Peyton Lee [Tue, 30 Apr 2024 14:09:09 +0000 (22:09 +0800)]
drm/amdgpu/vpe: fix vpe dpm clk ratio setup failed
Some version of BIOS does not enable all clock levels,
resulting in high level clock frequency of 0.
The number of valid CLKs must be confirmed in advance.
Signed-off-by: Peyton Lee <peytolee@amd.com> Reviewed-by: Lang Yu <lang.yu@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amd/display: Remove redundant NULL check in dcn10_set_input_transfer_func
This commit removes an unnecessary NULL check in the
`dcn10_set_input_transfer_func` function in the `dcn10_hwseq.c` file.
The variable `tf` is assigned the address of
`plane_state->in_transfer_func` unconditionally, so it can never be
`NULL`. Therefore, the check `if (tf == NULL)` is unnecessary and has
been removed.
Fixes the below smatch warning:
drivers/gpu/drm/amd/amdgpu/../display/dc/hwss/dcn10/dcn10_hwseq.c:1839 dcn10_set_input_transfer_func() warn: address of 'plane_state->in_transfer_func' is non-NULL
Fixes: 285a7054bf81 ("drm/amd/display: Remove plane and stream pointers from dc scratch") Cc: Wenjing Liu <wenjing.liu@amd.com> Cc: Tom Chung <chiahsuan.chung@amd.com> Cc: Alvin Lee <alvin.lee2@amd.com> Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Cc: Roman Li <roman.li@amd.com> Cc: Hersen Wu <hersenxs.wu@amd.com> Cc: Alex Hung <alex.hung@amd.com> Cc: Aurabindo Pillai <aurabindo.pillai@amd.com> Cc: Harry Wentland <harry.wentland@amd.com> Suggested-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amd/display: Remove redundant NULL check in dce110_set_input_transfer_func
This commit removes a redundant NULL check in the
`dce110_set_input_transfer_func` function in the `dce110_hwseq.c` file.
The variable `tf` is assigned the address of
`plane_state->in_transfer_func` unconditionally, so it can never be
`NULL`. Therefore, the check `if (tf == NULL)` is unnecessary and has
been removed.
Fixes the below smatch warning:
drivers/gpu/drm/amd/amdgpu/../display/dc/hwss/dce110/dce110_hwseq.c:301 dce110_set_input_transfer_func() warn: address of 'plane_state->in_transfer_func' is non-NULL
Fixes: 285a7054bf81 ("drm/amd/display: Remove plane and stream pointers from dc scratch") Cc: Wenjing Liu <wenjing.liu@amd.com> Cc: Tom Chung <chiahsuan.chung@amd.com> Cc: Alvin Lee <alvin.lee2@amd.com> Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Cc: Roman Li <roman.li@amd.com> Cc: Hersen Wu <hersenxs.wu@amd.com> Cc: Alex Hung <alex.hung@amd.com> Cc: Aurabindo Pillai <aurabindo.pillai@amd.com> Cc: Harry Wentland <harry.wentland@amd.com> Suggested-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The retry loop for SRIOV reset have refcount and memory leak issue.
Depending on which function call fails it can potentially call
amdgpu_amdkfd_pre/post_reset different number of times and causes
kfd_locked count to be wrong. This will block all future attempts at
opening /dev/kfd. The retry loop also leakes resources by calling
amdgpu_virt_init_data_exchange multiple times without calling the
corresponding fini function.
Align with the bare-metal reset path which doesn't have these issues.
This means taking the amdgpu_amdkfd_pre/post_reset functions out of the
reset loop and calling amdgpu_device_pre_asic_reset each retry which
properly free the resources from previous try by calling
amdgpu_virt_fini_data_exchange.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Emily Deng <Emily.Deng@amd.com> Reviewed-by: Zhigang Luo <zhigang.luo@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Yunxiang Li [Mon, 22 Apr 2024 18:44:38 +0000 (14:44 -0400)]
drm/amdgpu: Add reset_context flag for host FLR
There are other reset sources that pass NULL as the job pointer, such as
amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if
the FLR comes from the host does not work.
Add a flag in reset_context to explicitly mark host triggered reset, and
set this flag when we receive host reset notification.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Emily Deng <Emily.Deng@amd.com> Reviewed-by: Zhigang Luo <zhigang.luo@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Yunxiang Li [Mon, 22 Apr 2024 18:59:02 +0000 (14:59 -0400)]
drm/amdgpu: Fix two reset triggered in a row
Some times a hang GPU causes multiple reset sources to schedule resets.
The second source will be able to trigger an unnecessary reset if they
schedule after we call amdgpu_device_stop_pending_resets.
Move amdgpu_device_stop_pending_resets to after the reset is done. Since
at this point the GPU is supposedly in a good state, any reset scheduled
after this point would be a legitimate reset.
Remove unnecessary and incorrect checks for amdgpu_in_reset that was
kinda serving this purpose.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
These lines are indented too far. Clean the whitespace.
Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Colin Ian King [Wed, 24 Apr 2024 16:28:09 +0000 (17:28 +0100)]
drm/amd/display: Fix spelling various spelling mistakes
There are various spelling mistakes in dml2_printf messages, fix them.
Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amd/display: Avoid -Wenum-float-conversion in add_margin_and_round_to_dfs_grainularity()
When building with clang 19 or newer (which strengthened some of the
enum conversion warnings for C), there is a warning (or error with
CONFIG_WERROR=y) around doing arithmetic with an enumerated type and a
floating point expression.
drivers/gpu/drm/amd/amdgpu/../display/dc/dml2/dml21/src/dml2_dpmm/dml2_dpmm_dcn4.c:181:58: error: arithmetic between enumeration type 'enum dentist_divider_range' and floating-point type 'double' [-Werror,-Wenum-float-conversion]
181 | divider = (unsigned int)(DFS_DIVIDER_RANGE_SCALE_FACTOR * (vco_freq_khz / clock_khz));
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
This conversion is expected due to the nature of the enumerated value
and definition, so silence the warning by casting the enumeration to an
integer explicitly to make it clear to the compiler.
Fixes: 70839da63605 ("drm/amd/display: Add new DCN401 sources") Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Mario Limonciello [Tue, 30 Apr 2024 14:53:23 +0000 (09:53 -0500)]
drm/amd/display: Disable panel replay by default for now
Panel replay was enabled by default in commit 5950efe25ee0
("drm/amd/display: Enable Panel Replay for static screen use case"), but
it isn't working properly at least on some BOE and AUO panels. Instead
of being static the screen is solid black when active. As it's a new
feature that was just introduced that regressed VRR disable it for now
so that problem can be properly root caused.
Cc: Tom Chung <chiahsuan.chung@amd.com> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3344 Fixes: 5950efe25ee0 ("drm/amd/display: Enable Panel Replay for static screen use case") Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Acked-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Tim Huang [Mon, 29 Apr 2024 03:17:54 +0000 (11:17 +0800)]
drm/amd/pm: fix uninitialized variable warning for smu_v13
Clear warning that using uninitialized variable when the dpm is
not enabled and reuse the code for SMU13 to get the boot frequency.
Signed-off-by: Tim Huang <Tim.Huang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
v1: Add sdma v7_0 ip block support. (Likun)
v2: Move vmhub from ring_funcs to ring. (Hawking)
v3: Switch to AMDGPU_GFXHUB(0). (Hawking)
v4: Move microcode init into early_init. (Likun)
v5: Fix warnings (Alex)
v6: Squash in various fixes (Alex)
v7: Rebase (Alex)
v8: Rebase (Alex)
Judging from the product name this might be a clone of a
BOE panel, but with larger dimensions.
Panel frequently shows non-functional backlight control. Adding
some debug prints to update_connector_ext_caps() shows that
something the OLED bit of ext_caps is set, and then the driver
assumes that backlight is controlled via AUX.
Forcing backlight control to PWM via amdgpu.backlight=0 restores
backlight operation.
Signed-off-by: Tobias Jakobi <tjakobi@math.uni-bielefeld.de> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Philip Yang [Mon, 8 Apr 2024 13:26:56 +0000 (09:26 -0400)]
drm/amdkfd: Bump kfd version for contiguous VRAM allocation
Bump the kfd ioctl minor version to delcare the contiguous VRAM
allocation flag support.
Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
v1: Add gmc v12_0 ip block support.
v2: Switch to gfx.kiq array.
v3: Switch to vmhubs_mask.
v4: Switch to AMDGPU_MMHUB0(0) and AMDGPU_GFXHUB(0)
v5: Rebase (Alex)
v6: Squash in fixes for AGP handling, gfxhub init order,
vmhub index (Alex)
v7: Rebase (Alex)
v8: squash in ecc fix (Alex)
Lancelot SIX [Wed, 10 Apr 2024 13:14:13 +0000 (14:14 +0100)]
drm/amdkfd: Flush the process wq before creating a kfd_process
There is a race condition when re-creating a kfd_process for a process.
This has been observed when a process under the debugger executes
exec(3). In this scenario:
- The process executes exec.
- This will eventually release the process's mm, which will cause the
kfd_process object associated with the process to be freed
(kfd_process_free_notifier decrements the reference count to the
kfd_process to 0). This causes kfd_process_ref_release to enqueue
kfd_process_wq_release to the kfd_process_wq.
- The debugger receives the PTRACE_EVENT_EXEC notification, and tries to
re-enable AMDGPU traps (KFD_IOC_DBG_TRAP_ENABLE).
- When handling this request, KFD tries to re-create a kfd_process.
This eventually calls kfd_create_process and kobject_init_and_add.
At this point the call to kobject_init_and_add can fail because the
old kfd_process.kobj has not been freed yet by kfd_process_wq_release.
This patch proposes to avoid this race by making sure to drain
kfd_process_wq before creating a new kfd_process object. This way, we
know that any cleanup task is done executing when we reach
kobject_init_and_add.
Signed-off-by: Lancelot SIX <lancelot.six@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Philip Yang [Fri, 5 Apr 2024 20:02:50 +0000 (16:02 -0400)]
drm/amdkfd: Evict BO itself for contiguous allocation
If the BO pages pinned for RDMA is not contiguous on VRAM, evict it to
system memory first to free the VRAM space, then allocate contiguous
VRAM space, and then move it from system memory back to VRAM.
v6: user context should use interruptible call (Felix)
Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Smatch complains because some lines are indented more than they should
be. I went a bit crazy re-indenting this. ;)
The comments were not useful except as a marker of things which are left
to implement so I deleted most of them except for the TODO.
I introduced a "data" pointer so that I could replace
"scl_data->dscl_prog_data." with just "data->" and shorten the lines a
bit. It's more readable without the line breaks.
I also tried to align it so you can see what is changing on each line.
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Tim Huang [Fri, 26 Apr 2024 04:52:45 +0000 (12:52 +0800)]
drm/amd/pm: fix uninitialized variable warning for smu8_hwmgr
Clear warnings that using uninitialized value level when fails
to get the value from SMU.
Signed-off-by: Tim Huang <Tim.Huang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Check the return of function smum_send_msg_to_smc
as it may fail to initialize the variable.
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Tim Huang <Tim.Huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Tim Huang [Thu, 25 Apr 2024 05:15:27 +0000 (13:15 +0800)]
drm/amdgpu: fix overflowed array index read warning
Clear overflowed array index read warning by cast operation.
Signed-off-by: Tim Huang <Tim.Huang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Tim Huang [Thu, 25 Apr 2024 03:09:00 +0000 (11:09 +0800)]
drm/amdgpu: fix potential resource leak warning
Clear resource leak warning that when the prepare fails,
the allocated amdgpu job object will never be released.
Signed-off-by: Tim Huang <Tim.Huang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Yang Wang [Tue, 23 Apr 2024 02:14:47 +0000 (10:14 +0800)]
drm/amdgpu: avoid dump mca bank log muti times during ras ISR
because the ue valid mca count will only be cleared after gpu reset,
so only dump mca log on the first time to get mca bank after receive RAS interrupt.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Yang Wang [Thu, 18 Apr 2024 07:46:00 +0000 (15:46 +0800)]
drm/amdgpu: add MCA smu cache support
v1:
because SMU CE valid mca bank will be cleared after reading,
this patch adds mca cache at the driver level to ensure that the mca bank is not lost.
v2:
refine amdgpu_mca_init/fini/reset() function name.
v3:
add mca_cache.lock support
only add CE bank to mca bank cache.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>