Ilia Levi [Thu, 13 Feb 2025 09:35:59 +0000 (11:35 +0200)]
drm/xe: Add xe_mmio_init() initialization function
Add a convenience function for minimal initialization of struct xe_mmio.
This function also validates that the entirety of the provided mmio region
is usable with struct xe_reg.
Lucas De Marchi [Thu, 13 Feb 2025 19:29:07 +0000 (11:29 -0800)]
drm/xe/oa: Handle errors in xe_oa_register()
Let xe_oa_unregister() be handled by devm infra since it's only putting
the kobject. Also, since kobject_create_and_add may fail, handle the
error accordingly.
Lucas De Marchi [Thu, 13 Feb 2025 19:29:06 +0000 (11:29 -0800)]
drm/xe: Move drm_dev_unplug() out of display function
This is not really display-related and needed for any sequence on driver
removal that has to interact with drm_dev_enter()/drm_dev_exit().
Just remove xe_device_remove_display() and inline it in the single
caller to make clear this is not done only for display.
Lucas De Marchi [Thu, 13 Feb 2025 19:29:05 +0000 (11:29 -0800)]
drm/xe/oa: Move fini to xe_oa
Like done with other functions, cleanup the error handling in
xe_device_probe() by moving the OA fini to be handled by xe_oa
itself, which relies on devm to call the cleanup function.
Lucas De Marchi [Thu, 13 Feb 2025 19:29:04 +0000 (11:29 -0800)]
drm/xe: Cleanup extra calls to xe_hw_fence_irq_finish()
Now that xe_gt_remove is handled entirely by xe_gt, it's clear there are
some extra calls to xe_hw_fence_irq_finish() that aren't necessary.
Neither all_fw_domain_init() or gt_fw_domain_init() need to do that
since it's handled by the caller on any error.
Lucas De Marchi [Thu, 13 Feb 2025 19:29:03 +0000 (11:29 -0800)]
drm/xe: Cleanup unwind of gt initialization
The only thing in xe_gt_remove() that really needs to happen on the
device remove callback is the xe_uc_remove(). That's because of the
following call chain:
Move xe_gsc_proxy_remove() to be handled as a xe_device_remove_action,
so it's recorded when it should run during device removal. The rest can
be handled normally by devm infra.
Besides removing the deep call chain above, xe_device_probe() doesn't
have to unwind the gt loop and it's also more in line with the
xe_device_probe() style.
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250213192909.996148-7-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Lucas De Marchi [Thu, 13 Feb 2025 19:29:02 +0000 (11:29 -0800)]
drm/xe: Remove leftover pxp comment
Not being able to initialize pxp is fatal if the platform is expected to
have it. Update comment after commit 9c9dc9ba4a00 ("drm/xe/pxp: Fail the
load if PXP fails to initialize").
Lucas De Marchi [Thu, 13 Feb 2025 19:28:59 +0000 (11:28 -0800)]
drm/xe: Fix error handling in xe_irq_install()
When devm_add_action_or_reset() fails, it already calls the function
passed as parameter and that function is already free'ing the irqs.
Drop the goto and just return.
The caller, xe_device_probe(), should also do the same thing instead of
wrongly doing `goto err` and calling the unrelated xe_display_fini()
function.
Lucas De Marchi [Thu, 13 Feb 2025 19:28:58 +0000 (11:28 -0800)]
drm/xe: Fix xe_display_fini() calls
xe_display_fini() undoes things from xe_display_init() (technically from
intel_display_driver_probe()). Those `goto err` in xe_device_probe()
were wrong and being accumulated over time.
Commit 65e366ace5ee ("drm/xe/display: Use a single early init call for
display") made it easier to fix now that we don't have xe_display_* init
calls spread on xe_device_probe(). Change xe_display_init() to use
devm_add_action_or_reset() that will finalize display in the right
order.
While at it, also add a newline and comment about calling
xe_driver_flr_fini.
Lucas De Marchi [Thu, 13 Feb 2025 19:28:57 +0000 (11:28 -0800)]
drm/xe: Add callback support for driver remove
xe device probe uses devm cleanup in most places. However there are a
few cases where this is not possible: when the driver interacts with
component add/del. In that case, the resource group would be cleanup
while the entire device resources are in the process of cleanup. One
example is the xe_gsc_proxy and display using that to interact with mei
and audio.
Add a callback-based remove so the exception doesn't make the probe
use multiple error handling styles.
v2: Change internal API to mimic the devm API. This will make it easier
to migrate in future when devm can be used.
Xin Wang [Thu, 13 Feb 2025 22:36:15 +0000 (06:36 +0800)]
drm/xe/debugfs: fixed the return value of wedged_mode_set
It is generally expected that the write() function should return a
positive value indicating the number of bytes written or a negative
error code if an error occurs. Returning 0 is unusual and can lead
to unexpected behavior.
When the user program writes the same value to wedged_mode twice in
a row, a lockup will occur, because the value expected to be
returned by the write() function inside the program should be equal
to the actual written value instead of 0.
To reproduce the issue:
echo 1 > /sys/kernel/debug/dri/0/wedged_mode
echo 1 > /sys/kernel/debug/dri/0/wedged_mode <- lockup here
Signed-off-by: Xin Wang <x.wang@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Fei Yang <fei.yang@intel.com> Cc: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250213223615.2327367-1-x.wang@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Rodrigo Vivi [Wed, 12 Feb 2025 19:24:47 +0000 (14:24 -0500)]
drm/xe/display: Remove hpd cancel work sync from runtime pm path
This function will synchronously cancel and wait for many display
work queue items, which might try to take the runtime pm reference
causing a bad deadlock. So, remove it from the runtime_pm suspend patch.
Lucas De Marchi [Fri, 31 Jan 2025 17:17:16 +0000 (09:17 -0800)]
drm/xe/debugfs: Add node to dump guc log to dmesg
Currently xe_guc_log_print_dmesg() is unused, as it's expected
developers to add those calls when needed. However it makes it hard to
guarantee it's working as nothing is testing it. Add a node in debugfs
so it can be tested. This is purely for testing purposes since with the
device probed and working, the guc log can be obtained by the regular
debugfs file.
Nirmoy Das [Mon, 10 Feb 2025 14:36:54 +0000 (15:36 +0100)]
drm/xe: Carve out wopcm portion from the stolen memory
The top of stolen memory is WOPCM, which shouldn't be accessed. Remove
this portion from the stolen memory region for discrete platforms.
This was already done for integrated, but was missing for discrete
platforms.
This also moves get_wopcm_size() so detect_bar2_dgfx() and
detect_bar2_integrated can use the same function.
v2: Improve commit message and suitable stable version tag(Lucas)
Fixes: d8b52a02cb40 ("drm/xe: Implement stolen memory.") Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: stable@vger.kernel.org # v6.11+ Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250210143654.2076747-1-nirmoy.das@intel.com Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
Piotr Piórkowski [Mon, 10 Feb 2025 08:15:11 +0000 (09:15 +0100)]
drm/xe: Move VRAM manager to struct xe_vram_region
VRAM manager is related directly to struct xe_vram_region so it
should be inside this structure.
Let's move the VRAM to struct xe_vram_region.
v2:
- remove xe_vram_region pointer from xe_ttm_vram_mgr
- stop use dynamic alloaction for xe_ttm_vram_mgr in xe_vram_region
- rename struct xe_ttm_vram_mgr vram_mgr to ttm
v3:
- fix "'ttm' not described in 'xe_vram_region'"
Piotr Piórkowski [Mon, 10 Feb 2025 08:15:10 +0000 (09:15 +0100)]
drm/xe: Rename struct xe_mem_region to struct xe_vram_region
The xe_mem_region structure has so far been used only in the context
of VRAM regions. Also, the description of its fields clearly indicates
that it was designed for VRAM regions. This struct is strictly related
only to VRAM.
So let's be clear on this point and rename it to xe_vram_region.
Piotr Piórkowski [Fri, 7 Feb 2025 11:31:11 +0000 (12:31 +0100)]
drm/xe/pf: Use an explicit check to see if the device has LMTT
So far, the main condition for using LMTT has been to check that
the device is a discrete gfx.
Let's add a dedicated function to check if the device supports LMTT
as not all future discrete GPU platforms will require LMTT.
v2:
- use xe_has_device_lmtt only when necessary - leave IS_DGFX for other
things related to LMEM provisioning
v3:
- remove IS_SRIOV_PF condition from xe_device_has_lmtt (Michal
Wajdeczko)
- keep IS_SRIOV_PF asserts in LMTT-related code (Michal Wajdeczko)
v4:
- update commit description
Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Michał Winiarski <michal.winiarski@intel.com> Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250207113111.853821-2-piotr.piorkowski@intel.com
Michal Wajdeczko [Thu, 6 Feb 2025 21:45:45 +0000 (22:45 +0100)]
drm/xe: Enable SR-IOV for PTL
We should now have sufficient changes in the driver to run it on
PTL platforms in the SR-IOV Physical Function (PF) mode, that would
allow us to enable SR-IOV Virtual Functions (VFs), and successfully
probe our driver in the VF mode on enabled VF devices.
To unblock SR-IOV PF mode you need to load xe with modparam:
Note that in default auto-provisioning all VFs are allocated with
some amount of shared resources (like unlimited GPU execution and
preemption times, fair GGTT space, fair GuC context IDs range, ...)
However with CONFIG_DEBUG_FS enabled it is possible to tweak most
of the SR-IOV configuration parameters using attributes like:
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com> Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Jakub Kolakowski <jakub1.kolakowski@intel.com> Tested-by: Marcin Bernatowicz <marcin.bernatowicz@linux.intel.com> Reviewed-by: Marcin Bernatowicz <marcin.bernatowicz@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250206214545.940-1-michal.wajdeczko@intel.com
Francois Dugast [Thu, 6 Feb 2025 13:45:50 +0000 (14:45 +0100)]
drm/xe: Add stats for vma page faults
Add new entries in stats for vma page faults. If CONFIG_DEBUG_FS is
enabled, the count and number of bytes can be viewed per GT in the
stat debugfs file. This helps when testing, to confirm page faults
have been triggered as expected. It also helps when looking at the
performance impact of page faults. Data is simply collected when
entering the page fault handler so there is no indication whether
it completed successfully, with or without retries, etc.
Michal Wajdeczko [Wed, 5 Feb 2025 12:01:50 +0000 (13:01 +0100)]
drm/xe: Don't treat SR-IOV platforms as reclaim unsafe
Since commit a4d1c5d0b99b ("drm/xe/pf: Move VFs reprovisioning
to worker") and commit 78d5d1e20d1d ("drm/xe/relay: Don't use
GFP_KERNEL for new transactions") we should have no more lockdep
dependencies on the reclaim path when running in the SRIOV mode
so we believe that we can now mark SRIOV driver as reclaim safe.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Jonathan Cavitt <jonathan.cavitt@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Tested-by: Marcin Bernatowicz <marcin.bernatowicz@linux.intel.com> Reviewed-by: Marcin Bernatowicz <marcin.bernatowicz@linux.intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250205120150.896-1-michal.wajdeczko@intel.com
Rodrigo Vivi [Thu, 9 Jan 2025 19:52:19 +0000 (14:52 -0500)]
drm/xe: Fix PVC RPe and RPa information
A simple lazy buggy copy and paste of the PVC comment has brought
the attention to the incorrect masks of the PVC register for RPa
and RPe. So, let's fix them all.
Raag Jadav [Fri, 31 Jan 2025 05:45:02 +0000 (11:15 +0530)]
drm/xe/hwmon: expose package and vram temperature
Add hwmon support for temp2_input and temp3_input attributes, which will
expose package and vram temperature in millidegree Celsius. With this in
place we can monitor temperature using lm-sensors tool.
drm/xe/pxp: Fail the load if PXP fails to initialize
The PXP implementation mimics the i915 approach of allowing the load
to continue even if PXP init has failed. On Xe however we're taking an
harder stance on boot error and only allowing the load to complete if
everything is working, so update the code to fail if anything goes wrong
during PXP init.
While at it, update the return code in case of PXP not supported to be 0
instead of EOPNOTSUPP, to follow the standard of functions called by
xe_device_probe where every non-zero value means failure.
Since we have a preallocated pool of relay transactions, which
should cover all our normal relay use cases, we may use the
GFP_NOWAIT flag when allocating new outgoing transactions.
Sai Teja Pottumuttu [Thu, 30 Jan 2025 08:58:04 +0000 (14:28 +0530)]
drm/xe: Refactor max_remote_tiles
max_remote_tiles is more related to the platform than the GT IP. Thus
move it to platform descriptor from graphics descriptor. Note that the
FIXME is no more required, thus it can be dropped.
v2: Rebase
v3: Change the position of comment (MattR)
The HW suspend flow kills all PXP HWDRM sessions, so we need to mark all
the queues and BOs as invalid and do a full termination when PXP is next
used.
v2: rebase
v3: rebase on new status flow, defer termination to next PXP use as it
makes things much easier and allows us to use the same function for all
types of suspend.
v4: fix the documentation of the suspend function (John)
drm/xe/pxp/uapi: Add API to mark a BO as using PXP
The driver needs to know if a BO is encrypted with PXP to enable the
display decryption at flip time.
Furthermore, we want to keep track of the status of the encryption and
reject any operation that involves a BO that is encrypted using an old
key. There are two points in time where such checks can kick in:
1 - at VM bind time, all operations except for unmapping will be
rejected if the key used to encrypt the BO is no longer valid. This
check is opt-in via a new VM_BIND flag, to avoid a scenario where a
malicious app purposely shares an invalid BO with a non-PXP aware
app (such as a compositor). If the VM_BIND was failed, the
compositor would be unable to display anything at all. Allowing the
bind to go through means that output still works, it just displays
garbage data within the bounds of the illegal BO.
2 - at job submission time, if the queue is marked as using PXP, all
objects bound to the VM will be checked and the submission will be
rejected if any of them was encrypted with a key that is no longer
valid.
Note that there is no risk of leaking the encrypted data if a user does
not opt-in to those checks; the only consequence is that the user will
not realize that the encryption key is changed and that the data is no
longer valid.
v2: Better commnnts and descriptions (John), rebase
v3: Properly return the result of key_assign up the stack, do not use
xe_bo in display headers (Jani)
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Jani Nikula <jani.nikula@intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250129174140.948829-11-daniele.ceraolospurio@intel.com
PXP prerequisites (SW proxy and HuC auth via GSC) are completed
asynchronously from driver load, which means that userspace can start
submitting before we're ready to start a PXP session. Therefore, we need
a query that userspace can use to check not only if PXP is supported but
also to wait until the prerequisites are done.
v2: Improve doc, do not report TYPE_NONE as supported (José)
v3: Better comments, remove unneeded copy_from_user (John)
drm/xe/pxp/uapi: Add userspace and LRC support for PXP-using queues
Userspace is required to mark a queue as using PXP to guarantee that the
PXP instructions will work. In addition to managing the PXP sessions,
when a PXP queue is created the driver will set the relevant bits in
its context control register.
On submission of a valid PXP queue, the driver will validate all
encrypted objects mapped to the VM to ensured they were encrypted with
the current key.
v2: Remove pxp_types include outside of PXP code (Jani), better comments
and code cleanup (John)
v3: split the internal PXP management to a separate patch for ease of
review. re-order ioctl checks to always return -EINVAL if parameters are
invalid, rebase on msix changes.
drm/xe/pxp: Add PXP queue tracking and session start
We expect every queue that uses PXP to be marked as doing so, to allow
the driver to correctly manage the encryption status. The API for doing
this from userspace is coming in the next patch, while this patch
implement the management side of things. When a PXP queue is created,
the driver will do the following:
- Start the default PXP session if it is not already running;
- assign an rpm ref to the queue to keep for its lifetime (this is
required because PXP HWDRM sessions are killed by the HW suspend flow).
Since PXP start and termination can race each other, this patch also
introduces locking and a state machine to keep track of the pending
operations. Note that since we'll need to take the lock from the
suspend/resume paths as well, we can't do submissions while holding it,
which means we need a slightly more complicated state machine to keep
track of intermediate steps.
v4: new patch in the series, split from the following interface patch to
keep review manageable. Lock and status rework to not do submissions
under lock.
drm/xe/pxp: Add GSC session initialization support
A session is initialized (i.e. started) by sending a message to the GSC.
The initialization will be triggered when a user opts-in to using PXP;
the interface for that is coming in a follow-up patch in the series.
v2: clean up error messages, use new ARB define (John)
When something happen to the session, the HW generates a termination
interrupt. In reply to this, the driver is required to submit an inline
session termination via the VCS, trigger the global termination and
notify the GSC FW that the session is now invalid.
v2: rename ARB define to make it cleaner to move it to uapi (John)
v3: fix parameter name in documentation
After a session is terminated, we need to inform the GSC so that it can
clean up its side of the allocation. This is done by sending an
invalidation command with the session ID.
The invalidation will be triggered in response to a termination,
interrupt, whose handling is coming in the next patch in the series.
The key termination is done with a specific submission to the VCS
engine. This flow will be triggered in response to a termination
interrupt, whose handling is coming in a follow-up patch in the series.
v2: clean up defines and command emission code. (John)
PXP requires submissions to the HW for the following operations
1) Key invalidation, done via the VCS engine
2) Communication with the GSC FW for session management, done via the
GSCCS.
Key invalidation submissions are serialized (only 1 termination can be
serviced at a given time) and done via GGTT, so we can allocate a simple
BO and a kernel queue for it.
Submissions for session management are tied to a PXP client (identified
by a unique host_session_id); from the GSC POV this is a user-accessible
construct, so all related submission must be done via PPGTT. The driver
does not currently support PPGTT submission from within the kernel, so
to add this support, the following changes have been included:
- a new type of kernel-owned VM (marked as GSC), required to ensure we
don't use fault mode on the engine and to mark the different lock
usage with lockdep.
- a new function to map a BO into a VM from within the kernel.
v2: improve comments and function name, remove unneeded include (John)
v3: fix variable/function names in documentation
As the first step towards adding PXP support, hook in the PXP init
function, allocate the PXP structure and initialize the KCR register to
allow PXP HWDRM sessions.
v2: remove unneeded includes, free PXP memory on error (John)
Lucas De Marchi [Fri, 31 Jan 2025 22:39:08 +0000 (14:39 -0800)]
drm/xe: Remove xe_dummy_exit()
Since commit 014125c64d09 ("drm/xe: Support 'nomodeset' kernel
command-line option") the dummy exit is not needed anymore since the
caller check for a NULL pointer. Drop it.
Riana Tauro [Fri, 31 Jan 2025 08:05:27 +0000 (13:35 +0530)]
drm/xe: Skip survivability mode for VF
Follow the probe flow in case of VF and do not enter survivability mode
in case of pcode init failure.
Fixes: 5e940312a2ac ("drm/xe: Add functions and sysfs for boot survivability") Suggested-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250131080527.2256475-1-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Maarten Lankhorst [Tue, 21 Jan 2025 14:28:50 +0000 (15:28 +0100)]
drm/xe/display: Use a single early init call for display
Now that interrupts are disabled for xe_display_init_noaccel,
both xe_display_init_noirq and xe_display_init_noaccel run in the same
context.
This means that we can get rid of the 3 different init calls. Without
interrupts, nothing is touching display up to this point.
Unify those 3 early display calls into a single xe_display_init_early(),
this makes the init sequence cleaner, and display less tangled during
init.
We're changing the driver to have no interrupts during early init for
Xe, so we poll the PIPE_FRMSTMSMP counter instead.
Interrupts cannot be enabled during FB readout because memirq's requires
an allocation. This would overwrite the FB we want to read out.
While it might be possible to also run do the same in i915 and run
it without interrupts, the platforms i915 supports had a less clear
distinction between display and graphics. For this reason I choose
only to touch Xe for now.
Jakub Kolakowski [Tue, 28 Jan 2025 11:03:00 +0000 (11:03 +0000)]
drm/xe/pf: Add runtime registers for graphics gen >= 30
Add missing runtime registers for graphics versions of 3000 or higher.
This is required for Xe3 where additionally we have
MIRROR_L3BANK_ENABLE register.
Signed-off-by: Jakub Kolakowski <jakub1.kolakowski@intel.com> Suggested-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Cc: Adam Miszczak <adam.miszczak@linux.intel.com> Cc: Jakub Kolakowski <jakub1.kolakowski@intel.com> Cc: Lukasz Laguna <lukasz.laguna@intel.com> Cc: Marcin Bernatowicz <marcin.bernatowicz@linux.intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Piotr Piorkowski <piotr.piorkowski@intel.com> Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Tested-by: Marcin Bernatowicz <marcin.bernatowicz@linux.intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250128110300.2840596-2-jakub1.kolakowski@intel.com
Michal Wajdeczko [Wed, 29 Jan 2025 19:59:47 +0000 (20:59 +0100)]
drm/xe/pf: Reset GuC VF config when unprovisioning critical resource
GuC firmware counts received VF configuration KLVs and may start
validation of the complete VF config even if some resources where
unprovisioned in the meantime, leading to unexpected errors like:
$ echo 1 | sudo tee /sys/kernel/debug/dri/0000:00:02.0/gt0/vf1/contexts_quota
$ echo 0 | sudo tee /sys/kernel/debug/dri/0000:00:02.0/gt0/vf1/contexts_quota
$ echo 1 | sudo tee /sys/kernel/debug/dri/0000:00:02.0/gt0/vf1/doorbells_quota
$ echo 0 | sudo tee /sys/kernel/debug/dri/0000:00:02.0/gt0/vf1/doorbells_quota
$ echo 1 | sudo tee /sys/kernel/debug/dri/0000:00:02.0/gt0/vf1/ggtt_quota
tee: '/sys/kernel/debug/dri/0000:00:02.0/gt0/vf1/ggtt_quota': Input/output error
To mitigate this problem trigger explicit VF config reset after
unprovisioning any of the critical resources (GGTT, context or
doorbell IDs) that GuC is monitoring.
Michal Wajdeczko [Wed, 29 Jan 2025 19:59:46 +0000 (20:59 +0100)]
drm/xe/pf: Don't send BEGIN_ID if VF has no context/doorbells
It turned out that GuC validates VF configuration immediately
after receiving "some" set of configuration KLVs and complains
if one of the critical, from GuC understanding, resource is left
unprovisioned, even if PF should be still allowed to make late VF
config adjustments, since VF was not yet started.
This issue was discovered after we decided to asynchronously
re-send configuration KLVs after GT reset/resume, as then fair
VF auto-provisioning could already allocate some of the resources,
which was a prerequiste for sending those config KLVs:
Francois Dugast [Wed, 29 Jan 2025 17:52:41 +0000 (18:52 +0100)]
drm/xe/gt_pagefault: Print engine class string
The engine class index which is printed here is an internal representation
for debugging. It is _not_ an index based on DRM_XE_ENGINE_CLASS_* values
provided in the uAPI. Add the string representation of the engine class to
the output in order to limit possible confusion by users when analyzing the
logs.
Lucas De Marchi [Tue, 28 Jan 2025 15:42:42 +0000 (07:42 -0800)]
drm/xe/guc: Fix size_t print format
Use %zx format to print size_t to remove the following warning when
building for i386:
>> drivers/gpu/drm/xe/xe_guc_ct.c:1727:43: warning: format specifies type 'unsigned long' but the argument has type 'size_t' (aka 'unsigned int') [-Wformat]
1727 | drm_printf(p, "[CTB].length: 0x%lx\n", snapshot->ctb_size);
| ~~~ ^~~~~~~~~~~~~~~~~~
| %zx
Cc: José Roberto de Souza <jose.souza@intel.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501281627.H6nj184e-lkp@intel.com/ Fixes: cb1f868ca137 ("drm/xe: Make GUC binaries dump consistent with other binaries in devcoredump") Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250128154242.3371687-1-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
DCC in LNL should be disabled. It was a mistake to decide
to go against GuC platform defaults in this case and this
could lead to regressions in some TDP limited scenarios
instead of helping.
Melissa Wen [Tue, 28 Jan 2025 00:41:10 +0000 (21:41 -0300)]
drm/amd/display: restore invalid MSA timing check for freesync
This restores the original behavior that gets min/max freq from EDID and
only set DP/eDP connector as freesync capable if "sink device is capable
of rendering incoming video stream without MSA timing parameters", i.e.,
`allow_invalid_MSA_timing_params` is true. The condition was mistakenly
removed by 0159f88a99c9 ("drm/amd/display: remove redundant freesync
parser for DP").
CC: Mario Limonciello <mario.limonciello@amd.com> CC: Alex Hung <alex.hung@amd.com> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3915 Fixes: 0159f88a99c9 ("drm/amd/display: remove redundant freesync parser for DP") Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Melissa Wen <mwen@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
Prike Liang [Tue, 14 Jan 2025 03:20:17 +0000 (11:20 +0800)]
drm/amdkfd: only flush the validate MES contex
The following page fault was observed duringthe KFD process release.
In this particular error case, the HIP test (./MemcpyPerformance -h)
does not require the queue. As a result, the process_context_addr was
not assigned when the KFD process was released, ultimately leading to
this page fault during the execution of the function
kfd_process_dequeue_from_all_devices().
Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Jonathan Kim <jonathan.kim@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
Jay Cornwall [Thu, 16 Jan 2025 20:36:39 +0000 (14:36 -0600)]
drm/amdkfd: Block per-queue reset when halt_if_hws_hang=1
The purpose of halt_if_hws_hang is to preserve GPU state for driver
debugging when queue preemption fails. Issuing per-queue reset may
kill wavefronts which caused the preemption failure.
Signed-off-by: Jay Cornwall <jay.cornwall@amd.com> Reviewed-by: Jonathan Kim <Jonathan.Kim@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.12.x
Riana Tauro [Tue, 28 Jan 2025 09:56:31 +0000 (15:26 +0530)]
drm/xe: Enable Boot Survivability mode
Enable boot survivability mode if pcode initialization fails and
if boot status indicates a failure. In this mode, drm card is not
exposed and driver probe returns success after loading the bare minimum
to allow firmware to be flashed via mei.
Riana Tauro [Tue, 28 Jan 2025 09:56:30 +0000 (15:26 +0530)]
drm/xe: Add functions and sysfs for boot survivability
Boot Survivability is a software based workflow for recovering a system
in a failed boot state. Here system recoverability is concerned with
recovering the firmware responsible for boot.
This is implemented by loading the driver with bare minimum (no drm card)
to allow the firmware to be flashed through mei-gsc and collect telemetry.
The driver's probe flow is modified such that it enters survivability mode
when pcode initialization is incomplete and boot status denotes a failure.
In this mode, drm card is not exposed and presence of survivability_mode
entry in PCI sysfs is used to indicate survivability mode and
provide additional information required for debug
This patch adds initialization functions and exposes admin
readable sysfs entries
The new sysfs will have the below layout
/sys/bus/.../bdf
├── survivability_mode
v2: reorder headers
fix doc
remove survivability info and use mode to display information
use separate function for logging survivability information
for critical error (Rodrigo)
v3: use for loop
use dev logs instead of drm
use helper function for aux history(Rodrigo)
remove unnecessary error check of greater than max_scratch
as we are reading only 3 bit
v4: fix checkpatch warnings
fix space (Rodrigo)
rename register
José Roberto de Souza [Thu, 23 Jan 2025 20:22:04 +0000 (12:22 -0800)]
drm/xe: Make GUC binaries dump consistent with other binaries in devcoredump
All other(hwsp, hwctx and vmas) binaries follow this format:
[name].length: 0x1000
[name].data: xxxxxxx
[name].error: errno
The error one is just in case by some reason it was not able to
capture the binary.
So this GuC binaries should follow the same patern.
v2:
- renamed GUC binary to LOG
Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250123202307.95103-3-jose.souza@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Lucas De Marchi [Thu, 23 Jan 2025 20:22:03 +0000 (12:22 -0800)]
drm/xe: Fix and re-enable xe_print_blob_ascii85()
Commit 70fb86a85dc9 ("drm/xe: Revert some changes that break a mesa
debug tool") partially reverted some changes to workaround breakage
caused to mesa tools. However, in doing so it also broke fetching the
GuC log via debugfs since xe_print_blob_ascii85() simply bails out.
The fix is to avoid the extra newlines: the devcoredump interface is
line-oriented and adding random newlines in the middle breaks it. If a
tool is able to parse it by looking at the data and checking for chars
that are out of the ascii85 space, it can still do so. A format change
that breaks the line-oriented output on devcoredump however needs better
coordination with existing tools.
v2: Add suffix description comment
v3: Reword explanation of xe_print_blob_ascii85() calling drm_puts()
in a loop
Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Julia Filipchuk <julia.filipchuk@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Cc: stable@vger.kernel.org Fixes: 70fb86a85dc9 ("drm/xe: Revert some changes that break a mesa debug tool") Fixes: ec1455ce7e35 ("drm/xe/devcoredump: Add ASCII85 dump helper function") Link: https://patchwork.freedesktop.org/patch/msgid/20250123202307.95103-2-jose.souza@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Lucas De Marchi [Thu, 23 Jan 2025 05:11:11 +0000 (21:11 -0800)]
drm/xe/devcoredump: Move exec queue snapshot to Contexts section
Having the exec queue snapshot inside a "GuC CT" section was always
wrong. Commit c28fd6c358db ("drm/xe/devcoredump: Improve section
headings and add tile info") tried to fix that bug, but with that also
broke the mesa tool that parses the devcoredump, hence it was reverted
in commit 70fb86a85dc9 ("drm/xe: Revert some changes that break a mesa
debug tool").
With the mesa tool also fixed, this can propagate as a fix on both
kernel and userspace side to avoid unnecessary headache for a debug
feature.
Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Julia Filipchuk <julia.filipchuk@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Cc: stable@vger.kernel.org Fixes: 70fb86a85dc9 ("drm/xe: Revert some changes that break a mesa debug tool") Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250123051112.1938193-2-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
John Harrison [Sat, 18 Jan 2025 00:54:03 +0000 (16:54 -0800)]
drm/xe: Upgrade complaint about missing slice info
The steering code needs to know slice/subslice counts and this
information should be retrieved from the hwconfig table. However,
earlier platforms don't have it, hence the KMD has a fallback path.
Newer platforms really should have the entries and if they are missing
that is a bug that needs to be fixed in the table.
So update the complaint to be an error on newer platforms and remove
it completely for older ones that we know are bad (but are not POR for
the Xe driver anyway). Also, re-word the message a little to make it
clearer what the issue is.
Michal Wajdeczko [Sat, 25 Jan 2025 21:55:05 +0000 (22:55 +0100)]
drm/xe/pf: Move VFs reprovisioning to worker
Since the GuC is reset during GT reset, we need to re-send the
entire SR-IOV provisioning configuration to the GuC. But since
this whole configuration is protected by the PF master mutex and
we can't avoid making allocations under this mutex (like during
LMEM provisioning), we can't do this reprovisioning from gt-reset
path if we want to be reclaim-safe. Move VFs reprovisioning to a
async worker that we will start from the gt-reset path.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250125215505.720-1-michal.wajdeczko@intel.com
Michal Wajdeczko [Fri, 24 Jan 2025 18:52:47 +0000 (19:52 +0100)]
drm/xe/pf: Use GuC Buffer Cache during policy provisioning
Start using GuC buffer cache for the SRIOV policy configuration
actions. This is a required step before we could declare SRIOV
PF as being a reclaim safe.
Vinay Belgaumkar [Fri, 24 Jan 2025 05:04:11 +0000 (21:04 -0800)]
drm/xe/pmu: Add GT C6 events
Provide a PMU interface for GT C6 residency counters. The interface is
similar to the one available for i915, but gt is passed in the config
when creating the event.
Sample usage and output:
$ perf list | grep gt-c6
xe_0000_00_02.0/gt-c6-residency/ [Kernel PMU event]
==> /sys/bus/event_source/devices/xe_0000_00_02.0/events/gt-c6-residency.unit <==
ms
$ perf stat -e xe_0000_00_02.0/gt-c6-residency,gt=0/ -I1000
# time counts unit events
1.001196056 1,001 ms xe_0000_00_02.0/gt-c6-residency,gt=0/
2.005216219 1,003 ms xe_0000_00_02.0/gt-c6-residency,gt=0/
Lucas De Marchi [Fri, 24 Jan 2025 05:04:10 +0000 (21:04 -0800)]
drm/xe/pmu: Add attribute skeleton
Add the generic support for defining new attributes. This only adds
the macros and common infra for the event counters, but no counters
yet. This is going to be added as follow up changes.
Lucas De Marchi [Fri, 24 Jan 2025 05:04:09 +0000 (21:04 -0800)]
drm/xe/pmu: Get/put runtime pm on event init
When the event is created, make sure runtime pm is taken and later put:
in order to read an event counter the GPU needs to remain accessible and
doing a get/put during perf's read is not possible it's holding a
raw_spinlock.
Lucas De Marchi [Fri, 24 Jan 2025 05:04:07 +0000 (21:04 -0800)]
drm/xe/pmu: Assert max gt
XE_PMU_MAX_GT needs to be used due to a circular dependency, but we
should make sure it doesn't go out of sync with XE_PMU_MAX_GT. Add a
compile check for that.
Simona Vetter [Fri, 24 Jan 2025 16:06:06 +0000 (17:06 +0100)]
Merge tag 'drm-misc-next-fixes-2025-01-24' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next
drm-misc-next-fixes for v6.14-rc1:
- Fix a serious regression from commit e4b5ccd392b9 ("drm/v3d: Ensure
job pointer is set to NULL after job completion")
- dmem cgroup Kconfig fix (acked by Tejun)
- virtio: uaf in dma_buf free path
- xlnx: kerneldoc
Aric Cyr [Tue, 10 Dec 2024 23:38:15 +0000 (18:38 -0500)]
drm/amd/display: Optimize cursor position updates
[why]
Updating the cursor enablement register can be a slow operation and accumulates
when high polling rate cursors cause frequent updates asynchronously to the
cursor position.
[how]
Since the cursor enable bit is cached there is no need to update the
enablement register if there is no change to it. This removes the
read-modify-write from the cursor position programming path in HUBP and
DPP, leaving only the register writes.
Cc: Mario Limonciello <mario.limonciello@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Reviewed-by: Sung Lee <sung.lee@amd.com> Signed-off-by: Aric Cyr <Aric.Cyr@amd.com> Signed-off-by: Wayne Lin <wayne.lin@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Aric Cyr [Thu, 9 Jan 2025 20:03:48 +0000 (15:03 -0500)]
drm/amd/display: Add hubp cache reset when powergating
[Why]
When HUBP is power gated, the SW state can get out of sync with the
hardware state causing cursor to not be programmed correctly.
[How]
Similar to DPP, add a HUBP reset function which is called wherever
HUBP is initialized or powergated. This function will clear the cursor
position and attribute cache allowing for proper programming when the
HUBP is brought back up.
Cc: Mario Limonciello <mario.limonciello@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Reviewed-by: Sung Lee <sung.lee@amd.com> Signed-off-by: Aric Cyr <Aric.Cyr@amd.com> Signed-off-by: Wayne Lin <wayne.lin@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Shaoyun Liu [Tue, 14 Jan 2025 16:57:41 +0000 (11:57 -0500)]
drm/amd/amdgpu: Enable scratch data dump for mes 12
MES internal will check CP_MES_MSCRATCH_LO/HI register to set scratch
data location during ucode start, driver side need to start the MES
one by one with different setting for each pipe
Signed-off-by: Shaoyun Liu <shaoyun.liu@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Mario Limonciello [Thu, 16 Jan 2025 21:47:11 +0000 (15:47 -0600)]
drm/amd: Clarify kdoc for amdgpu.gttsize
Effectively amdgpu.gttsize gets set to ~1/2 of RAM, but that's controlled
by what the TTM page limit is set to. Clarify the kdoc.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Srinivasan Shanmugam [Mon, 20 Jan 2025 12:27:04 +0000 (17:57 +0530)]
drm/amd/amdgpu: Prevent null pointer dereference in GPU bandwidth calculation
If the parent is NULL, adev->pdev is used to retrieve the PCIe speed and
width, ensuring that the function can still determine these
capabilities from the device itself.
Fixes the below:
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6193 amdgpu_device_gpu_bandwidth()
error: we previously assumed 'parent' could be null (see line 6180)
Fixes: 757e8b951ce2 ("drm/amdgpu: cache gpu pcie link width") Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Srinivasan Shanmugam [Wed, 15 Jan 2025 16:59:06 +0000 (22:29 +0530)]
drm/amd/display: Fix error pointers in amdgpu_dm_crtc_mem_type_changed
The function amdgpu_dm_crtc_mem_type_changed was dereferencing pointers
returned by drm_atomic_get_plane_state without checking for errors. This
could lead to undefined behavior if the function returns an error pointer.
This commit adds checks using IS_ERR to ensure that new_plane_state and
old_plane_state are valid before dereferencing them.
Fixes the below:
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:11486 amdgpu_dm_crtc_mem_type_changed()
error: 'new_plane_state' dereferencing possible ERR_PTR()
Fixes: 4caacd1671b7 ("drm/amd/display: Do not elevate mem_type change to full update") Cc: Leo Li <sunpeng.li@amd.com> Cc: Tom Chung <chiahsuan.chung@amd.com> Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Cc: Roman Li <roman.li@amd.com> Cc: Alex Hung <alex.hung@amd.com> Cc: Aurabindo Pillai <aurabindo.pillai@amd.com> Cc: Harry Wentland <harry.wentland@amd.com> Cc: Hamza Mahfooz <hamza.mahfooz@amd.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Roman Li <roman.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Lin.Cao [Tue, 14 Jan 2025 09:42:01 +0000 (17:42 +0800)]
drm/amdgpu: fix ring timeout issue in gfx10 sr-iov environment
commit 26c95e838e63 ("drm/amdgpu: set the VM pointer to NULL in
amdgpu_job_prepare") set job->vm as NULL if there is no fence. It will
cause emit switch buffer be skippen if job->vm set as NULL.
Check job rather than vm could solve this problem.
Fixes: 26c95e838e63 ("drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare") Signed-off-by: Lin.Cao <lincao12@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>