]> www.infradead.org Git - users/hch/misc.git/commitdiff
drm/sched: Document race condition in drm_sched_fini()
authorPhilipp Stanner <phasta@kernel.org>
Wed, 13 Aug 2025 08:56:55 +0000 (10:56 +0200)
committerPhilipp Stanner <phasta@kernel.org>
Thu, 28 Aug 2025 08:27:18 +0000 (10:27 +0200)
In drm_sched_fini() all entities are marked as stopped - without taking
the appropriate lock, because that would deadlock. That means that
drm_sched_fini() and drm_sched_entity_push_job() can race against each
other.

This should most likely be fixed by establishing the rule that all
entities associated with a scheduler must be torn down first. Then,
however, the locking should be removed from drm_sched_fini() alltogether
with an appropriate comment.

Reported-by: James Flowers <bold.zone2373@fastmail.com>
Link: https://lore.kernel.org/dri-devel/20250720235748.2798-1-bold.zone2373@fastmail.com/
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250813085654.102504-2-phasta@kernel.org
drivers/gpu/drm/scheduler/sched_main.c

index 5a550fd76bf011638bacf04c3315c149890d9442..46119aacb809bef086c8803987b9ae3a9b479065 100644 (file)
@@ -1424,6 +1424,22 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
                         * Prevents reinsertion and marks job_queue as idle,
                         * it will be removed from the rq in drm_sched_entity_fini()
                         * eventually
+                        *
+                        * FIXME:
+                        * This lacks the proper spin_lock(&s_entity->lock) and
+                        * is, therefore, a race condition. Most notably, it
+                        * can race with drm_sched_entity_push_job(). The lock
+                        * cannot be taken here, however, because this would
+                        * lead to lock inversion -> deadlock.
+                        *
+                        * The best solution probably is to enforce the life
+                        * time rule of all entities having to be torn down
+                        * before their scheduler. Then, however, locking could
+                        * be dropped alltogether from this function.
+                        *
+                        * For now, this remains a potential race in all
+                        * drivers that keep entities alive for longer than
+                        * the scheduler.
                         */
                        s_entity->stopped = true;
                spin_unlock(&rq->lock);