drm/amd/sched: fix deadlock caused by unsignaled fences of deleted jobs

Highly concurrent Piglit runs can trigger a race condition where a pending SDMA job on a buffer object is never executed because the corresponding process is killed (perhaps due to a crash). Since the job's fences were never signaled, the buffer object was effectively leaked. Worse, the buffer was stuck wherever it happened to be at the time, possibly in VRAM. The symptom was user space processes stuck in interruptible waits with kernel stacks like: [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300 [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm] [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu] [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu] [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu] [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad [<ffffffffffffffff>] 0xffffffffffffffff Note: The correctness of this change depends on the earlier commit "drm/amd/sched: move adding finish callback to amd_sched_job_begin" v2: set an error on the finished fence Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Andres Rodriguez <andresx7@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
author: Nicolai Hähnle <nicolai.haehnle@amd.com> 2017-09-28 11:57:32 +0200
committer: Alex Deucher <alexander.deucher@amd.com> 2017-10-06 17:44:32 -0400
commit: 79867462634836ee5c39a2cdf624719feeb189bd (patch)
tree: 6a3cd9acb3a6ffd33751db2a01488847bfc1da5b /drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
parent: drm/amd/sched: NULL out the s_fence field after run_job (diff)
download: linux-dev-79867462634836ee5c39a2cdf624719feeb189bd.tar.xz
linux-dev-79867462634836ee5c39a2cdf624719feeb189bd.zip
1 files changed, 7 insertions, 1 deletions
diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
index 4693be20e30a..08e1332d814a 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
@@ -227,8 +227,14 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
 		 */
 		kthread_park(sched->thread);
 		kthread_unpark(sched->thread);
-		while (kfifo_out(&entity->job_queue, &job, sizeof(job)))
+		while (kfifo_out(&entity->job_queue, &job, sizeof(job))) {
+			struct amd_sched_fence *s_fence = job->s_fence;
+			amd_sched_fence_scheduled(s_fence);
+			dma_fence_set_error(&s_fence->finished, -ESRCH);
+			amd_sched_fence_finished(s_fence);
+			dma_fence_put(&s_fence->finished);
 			sched->ops->free_job(job);
+		}
 
 	}
 	kfifo_free(&entity->job_queue);
author	Nicolai Hähnle <nicolai.haehnle@amd.com>	2017-09-28 11:57:32 +0200
committer	Alex Deucher <alexander.deucher@amd.com>	2017-10-06 17:44:32 -0400
commit	79867462634836ee5c39a2cdf624719feeb189bd (patch)
tree	6a3cd9acb3a6ffd33751db2a01488847bfc1da5b /drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
parent	drm/amd/sched: NULL out the s_fence field after run_job (diff)
download	linux-dev-79867462634836ee5c39a2cdf624719feeb189bd.tar.xz linux-dev-79867462634836ee5c39a2cdf624719feeb189bd.zip