Age | Commit message (Collapse) | Author | Files | Lines |
|
We recorded the dependencies for WAIT_FOR_SUBMIT in order that we could
correctly perform priority inheritance from the parallel branches to the
common trunk. However, for the purpose of timeslicing and reset
handling, the dependency is weak -- as we the pair of requests are
allowed to run in parallel and not in strict succession.
The real significance though is that this allows us to rearrange
groups of WAIT_FOR_SUBMIT linked requests along the single engine, and
so can resolve user level inter-batch scheduling dependencies from user
semaphores.
Fixes: c81471f5e95c ("drm/i915: Copy across scheduler behaviour flags across submit fences")
Testcase: igt/gem_exec_fence/submit
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200507155109.8892-1-chris@chris-wilson.co.uk
(cherry picked from commit 6b6cd2ebd8d071e55998e32b648bb8081f7f02bb)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
The presumption is that by using a circular counter that is twice as
large as the maximum ELSP submission, we would never reuse the same CCID
for two inflight contexts.
However, if we continually preempt an active context such that it always
remains inflight, it can be resubmitted with an arbitrary number of
paired contexts. As each of its paired contexts will use a new CCID,
eventually it will wrap and submit two ELSP with the same CCID.
Rather than use a simple circular counter, switch over to a small bitmap
of inflight ids so we can avoid reusing one that is still potentially
active.
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1796
Fixes: 2935ed5339c4 ("drm/i915: Remove logical HW ID")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: <stable@vger.kernel.org> # v5.5+
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200428184751.11257-2-chris@chris-wilson.co.uk
(cherry picked from commit 5c4a53e3b1cbc38d0906e382f1037290658759bb)
(cherry picked from commit 134711240307d5586ae8e828d2699db70a8b74f2)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
The bspec is confusing on the nature of the upper 32bits of the LRC
descriptor. Once upon a time, it said that it uses the upper 32b to
decide if it should perform a lite-restore, and so we must ensure that
each unique context submitted to HW is given a unique CCID [for the
duration of it being on the HW]. Currently, this is achieved by using
a small circular tag, and assigning every context submitted to HW a
new id. However, this tag is being cleared on repinning an inflight
context such that we end up re-using the 0 tag for multiple contexts.
To avoid accidentally clearing the CCID in the upper 32bits of the LRC
descriptor, split the descriptor into two dwords so we can update the
GGTT address separately from the CCID.
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1796
Fixes: 2935ed5339c4 ("drm/i915: Remove logical HW ID")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: <stable@vger.kernel.org> # v5.5+
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200428184751.11257-1-chris@chris-wilson.co.uk
(cherry picked from commit 2632f174a2e1a5fd40a70404fa8ccfd0b1f79ebd)
(cherry picked from commit a4b70fcc587860f4b972f68217d8ebebe295ec15)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
If we find ourselves waiting on a MI_SEMAPHORE_WAIT, either within the
user batch or in our own preamble, the engine raises a
GT_WAIT_ON_SEMAPHORE interrupt. We can unmask that interrupt and so
respond to a semaphore wait by yielding the timeslice, if we have
another context to yield to!
The only real complication is that the interrupt is only generated for
the start of the semaphore wait, and is asynchronous to our
process_csb() -- that is, we may not have registered the timeslice before
we see the interrupt. To ensure we don't miss a potential semaphore
blocking forward progress (e.g. selftests/live_timeslice_preempt) we mark
the interrupt and apply it to the next timeslice regardless of whether it
was active at the time.
v2: We use semaphores in preempt-to-busy, within the timeslicing
implementation itself! Ergo, when we do insert a preemption due to an
expired timeslice, the new context may start with the missed semaphore
flagged by the retired context and be yielded, ad infinitum. To avoid
this, read the context id at the time of the semaphore interrupt and
only yield if that context is still active.
Fixes: 8ee36e048c98 ("drm/i915/execlists: Minimalistic timeslicing")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200407130811.17321-1-chris@chris-wilson.co.uk
(cherry picked from commit c4e8ba7390346a77ffe33ec3f210bc62e0b6c8c6)
(cherry picked from commit cd60e4ac4738a6921592c4f7baf87f9a3499f0e2)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
In order to allow userspace to rely on timeslicing to reorder their
batches, we must support preemption of those user batches. Declare
timeslicing as an explicit property that is a combination of having the
kernel support and HW support.
Suggested-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Fixes: 8ee36e048c98 ("drm/i915/execlists: Minimalistic timeslicing")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200501122249.12417-1-chris@chris-wilson.co.uk
(cherry picked from commit a211da9c771bf97395a3ced83a3aa383372b13a7)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
We move the virtual breadcrumb from one physical engine to the next, if
the next virtual request is scheduled on a new physical engine. Since
the virtual context can only be in one signal queue, we need it to track
the current physical engine for the new breadcrumbs. However, to move
the list we need both breadcrumb locks -- and since we cannot take both
at the same time (unless we are careful and always ensure consistent
ordering) stage the movement of the signaler via the current virtual
request.
Closes: https://gitlab.freedesktop.org/drm/intel/issues/1510
Fixes: 6d06779e8672 ("drm/i915: Load balancing across a virtual engine")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200325130059.30600-1-chris@chris-wilson.co.uk
(cherry picked from commit 6c81e21a4742385c00713137c6fdcade0412e93c)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
This reverts commit 36a6b5d964d995b536b1925ec42052ee40ba92c4.
The commit takes care Wa_1604544889 which was fixed on a0 stepping based on
a0 replan. So no SW workaround is required on any stepping now.
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Caz Yokoyama <caz.yokoyama@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Fixes: 36a6b5d964d9 ("drm/i915/tgl: Add extra hdc flush workaround")
Link: https://patchwork.freedesktop.org/patch/msgid/1c751032ce79c80c5485cae315f1a9904ce07cac.1583359940.git.caz.yokoyama@intel.com
|
|
Record the initial active element we use when building the next ELSP
submission, so that we can compare against it latter to see if there's
no change.
Fixes: 44d0a9c05bc0 ("drm/i915/execlists: Skip redundant resubmission")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200311092624.10012-2-chris@chris-wilson.co.uk
|
|
The virtual engine passes tokens back and forth to its backing physical
engines.
[ 57.372993] BUG: KCSAN: data-race in execlists_dequeue [i915] / virtual_submission_tasklet [i915]
[ 57.373012]
[ 57.373023] write to 0xffff8881f47324c0 of 4 bytes by interrupt on cpu 2:
[ 57.373241] execlists_dequeue+0x6fa/0x2150 [i915]
[ 57.373458] __execlists_submission_tasklet+0x48/0x60 [i915]
[ 57.373677] execlists_submission_tasklet+0xd3/0x170 [i915]
[ 57.373694] tasklet_action_common.isra.0+0x42/0xa0
[ 57.373709] __do_softirq+0xd7/0x2cd
[ 57.373723] irq_exit+0xbe/0xe0
[ 57.373735] do_IRQ+0x51/0x100
[ 57.373748] ret_from_intr+0x0/0x1c
[ 57.373963] engine_retire+0x89/0xe0 [i915]
[ 57.373977] process_one_work+0x3b1/0x690
[ 57.373990] worker_thread+0x80/0x670
[ 57.374004] kthread+0x19a/0x1e0
[ 57.374017] ret_from_fork+0x1f/0x30
[ 57.374027]
[ 57.374038] read to 0xffff8881f47324c0 of 4 bytes by interrupt on cpu 3:
[ 57.374256] virtual_submission_tasklet+0x27/0x5a0 [i915]
[ 57.374273] tasklet_action_common.isra.0+0x42/0xa0
[ 57.374288] __do_softirq+0xd7/0x2cd
[ 57.374302] run_ksoftirqd+0x15/0x20
[ 57.374315] smpboot_thread_fn+0x1ab/0x300
[ 57.374329] kthread+0x19a/0x1e0
[ 57.374342] ret_from_fork+0x1f/0x30
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200310141320.24149-3-chris@chris-wilson.co.uk
|
|
[ 206.875637] BUG: KCSAN: data-race in __i915_schedule+0x7fc/0x930 [i915]
[ 206.875654]
[ 206.875666] race at unknown origin, with read to 0xffff8881f7644480 of 8 bytes by task 703 on cpu 3:
[ 206.875901] __i915_schedule+0x7fc/0x930 [i915]
[ 206.876130] __bump_priority+0x63/0x80 [i915]
[ 206.876361] __i915_sched_node_add_dependency+0x258/0x300 [i915]
[ 206.876593] i915_sched_node_add_dependency+0x50/0xa0 [i915]
[ 206.876824] i915_request_await_dma_fence+0x1da/0x530 [i915]
[ 206.877057] i915_request_await_object+0x2fe/0x470 [i915]
[ 206.877287] i915_gem_do_execbuffer+0x45dc/0x4c20 [i915]
[ 206.877517] i915_gem_execbuffer2_ioctl+0x2c3/0x580 [i915]
[ 206.877535] drm_ioctl_kernel+0xe4/0x120
[ 206.877549] drm_ioctl+0x297/0x4c7
[ 206.877563] ksys_ioctl+0x89/0xb0
[ 206.877577] __x64_sys_ioctl+0x42/0x60
[ 206.877591] do_syscall_64+0x6e/0x2c0
[ 206.877606] entry_SYSCALL_64_after_hwframe+0x44/0xa9
v2: Be safe and include mb
References: https://gitlab.freedesktop.org/drm/intel/issues/1318
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200309170540.10332-1-chris@chris-wilson.co.uk
|
|
[ 120.176548] BUG: KCSAN: data-race in __i915_schedule [i915] / effective_prio [i915]
[ 120.176566]
[ 120.176577] write to 0xffff8881e35e6540 of 4 bytes by task 730 on cpu 3:
[ 120.176792] __i915_schedule+0x63e/0x920 [i915]
[ 120.177007] __bump_priority+0x63/0x80 [i915]
[ 120.177220] __i915_sched_node_add_dependency+0x258/0x300 [i915]
[ 120.177438] i915_sched_node_add_dependency+0x50/0xa0 [i915]
[ 120.177654] i915_request_await_dma_fence+0x1da/0x530 [i915]
[ 120.177867] i915_request_await_object+0x2fe/0x470 [i915]
[ 120.178081] i915_gem_do_execbuffer+0x45dc/0x4c20 [i915]
[ 120.178292] i915_gem_execbuffer2_ioctl+0x2c3/0x580 [i915]
[ 120.178309] drm_ioctl_kernel+0xe4/0x120
[ 120.178322] drm_ioctl+0x297/0x4c7
[ 120.178335] ksys_ioctl+0x89/0xb0
[ 120.178348] __x64_sys_ioctl+0x42/0x60
[ 120.178361] do_syscall_64+0x6e/0x2c0
[ 120.178375] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 120.178387]
[ 120.178397] read to 0xffff8881e35e6540 of 4 bytes by interrupt on cpu 2:
[ 120.178606] effective_prio+0x25/0xc0 [i915]
[ 120.178812] process_csb+0xe8b/0x10a0 [i915]
[ 120.179021] execlists_submission_tasklet+0x30/0x170 [i915]
[ 120.179038] tasklet_action_common.isra.0+0x42/0xa0
[ 120.179053] __do_softirq+0xd7/0x2cd
[ 120.179066] irq_exit+0xbe/0xe0
[ 120.179078] do_IRQ+0x51/0x100
[ 120.179090] ret_from_intr+0x0/0x1c
[ 120.179104] cpuidle_enter_state+0x1b8/0x5d0
[ 120.179117] cpuidle_enter+0x50/0x90
[ 120.179131] do_idle+0x1a1/0x1f0
[ 120.179145] cpu_startup_entry+0x14/0x16
[ 120.179158] start_secondary+0x120/0x180
[ 120.179172] secondary_startup_64+0xa4/0xb0
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200309110934.868-5-chris@chris-wilson.co.uk
|
|
[ 145.927961] BUG: KCSAN: data-race in can_merge_rq [i915] / signal_irq_work [i915]
[ 145.927980]
[ 145.927992] write (marked) to 0xffff8881e513fab0 of 8 bytes by interrupt on cpu 2:
[ 145.928250] signal_irq_work+0x134/0x640 [i915]
[ 145.928268] irq_work_run_list+0xd7/0x120
[ 145.928283] irq_work_run+0x1d/0x50
[ 145.928300] smp_irq_work_interrupt+0x21/0x30
[ 145.928328] irq_work_interrupt+0xf/0x20
[ 145.928356] _raw_spin_unlock_irqrestore+0x34/0x40
[ 145.928596] execlists_submission_tasklet+0xde/0x170 [i915]
[ 145.928616] tasklet_action_common.isra.0+0x42/0xa0
[ 145.928632] __do_softirq+0xd7/0x2cd
[ 145.928646] irq_exit+0xbe/0xe0
[ 145.928665] do_IRQ+0x51/0x100
[ 145.928684] ret_from_intr+0x0/0x1c
[ 145.928699] schedule+0x0/0xb0
[ 145.928719] worker_thread+0x194/0x670
[ 145.928743] kthread+0x19a/0x1e0
[ 145.928765] ret_from_fork+0x1f/0x30
[ 145.928784]
[ 145.928796] read to 0xffff8881e513fab0 of 8 bytes by task 738 on cpu 1:
[ 145.929046] can_merge_rq+0xb1/0x100 [i915]
[ 145.929282] __execlists_submission_tasklet+0x866/0x25a0 [i915]
[ 145.929518] execlists_submit_request+0x2a4/0x2b0 [i915]
[ 145.929758] submit_notify+0x8f/0xc0 [i915]
[ 145.929989] __i915_sw_fence_complete+0x5d/0x3e0 [i915]
[ 145.930221] i915_sw_fence_complete+0x58/0x80 [i915]
[ 145.930453] i915_sw_fence_commit+0x16/0x20 [i915]
[ 145.930698] __i915_request_queue+0x60/0x70 [i915]
[ 145.930935] i915_gem_do_execbuffer+0x3997/0x4c20 [i915]
[ 145.931175] i915_gem_execbuffer2_ioctl+0x2c3/0x580 [i915]
[ 145.931194] drm_ioctl_kernel+0xe4/0x120
[ 145.931208] drm_ioctl+0x297/0x4c7
[ 145.931222] ksys_ioctl+0x89/0xb0
[ 145.931238] __x64_sys_ioctl+0x42/0x60
[ 145.931260] do_syscall_64+0x6e/0x2c0
[ 145.931275] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200309110934.868-4-chris@chris-wilson.co.uk
|
|
[ 25.025543] BUG: KCSAN: data-race in __i915_request_create [i915] / process_csb [i915]
[ 25.025561]
[ 25.025573] write (marked) to 0xffff8881e85c1620 of 8 bytes by task 696 on cpu 1:
[ 25.025789] __i915_request_create+0x54b/0x5d0 [i915]
[ 25.026001] i915_request_create+0xcc/0x150 [i915]
[ 25.026218] i915_gem_do_execbuffer+0x2f70/0x4c20 [i915]
[ 25.026428] i915_gem_execbuffer2_ioctl+0x2c3/0x580 [i915]
[ 25.026445] drm_ioctl_kernel+0xe4/0x120
[ 25.026459] drm_ioctl+0x297/0x4c7
[ 25.026472] ksys_ioctl+0x89/0xb0
[ 25.026484] __x64_sys_ioctl+0x42/0x60
[ 25.026497] do_syscall_64+0x6e/0x2c0
[ 25.026510] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 25.026522]
[ 25.026532] read to 0xffff8881e85c1620 of 8 bytes by interrupt on cpu 2:
[ 25.026742] process_csb+0x8d6/0x1070 [i915]
[ 25.026949] execlists_submission_tasklet+0x30/0x170 [i915]
[ 25.026969] tasklet_action_common.isra.0+0x42/0xa0
[ 25.026984] __do_softirq+0xd7/0x2cd
[ 25.026997] irq_exit+0xbe/0xe0
[ 25.027009] do_IRQ+0x51/0x100
[ 25.027021] ret_from_intr+0x0/0x1c
[ 25.027033] poll_idle+0x3e/0x13b
[ 25.027047] cpuidle_enter_state+0x189/0x5d0
[ 25.027060] cpuidle_enter+0x50/0x90
[ 25.027074] do_idle+0x1a1/0x1f0
[ 25.027086] cpu_startup_entry+0x14/0x16
[ 25.027100] start_secondary+0x120/0x180
[ 25.027116] secondary_startup_64+0xa4/0xb0
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200309110934.868-2-chris@chris-wilson.co.uk
|
|
[ 7534.150687] BUG: KCSAN: data-race in __execlists_submission_tasklet [i915] / process_csb [i915]
[ 7534.150706]
[ 7534.150717] write to 0xffff8881f1bc24b4 of 4 bytes by task 24404 on cpu 3:
[ 7534.150925] __execlists_submission_tasklet+0x1158/0x2780 [i915]
[ 7534.151133] execlists_submit_request+0x2e8/0x2f0 [i915]
[ 7534.151348] submit_notify+0x8f/0xc0 [i915]
[ 7534.151549] __i915_sw_fence_complete+0x5d/0x3e0 [i915]
[ 7534.151753] i915_sw_fence_complete+0x58/0x80 [i915]
[ 7534.151963] i915_sw_fence_commit+0x16/0x20 [i915]
[ 7534.152179] __i915_request_queue+0x60/0x70 [i915]
[ 7534.152388] i915_gem_do_execbuffer+0x3997/0x4c20 [i915]
[ 7534.152598] i915_gem_execbuffer2_ioctl+0x2c3/0x580 [i915]
[ 7534.152615] drm_ioctl_kernel+0xe4/0x120
[ 7534.152629] drm_ioctl+0x297/0x4c7
[ 7534.152642] ksys_ioctl+0x89/0xb0
[ 7534.152654] __x64_sys_ioctl+0x42/0x60
[ 7534.152667] do_syscall_64+0x6e/0x2c0
[ 7534.152681] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 7534.152693]
[ 7534.152703] read to 0xffff8881f1bc24b4 of 4 bytes by interrupt on cpu 2:
[ 7534.152914] process_csb+0xe7c/0x10a0 [i915]
[ 7534.153120] execlists_submission_tasklet+0x30/0x170 [i915]
[ 7534.153138] tasklet_action_common.isra.0+0x42/0xa0
[ 7534.153153] __do_softirq+0xd7/0x2cd
[ 7534.153166] run_ksoftirqd+0x15/0x20
[ 7534.153180] smpboot_thread_fn+0x1ab/0x300
[ 7534.153194] kthread+0x19a/0x1e0
[ 7534.153207] ret_from_fork+0x1f/0x30
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200309144249.10309-1-chris@chris-wilson.co.uk
|
|
If we stop filling the ELSP due to an incompatible virtual engine
request, check if we should enable the timeslice on behalf of the queue.
This fixes the case where we are inspecting the last->next element when
we know that the last element is the last request in the execution queue,
and so decided we did not need to enable timeslicing despite the intent
to do so!
Fixes: 8ee36e048c98 ("drm/i915/execlists: Minimalistic timeslicing")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: <stable@vger.kernel.org> # v5.4+
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200306113012.3184606-1-chris@chris-wilson.co.uk
|
|
Check the flow of requests into the hardware to verify that are
submitted in order along their timeline.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200306071614.2846708-1-chris@chris-wilson.co.uk
|
|
Show the timeslicing priority hint in engine dumps to aide debugging.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200305135843.2760512-1-chris@chris-wilson.co.uk
|
|
As we release the head requests back into the queue, propagate any
change in error status that may have occurred while the requests were
temporarily suspended.
Closes: https://gitlab.freedesktop.org/drm/intel/issues/1277
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200304121849.2448028-2-chris@chris-wilson.co.uk
|
|
Trying to use i915_request_skip() prior to i915_request_add() causes us
to try and fill the ring upto request->postfix, which has not yet been
set, and so may cause us to memset() past the end of the ring.
Instead of skipping the request immediately, just flag the error on the
request (only accepting the first fatal error we see) and then clear the
request upon submission.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200304121849.2448028-1-chris@chris-wilson.co.uk
|
|
We only use sentinel requests for "preempt-to-idle" passes, so assert
that they are the only request in a new submission.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200302085812.4172450-12-chris@chris-wilson.co.uk
|
|
An odd and highly unlikely path caught us out. On delayed submission
(due to an asynchronous reset handler), we poked the priority_hint and
kicked the tasklet. However, we had already marked the device as wedged
and swapped out the tasklet for a no-op. The result was that we never
cleared the priority hint and became upset when we later checked.
<0> [574.303565] i915_sel-6278 2.... 481822445us : __i915_subtests: Running intel_execlists_live_selftests/live_error_interrupt
<0> [574.303565] i915_sel-6278 2.... 481822472us : __engine_unpark: 0000:00:02.0 rcs0:
<0> [574.303565] i915_sel-6278 2.... 481822491us : __gt_unpark: 0000:00:02.0
<0> [574.303565] i915_sel-6278 2.... 481823220us : execlists_context_reset: 0000:00:02.0 rcs0: context:f4ee reset
<0> [574.303565] i915_sel-6278 2.... 481824830us : __intel_context_active: 0000:00:02.0 rcs0: context:f51b active
<0> [574.303565] i915_sel-6278 2.... 481825258us : __intel_context_do_pin: 0000:00:02.0 rcs0: context:f51b pin ring:{start:00006000, head:0000, tail:0000}
<0> [574.303565] i915_sel-6278 2.... 481825311us : __i915_request_commit: 0000:00:02.0 rcs0: fence f51b:2, current 0
<0> [574.303565] i915_sel-6278 2d..1 481825347us : __i915_request_submit: 0000:00:02.0 rcs0: fence f51b:2, current 0
<0> [574.303565] i915_sel-6278 2d..1 481825363us : trace_ports: 0000:00:02.0 rcs0: submit { f51b:2, 0:0 }
<0> [574.303565] i915_sel-6278 2.... 481826809us : __intel_context_active: 0000:00:02.0 rcs0: context:f51c active
<0> [574.303565] <idle>-0 7d.h2 481827326us : cs_irq_handler: 0000:00:02.0 rcs0: CS error: 1
<0> [574.303565] <idle>-0 7..s1 481827377us : process_csb: 0000:00:02.0 rcs0: cs-irq head=3, tail=4
<0> [574.303565] <idle>-0 7..s1 481827379us : process_csb: 0000:00:02.0 rcs0: csb[4]: status=0x10000001:0x00000000
<0> [574.305593] <idle>-0 7..s1 481827385us : trace_ports: 0000:00:02.0 rcs0: promote { f51b:2*, 0:0 }
<0> [574.305611] <idle>-0 7..s1 481828179us : execlists_reset: 0000:00:02.0 rcs0: reset for CS error
<0> [574.305611] i915_sel-6278 2.... 481828284us : __intel_context_do_pin: 0000:00:02.0 rcs0: context:f51c pin ring:{start:00007000, head:0000, tail:0000}
<0> [574.305611] i915_sel-6278 2.... 481828345us : __i915_request_commit: 0000:00:02.0 rcs0: fence f51c:2, current 0
<0> [574.305611] <idle>-0 7dNs2 481847823us : __i915_request_unsubmit: 0000:00:02.0 rcs0: fence f51b:2, current 1
<0> [574.305611] <idle>-0 7dNs2 481847857us : execlists_hold: 0000:00:02.0 rcs0: fence f51b:2, current 1 on hold
<0> [574.305611] <idle>-0 7.Ns1 481847863us : intel_engine_reset: 0000:00:02.0 rcs0: flags=4
<0> [574.305611] <idle>-0 7.Ns1 481847945us : execlists_reset_prepare: 0000:00:02.0 rcs0: depth<-1
<0> [574.305611] <idle>-0 7.Ns1 481847946us : intel_engine_stop_cs: 0000:00:02.0 rcs0:
<0> [574.305611] <idle>-0 7.Ns1 538584284us : intel_engine_stop_cs: 0000:00:02.0 rcs0: timed out on STOP_RING -> IDLE
<0> [574.305611] <idle>-0 7.Ns1 538584347us : __intel_gt_reset: 0000:00:02.0 engine_mask=1
<0> [574.305611] <idle>-0 7.Ns1 538584406us : execlists_reset_rewind: 0000:00:02.0 rcs0:
<0> [574.305611] <idle>-0 7dNs2 538585050us : __i915_request_reset: 0000:00:02.0 rcs0: fence f51b:2, current 1 guilty? yes
<0> [574.305611] <idle>-0 7dNs2 538585063us : __execlists_reset: 0000:00:02.0 rcs0: replay {head:0000, tail:0068}
<0> [574.306565] <idle>-0 7.Ns1 538588457us : intel_engine_cancel_stop_cs: 0000:00:02.0 rcs0:
<0> [574.306565] <idle>-0 7dNs2 538588462us : __i915_request_submit: 0000:00:02.0 rcs0: fence f51c:2, current 0
<0> [574.306565] <idle>-0 7dNs2 538588471us : trace_ports: 0000:00:02.0 rcs0: submit { f51c:2, 0:0 }
<0> [574.306565] <idle>-0 7.Ns1 538588474us : execlists_reset_finish: 0000:00:02.0 rcs0: depth->1
<0> [574.306565] kworker/-202 2.... 538588755us : i915_request_retire: 0000:00:02.0 rcs0: fence f51c:2, current 2
<0> [574.306565] ksoftirq-46 7..s. 538588773us : process_csb: 0000:00:02.0 rcs0: cs-irq head=11, tail=1
<0> [574.306565] ksoftirq-46 7..s. 538588774us : process_csb: 0000:00:02.0 rcs0: csb[0]: status=0x10000001:0x00000000
<0> [574.306565] ksoftirq-46 7..s. 538588776us : trace_ports: 0000:00:02.0 rcs0: promote { f51c:2!, 0:0 }
<0> [574.306565] ksoftirq-46 7..s. 538588778us : process_csb: 0000:00:02.0 rcs0: csb[1]: status=0x10000018:0x00000020
<0> [574.306565] ksoftirq-46 7..s. 538588779us : trace_ports: 0000:00:02.0 rcs0: completed { f51c:2!, 0:0 }
<0> [574.306565] kworker/-202 2.... 538588826us : intel_context_unpin: 0000:00:02.0 rcs0: context:f51c unpin
<0> [574.306565] i915_sel-6278 6.... 538589663us : __intel_gt_set_wedged.part.32: 0000:00:02.0 start
<0> [574.306565] i915_sel-6278 6.... 538589667us : execlists_reset_prepare: 0000:00:02.0 rcs0: depth<-0
<0> [574.306565] i915_sel-6278 6.... 538589710us : intel_engine_stop_cs: 0000:00:02.0 rcs0:
<0> [574.306565] i915_sel-6278 6.... 538589732us : execlists_reset_prepare: 0000:00:02.0 bcs0: depth<-0
<0> [574.307591] i915_sel-6278 6.... 538589733us : intel_engine_stop_cs: 0000:00:02.0 bcs0:
<0> [574.307591] i915_sel-6278 6.... 538589757us : execlists_reset_prepare: 0000:00:02.0 vcs0: depth<-0
<0> [574.307591] i915_sel-6278 6.... 538589758us : intel_engine_stop_cs: 0000:00:02.0 vcs0:
<0> [574.307591] i915_sel-6278 6.... 538589771us : execlists_reset_prepare: 0000:00:02.0 vcs1: depth<-0
<0> [574.307591] i915_sel-6278 6.... 538589772us : intel_engine_stop_cs: 0000:00:02.0 vcs1:
<0> [574.307591] i915_sel-6278 6.... 538589778us : execlists_reset_prepare: 0000:00:02.0 vecs0: depth<-0
<0> [574.307591] i915_sel-6278 6.... 538589780us : intel_engine_stop_cs: 0000:00:02.0 vecs0:
<0> [574.307591] i915_sel-6278 6.... 538589786us : __intel_gt_reset: 0000:00:02.0 engine_mask=ff
<0> [574.307591] i915_sel-6278 6.... 538591175us : execlists_reset_cancel: 0000:00:02.0 rcs0:
<0> [574.307591] i915_sel-6278 6.... 538591970us : execlists_reset_cancel: 0000:00:02.0 bcs0:
<0> [574.307591] i915_sel-6278 6.... 538591982us : execlists_reset_cancel: 0000:00:02.0 vcs0:
<0> [574.307591] i915_sel-6278 6.... 538591996us : execlists_reset_cancel: 0000:00:02.0 vcs1:
<0> [574.307591] i915_sel-6278 6.... 538592759us : execlists_reset_cancel: 0000:00:02.0 vecs0:
<0> [574.307591] i915_sel-6278 6.... 538592977us : execlists_reset_finish: 0000:00:02.0 rcs0: depth->0
<0> [574.307591] i915_sel-6278 6.N.. 538592996us : execlists_reset_finish: 0000:00:02.0 bcs0: depth->0
<0> [574.307591] i915_sel-6278 6.N.. 538593023us : execlists_reset_finish: 0000:00:02.0 vcs0: depth->0
<0> [574.307591] i915_sel-6278 6.N.. 538593037us : execlists_reset_finish: 0000:00:02.0 vcs1: depth->0
<0> [574.307591] i915_sel-6278 6.N.. 538593051us : execlists_reset_finish: 0000:00:02.0 vecs0: depth->0
<0> [574.307591] i915_sel-6278 6.... 538593407us : __intel_gt_set_wedged.part.32: 0000:00:02.0 end
<0> [574.307591] kworker/-210 7d..1 551958381us : execlists_unhold: 0000:00:02.0 rcs0: fence f51b:2, current 2 hold release
<0> [574.307591] i915_sel-6278 0.... 559490788us : i915_request_retire: 0000:00:02.0 rcs0: fence f51b:2, current 2
<0> [574.307591] i915_sel-6278 0.... 559490793us : intel_context_unpin: 0000:00:02.0 rcs0: context:f51b unpin
<0> [574.307591] i915_sel-6278 0.... 559490798us : __engine_park: 0000:00:02.0 rcs0: parked
<0> [574.307591] i915_sel-6278 0.... 559490982us : __intel_context_retire: 0000:00:02.0 rcs0: context:f51c retire runtime: { total:30004ns, avg:30004ns }
<0> [574.307591] i915_sel-6278 0.... 559491372us : __engine_park: __engine_park:261 GEM_BUG_ON(engine->execlists.queue_priority_hint != (-((int)(~0U >> 1)) - 1))
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200227085723.1961649-9-chris@chris-wilson.co.uk
|
|
As we drop the engine-pm on retiring, that may happen while there are
still CS events in the buffer. As such we cannot assert the engine is
still active on reset, until we know that the current request is still
in flight.
Closes: https://gitlab.freedesktop.org/drm/intel/issues/1338
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Andi Shyti <andi.shyti@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200227204727.2009346-1-chris@chris-wilson.co.uk
|
|
No good reason why we must always use a static ringsize, so let
userspace select one during construction.
Link: https://github.com/intel/compute-runtime/pull/261
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Steve Carbonari <steven.carbonari@intel.com>
Reviewed-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200225192206.1107336-2-chris@chris-wilson.co.uk
|
|
While we know that the waiters cannot disappear as we walk our list
(only that they might be added), the same cannot be said for our
signalers as they may be completed by the HW and retired as we process
this request. Ergo we need to use rcu to protect the list iteration and
remember to mark up the list_del_rcu.
v2: Mark the deps as safe-for-rcu
Fixes: 793c22617367 ("drm/i915/gt: Protect execlists_hold/unhold from new waiters")
Fixes: 32ff621fd744 ("drm/i915/gt: Allow temporary suspension of inflight requests")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200220075025.1539375-1-chris@chris-wilson.co.uk
|
|
Without selftests enabled, I915_SELFTEST_ONLY becomes a dummy,
generating a bare '0'. This causes the compiler to complain about a
useless line, and while we could use I915_SELFTEST_DECLARE instead, it
is a bit messier. Move the selftest-only code to a helper and make that
conditional on having selftests enabled.
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200217095835.599827-1-chris@chris-wilson.co.uk
|
|
GPU saves accumulated context runtime (in CS timestamp units) in PPHWSP
which will be useful for us in cases when we are not able to track context
busyness ourselves (like with GuC). Keep a copy of this in struct
intel_context from where it can be easily read even if the context is not
pinned.
v2:
(Chris)
* Do not store pphwsp address in intel_context.
* Log CS wrap-around.
* Simplify calculation by relying on integer wraparound.
v3:
* Include total/avg in traces and error state for debugging
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20200216133620.394962-1-chris@chris-wilson.co.uk
|
|
With debugging turned off, we have to tell the compiler not to warn
about the unused debug locals.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200213081217.3107410-1-chris@chris-wilson.co.uk
|
|
Show the ring/request/context state if we see what we believe is an
early CS completion.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200211230944.1203098-1-chris@chris-wilson.co.uk
|
|
Currently on execlists, we use a local hwsp for the kernel_context,
rather than the engine's HWSP, as this is the default for execlists.
However, seqno wrap requires allocating a new HWSP cacheline, and may
require pinning a new HWSP page in the GGTT. This operation requiring
pinning in the GGTT is not allowed within the kernel_context timeline,
as doing so may require re-entering the kernel_context in order to evict
from the GGTT. As we want to avoid requiring a new HWSP for the
kernel_context, we can use the permanently pinned engine's HWSP instead.
However to do so we must prevent the use of semaphores reading the
kernel_context's HWSP, as the use of semaphores do not support rollover
onto the same cacheline. Fortunately, the kernel_context is mostly
isolated, so unlikely to give benefit to semaphores.
Reported-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200210205722.794180-5-chris@chris-wilson.co.uk
|
|
We manipulate ring->head while active in i915_request_retire underneath
the timeline manipulation. We cannot rely on a stable ring->head outside
of the timeline->mutex, in particular while setting up the context for
resume and reset.
Closes: https://gitlab.freedesktop.org/drm/intel/issues/1126
Fixes: 0881954965e3 ("drm/i915: Introduce intel_context.pin_mutex for pin management")
Fixes: e5dadff4b093 ("drm/i915: Protect request retirement with timeline->mutex")
References: f3c0efc9fe7a ("drm/i915/execlists: Leave resetting ring to intel_ring")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Andi Shyti <andi.shyti@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200211120131.958949-1-chris@chris-wilson.co.uk
|
|
live_preempt_hang's use of hang injection has been superseded by
live_preempt_reset's use of an non-preemptible spinner. The latter does
not require intrusive hacks into the code.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200209230838.361154-2-chris@chris-wilson.co.uk
|
|
Recording the frequent inspection of CSB head/tail when there is
expected to be no update adds noise to the debug trace. (Not entirely
useless, but since we know the sequence of function calls, we can
surmise the function was called -- so redundant.)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200209131922.180287-2-chris@chris-wilson.co.uk
|
|
In eliminating the recursion from walking the tree of signalers/waiters
for processing the hold/unhold operations, a crucial error crept in
where we looked at the parent request and not the list element when
processing the list.
Brown paper bag, much?
Closes: https://gitlab.freedesktop.org/drm/intel/issues/1166
Fixes: 32ff621fd744 ("drm/i915/gt: Allow temporary suspension of inflight requests")
Fixes: 748317386afb ("drm/i915/execlists: Offline error capture")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200209131922.180287-1-chris@chris-wilson.co.uk
|
|
We have switched from tail manipulation to forced context restore
to implement WaIdleLiteRestore. Remove the old defines and comments.
Note: we still do emit the WA tail, and use it as our first attempt to
avoid forcing a full-restore instead of a lite-restore, we just have a
much stronger backup mechanism for repeated preemptions.
References: f26a9e959a7b ("drm/i915/gt: Detect if we miss WaIdleLiteRestore")
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20200203163312.15475-1-mika.kuoppala@linux.intel.com
|
|
Moving the base forward since this one was so old.
New base contains fixes that we needed.
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
If we rewind the RING_TAIL on a context, due to a preemption event, we
must force the context restore for the RING_TAIL update to be properly
handled. Rather than note which preemption events may cause us to rewind
the tail, compare the new request's tail with the previously submitted
RING_TAIL, as it turns out that timeslicing was causing unexpected
rewinds.
<idle>-0 0d.s2 1280851190us : __execlists_submission_tasklet: 0000:00:02.0 rcs0: expired last=130:4698, prio=3, hint=3
<idle>-0 0d.s2 1280851192us : __i915_request_unsubmit: 0000:00:02.0 rcs0: fence 66:119966, current 119964
<idle>-0 0d.s2 1280851195us : __i915_request_unsubmit: 0000:00:02.0 rcs0: fence 130:4698, current 4695
<idle>-0 0d.s2 1280851198us : __i915_request_unsubmit: 0000:00:02.0 rcs0: fence 130:4696, current 4695
^---- Note we unwind 2 requests from the same context
<idle>-0 0d.s2 1280851208us : __i915_request_submit: 0000:00:02.0 rcs0: fence 130:4696, current 4695
<idle>-0 0d.s2 1280851213us : __i915_request_submit: 0000:00:02.0 rcs0: fence 134:1508, current 1506
^---- But to apply the new timeslice, we have to replay the first request
before the new client can start -- the unexpected RING_TAIL rewind
<idle>-0 0d.s2 1280851219us : trace_ports: 0000:00:02.0 rcs0: submit { 130:4696*, 134:1508 }
synmark2-5425 2..s. 1280851239us : process_csb: 0000:00:02.0 rcs0: cs-irq head=5, tail=0
synmark2-5425 2..s. 1280851240us : process_csb: 0000:00:02.0 rcs0: csb[0]: status=0x00008002:0x00000000
^---- Preemption event for the ELSP update; note the lite-restore
synmark2-5425 2..s. 1280851243us : trace_ports: 0000:00:02.0 rcs0: preempted { 130:4698, 66:119966 }
synmark2-5425 2..s. 1280851246us : trace_ports: 0000:00:02.0 rcs0: promote { 130:4696*, 134:1508 }
synmark2-5425 2.... 1280851462us : __i915_request_commit: 0000:00:02.0 rcs0: fence 130:4700, current 4695
synmark2-5425 2.... 1280852111us : __i915_request_commit: 0000:00:02.0 rcs0: fence 130:4702, current 4695
synmark2-5425 2.Ns1 1280852296us : process_csb: 0000:00:02.0 rcs0: cs-irq head=0, tail=2
synmark2-5425 2.Ns1 1280852297us : process_csb: 0000:00:02.0 rcs0: csb[1]: status=0x00000814:0x00000000
synmark2-5425 2.Ns1 1280852299us : trace_ports: 0000:00:02.0 rcs0: completed { 130:4696!, 134:1508 }
synmark2-5425 2.Ns1 1280852301us : process_csb: 0000:00:02.0 rcs0: csb[2]: status=0x00000818:0x00000040
synmark2-5425 2.Ns1 1280852302us : trace_ports: 0000:00:02.0 rcs0: completed { 134:1508, 0:0 }
synmark2-5425 2.Ns1 1280852313us : process_csb: process_csb:2336 GEM_BUG_ON(!i915_request_completed(*execlists->active) && !reset_in_progress(execlists))
Fixes: 8ee36e048c98 ("drm/i915/execlists: Minimalistic timeslicing")
Referenecs: 82c69bf58650 ("drm/i915/gt: Detect if we miss WaIdleLiteRestore")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: <stable@vger.kernel.org> # v5.4+
Link: https://patchwork.freedesktop.org/patch/msgid/20200207211452.2860634-1-chris@chris-wilson.co.uk
|
|
As we may add new waiters to a request as it is being run, we need to
mark the list iteration as being safe for concurrent addition.
v2: Mika spotted that we used the same trick for signalers_list, so warn
the compiler about the lockless walk there as well.
Fixes: 32ff621fd744 ("drm/i915/gt: Allow temporary suspension of inflight requests")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200207110213.2734386-1-chris@chris-wilson.co.uk
|
|
Mika spotted
<4>[17436.705441] general protection fault: 0000 [#1] PREEMPT SMP PTI
<4>[17436.705447] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.5.0+ #1
<4>[17436.705449] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 3805 05/16/2018
<4>[17436.705512] RIP: 0010:__execlists_submission_tasklet+0xc4d/0x16e0 [i915]
<4>[17436.705516] Code: c5 4c 8d 60 e0 75 17 e9 8c 07 00 00 49 8b 44 24 20 49 39 c5 4c 8d 60 e0 0f 84 7a 07 00 00 49 8b 5c 24 08 49 8b 87 80 00 00 00 <48> 39 83 d8 fe ff ff 75 d9 48 8b 83 88 fe ff ff a8 01 0f 84 b6 05
<4>[17436.705518] RSP: 0018:ffffc9000012ce80 EFLAGS: 00010083
<4>[17436.705521] RAX: ffff88822ae42000 RBX: 5a5a5a5a5a5a5a5a RCX: dead000000000122
<4>[17436.705523] RDX: ffff88822ae42588 RSI: ffff8881e32a7908 RDI: ffff8881c429fd48
<4>[17436.705525] RBP: ffffc9000012cf00 R08: ffff88822ae42588 R09: 00000000fffffffe
<4>[17436.705527] R10: ffff8881c429fb80 R11: 00000000a677cf08 R12: ffff8881c42a0aa8
<4>[17436.705529] R13: ffff8881c429fd38 R14: ffff88822ae42588 R15: ffff8881c429fb80
<4>[17436.705532] FS: 0000000000000000(0000) GS:ffff88822ed00000(0000) knlGS:0000000000000000
<4>[17436.705534] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[17436.705536] CR2: 00007f858c76d000 CR3: 0000000005610003 CR4: 00000000003606e0
<4>[17436.705538] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[17436.705540] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[17436.705542] Call Trace:
<4>[17436.705545] <IRQ>
<4>[17436.705603] execlists_submission_tasklet+0xc0/0x130 [i915]
which is us consuming a partially initialised new waiter in
defer_requests(). We can prevent this by initialising the i915_dependency
prior to making it visible, and since we are using a concurrent
list_add/iterator mark them up to the compiler.
Fixes: 8ee36e048c98 ("drm/i915/execlists: Minimalistic timeslicing")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200206204915.2636606-2-chris@chris-wilson.co.uk
|
|
The workarounds are a common "feature" across gens and submission
mechanisms and we already call the other WA related functions from
common engine ones (<setup/cleanup>_common), so it makes sense to
do the same with WA application. Medium-term, This will help us
reduce the duplication once the GuC resume function is added, but short
term it will also allow us to use the workaround lists for pre-gen8
engine workarounds.
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20200131075716.2212299-2-chris@chris-wilson.co.uk
|
|
On Braswell and Broxton (also known as Valleyview and Apollolake), we
need to serialise updates of the GGTT using the big stop_machine()
hammer. This has the side effect of appearing to lockdep as a possible
reclaim (since it uses the cpuhp mutex and that is tainted by per-cpu
allocations). However, we want to use vm->mutex (including ggtt->mutex)
from within the shrinker and so must avoid such possible taints. For this
purpose, we introduced the asynchronous vma binding and we can apply it
to the PIN_GLOBAL so long as take care to add the necessary waits for
the worker afterwards.
Closes: https://gitlab.freedesktop.org/drm/intel/issues/211
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200130181710.2030251-3-chris@chris-wilson.co.uk
|
|
When we reset the engine, we first remove the guilty request from the
active list. If it so happens that there is a pending preemption event
to process before we handle the reset, when we inspect that event we
find ourselves a little confused as we have bent the rules slightly to
perform the reset.
Just ignore any discrepancies inside reset, we know we'll start again
from scratch afterwards.
<0>[ 536.940213] <idle>-0 6..s1 537441383us : execlists_reset: 0000:00:02.0 vcs0: reset for CS error
<0>[ 536.940213] i915_sel-7302 2d..1 537441386us : trace_ports: 0000:00:02.0 vcs0: submit { 10c59:2*, 10c5a:2 }
<0>[ 536.940213] <idle>-0 6d.s2 537471320us : __i915_request_unsubmit: 0000:00:02.0 vcs0: fence 10c59:2, current 1
<0>[ 536.940213] <idle>-0 6d.s2 537471321us : execlists_hold: 0000:00:02.0 vcs0: fence 10c59:2, current 1 on hold
<0>[ 536.940213] <idle>-0 6.Ns1 537471328us : intel_engine_reset: 0000:00:02.0 vcs0: flags=10
<0>[ 536.940213] <idle>-0 6.Ns1 537471421us : execlists_reset_prepare: 0000:00:02.0 vcs0: depth<-1
<0>[ 536.940213] <idle>-0 6.Ns1 537471422us : intel_engine_stop_cs: 0000:00:02.0 vcs0:
<0>[ 536.940213] <idle>-0 6.Ns1 537472424us : intel_engine_stop_cs: 0000:00:02.0 vcs0: timed out on STOP_RING -> IDLE
<0>[ 536.940213] <idle>-0 6.Ns1 537472429us : __intel_gt_reset: 0000:00:02.0 engine_mask=4
<0>[ 536.940213] <idle>-0 6.Ns1 537472442us : execlists_reset_rewind: 0000:00:02.0 vcs0:
<0>[ 536.940213] <idle>-0 6dNs2 537472443us : process_csb: 0000:00:02.0 vcs0: cs-irq head=4, tail=5
<0>[ 536.940213] <idle>-0 6dNs2 537472444us : process_csb: 0000:00:02.0 vcs0: csb[5]: status=0x00008002:0x20000060
<0>[ 536.940213] <idle>-0 6dNs2 537472464us : trace_ports: 0000:00:02.0 vcs0: preempted { 10c59:2*, 0:0 }
<0>[ 536.940213] <idle>-0 6dNs2 537472465us : trace_ports: 0000:00:02.0 vcs0: promote { 10c59:2*, 10c5a:2 }
<0>[ 536.940213] <idle>-0 6dNs2 537472706us : assert_pending_valid: assert_pending_valid:1417 GEM_BUG_ON(!i915_request_is_active(rq))
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200129165935.1266132-1-chris@chris-wilson.co.uk
|
|
Now that we have offline error capture and can reset an engine from
inside an atomic context while also preserving the GPU state for
post-mortem analysis, it is time to handle error interrupts thrown by
the command parser.
This provides a much, much faster mechanism for us to detect known
problems than using heartbeats/hangchecks, and also provides a mechanism
for when those are disabled. However, it is limited to problems the HW
can detect in the CS and so not a complete solution for detecting lockups.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200128204318.4182039-2-chris@chris-wilson.co.uk
|
|
We write to execlists->pending[0] in process_csb() to acknowledge the
completion of the ESLP update, outside of the main spinlock. When we
check the current status of the previous submission in
__execlists_submission_tasklet() we should therefore use READ_ONCE() to
reflect and document the unsynchronized read.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200128171614.3845825-1-chris@chris-wilson.co.uk
|
|
Engine context pinned in perf OA was set to same context id as
the idle context. Set the context id to an unused value.
Clear the sw context id field in lrc descriptor before ORing with
ce->tag (Chris)
Closes: https://gitlab.freedesktop.org/drm/intel/issues/756
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20200124013701.40609-1-umesh.nerlige.ramappa@intel.com
|
|
If we encounter a hang on a virtual engine, as we process the hang the
request may already have been moved back to the virtual engine (we are
processing the hang on the physical engine). We need to reclaim the
request from the virtual engine so that the locking is consistent and
local to the real engine on which we will hold the request for error
state capturing.
v2: Pull the reclamation into execlists_hold() and assert that cannot be
called from outside of the reset (i.e. with the tasklet disabled).
v3: Added selftest
v4: Drop the reference owned by the virtual engine
Fixes: 748317386afb ("drm/i915/execlists: Offline error capture")
Testcase: igt/gem_exec_balancer/hang
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200122140243.495621-2-chris@chris-wilson.co.uk
|
|
Thanks to preempt-to-busy, we leave the request on the HW as we submit
the preemption request. This means that the request may complete at any
moment as we process HW events, and in particular the request may be
retired as we are planning to capture it for a preemption timeout.
Be more careful while obtaining the request to capture after a
preemption timeout, and check to see if it completed before we were able
to put it on the on-hold list. If we do see it did complete just before
we capture the request, proclaim the preemption-timeout a false positive
and pardon the reset as we should hit an arbitration point momentarily
and so be able to process the preemption.
Note that even after we move the request to be on hold it may be retired
(as the reset to stop the HW comes after), so we do require to hold our
own reference as we work on the request for capture (and all of the
peeking at state within the request needs to be carefully protected).
Fixes: 32ff621fd744 ("drm/i915/gt: Allow temporary suspension of inflight requests")
Closes: https://gitlab.freedesktop.org/drm/intel/issues/997
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200122140243.495621-1-chris@chris-wilson.co.uk
|
|
msm needs 5.5-rc4, go to the latest.
Signed-off-by: Dave Airlie <airlied@redhat.com>
|
|
Currently, we skip error capture upon forced preemption. We apply forced
preemption when there is a higher priority request that should be
running but is being blocked, and we skip inline error capture so that
the preemption request is not further delayed by a user controlled
capture -- extending the denial of service.
However, preemption reset is also used for heartbeats and regular GPU
hangs. By skipping the error capture, we remove the ability to debug GPU
hangs.
In order to capture the error without delaying the preemption request
further, we can do an out-of-line capture by removing the guilty request
from the execution queue and scheduling a worker to dump that request.
When removing a request, we need to remove the entire context and all
descendants from the execution queue, so that they do not jump past.
Closes: https://gitlab.freedesktop.org/drm/intel/issues/738
Fixes: 3a7a92aba8fb ("drm/i915/execlists: Force preemption")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200116184754.2860848-3-chris@chris-wilson.co.uk
|
|
In order to support out-of-line error capture, we need to remove the
active request from HW and put it to one side while a worker compresses
and stores all the details associated with that request. (As that
compression may take an arbitrary user-controlled amount of time, we
want to let the engine continue running on other workloads while the
hanging request is dumped.) Not only do we need to remove the active
request, but we also have to remove its context and all requests that
were dependent on it (both in flight, queued and future submission).
Finally once the capture is complete, we need to be able to resubmit the
request and its dependents and allow them to execute.
v2: Replace stack recursion with a simple list.
v3: Check all the parents, not just the first, when searching for a
stuck ancestor!
References: https://gitlab.freedesktop.org/drm/intel/issues/738
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200116184754.2860848-2-chris@chris-wilson.co.uk
|
|
If we keep track of when the i915_request.sched.link is on the HW
runlist, or in the priority queue we can simplify our interactions with
the request (such as during rescheduling). This also simplifies the next
patch where we introduce a new in-between list, for requests that are
ready but neither on the run list or in the queue.
v2: Update i915_sched_node.link explanation for current usage where it
is a link on both the queue and on the runlists.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200116184754.2860848-1-chris@chris-wilson.co.uk
|