aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/i915/i915_request.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-05-17drm/i915: Bump signaler priority on adding a waiterChris Wilson1-9/+0
The handling of the no-preemption priority level imposes the restriction that we need to maintain the implied ordering even though preemption is disabled. Otherwise we may end up with an AB-BA deadlock across multiple engine due to a real preemption event reordering the no-preemption WAITs. To resolve this issue we currently promote all requests to WAIT on unsubmission, however this interferes with the timeslicing requirement that we do not apply any implicit promotion that will defeat the round-robin timeslice list. (If we automatically promote the active request it will go back to the head of the queue and not the tail!) So we need implicit promotion to prevent reordering around semaphores where we are not allowed to preempt, and we must avoid implicit promotion on unsubmission. So instead of at unsubmit, if we apply that implicit promotion on adding the dependency, we avoid the semaphore deadlock and we also reduce the gains made by the promotion for user space waiting. Furthermore, by keeping the earlier dependencies at a higher level, we reduce the search space for timeslicing without altering runtime scheduling too badly (no dependencies at all will be assigned a higher priority for rrul). v2: Limit the bump to external edges (as originally intended) i.e. between contexts and out to the user. Testcase: igt/gem_concurrent_blit Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190515130052.4475-3-chris@chris-wilson.co.uk
2019-05-17drm/i915: Truly bump ready tasks ahead of busywaitsChris Wilson1-19/+12
In commit b7404c7ecb38 ("drm/i915: Bump ready tasks ahead of busywaits"), I tried cutting a corner in order to not install a signal for each of our dependencies, and only listened to requests on which we were intending to busywait. The compromise that was made was that instead of then being able to promote the request with a full NOSEMAPHORE like its non-busywaiting brethren, as we had not ensured we had cleared the semaphore chain, we settled for only using the NEWCLIENT boost. With an over saturated system with multiple NEWCLIENTS in flight at any time, this was found to be an inadequate promotion and left us with a much poorer scheduling order than prior to using semaphores. The outcome of this patch, is that all requests have NOSEMAPHORE priority when they have no dependencies and are ready to run and not busywait, restoring the pre-semaphore ordering on saturated systems. We can demonstrate the effect of poor scheduling order by oversaturating the system using gem_wsim on a system with multiple vcs engines (i.e running the same workloads across more clients than required for peak throughput, e.g. media_load_balance_17i7.wsim -c4 -b context): x v5.1 (normalized) + tip * fix +------------------------------------------------------------------------+ | x | | x | | x | | x | | %x | | %%x | | %%x | | %%x | | %%x | | %%x | | %%x | | %%x | | %%x | | %%x | | %%x | | %#x | | %#x | | %#x | | %#x | | %#x | | + %#xx | | + %#xx | | + %%#xx | | + %%#xx | | + %%#xx | | + %%#xx | | + %%##x | | +++ %%##x | | +++ %%##x | | +++ %%##x | | ++++ %%##x | | ++++ %%##x | | ++++ %%##xx | | ++++ %###xx | | ++++ %###xx | | ++++ %###xx | | ++++ %###xx | | ++++ + %#O#xx | | ++++ + %#O#xx | | ++++++ + %#O#xx | | ++++++++++ %OOOxxx| | ++++++++++ + %#OOO#xx| | + ++++++++++++ ++ +++++ + ++ @@OOOO#xx| | |A_| | ||__________M_______A____________________| | | |A_| | +------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 120 0.99456 1.00628 0.999985 1.0001545 0.0024387139 + 120 0.873021 1.00037 0.884134 0.90148752 0.039190862 Difference at 99.5% confidence -0.098667 +/- 0.0110762 -9.86517% +/- 1.10745% (Student's t, pooled s = 0.0277657) % 120 0.990207 1.00165 0.9970265 0.99699748 0.0021024 Difference at 99.5% confidence -0.003157 +/- 0.000908245 -0.315651% +/- 0.0908105% (Student's t, pooled s = 0.00227678) Fixes: b7404c7ecb38 ("drm/i915: Bump ready tasks ahead of busywaits") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com> Cc: Dmitry Ermilov <dmitry.ermilov@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190515130052.4475-2-chris@chris-wilson.co.uk
2019-05-17drm/i915: Mark semaphores as complete on unsubmit out if payload was startedChris Wilson1-0/+6
Avoid charging us for the presumed busywait if the request was preempted after successfully using semaphores to reduce inter-engine latency. v2: Bump the priority to reflect the lack of semaphores now required. References: ca6e56f654e7 ("drm/i915: Disable semaphore busywaits on saturated systems") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190515130052.4475-1-chris@chris-wilson.co.uk
2019-05-08drm/i915: Seal races between async GPU cancellation, retirement and signalingChris Wilson1-0/+1
Currently there is an underlying assumption that i915_request_unsubmit() is synchronous wrt the GPU -- that is the request is no longer in flight as we remove it. In the near future that may change, and this may upset our signaling as we can process an interrupt for that request while it is no longer in flight. CPU0 CPU1 intel_engine_breadcrumbs_irq (queue request completion) i915_request_cancel_signaling ... ... i915_request_enable_signaling dma_fence_signal Hence in the time it took us to drop the lock to signal the request, a preemption event may have occurred and re-queued the request. In the process, that request would have seen I915_FENCE_FLAG_SIGNAL clear and so reused the rq->signal_link that was in use on CPU0, leading to bad pointer chasing in intel_engine_breadcrumbs_irq. A related issue was that if someone started listening for a signal on a completed but no longer in-flight request, we missed the opportunity to immediately signal that request. Furthermore, as intel_contexts may be immediately released during request retirement, in order to be entirely sure that intel_engine_breadcrumbs_irq may no longer dereference the intel_context (ce->signals and ce->signal_link), we must wait for irq spinlock. In order to prevent the race, we use a bit in the fence.flags to signal the transfer onto the signal list inside intel_engine_breadcrumbs_irq. For simplicity, we use the DMA_FENCE_FLAG_SIGNALED_BIT as it then quickly signals to any outside observer that the fence is indeed signaled. v2: Sketch out potential dma-fence API for manual signaling v3: And the test_and_set_bit() Fixes: 52c0fdb25c7c ("drm/i915: Replace global breadcrumbs with per-context interrupt tracking") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190508112452.18942-1-chris@chris-wilson.co.uk
2019-05-07drm/i915: Only reschedule the submission tasklet if preemption is possibleChris Wilson1-2/+0
If we couple the scheduler more tightly with the execlists policy, we can apply the preemption policy to the question of whether we need to kick the tasklet at all for this priority bump. v2: Rephrase it as a core i915 policy and not an execlists foible. v3: Pull the kick together. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190507122544.12698-1-chris@chris-wilson.co.uk
2019-05-07drm/i915: Acquire the signaler's timeline HWSP lastChris Wilson1-4/+4
Acquiring the signaler's timeline takes an active reference to their HWSP that we would like to avoid if possible, so take it after performing all of our allocations required to set up the fencing. The acquisition also provides the final check that the target has not already signaled allowing us to avoid the semaphore at the last moment. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190503140239.32668-1-chris@chris-wilson.co.uk
2019-05-04drm/i915: Disable semaphore busywaits on saturated systemsChris Wilson1-1/+39
Asking the GPU to busywait on a memory address, perhaps not unexpectedly in hindsight for a shared system, leads to bus contention that affects CPU programs trying to concurrently access memory. This can manifest as a drop in transcode throughput on highly over-saturated workloads. The only clue offered by perf, is that the bus-cycles (perf stat -e bus-cycles) jumped by 50% when enabling semaphores. This corresponds with extra CPU active cycles being attributed to intel_idle's mwait. This patch introduces a heuristic to try and detect when more than one client is submitting to the GPU pushing it into an oversaturated state. As we already keep track of when the semaphores are signaled, we can inspect their state on submitting the busywait batch and if we planned to use a semaphore but were too late, conclude that the GPU is overloaded and not try to use semaphores in future requests. In practice, this means we optimistically try to use semaphores for the first frame of a transcode job split over multiple engines, and fail if there are multiple clients active and continue not to use semaphores for the subsequent frames in the sequence. Periodically, we try to optimistically switch semaphores back on whenever the client waits to catch up with the transcode results. With 1 client, on Broxton J3455, with the relative fps normalized by %cpu: x no semaphores + drm-tip * patched +------------------------------------------------------------------------+ | * | | *+ | | **+ | | **+ x | | x * +**+ x | | x x * * +***x xx | | x x * * *+***x *x | | x x* + * * *****x *x x | | + x xx+x* + *** * ********* x * | | + x xx+x* * *** +** ********* xx * | | * + ++++* + x*x****+*+* ***+*************+x* * | |*+ +** *+ + +* + *++****** *xxx**********x***+*****************+*++ *| | |__________A_____M_____| | | |_______________A____M_________| | | |____________A___M________| | +------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 120 2.60475 3.50941 3.31123 3.2143953 0.21117399 + 120 2.3826 3.57077 3.25101 3.1414161 0.28146407 Difference at 95.0% confidence -0.0729792 +/- 0.0629585 -2.27039% +/- 1.95864% (Student's t, pooled s = 0.248814) * 120 2.35536 3.66713 3.2849 3.2059917 0.24618565 No difference proven at 95.0% confidence With 10 clients over-saturating the pipeline: x no semaphores + drm-tip * patched +------------------------------------------------------------------------+ | ++ ** | | ++ ** | | ++ ** | | ++ ** | | ++ xx *** | | ++ xx *** | | ++ xxx*** | | ++ xxx*** | | +++ xxx*** | | +++ xx**** | | +++ xx**** | | +++ xx**** | | +++ xx**** | | ++++ xx**** | | +++++ xx**** | | +++++ x x****** | | ++++++ xxx******* | | ++++++ xxx******* | | ++++++ xxx******* | | ++++++ xx******** | | ++++++ xxxx******** | | ++++++ xxxx******** | | ++++++++ xxxxx********* | |+ + + + ++++++++ xxx*xx**********x* *| | |__A__| | | |__AM__| | | |__A_| | +------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 120 2.47855 2.8972 2.72376 2.7193402 0.074604933 + 120 1.17367 1.77459 1.71977 1.6966782 0.085850697 Difference at 95.0% confidence -1.02266 +/- 0.0203502 -37.607% +/- 0.748352% (Student's t, pooled s = 0.0804246) * 120 2.57868 3.00821 2.80142 2.7923878 0.058646477 Difference at 95.0% confidence 0.0730476 +/- 0.0169791 2.68622% +/- 0.624383% (Student's t, pooled s = 0.0671018) Indicating that we've recovered the regression from enabling semaphores on this saturated setup, with a hint towards an overall improvement. Very similar, but of smaller magnitude, results are observed on both Skylake(gt2) and Kabylake(gt4). This may be due to the reduced impact of bus-cycles, where we see a 50% hit on Broxton, it is only 10% on the big core, in this particular test. One observation to make here is that for a greedy client trying to maximise its own throughput, using semaphores is the right choice. It is only the holistic system-wide view that semaphores of one client impacts another and reduces the overall throughput where we would choose to disable semaphores. The most noticeable negactive impact this has is on the no-op microbenchmarks, which are also very notable for having no cpu bus load. In particular, this increases the runtime and energy consumption of gem_exec_whisper. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com> Cc: Dmitry Ermilov <dmitry.ermilov@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190504070707.30902-1-chris@chris-wilson.co.uk
2019-05-03drm/i915: Delay semaphore submission until the start of the signalerChris Wilson1-0/+19
Currently we submit the semaphore busywait as soon as the signaler is submitted to HW. However, we may submit the signaler as the tail of a batch of requests, and even not as the first context in the HW list, i.e. the busywait may start spinning far in advance of the signaler even starting. If we wait until the request before the signaler is completed before submitting the busywait, we prevent the busywait from starting too early, if the signaler is not first in submission port. To handle the case where the signaler is at the start of the second (or later) submission port, we will need to delay the execution callback until we know the context is promoted to port0. A challenge for later. Fixes: e88619646971 ("drm/i915: Use HW semaphores for inter-engine synchroni sation on gen8+") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190501114541.10077-9-chris@chris-wilson.co.uk
2019-04-26drm/i915: Move i915_request_alloc into selftests/Chris Wilson1-38/+0
Having transitioned GEM over to using intel_context as its primary means of tracking the GEM context and engine combined and using i915_request_create(), we can move the older i915_request_alloc() helper function into selftests/ where the remaining users are confined. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190426163336.15906-9-chris@chris-wilson.co.uk
2019-04-26drm/i915: Switch back to an array of logical per-engine HW contextsChris Wilson1-12/+3
We switched to a tree of per-engine HW context to accommodate the introduction of virtual engines. However, we plan to also support multiple instances of the same engine within the GEM context, defeating our use of the engine as a key to looking up the HW context. Just allocate a logical per-engine instance and always use an index into the ctx->engines[]. Later on, this ctx->engines[] may be replaced by a user specified map. v2: Add for_each_gem_engine() helper to iterator within the engines lock v3: intel_context_create_request() helper v4: s/unsigned long/unsigned int/ 4 billion engines is quite enough. v5: Push iterator locking to caller Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190426163336.15906-7-chris@chris-wilson.co.uk
2019-04-26drm/i915: Export intel_context_instance()Chris Wilson1-1/+10
We want to pass in a intel_context into intel_context_pin() and that requires us to first be able to lookup the intel_context! Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190426163336.15906-2-chris@chris-wilson.co.uk
2019-04-25drm/i915: Explicitly pin the logical context for execbufChris Wilson1-9/+0
In order to separate the reservation phase of building a request from its emission phase, we need to pull some of the request alloc activities from deep inside i915_request to the surface, GEM_EXECBUFFER. v2: Be frivolous, use a local drm_i915_private. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190425050143.811-1-chris@chris-wilson.co.uk
2019-04-24drm/i915: Invert the GEM wakeref hierarchyChris Wilson1-5/+5
In the current scheme, on submitting a request we take a single global GEM wakeref, which trickles down to wake up all GT power domains. This is undesirable as we would like to be able to localise our power management to the available power domains and to remove the global GEM operations from the heart of the driver. (The intent there is to push global GEM decisions to the boundary as used by the GEM user interface.) Now during request construction, each request is responsible via its logical context to acquire a wakeref on each power domain it intends to utilize. Currently, each request takes a wakeref on the engine(s) and the engines themselves take a chipset wakeref. This gives us a transition on each engine which we can extend if we want to insert more powermangement control (such as soft rc6). The global GEM operations that currently require a struct_mutex are reduced to listening to pm events from the chipset GT wakeref. As we reduce the struct_mutex requirement, these listeners should evaporate. Perhaps the biggest immediate change is that this removes the struct_mutex requirement around GT power management, allowing us greater flexibility in request construction. Another important knock-on effect, is that by tracking engine usage, we can insert a switch back to the kernel context on that engine immediately, avoiding any extra delay or inserting global synchronisation barriers. This makes tracking when an engine and its associated contexts are idle much easier -- important for when we forgo our assumed execution ordering and need idle barriers to unpin used contexts. In the process, it means we remove a large chunk of code whose only purpose was to switch back to the kernel context. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Imre Deak <imre.deak@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190424200717.1686-5-chris@chris-wilson.co.uk
2019-04-24drm/i915: Pass intel_context to i915_request_create()Chris Wilson1-105/+142
Start acquiring the logical intel_context and using that as our primary means for request allocation. This is the initial step to allow us to avoid requiring struct_mutex for request allocation along the perma-pinned kernel context, but it also provides a foundation for breaking up the complex request allocation to handle different scenarios inside execbuf. For the purpose of emitting a request from inside retirement (see the next patch for engine power management), we also need to lift control over the timeline mutex to the caller. v2: Note that the request carries the active reference upon construction. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190424200717.1686-4-chris@chris-wilson.co.uk
2019-04-24drm/i915: Introduce context->enter() and context->exit()Chris Wilson1-18/+4
We wish to start segregating the power management into different control domains, both with respect to the hardware and the user interface. The first step is that at the lowest level flow of requests, we want to process a context event (and not a global GEM operation). In this patch, we introduce the context callbacks that in future patches will be redirected to per-engine interfaces leading to global operations as required. The intent is that this will be guarded by the timeline->mutex, except that retiring has not quite finished transitioning over from being guarded by struct_mutex. So at the moment it is protected by struct_mutex with a reminded to switch. v2: Rename default handlers to intel_context_enter_engine. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190424200717.1686-3-chris@chris-wilson.co.uk
2019-04-24drm/i915: Move GraphicsTechnology files under gt/Chris Wilson1-1/+0
Start partitioning off the code that talks to the hardware (GT) from the uapi layers and move the device facing code under gt/ One casualty is s/intel_ringbuffer.h/intel_engine.h/ with the plan to subdivide that header and body further (and split out the submission code from the ringbuffer and logical context handling). This patch aims to be simple motion so git can fixup inflight patches with little mess. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190424174839.7141-1-chris@chris-wilson.co.uk
2019-04-19drm/i915: Expose the busyspin durations for i915_wait_requestChris Wilson1-2/+25
An interesting discussion regarding "hybrid interrupt polling" for NVMe came to the conclusion that the ideal busyspin before sleeping was half of the expected request latency (and better if it was already halfway through that request). This suggested that we too should look again at our tradeoff between spinning and waiting. Currently, our spin simply tries to hide the cost of enabling the interrupt, which is good to avoid penalising nop requests (i.e. test throughput) and not much else. Studying real world workloads suggests that a spin of upto 500us can dramatically boost performance, but the suggestion is that this is not from avoiding interrupt latency per-se, but from secondary effects of sleeping such as allowing the CPU reduce cstate and context switch away. In a truly hybrid interrupt polling scheme, we would aim to sleep until just before the request completed and then wake up in advance of the interrupt and do a quick poll to handle completion. This is tricky for ourselves at the moment as we are not recording request times, and since we allow preemption, our requests are not on as a nicely ordered timeline as IO. However, the idea is interesting, for it will certainly help us decide when busyspinning is worthwhile. v2: Expose the spin setting via Kconfig options for easier adjustment and testing. v3: Don't get caught sneaking in a change to the busyspin parameters. v4: Explain more about the "hybrid interrupt polling" scheme that we want to migrate towards. Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com> References: http://events.linuxfoundation.org/sites/events/files/slides/lemoal-nvme-polling-vault-2017-final_0.pdf Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Sagar Kamble <sagar.a.kamble@intel.com> Cc: Eero Tamminen <eero.t.tamminen@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Ben Widawsky <ben@bwidawsk.net> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: MichaƂ Winiarski <michal.winiarski@intel.com> Reviewed-by: Sagar Kamble <sagar.a.kamble@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190419182625.11186-1-chris@chris-wilson.co.uk
2019-04-11drm/i915: Call i915_sw_fence_fini on request cleanupChris Wilson1-0/+1
As i915_requests are put into an RCU-freelist, they may get reused before debugobjects notice them as being freed. On cleanup, explicitly call i915_sw_fence_fini() so that the debugobject is properly tracked. Reported-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Fixes: b7404c7ecb38 ("drm/i915: Bump ready tasks ahead of busywaits") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190411122445.20060-1-chris@chris-wilson.co.uk
2019-04-11drm/i915: Bump ready tasks ahead of busywaitsChris Wilson1-0/+40
Consider two tasks that are running in parallel on a pair of engines (vcs0, vcs1), but then must complete on a shared engine (rcs0). To maximise throughput, we want to run the first ready task on rcs0 (i.e. the first task that completes on either of vcs0 or vcs1). When using semaphores, however, we will instead queue onto rcs in submission order. To resolve this incorrect ordering, we want to re-evaluate the priority queue when each of the request is ready. Normally this happens because we only insert into the priority queue requests that are ready, but with semaphores we are inserting ahead of their readiness and to compensate we penalize those tasks with reduced priority (so that tasks that do not need to busywait should naturally be run first). However, given a series of tasks that each use semaphores, the queue degrades into submission fifo rather than readiness fifo, and so to counter this we give a small boost to semaphore users as their dependent tasks are completed (and so we no longer require any busywait prior to running the user task as they are then ready themselves). v2: Fixup irqsave for schedule_lock (Tvrtko) Testcase: igt/gem_exec_schedule/semaphore-codependency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com> Cc: Dmitry Ermilov <dmitry.ermilov@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190409152922.23894-1-chris@chris-wilson.co.uk
2019-04-08drm/i915: Consolidate the timeline->barrierChris Wilson1-9/+0
The timeline is strictly ordered, so by inserting the timeline->barrier request into the timeline->last_request it naturally provides the same barrier. Consolidate the pair of barriers into one as they serve the same purpose. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190408091728.20207-4-chris@chris-wilson.co.uk
2019-04-08drm/i915: extract intel_pm.h from intel_drv.hJani Nikula1-1/+2
It used to be handy that we only had a couple of headers, but over time intel_drv.h has become unwieldy. Extract declarations to a separate header file corresponding to the implementation module, clarifying the modularity of the driver. Ensure the new header is self-contained, and do so with minimal further includes, using forward declarations as needed. Include the new header only where needed, and sort the modified include directives while at it and as needed. No functional changes. v2: gen6_rps_reset_ei() is in i915_irq.c not intel_pm.c. Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/adc6463b95eef3440fba9826793f7d1c5f3b0b4a.1554461791.git.jani.nikula@intel.com
2019-04-05drm/i915: Use lockdep_pin_lock() over the construction of the requestChris Wilson1-0/+5
During request construction, we take the timeline->mutex to ensure exclusive access to the ringbuffer (for command emission) and the timeline itself (for command ordering). The timeline->mutex should not be dropped by callers until we release it in i915_request_add(). lockdep provides a pin/unpin lock facility to detect accidental unlocks inside critical sections, so put it to use for request construction. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190403082132.327-1-chris@chris-wilson.co.uk
2019-04-02drm/i915: Only emit one semaphore per requestChris Wilson1-2/+9
Ideally we only need one semaphore per ring to accommodate waiting on multiple engines in parallel. However, since we do not know which fences we will finally be waiting on, we emit a semaphore for every fence. It turns out to be quite easy to trick ourselves into exhausting our ringbuffer causing an error, just by feeding in a batch that depends on several thousand contexts. Since we never can be waiting on more than one semaphore in parallel (other than perhaps the desire to busywait on multiple engines), just pick the first fence for our semaphore. If we pick the wrong fence to busywait on, we just miss an opportunity to reduce latency. An adaption might be to use sched.flags as either a semaphore counter, or to track the first busywait on each engine, converting it back to a single use bit prior to closing the request. v2: Track first semaphore used per-engine (this caters for our basic igt that semaphores are working). Reported-by: Mika Kuoppala <mika.kuoppala@intel.com> Testcase: igt/gem_exec_fence/long-history Fixes: e88619646971 ("drm/i915: Use HW semaphores for inter-engine synchronisation on gen8+") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190401162641.10963-3-chris@chris-wilson.co.uk Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
2019-03-22drm/i915: Allow contexts to share a single timeline across all enginesChris Wilson1-25/+55
Previously, our view has been always to run the engines independently within a context. (Multiple engines happened before we had contexts and timelines, so they always operated independently and that behaviour persisted into contexts.) However, at the user level the context often represents a single timeline (e.g. GL contexts) and userspace must ensure that the individual engines are serialised to present that ordering to the client (or forgot about this detail entirely and hope no one notices - a fair ploy if the client can only directly control one engine themselves ;) In the next patch, we will want to construct a set of engines that operate as one, that have a single timeline interwoven between them, to present a single virtual engine to the user. (They submit to the virtual engine, then we decide which engine to execute on based.) To that end, we want to be able to create contexts which have a single timeline (fence context) shared between all engines, rather than multiple timelines. v2: Move the specialised timeline ordering to its own function. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190322092325.5883-4-chris@chris-wilson.co.uk
2019-03-21drm/i915: Stop storing the context name as the timeline nameChris Wilson1-5/+2
The timeline->name is only used for convenience in pretty printing the i915_request.fence->ops->get_timeline_name() and it is just as convenient to pull it from the gem_context directly. The few instances of its use inside GEM_TRACE() has proven more of a nuisance than helpful, so not worth saving imo. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190321140711.11190-4-chris@chris-wilson.co.uk
2019-03-18drm/i915: Hold a ref to the ring while retiringChris Wilson1-1/+5
As the final request on a ring may hold the reference to this ring (via retiring the last pinned context), we may find ourselves chasing a dangling pointer on completion of the list. A quick solution is to hold a reference to the ring itself as we retire along it so that we only free it after we stop dereferencing it. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190318095204.9913-4-chris@chris-wilson.co.uk
2019-03-08drm/i915: Reduce presumption of request ordering for barriersChris Wilson1-0/+1
Currently we assume that we know the order in which requests run and so can determine if we need to reissue a switch-to-kernel-context prior to idling. That assumption does not hold for the future, so instead of tracking which barriers have been used, simply determine if we have ever switched away from the kernel context by using the engine and before idling ensure that all engines that have been used since the last idle are synchronously switched back to the kernel context for safety (and else of shrinking memory while idle). v2: Use intel_engine_mask_t and ALL_ENGINES Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190308093657.8640-3-chris@chris-wilson.co.uk
2019-03-06drm/i915: Use i915_global_register()Chris Wilson1-14/+22
Rather than manually add every new global into each hook, use i915_global_register() function and keep a list of registered globals to invoke instead. However, I haven't found a way for random drivers to add an .init table to avoid having to manually add ourselves to i915_globals_init() each time. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20190305213830.18094-1-chris@chris-wilson.co.uk Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
2019-03-01drm/i915: Prioritise non-busywait semaphore workloadsChris Wilson1-0/+16
We don't want to busywait on the GPU if we have other work to do. If we give non-busywaiting workloads higher (initial) priority than workloads that require a busywait, we will prioritise work that is ready to run immediately. We then also have to be careful that we don't give earlier semaphores an accidental boost because later work doesn't wait on other rings, hence we keep a history of semaphore usage of the dependency chain. v2: Stop rolling the bits into a chain and just use a flag in case this request or any of our dependencies use a semaphore. The rolling around was contagious as Tvrtko was heard to fall off his chair. Testcase: igt/gem_exec_schedule/semaphore Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190301170901.8340-4-chris@chris-wilson.co.uk
2019-03-01drm/i915: Use HW semaphores for inter-engine synchronisation on gen8+Chris Wilson1-2/+136
Having introduced per-context seqno, we now have a means to identity progress across the system without feel of rollback as befell the global_seqno. That is we can program a MI_SEMAPHORE_WAIT operation in advance of submission safe in the knowledge that our target seqno and address is stable. However, since we are telling the GPU to busy-spin on the target address until it matches the signaling seqno, we only want to do so when we are sure that busy-spin will be completed quickly. To achieve this we only submit the request to HW once the signaler is itself executing (modulo preemption causing us to wait longer), and we only do so for default and above priority requests (so that idle priority tasks never themselves hog the GPU waiting for others). As might be reasonably expected, HW semaphores excel in inter-engine synchronisation microbenchmarks (where the 3x reduced latency / increased throughput more than offset the power cost of spinning on a second ring) and have significant improvement (can be up to ~10%, most see no change) for single clients that utilize multiple engines (typically media players and transcoders), without regressing multiple clients that can saturate the system or changing the power envelope dramatically. v3: Drop the older NEQ branch, now we pin the signaler's HWSP anyway. v4: Tell the world and include it as part of scheduler caps. Testcase: igt/gem_exec_whisper Testcase: igt/benchmarks/gem_wsim Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190301170901.8340-3-chris@chris-wilson.co.uk
2019-03-01drm/i915: Keep timeline HWSP allocated until idle across the systemChris Wilson1-15/+16
In preparation for enabling HW semaphores, we need to keep in flight timeline HWSP alive until its use across entire system has completed, as any other timeline active on the GPU may still refer back to the already retired timeline. We both have to delay recycling available cachelines and unpinning old HWSP until the next idle point. An easy option would be to simply keep all used HWSP until the system as a whole was idle, i.e. we could release them all at once on parking. However, on a busy system, we may never see a global idle point, essentially meaning the resource will be leaked until we are forced to do a GC pass. We already employ a fine-grained idle detection mechanism for vma, which we can reuse here so that each cacheline can be freed immediately after the last request using it is retired. v3: Keep track of the activity of each cacheline. v4: cacheline_free() on canceling the seqno tracking v5: Finally with a testcase to exercise wraparound v6: Pack cacheline into empty bits of page-aligned vaddr v7: Use i915_utils to hide the pointer casting around bit manipulation Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190301170901.8340-2-chris@chris-wilson.co.uk
2019-03-01drm/i915: Introduce i915_timeline.mutexChris Wilson1-1/+5
A simple mutex used for guarding the flow of requests in and out of the timeline. In the short-term, it will be used only to guard the addition of requests into the timeline, taken on alloc and released on commit so that only one caller can construct a request into the timeline (important as the seqno and ring pointers must be serialised). This will be used by observers to ensure that the seqno/hwsp is stable. Later, when we have reduced retiring to only operate on a single timeline at a time, we can then use the mutex as the sole guard required for retiring. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190301110547.14758-2-chris@chris-wilson.co.uk
2019-02-28drm/i915/execlists: Suppress mere WAIT preemptionChris Wilson1-0/+15
WAIT is occasionally suppressed by virtue of preempted requests being promoted to NEWCLIENT if they have not all ready received that boost. Make this consistent for all WAIT boosts that they are not allowed to preempt executing contexts and are merely granted the right to be at the front of the queue for the next execution slot. This is in keeping with the desire that the WAIT boost be a minor tweak that does not give excessive promotion to its user and open ourselves to trivial abuse. The problem with the inconsistent WAIT preemption becomes more apparent as the preemption is propagated across the engines, where one engine may preempt and the other not, and we be relying on the exact execution order being consistent across engines (e.g. using HW semaphores to coordinate parallel execution). v2: Also protect GuC submission from false preemption loops. v3: Build bug safeguards and better debug messages for st. v4: Do the priority bumping in unsubmit (i.e. on preemption/reset unwind), applying it earlier during submit causes out-of-order execution combined with execute fences. v5: Call sw_fence_fini for our dummy request (Matthew) Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190228220639.3173-1-chris@chris-wilson.co.uk
2019-02-28drm/i915: Make request allocation caches globalChris Wilson1-8/+45
As kmem_caches share the same properties (size, allocation/free behaviour) for all potential devices, we can use global caches. While this potential has worse fragmentation behaviour (one can argue that different devices would have different activity lifetimes, but you can also argue that activity is temporal across the system) it is the default behaviour of the system at large to amalgamate matching caches. The benefit for us is much reduced pointer dancing along the frequent allocation paths. v2: Defer shrinking until after a global grace period for futureproofing multiple consumers of the slab caches, similar to the current strategy for avoiding shrinking too early. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190228102035.5857-1-chris@chris-wilson.co.uk
2019-02-26drm/i915: Remove i915_request.global_seqnoChris Wilson1-29/+5
Having weaned the interrupt handling off using a single global execution queue, we no longer need to emit a global_seqno. Note that we still have a few assumptions about execution order along engine timelines, but this removes the most obvious artefact! Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190226094922.31617-3-chris@chris-wilson.co.uk
2019-02-26drm/i915: Remove access to global seqno in the HWSPChris Wilson1-17/+10
Stop accessing the HWSP to read the global seqno, and stop tracking the mirror in the engine's execution timeline -- it is unused. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190226094922.31617-2-chris@chris-wilson.co.uk
2019-02-20drm/i915: Beware temporary wedging when determining -EIOChris Wilson1-2/+3
At a few points in our uABI, we check to see if the driver is wedged and report -EIO back to the user in that case. However, as we perform the check and reset asynchronously (where once before they were both serialised by the struct_mutex), we may instead see the temporary wedging used to cancel inflight rendering to avoid a deadlock during reset (caused by either us timing out in our reset handler, i915_wedge_on_timeout or with malice aforethought in intel_reset_prepare for a stuck modeset). If we suspect this is the case, that is we see a wedged driver *and* reset in progress, then wait until the reset is resolved before reporting upon the wedged status. v2: might_sleep() (Mika) Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109580 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190220145637.23503-1-chris@chris-wilson.co.uk
2019-02-19drm/i915: Use time based guilty context banningChris Wilson1-2/+0
Currently, we accumulate each time a context hangs the GPU, offset against the number of requests it submits, and if that score exceeds a certain threshold, we ban that context from submitting any more requests (cancelling any work in flight). In contrast, we use a simple timer on the file, that if we see more than a 9 hangs faster than 60s apart in total across all of its contexts, we will ban the client from creating any more contexts. This leads to a confusing situation where the file may be banned before the context, so lets use a simple timer scheme for each. If the context submits 3 hanging requests within a 120s period, declare it forbidden to ever send more requests. This has the advantage of not being easy to repair by simply sending empty requests, but has the disadvantage that if the context is idle then it is forgiven. However, if the context is idle, it is not disrupting the system, but a hog can evade the request counting and cause much more severe disruption to the system. Updating ban_score from request retirement is dubious as the retirement is purposely not in sync with request submission (i.e. we try and batch retirement to reduce overhead and avoid latency on submission), which leads to surprising situations where we can forgive a hang immediately due to a backlog of requests from before the hang being retired afterwards. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190219122215.8941-2-chris@chris-wilson.co.uk
2019-02-15drm/i915: Defer application of request banning to submissionChris Wilson1-0/+3
As we currently do not check on submission whether the context is banned in a timely manner it is possible for some requests to escape cancellation after their parent context is banned. By moving the ban into the request submission under the engine->timeline.lock, we serialise it with the reset and setting of the context ban. References: eb8d0f5af4ec ("drm/i915: Remove GPU reset dependence on struct_mutex") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190213182737.12695-1-chris@chris-wilson.co.uk Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
2019-02-13drm/i915: Apply rps waitboosting for dma_fence_wait_timeout()Chris Wilson1-2/+19
As time goes by, usage of generic ioctls such as drm_syncobj and sync_file are on the increase bypassing i915-specific ioctls like GEM_WAIT. Currently, we only apply waitboosting to our driver ioctls as we track the file/client and account the waitboosting to them. However, since commit 7b92c1bd0540 ("drm/i915: Avoid keeping waitboost active for signaling threads"), we no longer have been applying the client ratelimiting on waitboosts and so that information has only been used for debug tracking. Push the application of waitboosting down to the common i915_request_wait, and apply it to all foreign fence waits as well. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Eero Tamminen <eero.t.tamminen@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190213092504.25709-1-chris@chris-wilson.co.uk
2019-02-05drm/i915: Pull i915_gem_active into the i915_active familyChris Wilson1-24/+11
Looking forward, we need to break the struct_mutex dependency on i915_gem_active. In the meantime, external use of i915_gem_active is quite beguiling, little do new users suspect that it implies a barrier as each request it tracks must be ordered wrt the previous one. As one of many, it can be used to track activity across multiple timelines, a shared fence, which fits our unordered request submission much better. We need to steer external users away from the singular, exclusive fence imposed by i915_gem_active to i915_active instead. As part of that process, we move i915_gem_active out of i915_request.c into i915_active.c to start separating the two concepts, and rename it to i915_active_request (both to tie it to the concept of tracking just one request, and to give it a longer, less appealing name). Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190205130005.2807-5-chris@chris-wilson.co.uk
2019-02-05drm/i915: Add timeline barrier supportTvrtko Ursulin1-0/+17
Timeline barrier allows serialization between different timelines. After calling i915_timeline_set_barrier with a request, all following submissions on this timeline will be set up as depending on this request, or barrier. Once the barrier has been completed it automatically gets cleared and things continue as normal. This facility will be used by the upcoming context SSEU code. v2: * Assert barrier has been retired on timeline_fini. (Chris Wilson) * Fix mock_timeline. v3: * Improved comment language. (Chris Wilson) v4: * Maintain ordering with previous barriers set on the timeline. v5: * Rebase. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20190205095032.22673-3-tvrtko.ursulin@linux.intel.com
2019-02-04drm/i915: Trim NEWCLIENT boostingChris Wilson1-1/+1
Limit the NEWCLIENT boost to only give its small priority boost to fresh clients only that have no dependencies. The idea for using NEWCLIENT boosting, commit b16c765122f9 ("drm/i915: Priority boost for new clients"), is that short-lived streams are often interactive and require lower latency -- and that by executing those ahead of the long running hogs, the short-lived clients do little to interfere with the system throughput by virtue of their short-lived nature. However, we were only considering the client's own timeline for determining whether or not it was a fresh stream. This allowed for compositors to wake up before their vblank and bump all of its client streams. However, in testing with media-bench this results in chaining all cooperating contexts together preventing us from being able to reorder contexts to reduce bubbles (pipeline stalls), overall increasing latency, and reducing system throughput. The exact opposite of our intent. The compromise of applying the NEWCLIENT boost to strictly fresh clients (that do not wait upon anything else) should maintain the "real-time response under load" characteristics of FQ_CODEL, without locking together the long chains of dependencies across the system. References: b16c765122f9 ("drm/i915: Priority boost for new clients") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190204150101.30759-1-chris@chris-wilson.co.uk
2019-01-29drm/i915: Replace global breadcrumbs with per-context interrupt trackingChris Wilson1-93/+49
A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd"), the issue of handling multiple clients waiting in parallel was brought to our attention. The requirement was that every client should be woken immediately upon its request being signaled, without incurring any cpu overhead. To handle certain fragility of our hw meant that we could not do a simple check inside the irq handler (some generations required almost unbounded delays before we could be sure of seqno coherency) and so request completion checking required delegation. Before commit 688e6c725816, the solution was simple. Every client waiting on a request would be woken on every interrupt and each would do a heavyweight check to see if their request was complete. Commit 688e6c725816 introduced an rbtree so that only the earliest waiter on the global timeline would woken, and would wake the next and so on. (Along with various complications to handle requests being reordered along the global timeline, and also a requirement for kthread to provide a delegate for fence signaling that had no process context.) The global rbtree depends on knowing the execution timeline (and global seqno). Without knowing that order, we must instead check all contexts queued to the HW to see which may have advanced. We trim that list by only checking queued contexts that are being waited on, but still we keep a list of all active contexts and their active signalers that we inspect from inside the irq handler. By moving the waiters onto the fence signal list, we can combine the client wakeup with the dma_fence signaling (a dramatic reduction in complexity, but does require the HW being coherent, the seqno must be visible from the cpu before the interrupt is raised - we keep a timer backup just in case). Having previously fixed all the issues with irq-seqno serialisation (by inserting delays onto the GPU after each request instead of random delays on the CPU after each interrupt), we can rely on the seqno state to perfom direct wakeups from the interrupt handler. This allows us to preserve our single context switch behaviour of the current routine, with the only downside that we lose the RT priority sorting of wakeups. In general, direct wakeup latency of multiple clients is about the same (about 10% better in most cases) with a reduction in total CPU time spent in the waiter (about 20-50% depending on gen). Average herd behaviour is improved, but at the cost of not delegating wakeups on task_prio. v2: Capture fence signaling state for error state and add comments to warm even the most cold of hearts. v3: Check if the request is still active before busywaiting v4: Reduce the amount of pointer misdirection with list_for_each_safe and using a local i915_request variable inside the loops v5: Add a missing pluralisation to a purely informative selftest message. References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
2019-01-29drm/i915: Identify active requestsChris Wilson1-5/+5
To allow requests to forgo a common execution timeline, one question we need to be able to answer is "is this request running?". To track whether a request has started on HW, we can emit a breadcrumb at the beginning of the request and check its timeline's HWSP to see if the breadcrumb has advanced past the start of this request. (This is in contrast to the global timeline where we need only ask if we are on the global timeline and if the timeline has advanced past the end of the previous request.) There is still confusion from a preempted request, which has already started but relinquished the HW to a high priority request. For the common case, this discrepancy should be negligible. However, for identification of hung requests, knowing which one was running at the time of the hang will be much more important. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190129185452.20989-2-chris@chris-wilson.co.uk
2019-01-28drm/i915: Track the context's seqno in its own timeline HWSPChris Wilson1-1/+2
Now that we have allocated ourselves a cacheline to store a breadcrumb, we can emit a write from the GPU into the timeline's HWSP of the per-context seqno as we complete each request. This drops the mirroring of the per-engine HWSP and allows each context to operate independently. We do not need to unwind the per-context timeline, and so requests are always consistent with the timeline breadcrumb, greatly simplifying the completion checks as we no longer need to be concerned about the global_seqno changing mid check. One complication though is that we have to be wary that the request may outlive the HWSP and so avoid touching the potentially danging pointer after we have retired the fence. We also have to guard our access of the HWSP with RCU, the release of the obj->mm.pages should already be RCU-safe. At this point, we are emitting both per-context and global seqno and still using the single per-engine execution timeline for resolving interrupts. v2: s/fake_complete/mark_complete/ Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190128181812.22804-5-chris@chris-wilson.co.uk
2019-01-28drm/i915: Introduce concept of per-timeline (context) HWSPChris Wilson1-5/+11
Supplement the per-engine HWSP with a per-timeline HWSP. That is a per-request pointer through which we can check a local seqno, abstracting away the presumption of a global seqno. In this first step, we point each request back into the engine's HWSP so everything continues to work with the global timeline. v2: s/i915_request_hwsp/hwsp_seqno/ to emphasis that this is the current HW value and that we are accessing it via i915_request merely as a convenience. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190128181812.22804-1-chris@chris-wilson.co.uk
2019-01-25drm/i915: Remove GPU reset dependence on struct_mutexChris Wilson1-47/+0
Now that the submission backends are controlled via their own spinlocks, with a wave of a magic wand we can lift the struct_mutex requirement around GPU reset. That is we allow the submission frontend (userspace) to keep on submitting while we process the GPU reset as we can suspend the backend independently. The major change is around the backoff/handoff strategy for performing the reset. With no mutex deadlock, we no longer have to coordinate with any waiter, and just perform the reset immediately. Testcase: igt/gem_mmap_gtt/hang # regresses Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190125132230.22221-3-chris@chris-wilson.co.uk
2019-01-25drm/i915: Remove manual breadcumb countingChris Wilson1-2/+2
Now that we know we measure the size of the engine->emit_breadcrumb() correctly, we can remove the previous manual counting. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190125120005.25191-1-chris@chris-wilson.co.uk
2019-01-22Merge drm/drm-next into drm-intel-next-queuedRodrigo Vivi1-6/+6
We need avi infoframe stuff who got merged via drm-misc Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>