diff options
| author | 2019-06-23 21:36:31 +0000 | |
|---|---|---|
| committer | 2019-06-23 21:36:31 +0000 | |
| commit | 23f101f37937a1bd4a29726cab2f76e0fb038b35 (patch) | |
| tree | f7da7d6b32c2e07114da399150bfa88d72187012 /gnu/llvm/docs/CommandGuide/llvm-mca.rst | |
| parent | sort previous; ok deraadt (diff) | |
| download | wireguard-openbsd-23f101f37937a1bd4a29726cab2f76e0fb038b35.tar.xz wireguard-openbsd-23f101f37937a1bd4a29726cab2f76e0fb038b35.zip | |
Import LLVM 8.0.0 release including clang, lld and lldb.
Diffstat (limited to 'gnu/llvm/docs/CommandGuide/llvm-mca.rst')
| -rw-r--r-- | gnu/llvm/docs/CommandGuide/llvm-mca.rst | 266 |
1 files changed, 145 insertions, 121 deletions
diff --git a/gnu/llvm/docs/CommandGuide/llvm-mca.rst b/gnu/llvm/docs/CommandGuide/llvm-mca.rst index e44eb2f8ce9..bc50794e0cb 100644 --- a/gnu/llvm/docs/CommandGuide/llvm-mca.rst +++ b/gnu/llvm/docs/CommandGuide/llvm-mca.rst @@ -21,43 +21,12 @@ The main goal of this tool is not just to predict the performance of the code when run on the target, but also help with diagnosing potential performance issues. -Given an assembly code sequence, llvm-mca estimates the Instructions Per Cycle -(IPC), as well as hardware resource pressure. The analysis and reporting style -were inspired by the IACA tool from Intel. +Given an assembly code sequence, :program:`llvm-mca` estimates the Instructions +Per Cycle (IPC), as well as hardware resource pressure. The analysis and +reporting style were inspired by the IACA tool from Intel. -:program:`llvm-mca` allows the usage of special code comments to mark regions of -the assembly code to be analyzed. A comment starting with substring -``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment starting with -substring ``LLVM-MCA-END`` marks the end of a code region. For example: - -.. code-block:: none - - # LLVM-MCA-BEGIN My Code Region - ... - # LLVM-MCA-END - -Multiple regions can be specified provided that they do not overlap. A code -region can have an optional description. If no user-defined region is specified, -then :program:`llvm-mca` assumes a default region which contains every -instruction in the input file. Every region is analyzed in isolation, and the -final performance report is the union of all the reports generated for every -code region. - -Inline assembly directives may be used from source code to annotate the -assembly text: - -.. code-block:: c++ - - int foo(int a, int b) { - __asm volatile("# LLVM-MCA-BEGIN foo"); - a += 42; - __asm volatile("# LLVM-MCA-END"); - a *= b; - return a; - } - -So for example, you can compile code with clang, output assembly, and pipe it -directly into llvm-mca for analysis: +For example, you can compile code with clang, output assembly, and pipe it +directly into :program:`llvm-mca` for analysis: .. code-block:: bash @@ -207,6 +176,40 @@ EXIT STATUS :program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed to standard error, and the tool returns 1. +USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS +--------------------------------------------- +:program:`llvm-mca` allows for the optional usage of special code comments to +mark regions of the assembly code to be analyzed. A comment starting with +substring ``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment +starting with substring ``LLVM-MCA-END`` marks the end of a code region. For +example: + +.. code-block:: none + + # LLVM-MCA-BEGIN My Code Region + ... + # LLVM-MCA-END + +Multiple regions can be specified provided that they do not overlap. A code +region can have an optional description. If no user-defined region is specified, +then :program:`llvm-mca` assumes a default region which contains every +instruction in the input file. Every region is analyzed in isolation, and the +final performance report is the union of all the reports generated for every +code region. + +Inline assembly directives may be used from source code to annotate the +assembly text: + +.. code-block:: c++ + + int foo(int a, int b) { + __asm volatile("# LLVM-MCA-BEGIN foo"); + a += 42; + __asm volatile("# LLVM-MCA-END"); + a *= b; + return a; + } + HOW LLVM-MCA WORKS ------------------ @@ -235,7 +238,10 @@ the following command using the example located at Iterations: 300 Instructions: 900 Total Cycles: 610 + Total uOps: 900 + Dispatch Width: 2 + uOps Per Cycle: 1.48 IPC: 1.48 Block RThroughput: 2.0 @@ -282,35 +288,45 @@ the following command using the example located at - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4 According to this report, the dot-product kernel has been executed 300 times, -for a total of 900 dynamically executed instructions. +for a total of 900 simulated instructions. The total number of simulated micro +opcodes (uOps) is also 900. The report is structured in three main sections. The first section collects a few performance numbers; the goal of this section is to give a very quick -overview of the performance throughput. In this example, the two important -performance indicators are **IPC** and **Block RThroughput** (Block Reciprocal +overview of the performance throughput. Important performance indicators are +**IPC**, **uOps Per Cycle**, and **Block RThroughput** (Block Reciprocal Throughput). IPC is computed dividing the total number of simulated instructions by the total -number of cycles. A delta between Dispatch Width and IPC is an indicator of a -performance issue. In the absence of loop-carried data dependencies, the +number of cycles. In the absence of loop-carried data dependencies, the observed IPC tends to a theoretical maximum which can be computed by dividing the number of instructions of a single iteration by the *Block RThroughput*. -IPC is bounded from above by the dispatch width. That is because the dispatch -width limits the maximum size of a dispatch group. IPC is also limited by the -amount of hardware parallelism. The availability of hardware resources affects -the resource pressure distribution, and it limits the number of instructions -that can be executed in parallel every cycle. A delta between Dispatch -Width and the theoretical maximum IPC is an indicator of a performance -bottleneck caused by the lack of hardware resources. In general, the lower the -Block RThroughput, the better. - -In this example, ``Instructions per iteration/Block RThroughput`` is 1.50. Since -there are no loop-carried dependencies, the observed IPC is expected to approach -1.50 when the number of iterations tends to infinity. The delta between the -Dispatch Width (2.00), and the theoretical maximum IPC (1.50) is an indicator of -a performance bottleneck caused by the lack of hardware resources, and the -*Resource pressure view* can help to identify the problematic resource usage. +Field 'uOps Per Cycle' is computed dividing the total number of simulated micro +opcodes by the total number of cycles. A delta between Dispatch Width and this +field is an indicator of a performance issue. In the absence of loop-carried +data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical +maximum throughput which can be computed by dividing the number of uOps of a +single iteration by the *Block RThroughput*. + +Field *uOps Per Cycle* is bounded from above by the dispatch width. That is +because the dispatch width limits the maximum size of a dispatch group. Both IPC +and 'uOps Per Cycle' are limited by the amount of hardware parallelism. The +availability of hardware resources affects the resource pressure distribution, +and it limits the number of instructions that can be executed in parallel every +cycle. A delta between Dispatch Width and the theoretical maximum uOps per +Cycle (computed by dividing the number of uOps of a single iteration by the +*Block RTrhoughput*) is an indicator of a performance bottleneck caused by the +lack of hardware resources. +In general, the lower the Block RThroughput, the better. + +In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there +are no loop-carried dependencies, the observed *uOps Per Cycle* is expected to +approach 1.50 when the number of iterations tends to infinity. The delta between +the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is +an indicator of a performance bottleneck caused by the lack of hardware +resources, and the *Resource pressure view* can help to identify the problematic +resource usage. The second section of the report shows the latency and reciprocal throughput of every instruction in the sequence. That section also reports @@ -454,21 +470,22 @@ The ``-all-stats`` command line option enables extra statistics and performance counters for the dispatch logic, the reorder buffer, the retire control unit, and the register file. -Below is an example of ``-all-stats`` output generated by MCA for the -dot-product example discussed in the previous sections. +Below is an example of ``-all-stats`` output generated by :program:`llvm-mca` +for 300 iterations of the dot-product example discussed in the previous +sections. .. code-block:: none Dynamic Dispatch Stall Cycles: RAT - Register unavailable: 0 RCU - Retire tokens unavailable: 0 - SCHEDQ - Scheduler full: 272 + SCHEDQ - Scheduler full: 272 (44.6%) LQ - Load queue full: 0 SQ - Store queue full: 0 GROUP - Static restrictions on the dispatch group: 0 - Dispatch Logic - number of cycles where we saw N instructions dispatched: + Dispatch Logic - number of cycles where we saw N micro opcodes dispatched: [# dispatched], [# cycles] 0, 24 (3.9%) 1, 272 (44.6%) @@ -481,11 +498,16 @@ dot-product example discussed in the previous sections. 1, 306 (50.2%) 2, 297 (48.7%) - Scheduler's queue usage: - JALU01, 0/20 - JFPU01, 18/18 - JLSAGU, 0/12 + [1] Resource name. + [2] Average number of used buffer entries. + [3] Maximum number of used buffer entries. + [4] Total number of buffer entries. + + [1] [2] [3] [4] + JALU01 0 0 20 + JFPU01 17 18 18 + JLSAGU 0 0 12 Retire Control Unit - number of cycles where we saw N instructions retired: @@ -494,6 +516,10 @@ dot-product example discussed in the previous sections. 1, 102 (16.7%) 2, 399 (65.4%) + Total ROB Entries: 64 + Max Used ROB Entries: 35 ( 54.7% ) + Average Used ROB Entries per cy: 32 ( 50.0% ) + Register File statistics: Total number of mappings created: 900 @@ -511,23 +537,21 @@ dot-product example discussed in the previous sections. If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for SCHEDQ reports 272 cycles. This counter is incremented every time the dispatch -logic is unable to dispatch a group of two instructions because the scheduler's -queue is full. +logic is unable to dispatch a full group because the scheduler's queue is full. -Looking at the *Dispatch Logic* table, we see that the pipeline was only able -to dispatch two instructions 51.5% of the time. The dispatch group was limited -to one instruction 44.6% of the cycles, which corresponds to 272 cycles. The +Looking at the *Dispatch Logic* table, we see that the pipeline was only able to +dispatch two micro opcodes 51.5% of the time. The dispatch group was limited to +one micro opcode 44.6% of the cycles, which corresponds to 272 cycles. The dispatch statistics are displayed by either using the command option ``-all-stats`` or ``-dispatch-stats``. The next table, *Schedulers*, presents a histogram displaying a count, representing the number of instructions issued on some number of cycles. In -this case, of the 610 simulated cycles, single -instructions were issued 306 times (50.2%) and there were 7 cycles where -no instructions were issued. +this case, of the 610 simulated cycles, single instructions were issued 306 +times (50.2%) and there were 7 cycles where no instructions were issued. -The *Scheduler's queue usage* table shows that the maximum number of buffer -entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01 +The *Scheduler's queue usage* table shows that the average and maximum number of +buffer entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01 reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements three schedulers: @@ -543,28 +567,28 @@ A full scheduler queue is either caused by data dependency chains or by a sub-optimal usage of hardware resources. Sometimes, resource pressure can be mitigated by rewriting the kernel using different instructions that consume different scheduler resources. Schedulers with a small queue are less resilient -to bottlenecks caused by the presence of long data dependencies. -The scheduler statistics are displayed by -using the command option ``-all-stats`` or ``-scheduler-stats``. +to bottlenecks caused by the presence of long data dependencies. The scheduler +statistics are displayed by using the command option ``-all-stats`` or +``-scheduler-stats``. The next table, *Retire Control Unit*, presents a histogram displaying a count, representing the number of instructions retired on some number of cycles. In -this case, of the 610 simulated cycles, two instructions were retired during -the same cycle 399 times (65.4%) and there were 109 cycles where no -instructions were retired. The retire statistics are displayed by using the -command option ``-all-stats`` or ``-retire-stats``. +this case, of the 610 simulated cycles, two instructions were retired during the +same cycle 399 times (65.4%) and there were 109 cycles where no instructions +were retired. The retire statistics are displayed by using the command option +``-all-stats`` or ``-retire-stats``. The last table presented is *Register File statistics*. Each physical register file (PRF) used by the pipeline is presented in this table. In the case of AMD -Jaguar, there are two register files, one for floating-point registers -(JFpuPRF) and one for integer registers (JIntegerPRF). The table shows that of -the 900 instructions processed, there were 900 mappings created. Since this -dot-product example utilized only floating point registers, the JFPuPRF was -responsible for creating the 900 mappings. However, we see that the pipeline -only used a maximum of 35 of 72 available register slots at any given time. We -can conclude that the floating point PRF was the only register file used for -the example, and that it was never resource constrained. The register file -statistics are displayed by using the command option ``-all-stats`` or +Jaguar, there are two register files, one for floating-point registers (JFpuPRF) +and one for integer registers (JIntegerPRF). The table shows that of the 900 +instructions processed, there were 900 mappings created. Since this dot-product +example utilized only floating point registers, the JFPuPRF was responsible for +creating the 900 mappings. However, we see that the pipeline only used a +maximum of 35 of 72 available register slots at any given time. We can conclude +that the floating point PRF was the only register file used for the example, and +that it was never resource constrained. The register file statistics are +displayed by using the command option ``-all-stats`` or ``-register-file-stats``. In this example, we can conclude that the IPC is mostly limited by data @@ -572,8 +596,8 @@ dependencies, and not by resource pressure. Instruction Flow ^^^^^^^^^^^^^^^^ -This section describes the instruction flow through MCA's default out-of-order -pipeline, as well as the functional units involved in the process. +This section describes the instruction flow through the default pipeline of +:program:`llvm-mca`, as well as the functional units involved in the process. The default pipeline implements the following sequence of stages used to process instructions. @@ -585,9 +609,9 @@ process instructions. The default pipeline only models the out-of-order portion of a processor. Therefore, the instruction fetch and decode stages are not modeled. Performance -bottlenecks in the frontend are not diagnosed. MCA assumes that instructions -have all been decoded and placed into a queue. Also, MCA does not model branch -prediction. +bottlenecks in the frontend are not diagnosed. :program:`llvm-mca` assumes that +instructions have all been decoded and placed into a queue before the simulation +start. Also, :program:`llvm-mca` does not model branch prediction. Instruction Dispatch """""""""""""""""""" @@ -607,19 +631,19 @@ An instruction can be dispatched if: * The schedulers are not full. Scheduling models can optionally specify which register files are available on -the processor. MCA uses that information to initialize register file -descriptors. Users can limit the number of physical registers that are +the processor. :program:`llvm-mca` uses that information to initialize register +file descriptors. Users can limit the number of physical registers that are globally available for register renaming by using the command option -``-register-file-size``. A value of zero for this option means *unbounded*. -By knowing how many registers are available for renaming, MCA can predict -dispatch stalls caused by the lack of registers. +``-register-file-size``. A value of zero for this option means *unbounded*. By +knowing how many registers are available for renaming, the tool can predict +dispatch stalls caused by the lack of physical registers. The number of reorder buffer entries consumed by an instruction depends on the -number of micro-opcodes specified by the target scheduling model. MCA's -reorder buffer's purpose is to track the progress of instructions that are -"in-flight," and to retire instructions in program order. The number of -entries in the reorder buffer defaults to the `MicroOpBufferSize` provided by -the target scheduling model. +number of micro-opcodes specified for that instruction by the target scheduling +model. The reorder buffer is responsible for tracking the progress of +instructions that are "in-flight", and retiring them in program order. The +number of entries in the reorder buffer defaults to the value specified by field +`MicroOpBufferSize` in the target scheduling model. Instructions that are dispatched to the schedulers consume scheduler buffer entries. :program:`llvm-mca` queries the scheduling model to determine the set @@ -646,32 +670,32 @@ available units from the group; by default, the resource manager uses a round-robin selector to guarantee that resource usage is uniformly distributed between all units of a group. -:program:`llvm-mca`'s scheduler implements three instruction queues: +:program:`llvm-mca`'s scheduler internally groups instructions into three sets: -* WaitQueue: a queue of instructions whose operands are not ready. -* ReadyQueue: a queue of instructions ready to execute. -* IssuedQueue: a queue of instructions executing. +* WaitSet: a set of instructions whose operands are not ready. +* ReadySet: a set of instructions ready to execute. +* IssuedSet: a set of instructions executing. -Depending on the operand availability, instructions that are dispatched to the -scheduler are either placed into the WaitQueue or into the ReadyQueue. +Depending on the operands availability, instructions that are dispatched to the +scheduler are either placed into the WaitSet or into the ReadySet. -Every cycle, the scheduler checks if instructions can be moved from the -WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be -issued to the underlying pipelines. The algorithm prioritizes older instructions -over younger instructions. +Every cycle, the scheduler checks if instructions can be moved from the WaitSet +to the ReadySet, and if instructions from the ReadySet can be issued to the +underlying pipelines. The algorithm prioritizes older instructions over younger +instructions. Write-Back and Retire Stage """"""""""""""""""""""""""" -Issued instructions are moved from the ReadyQueue to the IssuedQueue. There, +Issued instructions are moved from the ReadySet to the IssuedSet. There, instructions wait until they reach the write-back stage. At that point, they get removed from the queue and the retire control unit is notified. -When instructions are executed, the retire control unit flags the -instruction as "ready to retire." +When instructions are executed, the retire control unit flags the instruction as +"ready to retire." -Instructions are retired in program order. The register file is notified of -the retirement so that it can free the physical registers that were allocated -for the instruction during the register renaming stage. +Instructions are retired in program order. The register file is notified of the +retirement so that it can free the physical registers that were allocated for +the instruction during the register renaming stage. Load/Store Unit and Memory Consistency Model """""""""""""""""""""""""""""""""""""""""""" |
