summaryrefslogtreecommitdiffstats
path: root/gnu/llvm/docs/CommandGuide/llvm-mca.rst
diff options
context:
space:
mode:
authorpatrick <patrick@openbsd.org>2019-06-23 21:36:31 +0000
committerpatrick <patrick@openbsd.org>2019-06-23 21:36:31 +0000
commit23f101f37937a1bd4a29726cab2f76e0fb038b35 (patch)
treef7da7d6b32c2e07114da399150bfa88d72187012 /gnu/llvm/docs/CommandGuide/llvm-mca.rst
parentsort previous; ok deraadt (diff)
downloadwireguard-openbsd-23f101f37937a1bd4a29726cab2f76e0fb038b35.tar.xz
wireguard-openbsd-23f101f37937a1bd4a29726cab2f76e0fb038b35.zip
Import LLVM 8.0.0 release including clang, lld and lldb.
Diffstat (limited to 'gnu/llvm/docs/CommandGuide/llvm-mca.rst')
-rw-r--r--gnu/llvm/docs/CommandGuide/llvm-mca.rst266
1 files changed, 145 insertions, 121 deletions
diff --git a/gnu/llvm/docs/CommandGuide/llvm-mca.rst b/gnu/llvm/docs/CommandGuide/llvm-mca.rst
index e44eb2f8ce9..bc50794e0cb 100644
--- a/gnu/llvm/docs/CommandGuide/llvm-mca.rst
+++ b/gnu/llvm/docs/CommandGuide/llvm-mca.rst
@@ -21,43 +21,12 @@ The main goal of this tool is not just to predict the performance of the code
when run on the target, but also help with diagnosing potential performance
issues.
-Given an assembly code sequence, llvm-mca estimates the Instructions Per Cycle
-(IPC), as well as hardware resource pressure. The analysis and reporting style
-were inspired by the IACA tool from Intel.
+Given an assembly code sequence, :program:`llvm-mca` estimates the Instructions
+Per Cycle (IPC), as well as hardware resource pressure. The analysis and
+reporting style were inspired by the IACA tool from Intel.
-:program:`llvm-mca` allows the usage of special code comments to mark regions of
-the assembly code to be analyzed. A comment starting with substring
-``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment starting with
-substring ``LLVM-MCA-END`` marks the end of a code region. For example:
-
-.. code-block:: none
-
- # LLVM-MCA-BEGIN My Code Region
- ...
- # LLVM-MCA-END
-
-Multiple regions can be specified provided that they do not overlap. A code
-region can have an optional description. If no user-defined region is specified,
-then :program:`llvm-mca` assumes a default region which contains every
-instruction in the input file. Every region is analyzed in isolation, and the
-final performance report is the union of all the reports generated for every
-code region.
-
-Inline assembly directives may be used from source code to annotate the
-assembly text:
-
-.. code-block:: c++
-
- int foo(int a, int b) {
- __asm volatile("# LLVM-MCA-BEGIN foo");
- a += 42;
- __asm volatile("# LLVM-MCA-END");
- a *= b;
- return a;
- }
-
-So for example, you can compile code with clang, output assembly, and pipe it
-directly into llvm-mca for analysis:
+For example, you can compile code with clang, output assembly, and pipe it
+directly into :program:`llvm-mca` for analysis:
.. code-block:: bash
@@ -207,6 +176,40 @@ EXIT STATUS
:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
to standard error, and the tool returns 1.
+USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS
+---------------------------------------------
+:program:`llvm-mca` allows for the optional usage of special code comments to
+mark regions of the assembly code to be analyzed. A comment starting with
+substring ``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment
+starting with substring ``LLVM-MCA-END`` marks the end of a code region. For
+example:
+
+.. code-block:: none
+
+ # LLVM-MCA-BEGIN My Code Region
+ ...
+ # LLVM-MCA-END
+
+Multiple regions can be specified provided that they do not overlap. A code
+region can have an optional description. If no user-defined region is specified,
+then :program:`llvm-mca` assumes a default region which contains every
+instruction in the input file. Every region is analyzed in isolation, and the
+final performance report is the union of all the reports generated for every
+code region.
+
+Inline assembly directives may be used from source code to annotate the
+assembly text:
+
+.. code-block:: c++
+
+ int foo(int a, int b) {
+ __asm volatile("# LLVM-MCA-BEGIN foo");
+ a += 42;
+ __asm volatile("# LLVM-MCA-END");
+ a *= b;
+ return a;
+ }
+
HOW LLVM-MCA WORKS
------------------
@@ -235,7 +238,10 @@ the following command using the example located at
Iterations: 300
Instructions: 900
Total Cycles: 610
+ Total uOps: 900
+
Dispatch Width: 2
+ uOps Per Cycle: 1.48
IPC: 1.48
Block RThroughput: 2.0
@@ -282,35 +288,45 @@ the following command using the example located at
- - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4
According to this report, the dot-product kernel has been executed 300 times,
-for a total of 900 dynamically executed instructions.
+for a total of 900 simulated instructions. The total number of simulated micro
+opcodes (uOps) is also 900.
The report is structured in three main sections. The first section collects a
few performance numbers; the goal of this section is to give a very quick
-overview of the performance throughput. In this example, the two important
-performance indicators are **IPC** and **Block RThroughput** (Block Reciprocal
+overview of the performance throughput. Important performance indicators are
+**IPC**, **uOps Per Cycle**, and **Block RThroughput** (Block Reciprocal
Throughput).
IPC is computed dividing the total number of simulated instructions by the total
-number of cycles. A delta between Dispatch Width and IPC is an indicator of a
-performance issue. In the absence of loop-carried data dependencies, the
+number of cycles. In the absence of loop-carried data dependencies, the
observed IPC tends to a theoretical maximum which can be computed by dividing
the number of instructions of a single iteration by the *Block RThroughput*.
-IPC is bounded from above by the dispatch width. That is because the dispatch
-width limits the maximum size of a dispatch group. IPC is also limited by the
-amount of hardware parallelism. The availability of hardware resources affects
-the resource pressure distribution, and it limits the number of instructions
-that can be executed in parallel every cycle. A delta between Dispatch
-Width and the theoretical maximum IPC is an indicator of a performance
-bottleneck caused by the lack of hardware resources. In general, the lower the
-Block RThroughput, the better.
-
-In this example, ``Instructions per iteration/Block RThroughput`` is 1.50. Since
-there are no loop-carried dependencies, the observed IPC is expected to approach
-1.50 when the number of iterations tends to infinity. The delta between the
-Dispatch Width (2.00), and the theoretical maximum IPC (1.50) is an indicator of
-a performance bottleneck caused by the lack of hardware resources, and the
-*Resource pressure view* can help to identify the problematic resource usage.
+Field 'uOps Per Cycle' is computed dividing the total number of simulated micro
+opcodes by the total number of cycles. A delta between Dispatch Width and this
+field is an indicator of a performance issue. In the absence of loop-carried
+data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical
+maximum throughput which can be computed by dividing the number of uOps of a
+single iteration by the *Block RThroughput*.
+
+Field *uOps Per Cycle* is bounded from above by the dispatch width. That is
+because the dispatch width limits the maximum size of a dispatch group. Both IPC
+and 'uOps Per Cycle' are limited by the amount of hardware parallelism. The
+availability of hardware resources affects the resource pressure distribution,
+and it limits the number of instructions that can be executed in parallel every
+cycle. A delta between Dispatch Width and the theoretical maximum uOps per
+Cycle (computed by dividing the number of uOps of a single iteration by the
+*Block RTrhoughput*) is an indicator of a performance bottleneck caused by the
+lack of hardware resources.
+In general, the lower the Block RThroughput, the better.
+
+In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there
+are no loop-carried dependencies, the observed *uOps Per Cycle* is expected to
+approach 1.50 when the number of iterations tends to infinity. The delta between
+the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is
+an indicator of a performance bottleneck caused by the lack of hardware
+resources, and the *Resource pressure view* can help to identify the problematic
+resource usage.
The second section of the report shows the latency and reciprocal
throughput of every instruction in the sequence. That section also reports
@@ -454,21 +470,22 @@ The ``-all-stats`` command line option enables extra statistics and performance
counters for the dispatch logic, the reorder buffer, the retire control unit,
and the register file.
-Below is an example of ``-all-stats`` output generated by MCA for the
-dot-product example discussed in the previous sections.
+Below is an example of ``-all-stats`` output generated by :program:`llvm-mca`
+for 300 iterations of the dot-product example discussed in the previous
+sections.
.. code-block:: none
Dynamic Dispatch Stall Cycles:
RAT - Register unavailable: 0
RCU - Retire tokens unavailable: 0
- SCHEDQ - Scheduler full: 272
+ SCHEDQ - Scheduler full: 272 (44.6%)
LQ - Load queue full: 0
SQ - Store queue full: 0
GROUP - Static restrictions on the dispatch group: 0
- Dispatch Logic - number of cycles where we saw N instructions dispatched:
+ Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
[# dispatched], [# cycles]
0, 24 (3.9%)
1, 272 (44.6%)
@@ -481,11 +498,16 @@ dot-product example discussed in the previous sections.
1, 306 (50.2%)
2, 297 (48.7%)
-
Scheduler's queue usage:
- JALU01, 0/20
- JFPU01, 18/18
- JLSAGU, 0/12
+ [1] Resource name.
+ [2] Average number of used buffer entries.
+ [3] Maximum number of used buffer entries.
+ [4] Total number of buffer entries.
+
+ [1] [2] [3] [4]
+ JALU01 0 0 20
+ JFPU01 17 18 18
+ JLSAGU 0 0 12
Retire Control Unit - number of cycles where we saw N instructions retired:
@@ -494,6 +516,10 @@ dot-product example discussed in the previous sections.
1, 102 (16.7%)
2, 399 (65.4%)
+ Total ROB Entries: 64
+ Max Used ROB Entries: 35 ( 54.7% )
+ Average Used ROB Entries per cy: 32 ( 50.0% )
+
Register File statistics:
Total number of mappings created: 900
@@ -511,23 +537,21 @@ dot-product example discussed in the previous sections.
If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for
SCHEDQ reports 272 cycles. This counter is incremented every time the dispatch
-logic is unable to dispatch a group of two instructions because the scheduler's
-queue is full.
+logic is unable to dispatch a full group because the scheduler's queue is full.
-Looking at the *Dispatch Logic* table, we see that the pipeline was only able
-to dispatch two instructions 51.5% of the time. The dispatch group was limited
-to one instruction 44.6% of the cycles, which corresponds to 272 cycles. The
+Looking at the *Dispatch Logic* table, we see that the pipeline was only able to
+dispatch two micro opcodes 51.5% of the time. The dispatch group was limited to
+one micro opcode 44.6% of the cycles, which corresponds to 272 cycles. The
dispatch statistics are displayed by either using the command option
``-all-stats`` or ``-dispatch-stats``.
The next table, *Schedulers*, presents a histogram displaying a count,
representing the number of instructions issued on some number of cycles. In
-this case, of the 610 simulated cycles, single
-instructions were issued 306 times (50.2%) and there were 7 cycles where
-no instructions were issued.
+this case, of the 610 simulated cycles, single instructions were issued 306
+times (50.2%) and there were 7 cycles where no instructions were issued.
-The *Scheduler's queue usage* table shows that the maximum number of buffer
-entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01
+The *Scheduler's queue usage* table shows that the average and maximum number of
+buffer entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01
reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements
three schedulers:
@@ -543,28 +567,28 @@ A full scheduler queue is either caused by data dependency chains or by a
sub-optimal usage of hardware resources. Sometimes, resource pressure can be
mitigated by rewriting the kernel using different instructions that consume
different scheduler resources. Schedulers with a small queue are less resilient
-to bottlenecks caused by the presence of long data dependencies.
-The scheduler statistics are displayed by
-using the command option ``-all-stats`` or ``-scheduler-stats``.
+to bottlenecks caused by the presence of long data dependencies. The scheduler
+statistics are displayed by using the command option ``-all-stats`` or
+``-scheduler-stats``.
The next table, *Retire Control Unit*, presents a histogram displaying a count,
representing the number of instructions retired on some number of cycles. In
-this case, of the 610 simulated cycles, two instructions were retired during
-the same cycle 399 times (65.4%) and there were 109 cycles where no
-instructions were retired. The retire statistics are displayed by using the
-command option ``-all-stats`` or ``-retire-stats``.
+this case, of the 610 simulated cycles, two instructions were retired during the
+same cycle 399 times (65.4%) and there were 109 cycles where no instructions
+were retired. The retire statistics are displayed by using the command option
+``-all-stats`` or ``-retire-stats``.
The last table presented is *Register File statistics*. Each physical register
file (PRF) used by the pipeline is presented in this table. In the case of AMD
-Jaguar, there are two register files, one for floating-point registers
-(JFpuPRF) and one for integer registers (JIntegerPRF). The table shows that of
-the 900 instructions processed, there were 900 mappings created. Since this
-dot-product example utilized only floating point registers, the JFPuPRF was
-responsible for creating the 900 mappings. However, we see that the pipeline
-only used a maximum of 35 of 72 available register slots at any given time. We
-can conclude that the floating point PRF was the only register file used for
-the example, and that it was never resource constrained. The register file
-statistics are displayed by using the command option ``-all-stats`` or
+Jaguar, there are two register files, one for floating-point registers (JFpuPRF)
+and one for integer registers (JIntegerPRF). The table shows that of the 900
+instructions processed, there were 900 mappings created. Since this dot-product
+example utilized only floating point registers, the JFPuPRF was responsible for
+creating the 900 mappings. However, we see that the pipeline only used a
+maximum of 35 of 72 available register slots at any given time. We can conclude
+that the floating point PRF was the only register file used for the example, and
+that it was never resource constrained. The register file statistics are
+displayed by using the command option ``-all-stats`` or
``-register-file-stats``.
In this example, we can conclude that the IPC is mostly limited by data
@@ -572,8 +596,8 @@ dependencies, and not by resource pressure.
Instruction Flow
^^^^^^^^^^^^^^^^
-This section describes the instruction flow through MCA's default out-of-order
-pipeline, as well as the functional units involved in the process.
+This section describes the instruction flow through the default pipeline of
+:program:`llvm-mca`, as well as the functional units involved in the process.
The default pipeline implements the following sequence of stages used to
process instructions.
@@ -585,9 +609,9 @@ process instructions.
The default pipeline only models the out-of-order portion of a processor.
Therefore, the instruction fetch and decode stages are not modeled. Performance
-bottlenecks in the frontend are not diagnosed. MCA assumes that instructions
-have all been decoded and placed into a queue. Also, MCA does not model branch
-prediction.
+bottlenecks in the frontend are not diagnosed. :program:`llvm-mca` assumes that
+instructions have all been decoded and placed into a queue before the simulation
+start. Also, :program:`llvm-mca` does not model branch prediction.
Instruction Dispatch
""""""""""""""""""""
@@ -607,19 +631,19 @@ An instruction can be dispatched if:
* The schedulers are not full.
Scheduling models can optionally specify which register files are available on
-the processor. MCA uses that information to initialize register file
-descriptors. Users can limit the number of physical registers that are
+the processor. :program:`llvm-mca` uses that information to initialize register
+file descriptors. Users can limit the number of physical registers that are
globally available for register renaming by using the command option
-``-register-file-size``. A value of zero for this option means *unbounded*.
-By knowing how many registers are available for renaming, MCA can predict
-dispatch stalls caused by the lack of registers.
+``-register-file-size``. A value of zero for this option means *unbounded*. By
+knowing how many registers are available for renaming, the tool can predict
+dispatch stalls caused by the lack of physical registers.
The number of reorder buffer entries consumed by an instruction depends on the
-number of micro-opcodes specified by the target scheduling model. MCA's
-reorder buffer's purpose is to track the progress of instructions that are
-"in-flight," and to retire instructions in program order. The number of
-entries in the reorder buffer defaults to the `MicroOpBufferSize` provided by
-the target scheduling model.
+number of micro-opcodes specified for that instruction by the target scheduling
+model. The reorder buffer is responsible for tracking the progress of
+instructions that are "in-flight", and retiring them in program order. The
+number of entries in the reorder buffer defaults to the value specified by field
+`MicroOpBufferSize` in the target scheduling model.
Instructions that are dispatched to the schedulers consume scheduler buffer
entries. :program:`llvm-mca` queries the scheduling model to determine the set
@@ -646,32 +670,32 @@ available units from the group; by default, the resource manager uses a
round-robin selector to guarantee that resource usage is uniformly distributed
between all units of a group.
-:program:`llvm-mca`'s scheduler implements three instruction queues:
+:program:`llvm-mca`'s scheduler internally groups instructions into three sets:
-* WaitQueue: a queue of instructions whose operands are not ready.
-* ReadyQueue: a queue of instructions ready to execute.
-* IssuedQueue: a queue of instructions executing.
+* WaitSet: a set of instructions whose operands are not ready.
+* ReadySet: a set of instructions ready to execute.
+* IssuedSet: a set of instructions executing.
-Depending on the operand availability, instructions that are dispatched to the
-scheduler are either placed into the WaitQueue or into the ReadyQueue.
+Depending on the operands availability, instructions that are dispatched to the
+scheduler are either placed into the WaitSet or into the ReadySet.
-Every cycle, the scheduler checks if instructions can be moved from the
-WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be
-issued to the underlying pipelines. The algorithm prioritizes older instructions
-over younger instructions.
+Every cycle, the scheduler checks if instructions can be moved from the WaitSet
+to the ReadySet, and if instructions from the ReadySet can be issued to the
+underlying pipelines. The algorithm prioritizes older instructions over younger
+instructions.
Write-Back and Retire Stage
"""""""""""""""""""""""""""
-Issued instructions are moved from the ReadyQueue to the IssuedQueue. There,
+Issued instructions are moved from the ReadySet to the IssuedSet. There,
instructions wait until they reach the write-back stage. At that point, they
get removed from the queue and the retire control unit is notified.
-When instructions are executed, the retire control unit flags the
-instruction as "ready to retire."
+When instructions are executed, the retire control unit flags the instruction as
+"ready to retire."
-Instructions are retired in program order. The register file is notified of
-the retirement so that it can free the physical registers that were allocated
-for the instruction during the register renaming stage.
+Instructions are retired in program order. The register file is notified of the
+retirement so that it can free the physical registers that were allocated for
+the instruction during the register renaming stage.
Load/Store Unit and Memory Consistency Model
""""""""""""""""""""""""""""""""""""""""""""