Import LLVM 8.0.0 release including clang, lld and lldb.

author: patrick <patrick@openbsd.org> 2019-06-23 21:36:31 +0000
committer: patrick <patrick@openbsd.org> 2019-06-23 21:36:31 +0000
commit: 23f101f37937a1bd4a29726cab2f76e0fb038b35 (patch)
tree: f7da7d6b32c2e07114da399150bfa88d72187012 /gnu/llvm/docs/CommandGuide/llvm-mca.rst
parent: sort previous; ok deraadt (diff)
download: wireguard-openbsd-23f101f37937a1bd4a29726cab2f76e0fb038b35.tar.xz
wireguard-openbsd-23f101f37937a1bd4a29726cab2f76e0fb038b35.zip
1 files changed, 145 insertions, 121 deletions
diff --git a/gnu/llvm/docs/CommandGuide/llvm-mca.rst b/gnu/llvm/docs/CommandGuide/llvm-mca.rst
index e44eb2f8ce9..bc50794e0cb 100644
--- a/gnu/llvm/docs/CommandGuide/llvm-mca.rst
+++ b/gnu/llvm/docs/CommandGuide/llvm-mca.rst
@@ -21,43 +21,12 @@ The main goal of this tool is not just to predict the performance of the code
 when run on the target, but also help with diagnosing potential performance
 issues.
 
-Given an assembly code sequence, llvm-mca estimates the Instructions Per Cycle
-(IPC), as well as hardware resource pressure. The analysis and reporting style
-were inspired by the IACA tool from Intel.
+Given an assembly code sequence, :program:`llvm-mca` estimates the Instructions
+Per Cycle (IPC), as well as hardware resource pressure. The analysis and
+reporting style were inspired by the IACA tool from Intel.
 
-:program:`llvm-mca` allows the usage of special code comments to mark regions of
-the assembly code to be analyzed.  A comment starting with substring
-``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment starting with
-substring ``LLVM-MCA-END`` marks the end of a code region.  For example:
-
-.. code-block:: none
-
-  # LLVM-MCA-BEGIN My Code Region
-    ...
-  # LLVM-MCA-END
-
-Multiple regions can be specified provided that they do not overlap.  A code
-region can have an optional description. If no user-defined region is specified,
-then :program:`llvm-mca` assumes a default region which contains every
-instruction in the input file.  Every region is analyzed in isolation, and the
-final performance report is the union of all the reports generated for every
-code region.
-
-Inline assembly directives may be used from source code to annotate the 
-assembly text:
-
-.. code-block:: c++
-
-  int foo(int a, int b) {
-    __asm volatile("# LLVM-MCA-BEGIN foo");
-    a += 42;
-    __asm volatile("# LLVM-MCA-END");
-    a *= b;
-    return a;
-  }
-
-So for example, you can compile code with clang, output assembly, and pipe it
-directly into llvm-mca for analysis:
+For example, you can compile code with clang, output assembly, and pipe it
+directly into :program:`llvm-mca` for analysis:
 
 .. code-block:: bash
 
@@ -207,6 +176,40 @@ EXIT STATUS
 :program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
 to standard error, and the tool returns 1.
 
+USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS
+---------------------------------------------
+:program:`llvm-mca` allows for the optional usage of special code comments to
+mark regions of the assembly code to be analyzed.  A comment starting with
+substring ``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment
+starting with substring ``LLVM-MCA-END`` marks the end of a code region.  For
+example:
+
+.. code-block:: none
+
+  # LLVM-MCA-BEGIN My Code Region
+    ...
+  # LLVM-MCA-END
+
+Multiple regions can be specified provided that they do not overlap.  A code
+region can have an optional description. If no user-defined region is specified,
+then :program:`llvm-mca` assumes a default region which contains every
+instruction in the input file.  Every region is analyzed in isolation, and the
+final performance report is the union of all the reports generated for every
+code region.
+
+Inline assembly directives may be used from source code to annotate the
+assembly text:
+
+.. code-block:: c++
+
+  int foo(int a, int b) {
+    __asm volatile("# LLVM-MCA-BEGIN foo");
+    a += 42;
+    __asm volatile("# LLVM-MCA-END");
+    a *= b;
+    return a;
+  }
+
 HOW LLVM-MCA WORKS
 ------------------
 
@@ -235,7 +238,10 @@ the following command using the example located at
   Iterations:        300
   Instructions:      900
   Total Cycles:      610
+  Total uOps:        900
+
   Dispatch Width:    2
+  uOps Per Cycle:    1.48
   IPC:               1.48
   Block RThroughput: 2.0
 
@@ -282,35 +288,45 @@ the following command using the example located at
    -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps	%xmm3, %xmm3, %xmm4
 
 According to this report, the dot-product kernel has been executed 300 times,
-for a total of 900 dynamically executed instructions.
+for a total of 900 simulated instructions. The total number of simulated micro
+opcodes (uOps) is also 900.
 
 The report is structured in three main sections.  The first section collects a
 few performance numbers; the goal of this section is to give a very quick
-overview of the performance throughput. In this example, the two important
-performance indicators are **IPC** and **Block RThroughput** (Block Reciprocal
+overview of the performance throughput. Important performance indicators are
+**IPC**, **uOps Per Cycle**, and  **Block RThroughput** (Block Reciprocal
 Throughput).
 
 IPC is computed dividing the total number of simulated instructions by the total
-number of cycles.  A delta between Dispatch Width and IPC is an indicator of a
-performance issue. In the absence of loop-carried data dependencies, the
+number of cycles. In the absence of loop-carried data dependencies, the
 observed IPC tends to a theoretical maximum which can be computed by dividing
 the number of instructions of a single iteration by the *Block RThroughput*.
 
-IPC is bounded from above by the dispatch width. That is because the dispatch
-width limits the maximum size of a dispatch group. IPC is also limited by the
-amount of hardware parallelism. The availability of hardware resources affects
-the resource pressure distribution, and it limits the number of instructions
-that can be executed in parallel every cycle.  A delta between Dispatch
-Width and the theoretical maximum IPC is an indicator of a performance
-bottleneck caused by the lack of hardware resources. In general, the lower the
-Block RThroughput, the better.
-
-In this example, ``Instructions per iteration/Block RThroughput`` is 1.50. Since
-there are no loop-carried dependencies, the observed IPC is expected to approach
-1.50 when the number of iterations tends to infinity. The delta between the
-Dispatch Width (2.00), and the theoretical maximum IPC (1.50) is an indicator of
-a performance bottleneck caused by the lack of hardware resources, and the
-*Resource pressure view* can help to identify the problematic resource usage.
+Field 'uOps Per Cycle' is computed dividing the total number of simulated micro
+opcodes by the total number of cycles. A delta between Dispatch Width and this
+field is an indicator of a performance issue. In the absence of loop-carried
+data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical
+maximum throughput which can be computed by dividing the number of uOps of a
+single iteration by the *Block RThroughput*.
+
+Field *uOps Per Cycle* is bounded from above by the dispatch width. That is
+because the dispatch width limits the maximum size of a dispatch group. Both IPC
+and 'uOps Per Cycle' are limited by the amount of hardware parallelism. The
+availability of hardware resources affects the resource pressure distribution,
+and it limits the number of instructions that can be executed in parallel every
+cycle.  A delta between Dispatch Width and the theoretical maximum uOps per
+Cycle (computed by dividing the number of uOps of a single iteration by the
+*Block RTrhoughput*) is an indicator of a performance bottleneck caused by the
+lack of hardware resources.
+In general, the lower the Block RThroughput, the better.
+
+In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there
+are no loop-carried dependencies, the observed *uOps Per Cycle* is expected to
+approach 1.50 when the number of iterations tends to infinity. The delta between
+the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is
+an indicator of a performance bottleneck caused by the lack of hardware
+resources, and the *Resource pressure view* can help to identify the problematic
+resource usage.
 
 The second section of the report shows the latency and reciprocal
 throughput of every instruction in the sequence. That section also reports
@@ -454,21 +470,22 @@ The ``-all-stats`` command line option enables extra statistics and performance
 counters for the dispatch logic, the reorder buffer, the retire control unit,
 and the register file.
 
-Below is an example of ``-all-stats`` output generated by MCA for the
-dot-product example discussed in the previous sections.
+Below is an example of ``-all-stats`` output generated by  :program:`llvm-mca`
+for 300 iterations of the dot-product example discussed in the previous
+sections.
 
 .. code-block:: none
 
   Dynamic Dispatch Stall Cycles:
   RAT     - Register unavailable:                      0
   RCU     - Retire tokens unavailable:                 0
-  SCHEDQ  - Scheduler full:                            272
+  SCHEDQ  - Scheduler full:                            272  (44.6%)
   LQ      - Load queue full:                           0
   SQ      - Store queue full:                          0
   GROUP   - Static restrictions on the dispatch group: 0
 
 
-  Dispatch Logic - number of cycles where we saw N instructions dispatched:
+  Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
   [# dispatched], [# cycles]
    0,              24  (3.9%)
    1,              272  (44.6%)
@@ -481,11 +498,16 @@ dot-product example discussed in the previous sections.
    1,          306  (50.2%)
    2,          297  (48.7%)
 
-
   Scheduler's queue usage:
-  JALU01,  0/20
-  JFPU01,  18/18
-  JLSAGU,  0/12
+  [1] Resource name.
+  [2] Average number of used buffer entries.
+  [3] Maximum number of used buffer entries.
+  [4] Total number of buffer entries.
+
+   [1]            [2]        [3]        [4]
+  JALU01           0          0          20
+  JFPU01           17         18         18
+  JLSAGU           0          0          12
 
 
   Retire Control Unit - number of cycles where we saw N instructions retired:
@@ -494,6 +516,10 @@ dot-product example discussed in the previous sections.
    1,           102  (16.7%)
    2,           399  (65.4%)
 
+  Total ROB Entries:                64
+  Max Used ROB Entries:             35  ( 54.7% )
+  Average Used ROB Entries per cy:  32  ( 50.0% )
+
 
   Register File statistics:
   Total number of mappings created:    900
@@ -511,23 +537,21 @@ dot-product example discussed in the previous sections.
 
 If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for
 SCHEDQ reports 272 cycles.  This counter is incremented every time the dispatch
-logic is unable to dispatch a group of two instructions because the scheduler's
-queue is full.
+logic is unable to dispatch a full group because the scheduler's queue is full.
 
-Looking at the *Dispatch Logic* table, we see that the pipeline was only able
-to dispatch two instructions 51.5% of the time.  The dispatch group was limited
-to one instruction 44.6% of the cycles, which corresponds to 272 cycles.  The
+Looking at the *Dispatch Logic* table, we see that the pipeline was only able to
+dispatch two micro opcodes 51.5% of the time.  The dispatch group was limited to
+one micro opcode 44.6% of the cycles, which corresponds to 272 cycles.  The
 dispatch statistics are displayed by either using the command option
 ``-all-stats`` or ``-dispatch-stats``.
 
 The next table, *Schedulers*, presents a histogram displaying a count,
 representing the number of instructions issued on some number of cycles.  In
-this case, of the 610 simulated cycles, single
-instructions were issued 306 times (50.2%) and there were 7 cycles where
-no instructions were issued.
+this case, of the 610 simulated cycles, single instructions were issued 306
+times (50.2%) and there were 7 cycles where no instructions were issued.
 
-The *Scheduler's queue usage* table shows that the maximum number of buffer
-entries (i.e., scheduler queue entries) used at runtime.  Resource JFPU01
+The *Scheduler's queue usage* table shows that the average and maximum number of
+buffer entries (i.e., scheduler queue entries) used at runtime.  Resource JFPU01
 reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements
 three schedulers:
 
@@ -543,28 +567,28 @@ A full scheduler queue is either caused by data dependency chains or by a
 sub-optimal usage of hardware resources.  Sometimes, resource pressure can be
 mitigated by rewriting the kernel using different instructions that consume
 different scheduler resources.  Schedulers with a small queue are less resilient
-to bottlenecks caused by the presence of long data dependencies.
-The scheduler statistics are displayed by
-using the command option ``-all-stats`` or ``-scheduler-stats``.
+to bottlenecks caused by the presence of long data dependencies.  The scheduler
+statistics are displayed by using the command option ``-all-stats`` or
+``-scheduler-stats``.
 
 The next table, *Retire Control Unit*, presents a histogram displaying a count,
 representing the number of instructions retired on some number of cycles.  In
-this case, of the 610 simulated cycles, two instructions were retired during
-the same cycle 399 times (65.4%) and there were 109 cycles where no
-instructions were retired.  The retire statistics are displayed by using the
-command option ``-all-stats`` or ``-retire-stats``.
+this case, of the 610 simulated cycles, two instructions were retired during the
+same cycle 399 times (65.4%) and there were 109 cycles where no instructions
+were retired.  The retire statistics are displayed by using the command option
+``-all-stats`` or ``-retire-stats``.
 
 The last table presented is *Register File statistics*.  Each physical register
 file (PRF) used by the pipeline is presented in this table.  In the case of AMD
-Jaguar, there are two register files, one for floating-point registers
-(JFpuPRF) and one for integer registers (JIntegerPRF).  The table shows that of
-the 900 instructions processed, there were 900 mappings created.  Since this
-dot-product example utilized only floating point registers, the JFPuPRF was
-responsible for creating the 900 mappings.  However, we see that the pipeline
-only used a maximum of 35 of 72 available register slots at any given time. We
-can conclude that the floating point PRF was the only register file used for
-the example, and that it was never resource constrained.  The register file
-statistics are displayed by using the command option ``-all-stats`` or
+Jaguar, there are two register files, one for floating-point registers (JFpuPRF)
+and one for integer registers (JIntegerPRF).  The table shows that of the 900
+instructions processed, there were 900 mappings created.  Since this dot-product
+example utilized only floating point registers, the JFPuPRF was responsible for
+creating the 900 mappings.  However, we see that the pipeline only used a
+maximum of 35 of 72 available register slots at any given time. We can conclude
+that the floating point PRF was the only register file used for the example, and
+that it was never resource constrained.  The register file statistics are
+displayed by using the command option ``-all-stats`` or
 ``-register-file-stats``.
 
 In this example, we can conclude that the IPC is mostly limited by data
@@ -572,8 +596,8 @@ dependencies, and not by resource pressure.
 
 Instruction Flow
 ^^^^^^^^^^^^^^^^
-This section describes the instruction flow through MCA's default out-of-order
-pipeline, as well as the functional units involved in the process.
+This section describes the instruction flow through the default pipeline of
+:program:`llvm-mca`, as well as the functional units involved in the process.
 
 The default pipeline implements the following sequence of stages used to
 process instructions.
@@ -585,9 +609,9 @@ process instructions.
 
 The default pipeline only models the out-of-order portion of a processor.
 Therefore, the instruction fetch and decode stages are not modeled. Performance
-bottlenecks in the frontend are not diagnosed.  MCA assumes that instructions
-have all been decoded and placed into a queue.  Also, MCA does not model branch
-prediction.
+bottlenecks in the frontend are not diagnosed. :program:`llvm-mca` assumes that
+instructions have all been decoded and placed into a queue before the simulation
+start.  Also, :program:`llvm-mca` does not model branch prediction.
 
 Instruction Dispatch
 """"""""""""""""""""
@@ -607,19 +631,19 @@ An instruction can be dispatched if:
 * The schedulers are not full.
 
 Scheduling models can optionally specify which register files are available on
-the processor. MCA uses that information to initialize register file
-descriptors.  Users can limit the number of physical registers that are
+the processor. :program:`llvm-mca` uses that information to initialize register
+file descriptors.  Users can limit the number of physical registers that are
 globally available for register renaming by using the command option
-``-register-file-size``.  A value of zero for this option means *unbounded*.
-By knowing how many registers are available for renaming, MCA can predict
-dispatch stalls caused by the lack of registers.
+``-register-file-size``.  A value of zero for this option means *unbounded*. By
+knowing how many registers are available for renaming, the tool can predict
+dispatch stalls caused by the lack of physical registers.
 
 The number of reorder buffer entries consumed by an instruction depends on the
-number of micro-opcodes specified by the target scheduling model.  MCA's
-reorder buffer's purpose is to track the progress of instructions that are
-"in-flight," and to retire instructions in program order.  The number of
-entries in the reorder buffer defaults to the `MicroOpBufferSize` provided by
-the target scheduling model.
+number of micro-opcodes specified for that instruction by the target scheduling
+model.  The reorder buffer is responsible for tracking the progress of
+instructions that are "in-flight", and retiring them in program order.  The
+number of entries in the reorder buffer defaults to the value specified by field
+`MicroOpBufferSize` in the target scheduling model.
 
 Instructions that are dispatched to the schedulers consume scheduler buffer
 entries. :program:`llvm-mca` queries the scheduling model to determine the set
@@ -646,32 +670,32 @@ available units from the group; by default, the resource manager uses a
 round-robin selector to guarantee that resource usage is uniformly distributed
 between all units of a group.
 
-:program:`llvm-mca`'s scheduler implements three instruction queues:
+:program:`llvm-mca`'s scheduler internally groups instructions into three sets:
 
-* WaitQueue: a queue of instructions whose operands are not ready.
-* ReadyQueue: a queue of instructions ready to execute.
-* IssuedQueue: a queue of instructions executing.
+* WaitSet: a set of instructions whose operands are not ready.
+* ReadySet: a set of instructions ready to execute.
+* IssuedSet: a set of instructions executing.
 
-Depending on the operand availability, instructions that are dispatched to the
-scheduler are either placed into the WaitQueue or into the ReadyQueue.
+Depending on the operands availability, instructions that are dispatched to the
+scheduler are either placed into the WaitSet or into the ReadySet.
 
-Every cycle, the scheduler checks if instructions can be moved from the
-WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be
-issued to the underlying pipelines. The algorithm prioritizes older instructions
-over younger instructions.
+Every cycle, the scheduler checks if instructions can be moved from the WaitSet
+to the ReadySet, and if instructions from the ReadySet can be issued to the
+underlying pipelines. The algorithm prioritizes older instructions over younger
+instructions.
 
 Write-Back and Retire Stage
 """""""""""""""""""""""""""
-Issued instructions are moved from the ReadyQueue to the IssuedQueue.  There,
+Issued instructions are moved from the ReadySet to the IssuedSet.  There,
 instructions wait until they reach the write-back stage.  At that point, they
 get removed from the queue and the retire control unit is notified.
 
-When instructions are executed, the retire control unit flags the
-instruction as "ready to retire."
+When instructions are executed, the retire control unit flags the instruction as
+"ready to retire."
 
-Instructions are retired in program order.  The register file is notified of
-the retirement so that it can free the physical registers that were allocated
-for the instruction during the register renaming stage.
+Instructions are retired in program order.  The register file is notified of the
+retirement so that it can free the physical registers that were allocated for
+the instruction during the register renaming stage.
 
 Load/Store Unit and Memory Consistency Model
 """"""""""""""""""""""""""""""""""""""""""""
author	patrick <patrick@openbsd.org>	2019-06-23 21:36:31 +0000
committer	patrick <patrick@openbsd.org>	2019-06-23 21:36:31 +0000
commit	23f101f37937a1bd4a29726cab2f76e0fb038b35 (patch)
tree	f7da7d6b32c2e07114da399150bfa88d72187012 /gnu/llvm/docs/CommandGuide/llvm-mca.rst
parent	sort previous; ok deraadt (diff)
download	wireguard-openbsd-23f101f37937a1bd4a29726cab2f76e0fb038b35.tar.xz wireguard-openbsd-23f101f37937a1bd4a29726cab2f76e0fb038b35.zip