1 files changed, 0 insertions, 769 deletions
diff --git a/gnu/llvm/docs/CommandGuide/llvm-mca.rst b/gnu/llvm/docs/CommandGuide/llvm-mca.rst
deleted file mode 100644
index bc50794e0cb..00000000000
--- a/gnu/llvm/docs/CommandGuide/llvm-mca.rst
+++ /dev/null
@@ -1,769 +0,0 @@
-llvm-mca - LLVM Machine Code Analyzer
-=====================================
-
-SYNOPSIS
---------
-
-:program:`llvm-mca` [*options*] [input]
-
-DESCRIPTION
------------
-
-:program:`llvm-mca` is a performance analysis tool that uses information
-available in LLVM (e.g. scheduling models) to statically measure the performance
-of machine code in a specific CPU.
-
-Performance is measured in terms of throughput as well as processor resource
-consumption. The tool currently works for processors with an out-of-order
-backend, for which there is a scheduling model available in LLVM.
-
-The main goal of this tool is not just to predict the performance of the code
-when run on the target, but also help with diagnosing potential performance
-issues.
-
-Given an assembly code sequence, :program:`llvm-mca` estimates the Instructions
-Per Cycle (IPC), as well as hardware resource pressure. The analysis and
-reporting style were inspired by the IACA tool from Intel.
-
-For example, you can compile code with clang, output assembly, and pipe it
-directly into :program:`llvm-mca` for analysis:
-
-.. code-block:: bash
-
-  $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
-
-Or for Intel syntax:
-
-.. code-block:: bash
-
-  $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
-
-OPTIONS
--------
-
-If ``input`` is "``-``" or omitted, :program:`llvm-mca` reads from standard
-input. Otherwise, it will read from the specified filename.
-
-If the :option:`-o` option is omitted, then :program:`llvm-mca` will send its output
-to standard output if the input is from standard input.  If the :option:`-o`
-option specifies "``-``", then the output will also be sent to standard output.
-
-
-.. option:: -help
-
- Print a summary of command line options.
-
-.. option:: -mtriple=<target triple>
-
- Specify a target triple string.
-
-.. option:: -march=<arch>
-
- Specify the architecture for which to analyze the code. It defaults to the
- host default target.
-
-.. option:: -mcpu=<cpuname>
-
-  Specify the processor for which to analyze the code.  By default, the cpu name
-  is autodetected from the host.
-
-.. option:: -output-asm-variant=<variant id>
-
- Specify the output assembly variant for the report generated by the tool.
- On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables
- the AT&T (vic. Intel) assembly format for the code printed out by the tool in
- the analysis report.
-
-.. option:: -dispatch=<width>
-
- Specify a different dispatch width for the processor. The dispatch width
- defaults to field 'IssueWidth' in the processor scheduling model.  If width is
- zero, then the default dispatch width is used.
-
-.. option:: -register-file-size=<size>
-
- Specify the size of the register file. When specified, this flag limits how
- many physical registers are available for register renaming purposes. A value
- of zero for this flag means "unlimited number of physical registers".
-
-.. option:: -iterations=<number of iterations>
-
- Specify the number of iterations to run. If this flag is set to 0, then the
- tool sets the number of iterations to a default value (i.e. 100).
-
-.. option:: -noalias=<bool>
-
-  If set, the tool assumes that loads and stores don't alias. This is the
-  default behavior.
-
-.. option:: -lqueue=<load queue size>
-
-  Specify the size of the load queue in the load/store unit emulated by the tool.
-  By default, the tool assumes an unbound number of entries in the load queue.
-  A value of zero for this flag is ignored, and the default load queue size is
-  used instead. 
-
-.. option:: -squeue=<store queue size>
-
-  Specify the size of the store queue in the load/store unit emulated by the
-  tool. By default, the tool assumes an unbound number of entries in the store
-  queue. A value of zero for this flag is ignored, and the default store queue
-  size is used instead.
-
-.. option:: -timeline
-
-  Enable the timeline view.
-
-.. option:: -timeline-max-iterations=<iterations>
-
-  Limit the number of iterations to print in the timeline view. By default, the
-  timeline view prints information for up to 10 iterations.
-
-.. option:: -timeline-max-cycles=<cycles>
-
-  Limit the number of cycles in the timeline view. By default, the number of
-  cycles is set to 80.
-
-.. option:: -resource-pressure
-
-  Enable the resource pressure view. This is enabled by default.
-
-.. option:: -register-file-stats
-
-  Enable register file usage statistics.
-
-.. option:: -dispatch-stats
-
-  Enable extra dispatch statistics. This view collects and analyzes instruction
-  dispatch events, as well as static/dynamic dispatch stall events. This view
-  is disabled by default.
-
-.. option:: -scheduler-stats
-
-  Enable extra scheduler statistics. This view collects and analyzes instruction
-  issue events. This view is disabled by default.
-
-.. option:: -retire-stats
-
-  Enable extra retire control unit statistics. This view is disabled by default.
-
-.. option:: -instruction-info
-
-  Enable the instruction info view. This is enabled by default.
-
-.. option:: -all-stats
-
-  Print all hardware statistics. This enables extra statistics related to the
-  dispatch logic, the hardware schedulers, the register file(s), and the retire
-  control unit. This option is disabled by default.
-
-.. option:: -all-views
-
-  Enable all the view.
-
-.. option:: -instruction-tables
-
-  Prints resource pressure information based on the static information
-  available from the processor model. This differs from the resource pressure
-  view because it doesn't require that the code is simulated. It instead prints
-  the theoretical uniform distribution of resource pressure for every
-  instruction in sequence.
-
-
-EXIT STATUS
------------
-
-:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
-to standard error, and the tool returns 1.
-
-USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS
----------------------------------------------
-:program:`llvm-mca` allows for the optional usage of special code comments to
-mark regions of the assembly code to be analyzed.  A comment starting with
-substring ``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment
-starting with substring ``LLVM-MCA-END`` marks the end of a code region.  For
-example:
-
-.. code-block:: none
-
-  # LLVM-MCA-BEGIN My Code Region
-    ...
-  # LLVM-MCA-END
-
-Multiple regions can be specified provided that they do not overlap.  A code
-region can have an optional description. If no user-defined region is specified,
-then :program:`llvm-mca` assumes a default region which contains every
-instruction in the input file.  Every region is analyzed in isolation, and the
-final performance report is the union of all the reports generated for every
-code region.
-
-Inline assembly directives may be used from source code to annotate the
-assembly text:
-
-.. code-block:: c++
-
-  int foo(int a, int b) {
-    __asm volatile("# LLVM-MCA-BEGIN foo");
-    a += 42;
-    __asm volatile("# LLVM-MCA-END");
-    a *= b;
-    return a;
-  }
-
-HOW LLVM-MCA WORKS
-------------------
-
-:program:`llvm-mca` takes assembly code as input. The assembly code is parsed
-into a sequence of MCInst with the help of the existing LLVM target assembly
-parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module
-to generate a performance report.
-
-The Pipeline module simulates the execution of the machine code sequence in a
-loop of iterations (default is 100). During this process, the pipeline collects
-a number of execution related statistics. At the end of this process, the
-pipeline generates and prints a report from the collected statistics.
-
-Here is an example of a performance report generated by the tool for a
-dot-product of two packed float vectors of four elements. The analysis is
-conducted for target x86, cpu btver2.  The following result can be produced via
-the following command using the example located at
-``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
-
-.. code-block:: bash
-
-  $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
-
-.. code-block:: none
-
-  Iterations:        300
-  Instructions:      900
-  Total Cycles:      610
-  Total uOps:        900
-
-  Dispatch Width:    2
-  uOps Per Cycle:    1.48
-  IPC:               1.48
-  Block RThroughput: 2.0
-
-
-  Instruction Info:
-  [1]: #uOps
-  [2]: Latency
-  [3]: RThroughput
-  [4]: MayLoad
-  [5]: MayStore
-  [6]: HasSideEffects (U)
-
-  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
-   1      2     1.00                        vmulps	%xmm0, %xmm1, %xmm2
-   1      3     1.00                        vhaddps	%xmm2, %xmm2, %xmm3
-   1      3     1.00                        vhaddps	%xmm3, %xmm3, %xmm4
-
-
-  Resources:
-  [0]   - JALU0
-  [1]   - JALU1
-  [2]   - JDiv
-  [3]   - JFPA
-  [4]   - JFPM
-  [5]   - JFPU0
-  [6]   - JFPU1
-  [7]   - JLAGU
-  [8]   - JMul
-  [9]   - JSAGU
-  [10]  - JSTC
-  [11]  - JVALU0
-  [12]  - JVALU1
-  [13]  - JVIMUL
-
-
-  Resource pressure per iteration:
-  [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
-   -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -
-
-  Resource pressure by instruction:
-  [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
-   -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps	%xmm0, %xmm1, %xmm2
-   -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps	%xmm2, %xmm2, %xmm3
-   -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps	%xmm3, %xmm3, %xmm4
-
-According to this report, the dot-product kernel has been executed 300 times,
-for a total of 900 simulated instructions. The total number of simulated micro
-opcodes (uOps) is also 900.
-
-The report is structured in three main sections.  The first section collects a
-few performance numbers; the goal of this section is to give a very quick
-overview of the performance throughput. Important performance indicators are
-**IPC**, **uOps Per Cycle**, and  **Block RThroughput** (Block Reciprocal
-Throughput).
-
-IPC is computed dividing the total number of simulated instructions by the total
-number of cycles. In the absence of loop-carried data dependencies, the
-observed IPC tends to a theoretical maximum which can be computed by dividing
-the number of instructions of a single iteration by the *Block RThroughput*.
-
-Field 'uOps Per Cycle' is computed dividing the total number of simulated micro
-opcodes by the total number of cycles. A delta between Dispatch Width and this
-field is an indicator of a performance issue. In the absence of loop-carried
-data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical
-maximum throughput which can be computed by dividing the number of uOps of a
-single iteration by the *Block RThroughput*.
-
-Field *uOps Per Cycle* is bounded from above by the dispatch width. That is
-because the dispatch width limits the maximum size of a dispatch group. Both IPC
-and 'uOps Per Cycle' are limited by the amount of hardware parallelism. The
-availability of hardware resources affects the resource pressure distribution,
-and it limits the number of instructions that can be executed in parallel every
-cycle.  A delta between Dispatch Width and the theoretical maximum uOps per
-Cycle (computed by dividing the number of uOps of a single iteration by the
-*Block RTrhoughput*) is an indicator of a performance bottleneck caused by the
-lack of hardware resources.
-In general, the lower the Block RThroughput, the better.
-
-In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there
-are no loop-carried dependencies, the observed *uOps Per Cycle* is expected to
-approach 1.50 when the number of iterations tends to infinity. The delta between
-the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is
-an indicator of a performance bottleneck caused by the lack of hardware
-resources, and the *Resource pressure view* can help to identify the problematic
-resource usage.
-
-The second section of the report shows the latency and reciprocal
-throughput of every instruction in the sequence. That section also reports
-extra information related to the number of micro opcodes, and opcode properties
-(i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
-
-The third section is the *Resource pressure view*.  This view reports
-the average number of resource cycles consumed every iteration by instructions
-for every processor resource unit available on the target.  Information is
-structured in two tables. The first table reports the number of resource cycles
-spent on average every iteration. The second table correlates the resource
-cycles to the machine instruction in the sequence. For example, every iteration
-of the instruction vmulps always executes on resource unit [6]
-(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
-per iteration.  Note that on AMD Jaguar, vector floating-point multiply can
-only be issued to pipeline JFPU1, while horizontal floating-point additions can
-only be issued to pipeline JFPU0.
-
-The resource pressure view helps with identifying bottlenecks caused by high
-usage of specific hardware resources.  Situations with resource pressure mainly
-concentrated on a few resources should, in general, be avoided.  Ideally,
-pressure should be uniformly distributed between multiple resources.
-
-Timeline View
-^^^^^^^^^^^^^
-The timeline view produces a detailed report of each instruction's state
-transitions through an instruction pipeline.  This view is enabled by the
-command line option ``-timeline``.  As instructions transition through the
-various stages of the pipeline, their states are depicted in the view report.
-These states are represented by the following characters:
-
-* D : Instruction dispatched.
-* e : Instruction executing.
-* E : Instruction executed.
-* R : Instruction retired.
-* = : Instruction already dispatched, waiting to be executed.
-* \- : Instruction executed, waiting to be retired.
-
-Below is the timeline view for a subset of the dot-product example located in
-``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
-:program:`llvm-mca` using the following command:
-
-.. code-block:: bash
-
-  $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
-
-.. code-block:: none
-
-  Timeline view:
-                      012345
-  Index     0123456789
-
-  [0,0]     DeeER.    .    .   vmulps	%xmm0, %xmm1, %xmm2
-  [0,1]     D==eeeER  .    .   vhaddps	%xmm2, %xmm2, %xmm3
-  [0,2]     .D====eeeER    .   vhaddps	%xmm3, %xmm3, %xmm4
-  [1,0]     .DeeE-----R    .   vmulps	%xmm0, %xmm1, %xmm2
-  [1,1]     . D=eeeE---R   .   vhaddps	%xmm2, %xmm2, %xmm3
-  [1,2]     . D====eeeER   .   vhaddps	%xmm3, %xmm3, %xmm4
-  [2,0]     .  DeeE-----R  .   vmulps	%xmm0, %xmm1, %xmm2
-  [2,1]     .  D====eeeER  .   vhaddps	%xmm2, %xmm2, %xmm3
-  [2,2]     .   D======eeeER   vhaddps	%xmm3, %xmm3, %xmm4
-
-
-  Average Wait times (based on the timeline view):
-  [0]: Executions
-  [1]: Average time spent waiting in a scheduler's queue
-  [2]: Average time spent waiting in a scheduler's queue while ready
-  [3]: Average time elapsed from WB until retire stage
-
-        [0]    [1]    [2]    [3]
-  0.     3     1.0    1.0    3.3       vmulps	%xmm0, %xmm1, %xmm2
-  1.     3     3.3    0.7    1.0       vhaddps	%xmm2, %xmm2, %xmm3
-  2.     3     5.7    0.0    0.0       vhaddps	%xmm3, %xmm3, %xmm4
-
-The timeline view is interesting because it shows instruction state changes
-during execution.  It also gives an idea of how the tool processes instructions
-executed on the target, and how their timing information might be calculated.
-
-The timeline view is structured in two tables.  The first table shows
-instructions changing state over time (measured in cycles); the second table
-(named *Average Wait times*) reports useful timing statistics, which should
-help diagnose performance bottlenecks caused by long data dependencies and
-sub-optimal usage of hardware resources.
-
-An instruction in the timeline view is identified by a pair of indices, where
-the first index identifies an iteration, and the second index is the
-instruction index (i.e., where it appears in the code sequence).  Since this
-example was generated using 3 iterations: ``-iterations=3``, the iteration
-indices range from 0-2 inclusively.
-
-Excluding the first and last column, the remaining columns are in cycles.
-Cycles are numbered sequentially starting from 0.
-
-From the example output above, we know the following:
-
-* Instruction [1,0] was dispatched at cycle 1.
-* Instruction [1,0] started executing at cycle 2.
-* Instruction [1,0] reached the write back stage at cycle 4.
-* Instruction [1,0] was retired at cycle 10.
-
-Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the
-scheduler's queue for the operands to become available. By the time vmulps is
-dispatched, operands are already available, and pipeline JFPU1 is ready to
-serve another instruction.  So the instruction can be immediately issued on the
-JFPU1 pipeline. That is demonstrated by the fact that the instruction only
-spent 1cy in the scheduler's queue.
-
-There is a gap of 5 cycles between the write-back stage and the retire event.
-That is because instructions must retire in program order, so [1,0] has to wait
-for [0,2] to be retired first (i.e., it has to wait until cycle 10).
-
-In the example, all instructions are in a RAW (Read After Write) dependency
-chain.  Register %xmm2 written by vmulps is immediately used by the first
-vhaddps, and register %xmm3 written by the first vhaddps is used by the second
-vhaddps.  Long data dependencies negatively impact the ILP (Instruction Level
-Parallelism).
-
-In the dot-product example, there are anti-dependencies introduced by
-instructions from different iterations.  However, those dependencies can be
-removed at register renaming stage (at the cost of allocating register aliases,
-and therefore consuming physical registers).
-
-Table *Average Wait times* helps diagnose performance issues that are caused by
-the presence of long latency instructions and potentially long data dependencies
-which may limit the ILP.  Note that :program:`llvm-mca`, by default, assumes at
-least 1cy between the dispatch event and the issue event.
-
-When the performance is limited by data dependencies and/or long latency
-instructions, the number of cycles spent while in the *ready* state is expected
-to be very small when compared with the total number of cycles spent in the
-scheduler's queue.  The difference between the two counters is a good indicator
-of how large of an impact data dependencies had on the execution of the
-instructions.  When performance is mostly limited by the lack of hardware
-resources, the delta between the two counters is small.  However, the number of
-cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
-especially when compared to other low latency instructions.
-
-Extra Statistics to Further Diagnose Performance Issues
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The ``-all-stats`` command line option enables extra statistics and performance
-counters for the dispatch logic, the reorder buffer, the retire control unit,
-and the register file.
-
-Below is an example of ``-all-stats`` output generated by  :program:`llvm-mca`
-for 300 iterations of the dot-product example discussed in the previous
-sections.
-
-.. code-block:: none
-
-  Dynamic Dispatch Stall Cycles:
-  RAT     - Register unavailable:                      0
-  RCU     - Retire tokens unavailable:                 0
-  SCHEDQ  - Scheduler full:                            272  (44.6%)
-  LQ      - Load queue full:                           0
-  SQ      - Store queue full:                          0
-  GROUP   - Static restrictions on the dispatch group: 0
-
-
-  Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
-  [# dispatched], [# cycles]
-   0,              24  (3.9%)
-   1,              272  (44.6%)
-   2,              314  (51.5%)
-
-
-  Schedulers - number of cycles where we saw N instructions issued:
-  [# issued], [# cycles]
-   0,          7  (1.1%)
-   1,          306  (50.2%)
-   2,          297  (48.7%)
-
-  Scheduler's queue usage:
-  [1] Resource name.
-  [2] Average number of used buffer entries.
-  [3] Maximum number of used buffer entries.
-  [4] Total number of buffer entries.
-
-   [1]            [2]        [3]        [4]
-  JALU01           0          0          20
-  JFPU01           17         18         18
-  JLSAGU           0          0          12
-
-
-  Retire Control Unit - number of cycles where we saw N instructions retired:
-  [# retired], [# cycles]
-   0,           109  (17.9%)
-   1,           102  (16.7%)
-   2,           399  (65.4%)
-
-  Total ROB Entries:                64
-  Max Used ROB Entries:             35  ( 54.7% )
-  Average Used ROB Entries per cy:  32  ( 50.0% )
-
-
-  Register File statistics:
-  Total number of mappings created:    900
-  Max number of mappings used:         35
-
-  *  Register File #1 -- JFpuPRF:
-     Number of physical registers:     72
-     Total number of mappings created: 900
-     Max number of mappings used:      35
-
-  *  Register File #2 -- JIntegerPRF:
-     Number of physical registers:     64
-     Total number of mappings created: 0
-     Max number of mappings used:      0
-
-If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for
-SCHEDQ reports 272 cycles.  This counter is incremented every time the dispatch
-logic is unable to dispatch a full group because the scheduler's queue is full.
-
-Looking at the *Dispatch Logic* table, we see that the pipeline was only able to
-dispatch two micro opcodes 51.5% of the time.  The dispatch group was limited to
-one micro opcode 44.6% of the cycles, which corresponds to 272 cycles.  The
-dispatch statistics are displayed by either using the command option
-``-all-stats`` or ``-dispatch-stats``.
-
-The next table, *Schedulers*, presents a histogram displaying a count,
-representing the number of instructions issued on some number of cycles.  In
-this case, of the 610 simulated cycles, single instructions were issued 306
-times (50.2%) and there were 7 cycles where no instructions were issued.
-
-The *Scheduler's queue usage* table shows that the average and maximum number of
-buffer entries (i.e., scheduler queue entries) used at runtime.  Resource JFPU01
-reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements
-three schedulers:
-
-* JALU01 - A scheduler for ALU instructions.
-* JFPU01 - A scheduler floating point operations.
-* JLSAGU - A scheduler for address generation.
-
-The dot-product is a kernel of three floating point instructions (a vector
-multiply followed by two horizontal adds).  That explains why only the floating
-point scheduler appears to be used.
-
-A full scheduler queue is either caused by data dependency chains or by a
-sub-optimal usage of hardware resources.  Sometimes, resource pressure can be
-mitigated by rewriting the kernel using different instructions that consume
-different scheduler resources.  Schedulers with a small queue are less resilient
-to bottlenecks caused by the presence of long data dependencies.  The scheduler
-statistics are displayed by using the command option ``-all-stats`` or
-``-scheduler-stats``.
-
-The next table, *Retire Control Unit*, presents a histogram displaying a count,
-representing the number of instructions retired on some number of cycles.  In
-this case, of the 610 simulated cycles, two instructions were retired during the
-same cycle 399 times (65.4%) and there were 109 cycles where no instructions
-were retired.  The retire statistics are displayed by using the command option
-``-all-stats`` or ``-retire-stats``.
-
-The last table presented is *Register File statistics*.  Each physical register
-file (PRF) used by the pipeline is presented in this table.  In the case of AMD
-Jaguar, there are two register files, one for floating-point registers (JFpuPRF)
-and one for integer registers (JIntegerPRF).  The table shows that of the 900
-instructions processed, there were 900 mappings created.  Since this dot-product
-example utilized only floating point registers, the JFPuPRF was responsible for
-creating the 900 mappings.  However, we see that the pipeline only used a
-maximum of 35 of 72 available register slots at any given time. We can conclude
-that the floating point PRF was the only register file used for the example, and
-that it was never resource constrained.  The register file statistics are
-displayed by using the command option ``-all-stats`` or
-``-register-file-stats``.
-
-In this example, we can conclude that the IPC is mostly limited by data
-dependencies, and not by resource pressure.
-
-Instruction Flow
-^^^^^^^^^^^^^^^^
-This section describes the instruction flow through the default pipeline of
-:program:`llvm-mca`, as well as the functional units involved in the process.
-
-The default pipeline implements the following sequence of stages used to
-process instructions.
-
-* Dispatch (Instruction is dispatched to the schedulers).
-* Issue (Instruction is issued to the processor pipelines).
-* Write Back (Instruction is executed, and results are written back).
-* Retire (Instruction is retired; writes are architecturally committed).
-
-The default pipeline only models the out-of-order portion of a processor.
-Therefore, the instruction fetch and decode stages are not modeled. Performance
-bottlenecks in the frontend are not diagnosed. :program:`llvm-mca` assumes that
-instructions have all been decoded and placed into a queue before the simulation
-start.  Also, :program:`llvm-mca` does not model branch prediction.
-
-Instruction Dispatch
-""""""""""""""""""""
-During the dispatch stage, instructions are picked in program order from a
-queue of already decoded instructions, and dispatched in groups to the
-simulated hardware schedulers.
-
-The size of a dispatch group depends on the availability of the simulated
-hardware resources.  The processor dispatch width defaults to the value
-of the ``IssueWidth`` in LLVM's scheduling model.
-
-An instruction can be dispatched if:
-
-* The size of the dispatch group is smaller than processor's dispatch width.
-* There are enough entries in the reorder buffer.
-* There are enough physical registers to do register renaming.
-* The schedulers are not full.
-
-Scheduling models can optionally specify which register files are available on
-the processor. :program:`llvm-mca` uses that information to initialize register
-file descriptors.  Users can limit the number of physical registers that are
-globally available for register renaming by using the command option
-``-register-file-size``.  A value of zero for this option means *unbounded*. By
-knowing how many registers are available for renaming, the tool can predict
-dispatch stalls caused by the lack of physical registers.
-
-The number of reorder buffer entries consumed by an instruction depends on the
-number of micro-opcodes specified for that instruction by the target scheduling
-model.  The reorder buffer is responsible for tracking the progress of
-instructions that are "in-flight", and retiring them in program order.  The
-number of entries in the reorder buffer defaults to the value specified by field
-`MicroOpBufferSize` in the target scheduling model.
-
-Instructions that are dispatched to the schedulers consume scheduler buffer
-entries. :program:`llvm-mca` queries the scheduling model to determine the set
-of buffered resources consumed by an instruction.  Buffered resources are
-treated like scheduler resources.
-
-Instruction Issue
-"""""""""""""""""
-Each processor scheduler implements a buffer of instructions.  An instruction
-has to wait in the scheduler's buffer until input register operands become
-available.  Only at that point, does the instruction becomes eligible for
-execution and may be issued (potentially out-of-order) for execution.
-Instruction latencies are computed by :program:`llvm-mca` with the help of the
-scheduling model.
-
-:program:`llvm-mca`'s scheduler is designed to simulate multiple processor
-schedulers.  The scheduler is responsible for tracking data dependencies, and
-dynamically selecting which processor resources are consumed by instructions.
-It delegates the management of processor resource units and resource groups to a
-resource manager.  The resource manager is responsible for selecting resource
-units that are consumed by instructions.  For example, if an instruction
-consumes 1cy of a resource group, the resource manager selects one of the
-available units from the group; by default, the resource manager uses a
-round-robin selector to guarantee that resource usage is uniformly distributed
-between all units of a group.
-
-:program:`llvm-mca`'s scheduler internally groups instructions into three sets:
-
-* WaitSet: a set of instructions whose operands are not ready.
-* ReadySet: a set of instructions ready to execute.
-* IssuedSet: a set of instructions executing.
-
-Depending on the operands availability, instructions that are dispatched to the
-scheduler are either placed into the WaitSet or into the ReadySet.
-
-Every cycle, the scheduler checks if instructions can be moved from the WaitSet
-to the ReadySet, and if instructions from the ReadySet can be issued to the
-underlying pipelines. The algorithm prioritizes older instructions over younger
-instructions.
-
-Write-Back and Retire Stage
-"""""""""""""""""""""""""""
-Issued instructions are moved from the ReadySet to the IssuedSet.  There,
-instructions wait until they reach the write-back stage.  At that point, they
-get removed from the queue and the retire control unit is notified.
-
-When instructions are executed, the retire control unit flags the instruction as
-"ready to retire."
-
-Instructions are retired in program order.  The register file is notified of the
-retirement so that it can free the physical registers that were allocated for
-the instruction during the register renaming stage.
-
-Load/Store Unit and Memory Consistency Model
-""""""""""""""""""""""""""""""""""""""""""""
-To simulate an out-of-order execution of memory operations, :program:`llvm-mca`
-utilizes a simulated load/store unit (LSUnit) to simulate the speculative
-execution of loads and stores.
-
-Each load (or store) consumes an entry in the load (or store) queue. Users can
-specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the
-load and store queues respectively. The queues are unbounded by default.
-
-The LSUnit implements a relaxed consistency model for memory loads and stores.
-The rules are:
-
-1. A younger load is allowed to pass an older load only if there are no
-   intervening stores or barriers between the two loads.
-2. A younger load is allowed to pass an older store provided that the load does
-   not alias with the store.
-3. A younger store is not allowed to pass an older store.
-4. A younger store is not allowed to pass an older load.
-
-By default, the LSUnit optimistically assumes that loads do not alias
-(`-noalias=true`) store operations.  Under this assumption, younger loads are
-always allowed to pass older stores.  Essentially, the LSUnit does not attempt
-to run any alias analysis to predict when loads and stores do not alias with
-each other.
-
-Note that, in the case of write-combining memory, rule 3 could be relaxed to
-allow reordering of non-aliasing store operations.  That being said, at the
-moment, there is no way to further relax the memory model (``-noalias`` is the
-only option).  Essentially, there is no option to specify a different memory
-type (e.g., write-back, write-combining, write-through; etc.) and consequently
-to weaken, or strengthen, the memory model.
-
-Other limitations are:
-
-* The LSUnit does not know when store-to-load forwarding may occur.
-* The LSUnit does not know anything about cache hierarchy and memory types.
-* The LSUnit does not know how to identify serializing operations and memory
-  fences.
-
-The LSUnit does not attempt to predict if a load or store hits or misses the L1
-cache.  It only knows if an instruction "MayLoad" and/or "MayStore."  For
-loads, the scheduling model provides an "optimistic" load-to-use latency (which
-usually matches the load-to-use latency for when there is a hit in the L1D).
-
-:program:`llvm-mca` does not know about serializing operations or memory-barrier
-like instructions.  The LSUnit conservatively assumes that an instruction which
-has both "MayLoad" and unmodeled side effects behaves like a "soft"
-load-barrier.  That means, it serializes loads without forcing a flush of the
-load queue.  Similarly, instructions that "MayStore" and have unmodeled side
-effects are treated like store barriers.  A full memory barrier is a "MayLoad"
-and "MayStore" instruction with unmodeled side effects.  This is inaccurate, but
-it is the best that we can do at the moment with the current information
-available in LLVM.
-
-A load/store barrier consumes one entry of the load/store queue.  A load/store
-barrier enforces ordering of loads/stores.  A younger load cannot pass a load
-barrier.  Also, a younger store cannot pass a store barrier.  A younger load
-has to wait for the memory/load barrier to execute.  A load/store barrier is
-"executed" when it becomes the oldest entry in the load/store queue(s). That
-also means, by construction, all of the older loads/stores have been executed.
-
-In conclusion, the full set of load/store consistency rules are:
-
-#. A store may not pass a previous store.
-#. A store may not pass a previous load (regardless of ``-noalias``).
-#. A store has to wait until an older store barrier is fully executed.
-#. A load may pass a previous load.
-#. A load may not pass a previous store unless ``-noalias`` is set.
-#. A load has to wait until an older load barrier is fully executed.