aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/trace
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/trace')
-rw-r--r--Documentation/trace/boottime-trace.rst127
-rw-r--r--Documentation/trace/coresight/coresight-config.rst294
-rw-r--r--Documentation/trace/coresight/coresight-cpu-debug.rst3
-rw-r--r--Documentation/trace/coresight/coresight-ect.rst226
-rw-r--r--Documentation/trace/coresight/coresight-etm4x-reference.rst33
-rw-r--r--Documentation/trace/coresight/coresight-perf.rst158
-rw-r--r--Documentation/trace/coresight/coresight-trbe.rst38
-rw-r--r--Documentation/trace/coresight/coresight.rst205
-rw-r--r--Documentation/trace/events-kmem.rst2
-rw-r--r--Documentation/trace/events-msr.rst2
-rw-r--r--Documentation/trace/events-power.rst21
-rw-r--r--Documentation/trace/events.rst133
-rw-r--r--Documentation/trace/fprobe.rst174
-rw-r--r--Documentation/trace/ftrace-design.rst8
-rw-r--r--Documentation/trace/ftrace-uses.rst102
-rw-r--r--Documentation/trace/ftrace.rst145
-rw-r--r--Documentation/trace/hisi-ptt.rst298
-rw-r--r--Documentation/trace/histogram-design.rst2115
-rw-r--r--Documentation/trace/histogram.rst153
-rw-r--r--Documentation/trace/hwlat_detector.rst13
-rw-r--r--Documentation/trace/index.rst9
-rw-r--r--Documentation/trace/intel_th.rst2
-rw-r--r--Documentation/trace/kprobes.rst789
-rw-r--r--Documentation/trace/kprobetrace.rst10
-rw-r--r--Documentation/trace/mmiotrace.rst2
-rw-r--r--Documentation/trace/osnoise-tracer.rst152
-rw-r--r--Documentation/trace/postprocess/decode_msr.py2
-rw-r--r--Documentation/trace/postprocess/trace-pagealloc-postprocess.pl2
-rw-r--r--Documentation/trace/postprocess/trace-vmscan-postprocess.pl2
-rw-r--r--Documentation/trace/ring-buffer-design.rst (renamed from Documentation/trace/ring-buffer-design.txt)812
-rw-r--r--Documentation/trace/rv/da_monitor_instrumentation.rst171
-rw-r--r--Documentation/trace/rv/da_monitor_synthesis.rst147
-rw-r--r--Documentation/trace/rv/deterministic_automata.rst184
-rw-r--r--Documentation/trace/rv/index.rst14
-rw-r--r--Documentation/trace/rv/monitor_wip.rst55
-rw-r--r--Documentation/trace/rv/monitor_wwnr.rst45
-rw-r--r--Documentation/trace/rv/runtime-verification.rst231
-rw-r--r--Documentation/trace/stm.rst4
-rw-r--r--Documentation/trace/timerlat-tracer.rst182
-rw-r--r--Documentation/trace/tracepoints.rst27
-rw-r--r--Documentation/trace/uprobetracer.rst8
-rw-r--r--Documentation/trace/user_events.rst238
42 files changed, 6755 insertions, 583 deletions
diff --git a/Documentation/trace/boottime-trace.rst b/Documentation/trace/boottime-trace.rst
index dcb390075ca1..d594597201fd 100644
--- a/Documentation/trace/boottime-trace.rst
+++ b/Documentation/trace/boottime-trace.rst
@@ -61,6 +61,10 @@ These options can be used for each instance including global ftrace node.
ftrace.[instance.INSTANCE.]options = OPT1[, OPT2[...]]
Enable given ftrace options.
+ftrace.[instance.INSTANCE.]tracing_on = 0|1
+ Enable/Disable tracing on this instance when starting boot-time tracing.
+ (you can enable it by the "traceon" event trigger action)
+
ftrace.[instance.INSTANCE.]trace_clock = CLOCK
Set given CLOCK to ftrace's trace_clock.
@@ -95,6 +99,12 @@ These options are setting per-event options.
ftrace.[instance.INSTANCE.]event.GROUP.EVENT.enable
Enable GROUP:EVENT tracing.
+ftrace.[instance.INSTANCE.]event.GROUP.enable
+ Enable all event tracing within GROUP.
+
+ftrace.[instance.INSTANCE.]event.enable
+ Enable all event tracing.
+
ftrace.[instance.INSTANCE.]event.GROUP.EVENT.filter = FILTER
Set FILTER rule to the GROUP:EVENT.
@@ -115,6 +125,83 @@ Note that kprobe and synthetic event definitions can be written under
instance node, but those are also visible from other instances. So please
take care for event name conflict.
+Ftrace Histogram Options
+------------------------
+
+Since it is too long to write a histogram action as a string for per-event
+action option, there are tree-style options under per-event 'hist' subkey
+for the histogram actions. For the detail of the each parameter,
+please read the event histogram document (Documentation/trace/histogram.rst)
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]keys = KEY1[, KEY2[...]]
+ Set histogram key parameters. (Mandatory)
+ The 'N' is a digit string for the multiple histogram. You can omit it
+ if there is one histogram on the event.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]values = VAL1[, VAL2[...]]
+ Set histogram value parameters.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]sort = SORT1[, SORT2[...]]
+ Set histogram sort parameter options.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]size = NR_ENTRIES
+ Set histogram size (number of entries).
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]name = NAME
+ Set histogram name.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]var.VARIABLE = EXPR
+ Define a new VARIABLE by EXPR expression.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]<pause|continue|clear>
+ Set histogram control parameter. You can set one of them.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]onmatch.[M.]event = GROUP.EVENT
+ Set histogram 'onmatch' handler matching event parameter.
+ The 'M' is a digit string for the multiple 'onmatch' handler. You can omit it
+ if there is one 'onmatch' handler on this histogram.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]onmatch.[M.]trace = EVENT[, ARG1[...]]
+ Set histogram 'trace' action for 'onmatch'.
+ EVENT must be a synthetic event name, and ARG1... are parameters
+ for that event. Mandatory if 'onmatch.event' option is set.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]onmax.[M.]var = VAR
+ Set histogram 'onmax' handler variable parameter.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]onchange.[M.]var = VAR
+ Set histogram 'onchange' handler variable parameter.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]<onmax|onchange>.[M.]save = ARG1[, ARG2[...]]
+ Set histogram 'save' action parameters for 'onmax' or 'onchange' handler.
+ This option or below 'snapshot' option is mandatory if 'onmax.var' or
+ 'onchange.var' option is set.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]<onmax|onchange>.[M.]snapshot
+ Set histogram 'snapshot' action for 'onmax' or 'onchange' handler.
+ This option or above 'save' option is mandatory if 'onmax.var' or
+ 'onchange.var' option is set.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.filter = FILTER_EXPR
+ Set histogram filter expression. You don't need 'if' in the FILTER_EXPR.
+
+Note that this 'hist' option can conflict with the per-event 'actions'
+option if the 'actions' option has a histogram action.
+
+
+When to Start
+=============
+
+All boot-time tracing options starting with ``ftrace`` will be enabled at the
+end of core_initcall. This means you can trace the events from postcore_initcall.
+Most of the subsystems and architecture dependent drivers will be initialized
+after that (arch_initcall or subsys_initcall). Thus, you can trace those with
+boot-time tracing.
+If you want to trace events before core_initcall, you can use the options
+starting with ``kernel``. Some of them will be enabled eariler than the initcall
+processing (for example,. ``kernel.ftrace=function`` and ``kernel.trace_event``
+will start before the initcall.)
+
Examples
========
@@ -135,13 +222,23 @@ below::
}
synthetic.initcall_latency {
fields = "unsigned long func", "u64 lat"
- actions = "hist:keys=func.sym,lat:vals=lat:sort=lat"
+ hist {
+ keys = func.sym, lat
+ values = lat
+ sort = lat
+ }
}
- initcall.initcall_start {
- actions = "hist:keys=func:ts0=common_timestamp.usecs"
+ initcall.initcall_start.hist {
+ keys = func
+ var.ts0 = common_timestamp.usecs
}
- initcall.initcall_finish {
- actions = "hist:keys=func:lat=common_timestamp.usecs-$ts0:onmatch(initcall.initcall_start).initcall_latency(func,$lat)"
+ initcall.initcall_finish.hist {
+ keys = func
+ var.lat = common_timestamp.usecs - $ts0
+ onmatch {
+ event = initcall.initcall_start
+ trace = initcall_latency, func, $lat
+ }
}
}
@@ -164,6 +261,26 @@ is for tracing functions starting with "user\_", and others tracing
The instance node also accepts event nodes so that each instance
can customize its event tracing.
+With the trigger action and kprobes, you can trace function-graph while
+a function is called. For example, this will trace all function calls in
+the pci_proc_init()::
+
+ ftrace {
+ tracing_on = 0
+ tracer = function_graph
+ event.kprobes {
+ start_event {
+ probes = "pci_proc_init"
+ actions = "traceon"
+ }
+ end_event {
+ probes = "pci_proc_init%return"
+ actions = "traceoff"
+ }
+ }
+ }
+
+
This boot-time tracing also supports ftrace kernel parameters via boot
config.
For example, following kernel parameters::
diff --git a/Documentation/trace/coresight/coresight-config.rst b/Documentation/trace/coresight/coresight-config.rst
new file mode 100644
index 000000000000..6d5ffa6f7347
--- /dev/null
+++ b/Documentation/trace/coresight/coresight-config.rst
@@ -0,0 +1,294 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================
+CoreSight System Configuration Manager
+======================================
+
+ :Author: Mike Leach <mike.leach@linaro.org>
+ :Date: October 2020
+
+Introduction
+============
+
+The CoreSight System Configuration manager is an API that allows the
+programming of the CoreSight system with pre-defined configurations that
+can then be easily enabled from sysfs or perf.
+
+Many CoreSight components can be programmed in complex ways - especially ETMs.
+In addition, components can interact across the CoreSight system, often via
+the cross trigger components such as CTI and CTM. These system settings can
+be defined and enabled as named configurations.
+
+
+Basic Concepts
+==============
+
+This section introduces the basic concepts of a CoreSight system configuration.
+
+
+Features
+--------
+
+A feature is a named set of programming for a CoreSight device. The programming
+is device dependent, and can be defined in terms of absolute register values,
+resource usage and parameter values.
+
+The feature is defined using a descriptor. This descriptor is used to load onto
+a matching device, either when the feature is loaded into the system, or when the
+CoreSight device is registered with the configuration manager.
+
+The load process involves interpreting the descriptor into a set of register
+accesses in the driver - the resource usage and parameter descriptions
+translated into appropriate register accesses. This interpretation makes it easy
+and efficient for the feature to be programmed onto the device when required.
+
+The feature will not be active on the device until the feature is enabled, and
+the device itself is enabled. When the device is enabled then enabled features
+will be programmed into the device hardware.
+
+A feature is enabled as part of a configuration being enabled on the system.
+
+
+Parameter Value
+~~~~~~~~~~~~~~~
+
+A parameter value is a named value that may be set by the user prior to the
+feature being enabled that can adjust the behaviour of the operation programmed
+by the feature.
+
+For example, this could be a count value in a programmed operation that repeats
+at a given rate. When the feature is enabled then the current value of the
+parameter is used in programming the device.
+
+The feature descriptor defines a default value for a parameter, which is used
+if the user does not supply a new value.
+
+Users can update parameter values using the configfs API for the CoreSight
+system - which is described below.
+
+The current value of the parameter is loaded into the device when the feature
+is enabled on that device.
+
+
+Configurations
+--------------
+
+A configuration defines a set of features that are to be used in a trace
+session where the configuration is selected. For any trace session only one
+configuration may be selected.
+
+The features defined may be on any type of device that is registered
+to support system configuration. A configuration may select features to be
+enabled on a class of devices - i.e. any ETMv4, or specific devices, e.g. a
+specific CTI on the system.
+
+As with the feature, a descriptor is used to define the configuration.
+This will define the features that must be enabled as part of the configuration
+as well as any preset values that can be used to override default parameter
+values.
+
+
+Preset Values
+~~~~~~~~~~~~~
+
+Preset values are easily selectable sets of parameter values for the features
+that the configuration uses. The number of values in a single preset set, equals
+the sum of parameter values in the features used by the configuration.
+
+e.g. a configuration consists of 3 features, one has 2 parameters, one has
+a single parameter, and another has no parameters. A single preset set will
+therefore have 3 values.
+
+Presets are optionally defined by the configuration, up to 15 can be defined.
+If no preset is selected, then the parameter values defined in the feature
+are used as normal.
+
+
+Operation
+~~~~~~~~~
+
+The following steps take place in the operation of a configuration.
+
+1) In this example, the configuration is 'autofdo', which has an
+ associated feature 'strobing' that works on ETMv4 CoreSight Devices.
+
+2) The configuration is enabled. For example 'perf' may select the
+ configuration as part of its command line::
+
+ perf record -e cs_etm/autofdo/ myapp
+
+ which will enable the 'autofdo' configuration.
+
+3) perf starts tracing on the system. As each ETMv4 that perf uses for
+ trace is enabled, the configuration manager will check if the ETMv4
+ has a feature that relates to the currently active configuration.
+ In this case 'strobing' is enabled & programmed into the ETMv4.
+
+4) When the ETMv4 is disabled, any registers marked as needing to be
+ saved will be read back.
+
+5) At the end of the perf session, the configuration will be disabled.
+
+
+Viewing Configurations and Features
+===================================
+
+The set of configurations and features that are currently loaded into the
+system can be viewed using the configfs API.
+
+Mount configfs as normal and the 'cs-syscfg' subsystem will appear::
+
+ $ ls /config
+ cs-syscfg stp-policy
+
+This has two sub-directories::
+
+ $ cd cs-syscfg/
+ $ ls
+ configurations features
+
+The system has the configuration 'autofdo' built in. It may be examined as
+follows::
+
+ $ cd configurations/
+ $ ls
+ autofdo
+ $ cd autofdo/
+ $ ls
+ description feature_refs preset1 preset3 preset5 preset7 preset9
+ enable preset preset2 preset4 preset6 preset8
+ $ cat description
+ Setup ETMs with strobing for autofdo
+ $ cat feature_refs
+ strobing
+
+Each preset declared has a 'preset<n>' subdirectory declared. The values for
+the preset can be examined::
+
+ $ cat preset1/values
+ strobing.window = 0x1388 strobing.period = 0x2
+ $ cat preset2/values
+ strobing.window = 0x1388 strobing.period = 0x4
+
+The 'enable' and 'preset' files allow the control of a configuration when
+using CoreSight with sysfs.
+
+The features referenced by the configuration can be examined in the features
+directory::
+
+ $ cd ../../features/strobing/
+ $ ls
+ description matches nr_params params
+ $ cat description
+ Generate periodic trace capture windows.
+ parameter 'window': a number of CPU cycles (W)
+ parameter 'period': trace enabled for W cycles every period x W cycles
+ $ cat matches
+ SRC_ETMV4
+ $ cat nr_params
+ 2
+
+Move to the params directory to examine and adjust parameters::
+
+ cd params
+ $ ls
+ period window
+ $ cd period
+ $ ls
+ value
+ $ cat value
+ 0x2710
+ # echo 15000 > value
+ # cat value
+ 0x3a98
+
+Parameters adjusted in this way are reflected in all device instances that have
+loaded the feature.
+
+
+Using Configurations in perf
+============================
+
+The configurations loaded into the CoreSight configuration management are
+also declared in the perf 'cs_etm' event infrastructure so that they can
+be selected when running trace under perf::
+
+ $ ls /sys/devices/cs_etm
+ cpu0 cpu2 events nr_addr_filters power subsystem uevent
+ cpu1 cpu3 format perf_event_mux_interval_ms sinks type
+
+The key directory here is 'events' - a generic perf directory which allows
+selection on the perf command line. As with the sinks entries, this provides
+a hash of the configuration name.
+
+The entry in the 'events' directory uses perfs built in syntax generator
+to substitute the syntax for the name when evaluating the command::
+
+ $ ls events/
+ autofdo
+ $ cat events/autofdo
+ configid=0xa7c3dddd
+
+The 'autofdo' configuration may be selected on the perf command line::
+
+ $ perf record -e cs_etm/autofdo/u --per-thread <application>
+
+A preset to override the current parameter values can also be selected::
+
+ $ perf record -e cs_etm/autofdo,preset=1/u --per-thread <application>
+
+When configurations are selected in this way, then the trace sink used is
+automatically selected.
+
+Using Configurations in sysfs
+=============================
+
+Coresight can be controlled using sysfs. When this is in use then a configuration
+can be made active for the devices that are used in the sysfs session.
+
+In a configuration there are 'enable' and 'preset' files.
+
+To enable a configuration for use with sysfs::
+
+ $ cd configurations/autofdo
+ $ echo 1 > enable
+
+This will then use any default parameter values in the features - which can be
+adjusted as described above.
+
+To use a preset<n> set of parameter values::
+
+ $ echo 3 > preset
+
+This will select preset3 for the configuration.
+The valid values for preset are 0 - to deselect presets, and any value of
+<n> where a preset<n> sub-directory is present.
+
+Note that the active sysfs configuration is a global parameter, therefore
+only a single configuration can be active for sysfs at any one time.
+Attempting to enable a second configuration will result in an error.
+Additionally, attempting to disable the configuration while in use will
+also result in an error.
+
+The use of the active configuration by sysfs is independent of the configuration
+used in perf.
+
+
+Creating and Loading Custom Configurations
+==========================================
+
+Custom configurations and / or features can be dynamically loaded into the
+system by using a loadable module.
+
+An example of a custom configuration is found in ./samples/coresight.
+
+This creates a new configuration that uses the existing built in
+strobing feature, but provides a different set of presets.
+
+When the module is loaded, then the configuration appears in the configfs
+file system and is selectable in the same way as the built in configuration
+described above.
+
+Configurations can use previously loaded features. The system will ensure
+that it is not possible to unload a feature that is currently in use, by
+enforcing the unload order as the strict reverse of the load order.
diff --git a/Documentation/trace/coresight/coresight-cpu-debug.rst b/Documentation/trace/coresight/coresight-cpu-debug.rst
index 993dd294b81b..836b35532667 100644
--- a/Documentation/trace/coresight/coresight-cpu-debug.rst
+++ b/Documentation/trace/coresight/coresight-cpu-debug.rst
@@ -117,7 +117,8 @@ divide into below cases:
Device Tree Bindings
--------------------
-See Documentation/devicetree/bindings/arm/coresight-cpu-debug.txt for details.
+See Documentation/devicetree/bindings/arm/arm,coresight-cpu-debug.yaml for
+details.
How to use the module
diff --git a/Documentation/trace/coresight/coresight-ect.rst b/Documentation/trace/coresight/coresight-ect.rst
new file mode 100644
index 000000000000..a68732c5c6d6
--- /dev/null
+++ b/Documentation/trace/coresight/coresight-ect.rst
@@ -0,0 +1,226 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================================
+CoreSight Embedded Cross Trigger (CTI & CTM).
+=============================================
+
+ :Author: Mike Leach <mike.leach@linaro.org>
+ :Date: November 2019
+
+Hardware Description
+--------------------
+
+The CoreSight Cross Trigger Interface (CTI) is a hardware device that takes
+individual input and output hardware signals known as triggers to and from
+devices and interconnects them via the Cross Trigger Matrix (CTM) to other
+devices via numbered channels, in order to propagate events between devices.
+
+e.g.::
+
+ 0000000 in_trigs :::::::
+ 0 C 0----------->: : +======>(other CTI channel IO)
+ 0 P 0<-----------: : v
+ 0 U 0 out_trigs : : Channels ***** :::::::
+ 0000000 : CTI :<=========>*CTM*<====>: CTI :---+
+ ####### in_trigs : : (id 0-3) ***** ::::::: v
+ # ETM #----------->: : ^ #######
+ # #<-----------: : +---# ETR #
+ ####### out_trigs ::::::: #######
+
+The CTI driver enables the programming of the CTI to attach triggers to
+channels. When an input trigger becomes active, the attached channel will
+become active. Any output trigger attached to that channel will also
+become active. The active channel is propagated to other CTIs via the CTM,
+activating connected output triggers there, unless filtered by the CTI
+channel gate.
+
+It is also possible to activate a channel using system software directly
+programming registers in the CTI.
+
+The CTIs are registered by the system to be associated with CPUs and/or other
+CoreSight devices on the trace data path. When these devices are enabled the
+attached CTIs will also be enabled. By default/on power up the CTIs have
+no programmed trigger/channel attachments, so will not affect the system
+until explicitly programmed.
+
+The hardware trigger connections between CTIs and devices is implementation
+defined, unless the CPU/ETM combination is a v8 architecture, in which case
+the connections have an architecturally defined standard layout.
+
+The hardware trigger signals can also be connected to non-CoreSight devices
+(e.g. UART), or be propagated off chip as hardware IO lines.
+
+All the CTI devices are associated with a CTM. On many systems there will be a
+single effective CTM (one CTM, or multiple CTMs all interconnected), but it is
+possible that systems can have nets of CTIs+CTM that are not interconnected by
+a CTM to each other. On these systems a CTM index is declared to associate
+CTI devices that are interconnected via a given CTM.
+
+Sysfs files and directories
+---------------------------
+
+The CTI devices appear on the existing CoreSight bus alongside the other
+CoreSight devices::
+
+ >$ ls /sys/bus/coresight/devices
+ cti_cpu0 cti_cpu2 cti_sys0 etm0 etm2 funnel0 replicator0 tmc_etr0
+ cti_cpu1 cti_cpu3 cti_sys1 etm1 etm3 funnel1 tmc_etf0 tpiu0
+
+The ``cti_cpu<N>`` named CTIs are associated with a CPU, and any ETM used by
+that core. The ``cti_sys<N>`` CTIs are general system infrastructure CTIs that
+can be associated with other CoreSight devices, or other system hardware
+capable of generating or using trigger signals.::
+
+ >$ ls /sys/bus/coresight/devices/etm0/cti_cpu0
+ channels ctmid enable nr_trigger_cons mgmt power powered regs
+ connections subsystem triggers0 triggers1 uevent
+
+*Key file items are:-*
+ * ``enable``: enables/disables the CTI. Read to determine current state.
+ If this shows as enabled (1), but ``powered`` shows unpowered (0), then
+ the enable indicates a request to enabled when the device is powered.
+ * ``ctmid`` : associated CTM - only relevant if system has multiple CTI+CTM
+ clusters that are not interconnected.
+ * ``nr_trigger_cons`` : total connections - triggers<N> directories.
+ * ``powered`` : Read to determine if the CTI is currently powered.
+
+*Sub-directories:-*
+ * ``triggers<N>``: contains list of triggers for an individual connection.
+ * ``channels``: Contains the channel API - CTI main programming interface.
+ * ``regs``: Gives access to the raw programmable CTI regs.
+ * ``mgmt``: the standard CoreSight management registers.
+ * ``connections``: Links to connected *CoreSight* devices. The number of
+ links can be 0 to ``nr_trigger_cons``. Actual number given by ``nr_links``
+ in this directory.
+
+
+triggers<N> directories
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Individual trigger connection information. This describes trigger signals for
+CoreSight and non-CoreSight connections.
+
+Each triggers directory has a set of parameters describing the triggers for
+the connection.
+
+ * ``name`` : name of connection
+ * ``in_signals`` : input trigger signal indexes used in this connection.
+ * ``in_types`` : functional types for in signals.
+ * ``out_signals`` : output trigger signals for this connection.
+ * ``out_types`` : functional types for out signals.
+
+e.g::
+
+ >$ ls ./cti_cpu0/triggers0/
+ in_signals in_types name out_signals out_types
+ >$ cat ./cti_cpu0/triggers0/name
+ cpu0
+ >$ cat ./cti_cpu0/triggers0/out_signals
+ 0-2
+ >$ cat ./cti_cpu0/triggers0/out_types
+ pe_edbgreq pe_dbgrestart pe_ctiirq
+ >$ cat ./cti_cpu0/triggers0/in_signals
+ 0-1
+ >$ cat ./cti_cpu0/triggers0/in_types
+ pe_dbgtrigger pe_pmuirq
+
+If a connection has zero signals in either the 'in' or 'out' triggers then
+those parameters will be omitted.
+
+Channels API Directory
+~~~~~~~~~~~~~~~~~~~~~~
+
+This provides an easy way to attach triggers to channels, without needing
+the multiple register operations that are required if manipulating the
+'regs' sub-directory elements directly.
+
+A number of files provide this API::
+
+ >$ ls ./cti_sys0/channels/
+ chan_clear chan_inuse chan_xtrigs_out trigin_attach
+ chan_free chan_pulse chan_xtrigs_reset trigin_detach
+ chan_gate_disable chan_set chan_xtrigs_sel trigout_attach
+ chan_gate_enable chan_xtrigs_in trig_filter_enable trigout_detach
+ trigout_filtered
+
+Most access to these elements take the form::
+
+ echo <chan> [<trigger>] > /<device_path>/<operation>
+
+where the optional <trigger> is only needed for trigXX_attach | detach
+operations.
+
+e.g.::
+
+ >$ echo 0 1 > ./cti_sys0/channels/trigout_attach
+ >$ echo 0 > ./cti_sys0/channels/chan_set
+
+Attaches trigout(1) to channel(0), then activates channel(0) generating a
+set state on cti_sys0.trigout(1)
+
+
+*API operations*
+
+ * ``trigin_attach, trigout_attach``: Attach a channel to a trigger signal.
+ * ``trigin_detach, trigout_detach``: Detach a channel from a trigger signal.
+ * ``chan_set``: Set the channel - the set state will be propagated around
+ the CTM to other connected devices.
+ * ``chan_clear``: Clear the channel.
+ * ``chan_pulse``: Set the channel for a single CoreSight clock cycle.
+ * ``chan_gate_enable``: Write operation sets the CTI gate to propagate
+ (enable) the channel to other devices. This operation takes a channel
+ number. CTI gate is enabled for all channels by default at power up. Read
+ to list the currently enabled channels on the gate.
+ * ``chan_gate_disable``: Write channel number to disable gate for that
+ channel.
+ * ``chan_inuse``: Show the current channels attached to any signal
+ * ``chan_free``: Show channels with no attached signals.
+ * ``chan_xtrigs_sel``: write a channel number to select a channel to view,
+ read to show the selected channel number.
+ * ``chan_xtrigs_in``: Read to show the input triggers attached to
+ the selected view channel.
+ * ``chan_xtrigs_out``:Read to show the output triggers attached to
+ the selected view channel.
+ * ``trig_filter_enable``: Defaults to enabled, disable to allow potentially
+ dangerous output signals to be set.
+ * ``trigout_filtered``: Trigger out signals that are prevented from being
+ set if filtering ``trig_filter_enable`` is enabled. One use is to prevent
+ accidental ``EDBGREQ`` signals stopping a core.
+ * ``chan_xtrigs_reset``: Write 1 to clear all channel / trigger programming.
+ Resets device hardware to default state.
+
+
+The example below attaches input trigger index 1 to channel 2, and output
+trigger index 6 to the same channel. It then examines the state of the
+channel / trigger connections using the appropriate sysfs attributes.
+
+The settings mean that if either input trigger 1, or channel 2 go active then
+trigger out 6 will go active. We then enable the CTI, and use the software
+channel control to activate channel 2. We see the active channel on the
+``choutstatus`` register and the active signal on the ``trigoutstatus``
+register. Finally clearing the channel removes this.
+
+e.g.::
+
+ .../cti_sys0/channels# echo 2 1 > trigin_attach
+ .../cti_sys0/channels# echo 2 6 > trigout_attach
+ .../cti_sys0/channels# cat chan_free
+ 0-1,3
+ .../cti_sys0/channels# cat chan_inuse
+ 2
+ .../cti_sys0/channels# echo 2 > chan_xtrigs_sel
+ .../cti_sys0/channels# cat chan_xtrigs_trigin
+ 1
+ .../cti_sys0/channels# cat chan_xtrigs_trigout
+ 6
+ .../cti_sys0/# echo 1 > enable
+ .../cti_sys0/channels# echo 2 > chan_set
+ .../cti_sys0/channels# cat ../regs/choutstatus
+ 0x4
+ .../cti_sys0/channels# cat ../regs/trigoutstatus
+ 0x40
+ .../cti_sys0/channels# echo 2 > chan_clear
+ .../cti_sys0/channels# cat ../regs/trigoutstatus
+ 0x0
+ .../cti_sys0/channels# cat ../regs/choutstatus
+ 0x0
diff --git a/Documentation/trace/coresight/coresight-etm4x-reference.rst b/Documentation/trace/coresight/coresight-etm4x-reference.rst
index b64d9a9c79df..70e34b8c81c1 100644
--- a/Documentation/trace/coresight/coresight-etm4x-reference.rst
+++ b/Documentation/trace/coresight/coresight-etm4x-reference.rst
@@ -71,6 +71,20 @@ the ‘TRC’ prefix.
----
+:File: ``ts_source`` (ro)
+:Trace Registers: None.
+:Notes:
+ When FEAT_TRF is implemented, value of TRFCR_ELx.TS used for trace session. Otherwise -1
+ indicates an unknown time source. Check trcidr0.tssize to see if a global timestamp is
+ available.
+
+:Example:
+ ``$> cat ts_source``
+
+ ``$> 1``
+
+----
+
:File: ``addr_idx`` (rw)
:Trace Registers: None.
:Notes:
@@ -427,7 +441,7 @@ the ‘TRC’ prefix.
:Syntax:
``echo idx > vmid_idx``
- Where idx <  numvmidc
+ Where idx < numvmidc
----
@@ -650,13 +664,26 @@ Bit assignments shown below:-
parameter is set this value is applied to the currently indexed
address range.
+.. _coresight-branch-broadcast:
**bit (4):**
ETM_MODE_BB
**description:**
- Set to enable branch broadcast if supported in hardware [IDR0].
+ Set to enable branch broadcast if supported in hardware [IDR0]. The primary use for this feature
+ is when code is patched dynamically at run time and the full program flow may not be able to be
+ reconstructed using only conditional branches.
+
+ There is currently no support in Perf for supplying modified binaries to the decoder, so this
+ feature is only inteded to be used for debugging purposes or with a 3rd party tool.
+
+ Choosing this option will result in a significant increase in the amount of trace generated -
+ possible danger of overflows, or fewer instructions covered. Note, that this option also
+ overrides any setting of :ref:`ETM_MODE_RETURNSTACK <coresight-return-stack>`, so where a branch
+ broadcast range overlaps a return stack range, return stacks will not be available for that
+ range.
+.. _coresight-cycle-accurate:
**bit (5):**
ETMv4_MODE_CYCACC
@@ -678,6 +705,7 @@ Bit assignments shown below:-
**description:**
Set to enable virtual machine ID tracing if supported [IDR2].
+.. _coresight-timestamp:
**bit (11):**
ETMv4_MODE_TIMESTAMP
@@ -685,6 +713,7 @@ Bit assignments shown below:-
**description:**
Set to enable timestamp generation if supported [IDR0].
+.. _coresight-return-stack:
**bit (12):**
ETM_MODE_RETURNSTACK
diff --git a/Documentation/trace/coresight/coresight-perf.rst b/Documentation/trace/coresight/coresight-perf.rst
new file mode 100644
index 000000000000..d087aae7d492
--- /dev/null
+++ b/Documentation/trace/coresight/coresight-perf.rst
@@ -0,0 +1,158 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================
+CoreSight - Perf
+================
+
+ :Author: Carsten Haitzler <carsten.haitzler@arm.com>
+ :Date: June 29th, 2022
+
+Perf is able to locally access CoreSight trace data and store it to the
+output perf data files. This data can then be later decoded to give the
+instructions that were traced for debugging or profiling purposes. You
+can log such data with a perf record command like::
+
+ perf record -e cs_etm//u testbinary
+
+This would run some test binary (testbinary) until it exits and record
+a perf.data trace file. That file would have AUX sections if CoreSight
+is working correctly. You can dump the content of this file as
+readable text with a command like::
+
+ perf report --stdio --dump -i perf.data
+
+You should find some sections of this file have AUX data blocks like::
+
+ 0x1e78 [0x30]: PERF_RECORD_AUXTRACE size: 0x11dd0 offset: 0 ref: 0x1b614fc1061b0ad1 idx: 0 tid: 531230 cpu: -1
+
+ . ... CoreSight ETM Trace data: size 73168 bytes
+ Idx:0; ID:10; I_ASYNC : Alignment Synchronisation.
+ Idx:12; ID:10; I_TRACE_INFO : Trace Info.; INFO=0x0 { CC.0 }
+ Idx:17; ID:10; I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0x0000000000000000;
+ Idx:26; ID:10; I_TRACE_ON : Trace On.
+ Idx:27; ID:10; I_ADDR_CTXT_L_64IS0 : Address & Context, Long, 64 bit, IS0.; Addr=0x0000FFFFB6069140; Ctxt: AArch64,EL0, NS;
+ Idx:38; ID:10; I_ATOM_F6 : Atom format 6.; EEEEEEEEEEEEEEEEEEEEEEEE
+ Idx:39; ID:10; I_ATOM_F6 : Atom format 6.; EEEEEEEEEEEEEEEEEEEEEEEE
+ Idx:40; ID:10; I_ATOM_F6 : Atom format 6.; EEEEEEEEEEEEEEEEEEEEEEEE
+ Idx:41; ID:10; I_ATOM_F6 : Atom format 6.; EEEEEEEEEEEN
+ ...
+
+If you see these above, then your system is tracing CoreSight data
+correctly.
+
+To compile perf with CoreSight support in the tools/perf directory do::
+
+ make CORESIGHT=1
+
+This requires OpenCSD to build. You may install distribution packages
+for the support such as libopencsd and libopencsd-dev or download it
+and build yourself. Upstream OpenCSD is located at:
+
+ https://github.com/Linaro/OpenCSD
+
+For complete information on building perf with CoreSight support and
+more extensive usage look at:
+
+ https://github.com/Linaro/OpenCSD/blob/master/HOWTO.md
+
+
+Kernel CoreSight Support
+------------------------
+
+You will also want CoreSight support enabled in your kernel config.
+Ensure it is enabled with::
+
+ CONFIG_CORESIGHT=y
+
+There are various other CoreSight options you probably also want
+enabled like::
+
+ CONFIG_CORESIGHT_LINKS_AND_SINKS=y
+ CONFIG_CORESIGHT_LINK_AND_SINK_TMC=y
+ CONFIG_CORESIGHT_CATU=y
+ CONFIG_CORESIGHT_SINK_TPIU=y
+ CONFIG_CORESIGHT_SINK_ETBV10=y
+ CONFIG_CORESIGHT_SOURCE_ETM4X=y
+ CONFIG_CORESIGHT_CTI=y
+ CONFIG_CORESIGHT_CTI_INTEGRATION_REGS=y
+
+Please refer to the kernel configuration help for more information.
+
+Perf test - Verify kernel and userspace perf CoreSight work
+-----------------------------------------------------------
+
+When you run perf test, it will do a lot of self tests. Some of those
+tests will cover CoreSight (only if enabled and on ARM64). You
+generally would run perf test from the tools/perf directory in the
+kernel tree. Some tests will check some internal perf support like:
+
+ Check Arm CoreSight trace data recording and synthesized samples
+ Check Arm SPE trace data recording and synthesized samples
+
+Some others will actually use perf record and some test binaries that
+are in tests/shell/coresight and will collect traces to ensure a
+minimum level of functionality is met. The scripts that launch these
+tests are in the same directory. These will all look like:
+
+ CoreSight / ASM Pure Loop
+ CoreSight / Memcpy 16k 10 Threads
+ CoreSight / Thread Loop 10 Threads - Check TID
+ etc.
+
+These perf record tests will not run if the tool binaries do not exist
+in tests/shell/coresight/\*/ and will be skipped. If you do not have
+CoreSight support in hardware then either do not build perf with
+CoreSight support or remove these binaries in order to not have these
+tests fail and have them skip instead.
+
+These tests will log historical results in the current working
+directory (e.g. tools/perf) and will be named stats-\*.csv like:
+
+ stats-asm_pure_loop-out.csv
+ stats-memcpy_thread-16k_10.csv
+ ...
+
+These statistic files log some aspects of the AUX data sections in
+the perf data output counting some numbers of certain encodings (a
+good way to know that it's working in a very simple way). One problem
+with CoreSight is that given a large enough amount of data needing to
+be logged, some of it can be lost due to the processor not waking up
+in time to read out all the data from buffers etc.. You will notice
+that the amount of data collected can vary a lot per run of perf test.
+If you wish to see how this changes over time, simply run perf test
+multiple times and all these csv files will have more and more data
+appended to it that you can later examine, graph and otherwise use to
+figure out if things have become worse or better.
+
+This means sometimes these tests fail as they don't capture all the
+data needed. This is about tracking quality and amount of data
+produced over time and to see when changes to the Linux kernel improve
+quality of traces.
+
+Be aware that some of these tests take quite a while to run, specifically
+in processing the perf data file and dumping contents to then examine what
+is inside.
+
+You can change where these csv logs are stored by setting the
+PERF_TEST_CORESIGHT_STATDIR environment variable before running perf
+test like::
+
+ export PERF_TEST_CORESIGHT_STATDIR=/var/tmp
+ perf test
+
+They will also store resulting perf output data in the current
+directory for later inspection like::
+
+ perf-asm_pure_loop-out.data
+ perf-memcpy_thread-16k_10.data
+ ...
+
+You can alter where the perf data files are stored by setting the
+PERF_TEST_CORESIGHT_DATADIR environment variable such as::
+
+ PERF_TEST_CORESIGHT_DATADIR=/var/tmp
+ perf test
+
+You may wish to set these above environment variables if you wish to
+keep the output of tests outside of the current working directory for
+longer term storage and examination.
diff --git a/Documentation/trace/coresight/coresight-trbe.rst b/Documentation/trace/coresight/coresight-trbe.rst
new file mode 100644
index 000000000000..b9928ef148da
--- /dev/null
+++ b/Documentation/trace/coresight/coresight-trbe.rst
@@ -0,0 +1,38 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+Trace Buffer Extension (TRBE).
+==============================
+
+ :Author: Anshuman Khandual <anshuman.khandual@arm.com>
+ :Date: November 2020
+
+Hardware Description
+--------------------
+
+Trace Buffer Extension (TRBE) is a percpu hardware which captures in system
+memory, CPU traces generated from a corresponding percpu tracing unit. This
+gets plugged in as a coresight sink device because the corresponding trace
+generators (ETE), are plugged in as source device.
+
+The TRBE is not compliant to CoreSight architecture specifications, but is
+driven via the CoreSight driver framework to support the ETE (which is
+CoreSight compliant) integration.
+
+Sysfs files and directories
+---------------------------
+
+The TRBE devices appear on the existing coresight bus alongside the other
+coresight devices::
+
+ >$ ls /sys/bus/coresight/devices
+ trbe0 trbe1 trbe2 trbe3
+
+The ``trbe<N>`` named TRBEs are associated with a CPU.::
+
+ >$ ls /sys/bus/coresight/devices/trbe0/
+ align flag
+
+*Key file items are:-*
+ * ``align``: TRBE write pointer alignment
+ * ``flag``: TRBE updates memory with access and dirty flags
diff --git a/Documentation/trace/coresight/coresight.rst b/Documentation/trace/coresight/coresight.rst
index a566719f8e7e..4a71ea6cb390 100644
--- a/Documentation/trace/coresight/coresight.rst
+++ b/Documentation/trace/coresight/coresight.rst
@@ -130,7 +130,7 @@ Misc:
Device Tree Bindings
--------------------
-See Documentation/devicetree/bindings/arm/coresight.txt for details.
+See Documentation/devicetree/bindings/arm/arm,coresight-\*.yaml for details.
As of this writing drivers for ITM, STMs and CTIs are not provided but are
expected to be added as the solution matures.
@@ -241,6 +241,92 @@ to the newer scheme, to give a confirmation that what you see on your
system is not unexpected. One must use the "names" as they appear on
the system under specified locations.
+Topology Representation
+-----------------------
+
+Each CoreSight component has a ``connections`` directory which will contain
+links to other CoreSight components. This allows the user to explore the trace
+topology and for larger systems, determine the most appropriate sink for a
+given source. The connection information can also be used to establish
+which CTI devices are connected to a given component. This directory contains a
+``nr_links`` attribute detailing the number of links in the directory.
+
+For an ETM source, in this case ``etm0`` on a Juno platform, a typical
+arrangement will be::
+
+ linaro-developer:~# ls - l /sys/bus/coresight/devices/etm0/connections
+ <file details> cti_cpu0 -> ../../../23020000.cti/cti_cpu0
+ <file details> nr_links
+ <file details> out:0 -> ../../../230c0000.funnel/funnel2
+
+Following the out port to ``funnel2``::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/funnel2/connections
+ <file details> in:0 -> ../../../23040000.etm/etm0
+ <file details> in:1 -> ../../../23140000.etm/etm3
+ <file details> in:2 -> ../../../23240000.etm/etm4
+ <file details> in:3 -> ../../../23340000.etm/etm5
+ <file details> nr_links
+ <file details> out:0 -> ../../../20040000.funnel/funnel0
+
+And again to ``funnel0``::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/funnel0/connections
+ <file details> in:0 -> ../../../220c0000.funnel/funnel1
+ <file details> in:1 -> ../../../230c0000.funnel/funnel2
+ <file details> nr_links
+ <file details> out:0 -> ../../../20010000.etf/tmc_etf0
+
+Finding the first sink ``tmc_etf0``. This can be used to collect data
+as a sink, or as a link to propagate further along the chain::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/tmc_etf0/connections
+ <file details> cti_sys0 -> ../../../20020000.cti/cti_sys0
+ <file details> in:0 -> ../../../20040000.funnel/funnel0
+ <file details> nr_links
+ <file details> out:0 -> ../../../20150000.funnel/funnel4
+
+via ``funnel4``::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/funnel4/connections
+ <file details> in:0 -> ../../../20010000.etf/tmc_etf0
+ <file details> in:1 -> ../../../20140000.etf/tmc_etf1
+ <file details> nr_links
+ <file details> out:0 -> ../../../20120000.replicator/replicator0
+
+and a ``replicator0``::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/replicator0/connections
+ <file details> in:0 -> ../../../20150000.funnel/funnel4
+ <file details> nr_links
+ <file details> out:0 -> ../../../20030000.tpiu/tpiu0
+ <file details> out:1 -> ../../../20070000.etr/tmc_etr0
+
+Arriving at the final sink in the chain, ``tmc_etr0``::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/tmc_etr0/connections
+ <file details> cti_sys0 -> ../../../20020000.cti/cti_sys0
+ <file details> in:0 -> ../../../20120000.replicator/replicator0
+ <file details> nr_links
+
+As described below, when using sysfs it is sufficient to enable a sink and
+a source for successful trace. The framework will correctly enable all
+intermediate links as required.
+
+Note: ``cti_sys0`` appears in two of the connections lists above.
+CTIs can connect to multiple devices and are arranged in a star topology
+via the CTM. See (Documentation/trace/coresight/coresight-ect.rst)
+[#fourth]_ for further details.
+Looking at this device we see 4 connections::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/cti_sys0/connections
+ <file details> nr_links
+ <file details> stm0 -> ../../../20100000.stm/stm0
+ <file details> tmc_etf0 -> ../../../20010000.etf/tmc_etf0
+ <file details> tmc_etr0 -> ../../../20070000.etr/tmc_etr0
+ <file details> tpiu0 -> ../../../20030000.tpiu/tpiu0
+
+
How to use the tracer modules
-----------------------------
@@ -253,7 +339,8 @@ Preference is given to the former as using the sysFS interface
requires a deep understanding of the Coresight HW. The following sections
provide details on using both methods.
-1) Using the sysFS interface:
+Using the sysFS interface
+~~~~~~~~~~~~~~~~~~~~~~~~~
Before trace collection can start, a coresight sink needs to be identified.
There is no limit on the amount of sinks (nor sources) that can be enabled at
@@ -360,7 +447,8 @@ wealth of possibilities that coresight provides.
Instruction 0 0x8026B588 E8BD8000 true LDM sp!,{pc}
Timestamp Timestamp: 17107041535
-2) Using perf framework:
+Using perf framework
+~~~~~~~~~~~~~~~~~~~~
Coresight tracers are represented using the Perf framework's Performance
Monitoring Unit (PMU) abstraction. As such the perf framework takes charge of
@@ -409,7 +497,11 @@ More information on the above and other example on how to use Coresight with
the perf tools can be found in the "HOWTO.md" file of the openCSD gitHub
repository [#third]_.
-2.1) AutoFDO analysis using the perf tools:
+Advanced perf framework usage
+-----------------------------
+
+AutoFDO analysis using the perf tools
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
perf can be used to record and analyze trace of programs.
@@ -427,9 +519,42 @@ The --itrace option controls the type and frequency of synthesized events
Note that only 64-bit programs are currently supported - further work is
required to support instruction decode of 32-bit Arm programs.
+Tracing PID
+~~~~~~~~~~~
+
+The kernel can be built to write the PID value into the PE ContextID registers.
+For a kernel running at EL1, the PID is stored in CONTEXTIDR_EL1. A PE may
+implement Arm Virtualization Host Extensions (VHE), which the kernel can
+run at EL2 as a virtualisation host; in this case, the PID value is stored in
+CONTEXTIDR_EL2.
+
+perf provides PMU formats that program the ETM to insert these values into the
+trace data; the PMU formats are defined as below:
+
+ "contextid1": Available on both EL1 kernel and EL2 kernel. When the
+ kernel is running at EL1, "contextid1" enables the PID
+ tracing; when the kernel is running at EL2, this enables
+ tracing the PID of guest applications.
+
+ "contextid2": Only usable when the kernel is running at EL2. When
+ selected, enables PID tracing on EL2 kernel.
+
+ "contextid": Will be an alias for the option that enables PID
+ tracing. I.e,
+ contextid == contextid1, on EL1 kernel.
+ contextid == contextid2, on EL2 kernel.
+
+perf will always enable PID tracing at the relevant EL, this is accomplished by
+automatically enable the "contextid" config - but for EL2 it is possible to make
+specific adjustments using configs "contextid1" and "contextid2", E.g. if a user
+wants to trace PIDs for both host and guest, the two configs "contextid1" and
+"contextid2" can be set at the same time:
+
+ perf record -e cs_etm/contextid1,contextid2/u -- vm
+
Generating coverage files for Feedback Directed Optimization: AutoFDO
----------------------------------------------------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
'perf inject' accepts the --itrace option in which case tracing data is
removed and replaced with the synthesized events. e.g.
@@ -460,6 +585,45 @@ sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tuto
Bubble sorting array of 30000 elements
5806 ms
+Config option formats
+~~~~~~~~~~~~~~~~~~~~~
+
+The following strings can be provided between // on the perf command line to enable various options.
+They are also listed in the folder /sys/bus/event_source/devices/cs_etm/format/
+
+.. list-table::
+ :header-rows: 1
+
+ * - Option
+ - Description
+ * - branch_broadcast
+ - Session local version of the system wide setting:
+ :ref:`ETM_MODE_BB <coresight-branch-broadcast>`
+ * - contextid
+ - See `Tracing PID`_
+ * - contextid1
+ - See `Tracing PID`_
+ * - contextid2
+ - See `Tracing PID`_
+ * - configid
+ - Selection for a custom configuration. This is an implementation detail and not used directly,
+ see :ref:`trace/coresight/coresight-config:Using Configurations in perf`
+ * - preset
+ - Override for parameters in a custom configuration, see
+ :ref:`trace/coresight/coresight-config:Using Configurations in perf`
+ * - sinkid
+ - Hashed version of the string to select a sink, automatically set when using the @ notation.
+ This is an internal implementation detail and is not used directly, see `Using perf
+ framework`_.
+ * - cycacc
+ - Session local version of the system wide setting: :ref:`ETMv4_MODE_CYCACC
+ <coresight-cycle-accurate>`
+ * - retstack
+ - Session local version of the system wide setting: :ref:`ETM_MODE_RETURNSTACK
+ <coresight-return-stack>`
+ * - timestamp
+ - Session local version of the system wide setting: :ref:`ETMv4_MODE_TIMESTAMP
+ <coresight-timestamp>`
How to use the STM module
-------------------------
@@ -489,10 +653,39 @@ interface provided for that purpose by the generic STM API::
crw------- 1 root root 10, 61 Jan 3 18:11 /dev/stm0
root@genericarmv8:~#
-Details on how to use the generic STM API can be found here:- :doc:`../stm` [#second]_.
+Details on how to use the generic STM API can be found here:
+- Documentation/trace/stm.rst [#second]_.
+
+The CTI & CTM Modules
+---------------------
+
+The CTI (Cross Trigger Interface) provides a set of trigger signals between
+individual CTIs and components, and can propagate these between all CTIs via
+channels on the CTM (Cross Trigger Matrix).
+
+A separate documentation file is provided to explain the use of these devices.
+(Documentation/trace/coresight/coresight-ect.rst) [#fourth]_.
+
+CoreSight System Configuration
+------------------------------
+
+CoreSight components can be complex devices with many programming options.
+Furthermore, components can be programmed to interact with each other across the
+complete system.
+
+A CoreSight System Configuration manager is provided to allow these complex programming
+configurations to be selected and used easily from perf and sysfs.
+
+See the separate document for further information.
+(Documentation/trace/coresight/coresight-config.rst) [#fifth]_.
+
.. [#first] Documentation/ABI/testing/sysfs-bus-coresight-devices-stm
.. [#second] Documentation/trace/stm.rst
.. [#third] https://github.com/Linaro/perf-opencsd
+
+.. [#fourth] Documentation/trace/coresight/coresight-ect.rst
+
+.. [#fifth] Documentation/trace/coresight/coresight-config.rst
diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst
index 555484110e36..68fa75247488 100644
--- a/Documentation/trace/events-kmem.rst
+++ b/Documentation/trace/events-kmem.rst
@@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered.
Broadly speaking, pages are taken off the LRU lock in bulk and
freed in batch with a page list. Significant amounts of activity here could
indicate that the system is under memory pressure and can also indicate
-contention on the zone->lru_lock.
+contention on the lruvec->lru_lock.
4. Per-CPU Allocator Activity
=============================
diff --git a/Documentation/trace/events-msr.rst b/Documentation/trace/events-msr.rst
index e938aa0b6f4f..810481e530b6 100644
--- a/Documentation/trace/events-msr.rst
+++ b/Documentation/trace/events-msr.rst
@@ -4,7 +4,7 @@ MSR Trace Events
The x86 kernel supports tracing most MSR (Model Specific Register) accesses.
To see the definition of the MSRs on Intel systems please see the SDM
-at http://www.intel.com/sdm (Volume 3)
+at https://www.intel.com/sdm (Volume 3)
Available trace points:
diff --git a/Documentation/trace/events-power.rst b/Documentation/trace/events-power.rst
index 2ef318962e29..f45bf11fa88d 100644
--- a/Documentation/trace/events-power.rst
+++ b/Documentation/trace/events-power.rst
@@ -75,16 +75,6 @@ The PM QoS events are used for QoS add/update/remove request and for
target/flags update.
::
- pm_qos_add_request "pm_qos_class=%s value=%d"
- pm_qos_update_request "pm_qos_class=%s value=%d"
- pm_qos_remove_request "pm_qos_class=%s value=%d"
- pm_qos_update_request_timeout "pm_qos_class=%s value=%d, timeout_us=%ld"
-
-The first parameter gives the QoS class name (e.g. "CPU_DMA_LATENCY").
-The second parameter is value to be added/updated/removed.
-The third parameter is timeout value in usec.
-::
-
pm_qos_update_target "action=%s prev_value=%d curr_value=%d"
pm_qos_update_flags "action=%s prev_value=0x%x curr_value=0x%x"
@@ -92,7 +82,7 @@ The first parameter gives the QoS action name (e.g. "ADD_REQ").
The second parameter is the previous QoS value.
The third parameter is the current QoS value to update.
-And, there are also events used for device PM QoS add/update/remove request.
+There are also events used for device PM QoS add/update/remove request.
::
dev_pm_qos_add_request "device=%s type=%s new_value=%d"
@@ -103,3 +93,12 @@ The first parameter gives the device name which tries to add/update/remove
QoS requests.
The second parameter gives the request type (e.g. "DEV_PM_QOS_RESUME_LATENCY").
The third parameter is value to be added/updated/removed.
+
+And, there are events used for CPU latency QoS add/update/remove request.
+::
+
+ pm_qos_add_request "value=%d"
+ pm_qos_update_request "value=%d"
+ pm_qos_remove_request "value=%d"
+
+The parameter is the value to be added/updated/removed.
diff --git a/Documentation/trace/events.rst b/Documentation/trace/events.rst
index ed79b220bd07..c47f381d0c00 100644
--- a/Documentation/trace/events.rst
+++ b/Documentation/trace/events.rst
@@ -198,6 +198,15 @@ The glob (~) accepts a wild card character (\*,?) and character classes
prev_comm ~ "*sh*"
prev_comm ~ "ba*sh"
+If the field is a pointer that points into user space (for example
+"filename" from sys_enter_openat), then you have to append ".ustring" to the
+field name::
+
+ filename.ustring ~ "password"
+
+As the kernel will have to know how to retrieve the memory that the pointer
+is at from user space.
+
5.2 Setting filters
-------------------
@@ -230,6 +239,16 @@ Currently the caret ('^') for an error always appears at the beginning of
the filter string; the error message should still be useful though
even without more accurate position info.
+5.2.1 Filter limitations
+------------------------
+
+If a filter is placed on a string pointer ``(char *)`` that does not point
+to a string on the ring buffer, but instead points to kernel or user space
+memory, then, for safety reasons, at most 1024 bytes of the content is
+copied onto a temporary buffer to do the compare. If the copy of the memory
+faults (the pointer points to memory that should not be accessed), then the
+string compare will be treated as not matching.
+
5.3 Clearing filters
--------------------
@@ -342,7 +361,8 @@ section of Documentation/trace/ftrace.rst), but there are major
differences and the implementation isn't currently tied to it in any
way, so beware about making generalizations between the two.
-Note: Writing into trace_marker (See Documentation/trace/ftrace.rst)
+.. Note::
+ Writing into trace_marker (See Documentation/trace/ftrace.rst)
can also enable triggers that are written into
/sys/kernel/tracing/events/ftrace/print/trigger
@@ -526,8 +546,8 @@ The following commands are supported:
See Documentation/trace/histogram.rst for details and examples.
-6.3 In-kernel trace event API
------------------------------
+7. In-kernel trace event API
+============================
In most cases, the command-line interface to trace events is more than
sufficient. Sometimes, however, applications might find the need for
@@ -559,8 +579,8 @@ following:
- tracing synthetic events from in-kernel code
- the low-level "dynevent_cmd" API
-6.3.1 Dyamically creating synthetic event definitions
------------------------------------------------------
+7.1 Dyamically creating synthetic event definitions
+---------------------------------------------------
There are a couple ways to create a new synthetic event from a kernel
module or other kernel code.
@@ -569,14 +589,14 @@ The first creates the event in one step, using synth_event_create().
In this method, the name of the event to create and an array defining
the fields is supplied to synth_event_create(). If successful, a
synthetic event with that name and fields will exist following that
-call. For example, to create a new "schedtest" synthetic event:
+call. For example, to create a new "schedtest" synthetic event::
ret = synth_event_create("schedtest", sched_fields,
ARRAY_SIZE(sched_fields), THIS_MODULE);
The sched_fields param in this example points to an array of struct
synth_field_desc, each of which describes an event field by type and
-name:
+name::
static struct synth_field_desc sched_fields[] = {
{ .type = "pid_t", .name = "next_pid_field" },
@@ -588,8 +608,19 @@ name:
{ .type = "int", .name = "my_int_field" },
};
-See synth_field_size() for available types. If field_name contains [n]
-the field is considered to be an array.
+See synth_field_size() for available types.
+
+If field_name contains [n], the field is considered to be a static array.
+
+If field_names contains[] (no subscript), the field is considered to
+be a dynamic array, which will only take as much space in the event as
+is required to hold the array.
+
+Because space for an event is reserved before assigning field values
+to the event, using dynamic arrays implies that the piecewise
+in-kernel API described below can't be used with dynamic arrays. The
+other non-piecewise in-kernel APIs can, however, be used with dynamic
+arrays.
If the event is created from within a module, a pointer to the module
must be passed to synth_event_create(). This will ensure that the
@@ -615,7 +646,7 @@ synth_event_gen_cmd_array_start(), the user should create and
initialize a dynevent_cmd object using synth_event_cmd_init().
For example, to create a new "schedtest" synthetic event with two
-fields:
+fields::
struct dynevent_cmd cmd;
char *buf;
@@ -631,7 +662,7 @@ fields:
"u64", "ts_ns");
Alternatively, using an array of struct synth_field_desc fields
-containing the same information:
+containing the same information::
ret = synth_event_gen_cmd_array_start(&cmd, "schedtest", THIS_MODULE,
fields, n_fields);
@@ -640,7 +671,7 @@ Once the synthetic event object has been created, it can then be
populated with more fields. Fields are added one by one using
synth_event_add_field(), supplying the dynevent_cmd object, a field
type, and a field name. For example, to add a new int field named
-"intfield", the following call should be made:
+"intfield", the following call should be made::
ret = synth_event_add_field(&cmd, "int", "intfield");
@@ -649,7 +680,7 @@ the field is considered to be an array.
A group of fields can also be added all at once using an array of
synth_field_desc with add_synth_fields(). For example, this would add
-just the first four sched_fields:
+just the first four sched_fields::
ret = synth_event_add_fields(&cmd, sched_fields, 4);
@@ -658,15 +689,15 @@ synth_event_add_field_str() can be used to add it as-is; it will
also automatically append a ';' to the string.
Once all the fields have been added, the event should be finalized and
-registered by calling the synth_event_gen_cmd_end() function:
+registered by calling the synth_event_gen_cmd_end() function::
ret = synth_event_gen_cmd_end(&cmd);
At this point, the event object is ready to be used for tracing new
events.
-6.3.3 Tracing synthetic events from in-kernel code
---------------------------------------------------
+7.2 Tracing synthetic events from in-kernel code
+------------------------------------------------
To trace a synthetic event, there are several options. The first
option is to trace the event in one call, using synth_event_trace()
@@ -677,8 +708,8 @@ synth_event_trace_start() and synth_event_trace_end() along with
synth_event_add_next_val() or synth_event_add_val() to add the values
piecewise.
-6.3.3.1 Tracing a synthetic event all at once
----------------------------------------------
+7.2.1 Tracing a synthetic event all at once
+-------------------------------------------
To trace a synthetic event all at once, the synth_event_trace() or
synth_event_trace_array() functions can be used.
@@ -691,7 +722,7 @@ trace array)), along with an variable number of u64 args, one for each
synthetic event field, and the number of values being passed.
So, to trace an event corresponding to the synthetic event definition
-above, code like the following could be used:
+above, code like the following could be used::
ret = synth_event_trace(create_synth_test, 7, /* number of values */
444, /* next_pid_field */
@@ -715,7 +746,7 @@ trace array)), along with an array of u64, one for each synthetic
event field.
To trace an event corresponding to the synthetic event definition
-above, code like the following could be used:
+above, code like the following could be used::
u64 vals[7];
@@ -739,7 +770,7 @@ In order to trace a synthetic event, a pointer to the trace event file
is needed. The trace_get_event_file() function can be used to get
it - it will find the file in the given trace instance (in this case
NULL since the top trace array is being used) while at the same time
-preventing the instance containing it from going away:
+preventing the instance containing it from going away::
schedtest_event_file = trace_get_event_file(NULL, "synthetic",
"schedtest");
@@ -751,48 +782,48 @@ To enable a synthetic event from the kernel, trace_array_set_clr_event()
can be used (which is not specific to synthetic events, so does need
the "synthetic" system name to be specified explicitly).
-To enable the event, pass 'true' to it:
+To enable the event, pass 'true' to it::
trace_array_set_clr_event(schedtest_event_file->tr,
"synthetic", "schedtest", true);
-To disable it pass false:
+To disable it pass false::
trace_array_set_clr_event(schedtest_event_file->tr,
"synthetic", "schedtest", false);
Finally, synth_event_trace_array() can be used to actually trace the
-event, which should be visible in the trace buffer afterwards:
+event, which should be visible in the trace buffer afterwards::
ret = synth_event_trace_array(schedtest_event_file, vals,
ARRAY_SIZE(vals));
To remove the synthetic event, the event should be disabled, and the
-trace instance should be 'put' back using trace_put_event_file():
+trace instance should be 'put' back using trace_put_event_file()::
trace_array_set_clr_event(schedtest_event_file->tr,
"synthetic", "schedtest", false);
trace_put_event_file(schedtest_event_file);
If those have been successful, synth_event_delete() can be called to
-remove the event:
+remove the event::
ret = synth_event_delete("schedtest");
-6.3.3.1 Tracing a synthetic event piecewise
--------------------------------------------
+7.2.2 Tracing a synthetic event piecewise
+-----------------------------------------
To trace a synthetic using the piecewise method described above, the
synth_event_trace_start() function is used to 'open' the synthetic
-event trace:
+event trace::
- struct synth_trace_state trace_state;
+ struct synth_event_trace_state trace_state;
ret = synth_event_trace_start(schedtest_event_file, &trace_state);
It's passed the trace_event_file representing the synthetic event
using the same methods as described above, along with a pointer to a
-struct synth_trace_state object, which will be zeroed before use and
+struct synth_event_trace_state object, which will be zeroed before use and
used to maintain state between this and following calls.
Once the event has been opened, which means space for it has been
@@ -804,12 +835,12 @@ lookup per field.
To assign the values one after the other without lookups,
synth_event_add_next_val() should be used. Each call is passed the
-same synth_trace_state object used in the synth_event_trace_start(),
+same synth_event_trace_state object used in the synth_event_trace_start(),
along with the value to set the next field in the event. After each
field is set, the 'cursor' points to the next field, which will be set
by the subsequent call, continuing until all the fields have been set
in order. The same sequence of calls as in the above examples using
-this method would be (without error-handling code):
+this method would be (without error-handling code)::
/* next_pid_field */
ret = synth_event_add_next_val(777, &trace_state);
@@ -833,11 +864,11 @@ this method would be (without error-handling code):
ret = synth_event_add_next_val(395, &trace_state);
To assign the values in any order, synth_event_add_val() should be
-used. Each call is passed the same synth_trace_state object used in
+used. Each call is passed the same synth_event_trace_state object used in
the synth_event_trace_start(), along with the field name of the field
to set and the value to set it to. The same sequence of calls as in
the above examples using this method would be (without error-handling
-code):
+code)::
ret = synth_event_add_val("next_pid_field", 777, &trace_state);
ret = synth_event_add_val("next_comm_field", (u64)"silly putty",
@@ -855,7 +886,7 @@ can be used but not both at the same time.
Finally, the event won't be actually traced until it's 'closed',
which is done using synth_event_trace_end(), which takes only the
-struct synth_trace_state object used in the previous calls:
+struct synth_event_trace_state object used in the previous calls::
ret = synth_event_trace_end(&trace_state);
@@ -863,8 +894,8 @@ Note that synth_event_trace_end() must be called at the end regardless
of whether any of the add calls failed (say due to a bad field name
being passed in).
-6.3.4 Dyamically creating kprobe and kretprobe event definitions
-----------------------------------------------------------------
+7.3 Dyamically creating kprobe and kretprobe event definitions
+--------------------------------------------------------------
To create a kprobe or kretprobe trace event from kernel code, the
kprobe_event_gen_cmd_start() or kretprobe_event_gen_cmd_start()
@@ -878,7 +909,7 @@ function. Before calling kprobe_event_gen_cmd_start(), the user
should create and initialize a dynevent_cmd object using
kprobe_event_cmd_init().
-For example, to create a new "schedtest" kprobe event with two fields:
+For example, to create a new "schedtest" kprobe event with two fields::
struct dynevent_cmd cmd;
char *buf;
@@ -900,18 +931,18 @@ Once the kprobe event object has been created, it can then be
populated with more fields. Fields can be added using
kprobe_event_add_fields(), supplying the dynevent_cmd object along
with a variable arg list of probe fields. For example, to add a
-couple additional fields, the following call could be made:
+couple additional fields, the following call could be made::
ret = kprobe_event_add_fields(&cmd, "flags=%cx", "mode=+4($stack)");
Once all the fields have been added, the event should be finalized and
registered by calling the kprobe_event_gen_cmd_end() or
kretprobe_event_gen_cmd_end() functions, depending on whether a kprobe
-or kretprobe command was started:
+or kretprobe command was started::
ret = kprobe_event_gen_cmd_end(&cmd);
-or
+or::
ret = kretprobe_event_gen_cmd_end(&cmd);
@@ -920,13 +951,13 @@ events.
Similarly, a kretprobe event can be created using
kretprobe_event_gen_cmd_start() with a probe name and location and
-additional params such as $retval:
+additional params such as $retval::
ret = kretprobe_event_gen_cmd_start(&cmd, "gen_kretprobe_test",
"do_sys_open", "$retval");
Similar to the synthetic event case, code like the following can be
-used to enable the newly created kprobe event:
+used to enable the newly created kprobe event::
gen_kprobe_test = trace_get_event_file(NULL, "kprobes", "gen_kprobe_test");
@@ -934,14 +965,14 @@ used to enable the newly created kprobe event:
"kprobes", "gen_kprobe_test", true);
Finally, also similar to synthetic events, the following code can be
-used to give the kprobe event file back and delete the event:
+used to give the kprobe event file back and delete the event::
trace_put_event_file(gen_kprobe_test);
ret = kprobe_event_delete("gen_kprobe_test");
-6.3.4 The "dynevent_cmd" low-level API
---------------------------------------
+7.4 The "dynevent_cmd" low-level API
+------------------------------------
Both the in-kernel synthetic event and kprobe interfaces are built on
top of a lower-level "dynevent_cmd" interface. This interface is
@@ -963,7 +994,7 @@ are described below.
The first step in building a new command string is to create and
initialize an instance of a dynevent_cmd. Here, for instance, we
-create a dynevent_cmd on the stack and initialize it:
+create a dynevent_cmd on the stack and initialize it::
struct dynevent_cmd cmd;
char *buf;
@@ -989,7 +1020,7 @@ calls to argument-adding functions.
To add a single argument, define and initialize a struct dynevent_arg
or struct dynevent_arg_pair object. Here's an example of the simplest
possible arg addition, which is simply to append the given string as
-a whitespace-separated argument to the command:
+a whitespace-separated argument to the command::
struct dynevent_arg arg;
@@ -1007,7 +1038,7 @@ the arg.
Here's another more complicated example using an 'arg pair', which is
used to create an argument that consists of a couple components added
together as a unit, for example, a 'type field_name;' arg or a simple
-expression arg e.g. 'flags=%cx':
+expression arg e.g. 'flags=%cx'::
struct dynevent_arg_pair arg_pair;
@@ -1031,7 +1062,7 @@ Any number of dynevent_*_add() calls can be made to build up the string
(until its length surpasses cmd->maxlen). When all the arguments have
been added and the command string is complete, the only thing left to
do is run the command, which happens by simply calling
-dynevent_create():
+dynevent_create()::
ret = dynevent_create(&cmd);
diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst
new file mode 100644
index 000000000000..b64bec1ce144
--- /dev/null
+++ b/Documentation/trace/fprobe.rst
@@ -0,0 +1,174 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
+Fprobe - Function entry/exit probe
+==================================
+
+.. Author: Masami Hiramatsu <mhiramat@kernel.org>
+
+Introduction
+============
+
+Fprobe is a function entry/exit probe mechanism based on ftrace.
+Instead of using ftrace full feature, if you only want to attach callbacks
+on function entry and exit, similar to the kprobes and kretprobes, you can
+use fprobe. Compared with kprobes and kretprobes, fprobe gives faster
+instrumentation for multiple functions with single handler. This document
+describes how to use fprobe.
+
+The usage of fprobe
+===================
+
+The fprobe is a wrapper of ftrace (+ kretprobe-like return callback) to
+attach callbacks to multiple function entry and exit. User needs to set up
+the `struct fprobe` and pass it to `register_fprobe()`.
+
+Typically, `fprobe` data structure is initialized with the `entry_handler`
+and/or `exit_handler` as below.
+
+.. code-block:: c
+
+ struct fprobe fp = {
+ .entry_handler = my_entry_callback,
+ .exit_handler = my_exit_callback,
+ };
+
+To enable the fprobe, call one of register_fprobe(), register_fprobe_ips(), and
+register_fprobe_syms(). These functions register the fprobe with different types
+of parameters.
+
+The register_fprobe() enables a fprobe by function-name filters.
+E.g. this enables @fp on "func*()" function except "func2()".::
+
+ register_fprobe(&fp, "func*", "func2");
+
+The register_fprobe_ips() enables a fprobe by ftrace-location addresses.
+E.g.
+
+.. code-block:: c
+
+ unsigned long ips[] = { 0x.... };
+
+ register_fprobe_ips(&fp, ips, ARRAY_SIZE(ips));
+
+And the register_fprobe_syms() enables a fprobe by symbol names.
+E.g.
+
+.. code-block:: c
+
+ char syms[] = {"func1", "func2", "func3"};
+
+ register_fprobe_syms(&fp, syms, ARRAY_SIZE(syms));
+
+To disable (remove from functions) this fprobe, call::
+
+ unregister_fprobe(&fp);
+
+You can temporally (soft) disable the fprobe by::
+
+ disable_fprobe(&fp);
+
+and resume by::
+
+ enable_fprobe(&fp);
+
+The above is defined by including the header::
+
+ #include <linux/fprobe.h>
+
+Same as ftrace, the registered callbacks will start being called some time
+after the register_fprobe() is called and before it returns. See
+:file:`Documentation/trace/ftrace.rst`.
+
+Also, the unregister_fprobe() will guarantee that the both enter and exit
+handlers are no longer being called by functions after unregister_fprobe()
+returns as same as unregister_ftrace_function().
+
+The fprobe entry/exit handler
+=============================
+
+The prototype of the entry/exit callback function is as follows:
+
+.. code-block:: c
+
+ void callback_func(struct fprobe *fp, unsigned long entry_ip, struct pt_regs *regs);
+
+Note that both entry and exit callbacks have same ptototype. The @entry_ip is
+saved at function entry and passed to exit handler.
+
+@fp
+ This is the address of `fprobe` data structure related to this handler.
+ You can embed the `fprobe` to your data structure and get it by
+ container_of() macro from @fp. The @fp must not be NULL.
+
+@entry_ip
+ This is the ftrace address of the traced function (both entry and exit).
+ Note that this may not be the actual entry address of the function but
+ the address where the ftrace is instrumented.
+
+@regs
+ This is the `pt_regs` data structure at the entry and exit. Note that
+ the instruction pointer of @regs may be different from the @entry_ip
+ in the entry_handler. If you need traced instruction pointer, you need
+ to use @entry_ip. On the other hand, in the exit_handler, the instruction
+ pointer of @regs is set to the currect return address.
+
+Share the callbacks with kprobes
+================================
+
+Since the recursion safeness of the fprobe (and ftrace) is a bit different
+from the kprobes, this may cause an issue if user wants to run the same
+code from the fprobe and the kprobes.
+
+Kprobes has per-cpu 'current_kprobe' variable which protects the kprobe
+handler from recursion in all cases. On the other hand, fprobe uses
+only ftrace_test_recursion_trylock(). This allows interrupt context to
+call another (or same) fprobe while the fprobe user handler is running.
+
+This is not a matter if the common callback code has its own recursion
+detection, or it can handle the recursion in the different contexts
+(normal/interrupt/NMI.)
+But if it relies on the 'current_kprobe' recursion lock, it has to check
+kprobe_running() and use kprobe_busy_*() APIs.
+
+Fprobe has FPROBE_FL_KPROBE_SHARED flag to do this. If your common callback
+code will be shared with kprobes, please set FPROBE_FL_KPROBE_SHARED
+*before* registering the fprobe, like:
+
+.. code-block:: c
+
+ fprobe.flags = FPROBE_FL_KPROBE_SHARED;
+
+ register_fprobe(&fprobe, "func*", NULL);
+
+This will protect your common callback from the nested call.
+
+The missed counter
+==================
+
+The `fprobe` data structure has `fprobe::nmissed` counter field as same as
+kprobes.
+This counter counts up when;
+
+ - fprobe fails to take ftrace_recursion lock. This usually means that a function
+ which is traced by other ftrace users is called from the entry_handler.
+
+ - fprobe fails to setup the function exit because of the shortage of rethook
+ (the shadow stack for hooking the function return.)
+
+The `fprobe::nmissed` field counts up in both cases. Therefore, the former
+skips both of entry and exit callback and the latter skips the exit
+callback, but in both case the counter will increase by 1.
+
+Note that if you set the FTRACE_OPS_FL_RECURSION and/or FTRACE_OPS_FL_RCU to
+`fprobe::ops::flags` (ftrace_ops::flags) when registering the fprobe, this
+counter may not work correctly, because ftrace skips the fprobe function which
+increase the counter.
+
+
+Functions and structures
+========================
+
+.. kernel-doc:: include/linux/fprobe.h
+.. kernel-doc:: kernel/trace/fprobe.c
+
diff --git a/Documentation/trace/ftrace-design.rst b/Documentation/trace/ftrace-design.rst
index a8e22e0db63c..6893399157f0 100644
--- a/Documentation/trace/ftrace-design.rst
+++ b/Documentation/trace/ftrace-design.rst
@@ -229,14 +229,6 @@ Adding support for it is easy: just define the macro in asm/ftrace.h and
pass the return address pointer as the 'retp' argument to
ftrace_push_return_trace().
-HAVE_FTRACE_NMI_ENTER
----------------------
-
-If you can't trace NMI functions, then skip this option.
-
-<details to be filled>
-
-
HAVE_SYSCALL_TRACEPOINTS
------------------------
diff --git a/Documentation/trace/ftrace-uses.rst b/Documentation/trace/ftrace-uses.rst
index 2a05e770618a..f7d98ae5b885 100644
--- a/Documentation/trace/ftrace-uses.rst
+++ b/Documentation/trace/ftrace-uses.rst
@@ -30,8 +30,8 @@ The ftrace context
This requires extra care to what can be done inside a callback. A callback
can be called outside the protective scope of RCU.
-The ftrace infrastructure has some protections against recursions and RCU
-but one must still be very careful how they use the callbacks.
+There are helper functions to help against recursion, and making sure
+RCU is watching. These are explained below.
The ftrace_ops structure
@@ -55,17 +55,17 @@ an ftrace_ops with ftrace:
Both .flags and .private are optional. Only .func is required.
-To enable tracing call:
+To enable tracing call::
-.. c:function:: register_ftrace_function(&ops);
+ register_ftrace_function(&ops);
-To disable tracing call:
+To disable tracing call::
-.. c:function:: unregister_ftrace_function(&ops);
+ unregister_ftrace_function(&ops);
-The above is defined by including the header:
+The above is defined by including the header::
-.. c:function:: #include <linux/ftrace.h>
+ #include <linux/ftrace.h>
The registered callback will start being called some time after the
register_ftrace_function() is called and before it returns. The exact time
@@ -108,6 +108,58 @@ The prototype of the callback function is as follows (as of v4.14):
at the start of the function where ftrace was tracing. Otherwise it
either contains garbage, or NULL.
+Protect your callback
+=====================
+
+As functions can be called from anywhere, and it is possible that a function
+called by a callback may also be traced, and call that same callback,
+recursion protection must be used. There are two helper functions that
+can help in this regard. If you start your code with:
+
+.. code-block:: c
+
+ int bit;
+
+ bit = ftrace_test_recursion_trylock(ip, parent_ip);
+ if (bit < 0)
+ return;
+
+and end it with:
+
+.. code-block:: c
+
+ ftrace_test_recursion_unlock(bit);
+
+The code in between will be safe to use, even if it ends up calling a
+function that the callback is tracing. Note, on success,
+ftrace_test_recursion_trylock() will disable preemption, and the
+ftrace_test_recursion_unlock() will enable it again (if it was previously
+enabled). The instruction pointer (ip) and its parent (parent_ip) is passed to
+ftrace_test_recursion_trylock() to record where the recursion happened
+(if CONFIG_FTRACE_RECORD_RECURSION is set).
+
+Alternatively, if the FTRACE_OPS_FL_RECURSION flag is set on the ftrace_ops
+(as explained below), then a helper trampoline will be used to test
+for recursion for the callback and no recursion test needs to be done.
+But this is at the expense of a slightly more overhead from an extra
+function call.
+
+If your callback accesses any data or critical section that requires RCU
+protection, it is best to make sure that RCU is "watching", otherwise
+that data or critical section will not be protected as expected. In this
+case add:
+
+.. code-block:: c
+
+ if (!rcu_is_watching())
+ return;
+
+Alternatively, if the FTRACE_OPS_FL_RCU flag is set on the ftrace_ops
+(as explained below), then a helper trampoline will be used to test
+for rcu_is_watching for the callback and no other test needs to be done.
+But this is at the expense of a slightly more overhead from an extra
+function call.
+
The ftrace FLAGS
================
@@ -128,26 +180,20 @@ FTRACE_OPS_FL_SAVE_REGS_IF_SUPPORTED
will not fail with this flag set. But the callback must check if
regs is NULL or not to determine if the architecture supports it.
-FTRACE_OPS_FL_RECURSION_SAFE
- By default, a wrapper is added around the callback to
- make sure that recursion of the function does not occur. That is,
- if a function that is called as a result of the callback's execution
- is also traced, ftrace will prevent the callback from being called
- again. But this wrapper adds some overhead, and if the callback is
- safe from recursion, it can set this flag to disable the ftrace
- protection.
-
- Note, if this flag is set, and recursion does occur, it could cause
- the system to crash, and possibly reboot via a triple fault.
-
- It is OK if another callback traces a function that is called by a
- callback that is marked recursion safe. Recursion safe callbacks
- must never trace any function that are called by the callback
- itself or any nested functions that those functions call.
-
- If this flag is set, it is possible that the callback will also
- be called with preemption enabled (when CONFIG_PREEMPTION is set),
- but this is not guaranteed.
+FTRACE_OPS_FL_RECURSION
+ By default, it is expected that the callback can handle recursion.
+ But if the callback is not that worried about overehead, then
+ setting this bit will add the recursion protection around the
+ callback by calling a helper function that will do the recursion
+ protection and only call the callback if it did not recurse.
+
+ Note, if this flag is not set, and recursion does occur, it could
+ cause the system to crash, and possibly reboot via a triple fault.
+
+ Not, if this flag is set, then the callback will always be called
+ with preemption disabled. If it is not set, then it is possible
+ (but not guaranteed) that the callback will be called in
+ preemptable context.
FTRACE_OPS_FL_IPMODIFY
Requires FTRACE_OPS_FL_SAVE_REGS set. If the callback is to "hijack"
diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index ff658e27d25b..60bceb018d6a 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -34,13 +34,13 @@ Throughout the kernel is hundreds of static event points that
can be enabled via the tracefs file system to see what is
going on in certain parts of the kernel.
-See events.txt for more information.
+See events.rst for more information.
Implementation Details
----------------------
-See :doc:`ftrace-design` for details for arch porters and such.
+See Documentation/trace/ftrace-design.rst for details for arch porters and such.
The File System
@@ -125,10 +125,13 @@ of ftrace. Here is a list of some of the key files:
trace:
This file holds the output of the trace in a human
- readable format (described below). Note, tracing is temporarily
- disabled when the file is open for reading. Once all readers
- are closed, tracing is re-enabled. Opening this file for
+ readable format (described below). Opening this file for
writing with the O_TRUNC flag clears the ring buffer content.
+ Note, this file is not a consumer. If tracing is off
+ (no tracer running, or tracing_on is zero), it will produce
+ the same output each time it is read. When tracing is on,
+ it may produce inconsistent results as it tries to read
+ the entire buffer without consuming it.
trace_pipe:
@@ -142,9 +145,7 @@ of ftrace. Here is a list of some of the key files:
will not be read again with a sequential read. The
"trace" file is static, and if the tracer is not
adding more data, it will display the same
- information every time it is read. Unlike the
- "trace" file, opening this file for reading will not
- temporarily disable tracing.
+ information every time it is read.
trace_options:
@@ -262,6 +263,20 @@ of ftrace. Here is a list of some of the key files:
traced by the function tracer as well. This option will also
cause PIDs of tasks that exit to be removed from the file.
+ set_ftrace_notrace_pid:
+
+ Have the function tracer ignore threads whose PID are listed in
+ this file.
+
+ If the "function-fork" option is set, then when a task whose
+ PID is listed in this file forks, the child's PID will
+ automatically be added to this file, and the child will not be
+ traced by the function tracer as well. This option will also
+ cause PIDs of tasks that exit to be removed from the file.
+
+ If a PID is in both this file and "set_ftrace_pid", then this
+ file takes precedence, and the thread will not be traced.
+
set_event_pid:
Have the events only trace a task with a PID listed in this file.
@@ -273,6 +288,19 @@ of ftrace. Here is a list of some of the key files:
cause the PIDs of tasks to be removed from this file when the task
exits.
+ set_event_notrace_pid:
+
+ Have the events not trace a task with a PID listed in this file.
+ Note, sched_switch and sched_wakeup will trace threads not listed
+ in this file, even if a thread's PID is in the file if the
+ sched_switch or sched_wakeup events also trace a thread that should
+ be traced.
+
+ To have the PIDs of children of tasks with their PID in this file
+ added on fork, enable the "event-fork" option. That option will also
+ cause the PIDs of tasks to be removed from this file when the task
+ exits.
+
set_graph_function:
Functions listed in this file will cause the function graph
@@ -326,8 +354,8 @@ of ftrace. Here is a list of some of the key files:
is being directly called by the function. If the count is greater
than 1 it most likely will be ftrace_ops_list_func().
- If the callback of the function jumps to a trampoline that is
- specific to a the callback and not the standard trampoline,
+ If the callback of a function jumps to a trampoline that is
+ specific to the callback and which is not the standard trampoline,
its address will be printed as well as the function that the
trampoline calls.
@@ -348,11 +376,11 @@ of ftrace. Here is a list of some of the key files:
kprobe_events:
- Enable dynamic trace points. See kprobetrace.txt.
+ Enable dynamic trace points. See kprobetrace.rst.
kprobe_profile:
- Dynamic trace points stats. See kprobetrace.txt.
+ Dynamic trace points stats. See kprobetrace.rst.
max_graph_depth:
@@ -489,6 +517,18 @@ of ftrace. Here is a list of some of the key files:
processing should be able to handle them. See comments in the
ktime_get_boot_fast_ns() function for more information.
+ tai:
+ This is the tai clock (CLOCK_TAI) and is derived from the wall-
+ clock time. However, this clock does not experience
+ discontinuities and backwards jumps caused by NTP inserting leap
+ seconds. Since the clock access is designed for use in tracing,
+ side effects are possible. The clock access may yield wrong
+ readouts in case the internal TAI offset is updated e.g., caused
+ by setting the system time or using adjtimex() with an offset.
+ These effects are rare and post processing should be able to
+ handle them. See comments in the ktime_get_tai_fast_ns()
+ function for more information.
+
To set a clock, simply echo the clock name into this file::
# echo global > trace_clock
@@ -524,7 +564,7 @@ of ftrace. Here is a list of some of the key files:
start::
- trace_fd = open("trace_marker", WR_ONLY);
+ trace_fd = open("trace_marker", O_WRONLY);
Note: Writing into the trace_marker file can also initiate triggers
that are written into /sys/kernel/tracing/events/ftrace/print/trigger
@@ -533,14 +573,14 @@ of ftrace. Here is a list of some of the key files:
trace_marker_raw:
- This is similar to trace_marker above, but is meant for for binary data
+ This is similar to trace_marker above, but is meant for binary data
to be written to it, where a tool can be used to parse the data
from trace_pipe_raw.
uprobe_events:
Add dynamic tracepoints in programs.
- See uprobetracer.txt
+ See uprobetracer.rst
uprobe_profile:
@@ -561,19 +601,19 @@ of ftrace. Here is a list of some of the key files:
files at various levels that can enable the tracepoints
when a "1" is written to them.
- See events.txt for more information.
+ See events.rst for more information.
set_event:
By echoing in the event into this file, will enable that event.
- See events.txt for more information.
+ See events.rst for more information.
available_events:
A list of events that can be enabled in tracing.
- See events.txt for more information.
+ See events.rst for more information.
timestamp_mode:
@@ -1125,6 +1165,18 @@ Here are the available options:
the trace displays additional information about the
latency, as described in "Latency trace format".
+ pause-on-trace
+ When set, opening the trace file for read, will pause
+ writing to the ring buffer (as if tracing_on was set to zero).
+ This simulates the original behavior of the trace file.
+ When the file is closed, tracing will be enabled again.
+
+ hash-ptr
+ When set, "%p" in the event printk format displays the
+ hashed pointer value instead of real address.
+ This will be useful if you want to find out which hashed
+ value is corresponding to the real value in trace log.
+
record-cmd
When any event or tracer is enabled, a hook is enabled
in the sched_switch trace point to fill comm cache
@@ -1176,6 +1228,8 @@ Here are the available options:
tasks fork. Also, when tasks with PIDs in set_event_pid exit,
their PIDs will be removed from the file.
+ This affects PIDs listed in set_event_notrace_pid as well.
+
function-trace
The latency tracers will enable function tracing
if this option is enabled (default it is). When
@@ -1190,6 +1244,8 @@ Here are the available options:
set_ftrace_pid exit, their PIDs will be removed from the
file.
+ This affects PIDs in set_ftrace_notrace_pid as well.
+
display-graph
When set, the latency tracers (irqsoff, wakeup, etc) will
use function graph tracing instead of function tracing.
@@ -1356,7 +1412,7 @@ an example::
=> x86_64_start_reservations
=> x86_64_start_kernel
-Here we see that that we had a latency of 16 microseconds (which is
+Here we see that we had a latency of 16 microseconds (which is
very good). The _raw_spin_lock_irq in run_timer_softirq disabled
interrupts. The difference between the 16 and the displayed
timestamp 25us occurred because the clock was incremented
@@ -1415,7 +1471,7 @@ function-trace, we get a much larger output::
=> __blk_run_queue_uncond
=> __blk_run_queue
=> blk_queue_bio
- => generic_make_request
+ => submit_bio_noacct
=> submit_bio
=> submit_bh
=> __ext3_get_inode_loc
@@ -1486,7 +1542,7 @@ display-graph option::
=> remove_vma
=> exit_mmap
=> mmput
- => flush_old_exec
+ => begin_new_exec
=> load_elf_binary
=> search_binary_handler
=> __do_execve_file.isra.32
@@ -1700,7 +1756,7 @@ tracers.
=> __blk_run_queue_uncond
=> __blk_run_queue
=> blk_queue_bio
- => generic_make_request
+ => submit_bio_noacct
=> submit_bio
=> submit_bh
=> ext3_bread
@@ -2126,6 +2182,8 @@ periodically make a CPU constantly busy with interrupts disabled.
# cat trace
# tracer: hwlat
#
+ # entries-in-buffer/entries-written: 13/13 #P:8
+ #
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
@@ -2133,12 +2191,18 @@ periodically make a CPU constantly busy with interrupts disabled.
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
- <...>-3638 [001] d... 19452.055471: #1 inner/outer(us): 12/14 ts:1499801089.066141940
- <...>-3638 [003] d... 19454.071354: #2 inner/outer(us): 11/9 ts:1499801091.082164365
- <...>-3638 [002] dn.. 19461.126852: #3 inner/outer(us): 12/9 ts:1499801098.138150062
- <...>-3638 [001] d... 19488.340960: #4 inner/outer(us): 8/12 ts:1499801125.354139633
- <...>-3638 [003] d... 19494.388553: #5 inner/outer(us): 8/12 ts:1499801131.402150961
- <...>-3638 [003] d... 19501.283419: #6 inner/outer(us): 0/12 ts:1499801138.297435289 nmi-total:4 nmi-count:1
+ <...>-1729 [001] d... 678.473449: #1 inner/outer(us): 11/12 ts:1581527483.343962693 count:6
+ <...>-1729 [004] d... 689.556542: #2 inner/outer(us): 16/9 ts:1581527494.889008092 count:1
+ <...>-1729 [005] d... 714.756290: #3 inner/outer(us): 16/16 ts:1581527519.678961629 count:5
+ <...>-1729 [001] d... 718.788247: #4 inner/outer(us): 9/17 ts:1581527523.889012713 count:1
+ <...>-1729 [002] d... 719.796341: #5 inner/outer(us): 13/9 ts:1581527524.912872606 count:1
+ <...>-1729 [006] d... 844.787091: #6 inner/outer(us): 9/12 ts:1581527649.889048502 count:2
+ <...>-1729 [003] d... 849.827033: #7 inner/outer(us): 18/9 ts:1581527654.889013793 count:1
+ <...>-1729 [007] d... 853.859002: #8 inner/outer(us): 9/12 ts:1581527658.889065736 count:1
+ <...>-1729 [001] d... 855.874978: #9 inner/outer(us): 9/11 ts:1581527660.861991877 count:1
+ <...>-1729 [001] d... 863.938932: #10 inner/outer(us): 9/11 ts:1581527668.970010500 count:1 nmi-total:7 nmi-count:1
+ <...>-1729 [007] d... 878.050780: #11 inner/outer(us): 9/12 ts:1581527683.385002600 count:1 nmi-total:5 nmi-count:1
+ <...>-1729 [007] d... 886.114702: #12 inner/outer(us): 9/12 ts:1581527691.385001600 count:1
The above output is somewhat the same in the header. All events will have
@@ -2148,7 +2212,7 @@ interrupts disabled 'd'. Under the FUNCTION title there is:
This is the count of events recorded that were greater than the
tracing_threshold (See below).
- inner/outer(us): 12/14
+ inner/outer(us): 11/11
This shows two numbers as "inner latency" and "outer latency". The test
runs in a loop checking a timestamp twice. The latency detected within
@@ -2156,11 +2220,15 @@ interrupts disabled 'd'. Under the FUNCTION title there is:
after the previous timestamp and the next timestamp in the loop is
the "outer latency".
- ts:1499801089.066141940
+ ts:1581527483.343962693
+
+ The absolute timestamp that the first latency was recorded in the window.
+
+ count:6
- The absolute timestamp that the event happened.
+ The number of times a latency was detected during the window.
- nmi-total:4 nmi-count:1
+ nmi-total:7 nmi-count:1
On architectures that support it, if an NMI comes in during the
test, the time spent in NMI is reported in "nmi-total" (in
@@ -2386,11 +2454,10 @@ Or this simple script!
#!/bin/bash
tracefs=`sed -ne 's/^tracefs \(.*\) tracefs.*/\1/p' /proc/mounts`
- echo nop > $tracefs/tracing/current_tracer
- echo 0 > $tracefs/tracing/tracing_on
- echo $$ > $tracefs/tracing/set_ftrace_pid
- echo function > $tracefs/tracing/current_tracer
- echo 1 > $tracefs/tracing/tracing_on
+ echo 0 > $tracefs/tracing_on
+ echo $$ > $tracefs/set_ftrace_pid
+ echo function > $tracefs/current_tracer
+ echo 1 > $tracefs/tracing_on
exec "$@"
@@ -2706,7 +2773,7 @@ listed in:
put_prev_task_idle
kmem_cache_create
pick_next_task_rt
- get_online_cpus
+ cpus_read_lock
pick_next_task_fair
mutex_lock
[...]
@@ -3315,7 +3382,7 @@ one of the latency tracers, you will get the following results.
Instances
---------
-In the tracefs tracing directory is a directory called "instances".
+In the tracefs tracing directory, there is a directory called "instances".
This directory can have new directories created inside of it using
mkdir, and removing directories with rmdir. The directory created
with mkdir in this directory will already contain files and other
diff --git a/Documentation/trace/hisi-ptt.rst b/Documentation/trace/hisi-ptt.rst
new file mode 100644
index 000000000000..4f87d8e21065
--- /dev/null
+++ b/Documentation/trace/hisi-ptt.rst
@@ -0,0 +1,298 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================
+HiSilicon PCIe Tune and Trace device
+======================================
+
+Introduction
+============
+
+HiSilicon PCIe tune and trace device (PTT) is a PCIe Root Complex
+integrated Endpoint (RCiEP) device, providing the capability
+to dynamically monitor and tune the PCIe link's events (tune),
+and trace the TLP headers (trace). The two functions are independent,
+but is recommended to use them together to analyze and enhance the
+PCIe link's performance.
+
+On Kunpeng 930 SoC, the PCIe Root Complex is composed of several
+PCIe cores. Each PCIe core includes several Root Ports and a PTT
+RCiEP, like below. The PTT device is capable of tuning and
+tracing the links of the PCIe core.
+::
+
+ +--------------Core 0-------+
+ | | [ PTT ] |
+ | | [Root Port]---[Endpoint]
+ | | [Root Port]---[Endpoint]
+ | | [Root Port]---[Endpoint]
+ Root Complex |------Core 1-------+
+ | | [ PTT ] |
+ | | [Root Port]---[ Switch ]---[Endpoint]
+ | | [Root Port]---[Endpoint] `-[Endpoint]
+ | | [Root Port]---[Endpoint]
+ +---------------------------+
+
+The PTT device driver registers one PMU device for each PTT device.
+The name of each PTT device is composed of 'hisi_ptt' prefix with
+the id of the SICL and the Core where it locates. The Kunpeng 930
+SoC encapsulates multiple CPU dies (SCCL, Super CPU Cluster) and
+IO dies (SICL, Super I/O Cluster), where there's one PCIe Root
+Complex for each SICL.
+::
+
+ /sys/devices/hisi_ptt<sicl_id>_<core_id>
+
+Tune
+====
+
+PTT tune is designed for monitoring and adjusting PCIe link parameters (events).
+Currently we support events in 2 classes. The scope of the events
+covers the PCIe core to which the PTT device belongs.
+
+Each event is presented as a file under $(PTT PMU dir)/tune, and
+a simple open/read/write/close cycle will be used to tune the event.
+::
+
+ $ cd /sys/devices/hisi_ptt<sicl_id>_<core_id>/tune
+ $ ls
+ qos_tx_cpl qos_tx_np qos_tx_p
+ tx_path_rx_req_alloc_buf_level
+ tx_path_tx_req_alloc_buf_level
+ $ cat qos_tx_dp
+ 1
+ $ echo 2 > qos_tx_dp
+ $ cat qos_tx_dp
+ 2
+
+Current value (numerical value) of the event can be simply read
+from the file, and the desired value written to the file to tune.
+
+1. Tx Path QoS Control
+------------------------
+
+The following files are provided to tune the QoS of the tx path of
+the PCIe core.
+
+- qos_tx_cpl: weight of Tx completion TLPs
+- qos_tx_np: weight of Tx non-posted TLPs
+- qos_tx_p: weight of Tx posted TLPs
+
+The weight influences the proportion of certain packets on the PCIe link.
+For example, for the storage scenario, increase the proportion
+of the completion packets on the link to enhance the performance as
+more completions are consumed.
+
+The available tune data of these events is [0, 1, 2].
+Writing a negative value will return an error, and out of range
+values will be converted to 2. Note that the event value just
+indicates a probable level, but is not precise.
+
+2. Tx Path Buffer Control
+-------------------------
+
+Following files are provided to tune the buffer of tx path of the PCIe core.
+
+- rx_alloc_buf_level: watermark of Rx requested
+- tx_alloc_buf_level: watermark of Tx requested
+
+These events influence the watermark of the buffer allocated for each
+type. Rx means the inbound while Tx means outbound. The packets will
+be stored in the buffer first and then transmitted either when the
+watermark reached or when timed out. For a busy direction, you should
+increase the related buffer watermark to avoid frequently posting and
+thus enhance the performance. In most cases just keep the default value.
+
+The available tune data of above events is [0, 1, 2].
+Writing a negative value will return an error, and out of range
+values will be converted to 2. Note that the event value just
+indicates a probable level, but is not precise.
+
+Trace
+=====
+
+PTT trace is designed for dumping the TLP headers to the memory, which
+can be used to analyze the transactions and usage condition of the PCIe
+Link. You can choose to filter the traced headers by either Requester ID,
+or those downstream of a set of Root Ports on the same core of the PTT
+device. It's also supported to trace the headers of certain type and of
+certain direction.
+
+You can use the perf command `perf record` to set the parameters, start
+trace and get the data. It's also supported to decode the trace
+data with `perf report`. The control parameters for trace is inputted
+as event code for each events, which will be further illustrated later.
+An example usage is like
+::
+
+ $ perf record -e hisi_ptt0_2/filter=0x80001,type=1,direction=1,
+ format=1/ -- sleep 5
+
+This will trace the TLP headers downstream root port 0000:00:10.1 (event
+code for event 'filter' is 0x80001) with type of posted TLP requests,
+direction of inbound and traced data format of 8DW.
+
+1. Filter
+---------
+
+The TLP headers to trace can be filtered by the Root Ports or the Requester ID
+of the Endpoint, which are located on the same core of the PTT device. You can
+set the filter by specifying the `filter` parameter which is required to start
+the trace. The parameter value is 20 bit. Bit 19 indicates the filter type.
+1 for Root Port filter and 0 for Requester filter. Bit[15:0] indicates the
+filter value. The value for a Root Port is a mask of the core port id which is
+calculated from its PCI Slot ID as (slotid & 7) * 2. The value for a Requester
+is the Requester ID (Device ID of the PCIe function). Bit[18:16] is currently
+reserved for extension.
+
+For example, if the desired filter is Endpoint function 0000:01:00.1 the filter
+value will be 0x00101. If the desired filter is Root Port 0000:00:10.0 then
+then filter value is calculated as 0x80001.
+
+Note that multiple Root Ports can be specified at one time, but only one
+Endpoint function can be specified in one trace. Specifying both Root Port
+and function at the same time is not supported. Driver maintains a list of
+available filters and will check the invalid inputs.
+
+Currently the available filters are detected in driver's probe. If the supported
+devices are removed/added after probe, you may need to reload the driver to update
+the filters.
+
+2. Type
+-------
+
+You can trace the TLP headers of certain types by specifying the `type`
+parameter, which is required to start the trace. The parameter value is
+8 bit. Current supported types and related values are shown below:
+
+- 8'b00000001: posted requests (P)
+- 8'b00000010: non-posted requests (NP)
+- 8'b00000100: completions (CPL)
+
+You can specify multiple types when tracing inbound TLP headers, but can only
+specify one when tracing outbound TLP headers.
+
+3. Direction
+------------
+
+You can trace the TLP headers from certain direction, which is relative
+to the Root Port or the PCIe core, by specifying the `direction` parameter.
+This is optional and the default parameter is inbound. The parameter value
+is 4 bit. When the desired format is 4DW, directions and related values
+supported are shown below:
+
+- 4'b0000: inbound TLPs (P, NP, CPL)
+- 4'b0001: outbound TLPs (P, NP, CPL)
+- 4'b0010: outbound TLPs (P, NP, CPL) and inbound TLPs (P, NP, CPL B)
+- 4'b0011: outbound TLPs (P, NP, CPL) and inbound TLPs (CPL A)
+
+When the desired format is 8DW, directions and related values supported are
+shown below:
+
+- 4'b0000: reserved
+- 4'b0001: outbound TLPs (P, NP, CPL)
+- 4'b0010: inbound TLPs (P, NP, CPL B)
+- 4'b0011: inbound TLPs (CPL A)
+
+Inbound completions are classified into two types:
+
+- completion A (CPL A): completion of CHI/DMA/Native non-posted requests, except for CPL B
+- completion B (CPL B): completion of DMA remote2local and P2P non-posted requests
+
+4. Format
+--------------
+
+You can change the format of the traced TLP headers by specifying the
+`format` parameter. The default format is 4DW. The parameter value is 4 bit.
+Current supported formats and related values are shown below:
+
+- 4'b0000: 4DW length per TLP header
+- 4'b0001: 8DW length per TLP header
+
+The traced TLP header format is different from the PCIe standard.
+
+When using the 8DW data format, the entire TLP header is logged
+(Header DW0-3 shown below). For example, the TLP header for Memory
+Reads with 64-bit addresses is shown in PCIe r5.0, Figure 2-17;
+the header for Configuration Requests is shown in Figure 2.20, etc.
+
+In addition, 8DW trace buffer entries contain a timestamp and
+possibly a prefix for a PASID TLP prefix (see Figure 6-20, PCIe r5.0).
+Otherwise this field will be all 0.
+
+The bit[31:11] of DW0 is always 0x1fffff, which can be
+used to distinguish the data format. 8DW format is like
+::
+
+ bits [ 31:11 ][ 10:0 ]
+ |---------------------------------------|-------------------|
+ DW0 [ 0x1fffff ][ Reserved (0x7ff) ]
+ DW1 [ Prefix ]
+ DW2 [ Header DW0 ]
+ DW3 [ Header DW1 ]
+ DW4 [ Header DW2 ]
+ DW5 [ Header DW3 ]
+ DW6 [ Reserved (0x0) ]
+ DW7 [ Time ]
+
+When using the 4DW data format, DW0 of the trace buffer entry
+contains selected fields of DW0 of the TLP, together with a
+timestamp. DW1-DW3 of the trace buffer entry contain DW1-DW3
+directly from the TLP header.
+
+4DW format is like
+::
+
+ bits [31:30] [ 29:25 ][24][23][22][21][ 20:11 ][ 10:0 ]
+ |-----|---------|---|---|---|---|-------------|-------------|
+ DW0 [ Fmt ][ Type ][T9][T8][TH][SO][ Length ][ Time ]
+ DW1 [ Header DW1 ]
+ DW2 [ Header DW2 ]
+ DW3 [ Header DW3 ]
+
+5. Memory Management
+--------------------
+
+The traced TLP headers will be written to the memory allocated
+by the driver. The hardware accepts 4 DMA address with same size,
+and writes the buffer sequentially like below. If DMA addr 3 is
+finished and the trace is still on, it will return to addr 0.
+::
+
+ +->[DMA addr 0]->[DMA addr 1]->[DMA addr 2]->[DMA addr 3]-+
+ +---------------------------------------------------------+
+
+Driver will allocate each DMA buffer of 4MiB. The finished buffer
+will be copied to the perf AUX buffer allocated by the perf core.
+Once the AUX buffer is full while the trace is still on, driver
+will commit the AUX buffer first and then apply for a new one with
+the same size. The size of AUX buffer is default to 16MiB. User can
+adjust the size by specifying the `-m` parameter of the perf command.
+
+6. Decoding
+-----------
+
+You can decode the traced data with `perf report -D` command (currently
+only support to dump the raw trace data). The traced data will be decoded
+according to the format described previously (take 8DW as an example):
+::
+
+ [...perf headers and other information]
+ . ... HISI PTT data: size 4194304 bytes
+ . 00000000: 00 00 00 00 Prefix
+ . 00000004: 01 00 00 60 Header DW0
+ . 00000008: 0f 1e 00 01 Header DW1
+ . 0000000c: 04 00 00 00 Header DW2
+ . 00000010: 40 00 81 02 Header DW3
+ . 00000014: 33 c0 04 00 Time
+ . 00000020: 00 00 00 00 Prefix
+ . 00000024: 01 00 00 60 Header DW0
+ . 00000028: 0f 1e 00 01 Header DW1
+ . 0000002c: 04 00 00 00 Header DW2
+ . 00000030: 40 00 81 02 Header DW3
+ . 00000034: 02 00 00 00 Time
+ . 00000040: 00 00 00 00 Prefix
+ . 00000044: 01 00 00 60 Header DW0
+ . 00000048: 0f 1e 00 01 Header DW1
+ . 0000004c: 04 00 00 00 Header DW2
+ . 00000050: 40 00 81 02 Header DW3
+ [...]
diff --git a/Documentation/trace/histogram-design.rst b/Documentation/trace/histogram-design.rst
new file mode 100644
index 000000000000..088c8cce738b
--- /dev/null
+++ b/Documentation/trace/histogram-design.rst
@@ -0,0 +1,2115 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Histogram Design Notes
+======================
+
+:Author: Tom Zanussi <zanussi@kernel.org>
+
+This document attempts to provide a description of how the ftrace
+histograms work and how the individual pieces map to the data
+structures used to implement them in trace_events_hist.c and
+tracing_map.c.
+
+Note: All the ftrace histogram command examples assume the working
+ directory is the ftrace /tracing directory. For example::
+
+ # cd /sys/kernel/debug/tracing
+
+Also, the histogram output displayed for those commands will be
+generally be truncated - only enough to make the point is displayed.
+
+'hist_debug' trace event files
+==============================
+
+If the kernel is compiled with CONFIG_HIST_TRIGGERS_DEBUG set, an
+event file named 'hist_debug' will appear in each event's
+subdirectory. This file can be read at any time and will display some
+of the hist trigger internals described in this document. Specific
+examples and output will be described in test cases below.
+
+Basic histograms
+================
+
+First, basic histograms. Below is pretty much the simplest thing you
+can do with histograms - create one with a single key on a single
+event and cat the output::
+
+ # echo 'hist:keys=pid' >> events/sched/sched_waking/trigger
+
+ # cat events/sched/sched_waking/hist
+
+ { pid: 18249 } hitcount: 1
+ { pid: 13399 } hitcount: 1
+ { pid: 17973 } hitcount: 1
+ { pid: 12572 } hitcount: 1
+ ...
+ { pid: 10 } hitcount: 921
+ { pid: 18255 } hitcount: 1444
+ { pid: 25526 } hitcount: 2055
+ { pid: 5257 } hitcount: 2055
+ { pid: 27367 } hitcount: 2055
+ { pid: 1728 } hitcount: 2161
+
+ Totals:
+ Hits: 21305
+ Entries: 183
+ Dropped: 0
+
+What this does is create a histogram on the sched_waking event using
+pid as a key and with a single value, hitcount, which even if not
+explicitly specified, exists for every histogram regardless.
+
+The hitcount value is a per-bucket value that's automatically
+incremented on every hit for the given key, which in this case is the
+pid.
+
+So in this histogram, there's a separate bucket for each pid, and each
+bucket contains a value for that bucket, counting the number of times
+sched_waking was called for that pid.
+
+Each histogram is represented by a hist_data struct.
+
+To keep track of each key and value field in the histogram, hist_data
+keeps an array of these fields named fields[]. The fields[] array is
+an array containing struct hist_field representations of each
+histogram val and key in the histogram (variables are also included
+here, but are discussed later). So for the above histogram we have one
+key and one value; in this case the one value is the hitcount value,
+which all histograms have, regardless of whether they define that
+value or not, which the above histogram does not.
+
+Each struct hist_field contains a pointer to the ftrace_event_field
+from the event's trace_event_file along with various bits related to
+that such as the size, offset, type, and a hist_field_fn_t function,
+which is used to grab the field's data from the ftrace event buffer
+(in most cases - some hist_fields such as hitcount don't directly map
+to an event field in the trace buffer - in these cases the function
+implementation gets its value from somewhere else). The flags field
+indicates which type of field it is - key, value, variable, variable
+reference, etc., with value being the default.
+
+The other important hist_data data structure in addition to the
+fields[] array is the tracing_map instance created for the histogram,
+which is held in the .map member. The tracing_map implements the
+lock-free hash table used to implement histograms (see
+kernel/trace/tracing_map.h for much more discussion about the
+low-level data structures implementing the tracing_map). For the
+purposes of this discussion, the tracing_map contains a number of
+buckets, each bucket corresponding to a particular tracing_map_elt
+object hashed by a given histogram key.
+
+Below is a diagram the first part of which describes the hist_data and
+associated key and value fields for the histogram described above. As
+you can see, there are two fields in the fields array, one val field
+for the hitcount and one key field for the pid key.
+
+Below that is a diagram of a run-time snapshot of what the tracing_map
+might look like for a given run. It attempts to show the
+relationships between the hist_data fields and the tracing_map
+elements for a couple hypothetical keys and values.::
+
+ +------------------+
+ | hist_data |
+ +------------------+ +----------------+
+ | .fields[] |---->| val = hitcount |----------------------------+
+ +----------------+ +----------------+ |
+ | .map | | .size | |
+ +----------------+ +--------------+ |
+ | .offset | |
+ +--------------+ |
+ | .fn() | |
+ +--------------+ |
+ . |
+ . |
+ . |
+ +----------------+ <--- n_vals |
+ | key = pid |----------------------------|--+
+ +----------------+ | |
+ | .size | | |
+ +--------------+ | |
+ | .offset | | |
+ +--------------+ | |
+ | .fn() | | |
+ +----------------+ <--- n_fields | |
+ | unused | | |
+ +----------------+ | |
+ | | | |
+ +--------------+ | |
+ | | | |
+ +--------------+ | |
+ | | | |
+ +--------------+ | |
+ n_keys = n_fields - n_vals | |
+
+The hist_data n_vals and n_fields delineate the extent of the fields[] | |
+array and separate keys from values for the rest of the code. | |
+
+Below is a run-time representation of the tracing_map part of the | |
+histogram, with pointers from various parts of the fields[] array | |
+to corresponding parts of the tracing_map. | |
+
+The tracing_map consists of an array of tracing_map_entrys and a set | |
+of preallocated tracing_map_elts (abbreviated below as map_entry and | |
+map_elt). The total number of map_entrys in the hist_data.map array = | |
+map->max_elts (actually map->map_size but only max_elts of those are | |
+used. This is a property required by the map_insert() algorithm). | |
+
+If a map_entry is unused, meaning no key has yet hashed into it, its | |
+.key value is 0 and its .val pointer is NULL. Once a map_entry has | |
+been claimed, the .key value contains the key's hash value and the | |
+.val member points to a map_elt containing the full key and an entry | |
+for each key or value in the map_elt.fields[] array. There is an | |
+entry in the map_elt.fields[] array corresponding to each hist_field | |
+in the histogram, and this is where the continually aggregated sums | |
+corresponding to each histogram value are kept. | |
+
+The diagram attempts to show the relationship between the | |
+hist_data.fields[] and the map_elt.fields[] with the links drawn | |
+between diagrams::
+
+ +-----------+ | |
+ | hist_data | | |
+ +-----------+ | |
+ | .fields | | |
+ +---------+ +-----------+ | |
+ | .map |---->| map_entry | | |
+ +---------+ +-----------+ | |
+ | .key |---> 0 | |
+ +---------+ | |
+ | .val |---> NULL | |
+ +-----------+ | |
+ | map_entry | | |
+ +-----------+ | |
+ | .key |---> pid = 999 | |
+ +---------+ +-----------+ | |
+ | .val |--->| map_elt | | |
+ +---------+ +-----------+ | |
+ . | .key |---> full key * | |
+ . +---------+ +---------------+ | |
+ . | .fields |--->| .sum (val) |<-+ |
+ +-----------+ +---------+ | 2345 | | |
+ | map_entry | +---------------+ | |
+ +-----------+ | .offset (key) |<----+
+ | .key |---> 0 | 0 | | |
+ +---------+ +---------------+ | |
+ | .val |---> NULL . | |
+ +-----------+ . | |
+ | map_entry | . | |
+ +-----------+ +---------------+ | |
+ | .key | | .sum (val) or | | |
+ +---------+ +---------+ | .offset (key) | | |
+ | .val |--->| map_elt | +---------------+ | |
+ +-----------+ +---------+ | .sum (val) or | | |
+ | map_entry | | .offset (key) | | |
+ +-----------+ +---------------+ | |
+ | .key |---> pid = 4444 | |
+ +---------+ +-----------+ | |
+ | .val | | map_elt | | |
+ +---------+ +-----------+ | |
+ | .key |---> full key * | |
+ +---------+ +---------------+ | |
+ | .fields |--->| .sum (val) |<-+ |
+ +---------+ | 65523 | |
+ +---------------+ |
+ | .offset (key) |<----+
+ | 0 |
+ +---------------+
+ .
+ .
+ .
+ +---------------+
+ | .sum (val) or |
+ | .offset (key) |
+ +---------------+
+ | .sum (val) or |
+ | .offset (key) |
+ +---------------+
+
+Abbreviations used in the diagrams::
+
+ hist_data = struct hist_trigger_data
+ hist_data.fields = struct hist_field
+ fn = hist_field_fn_t
+ map_entry = struct tracing_map_entry
+ map_elt = struct tracing_map_elt
+ map_elt.fields = struct tracing_map_field
+
+Whenever a new event occurs and it has a hist trigger associated with
+it, event_hist_trigger() is called. event_hist_trigger() first deals
+with the key: for each subkey in the key (in the above example, there
+is just one subkey corresponding to pid), the hist_field that
+represents that subkey is retrieved from hist_data.fields[] and the
+hist_field_fn_t fn() associated with that field, along with the
+field's size and offset, is used to grab that subkey's data from the
+current trace record.
+
+Once the complete key has been retrieved, it's used to look that key
+up in the tracing_map. If there's no tracing_map_elt associated with
+that key, an empty one is claimed and inserted in the map for the new
+key. In either case, the tracing_map_elt associated with that key is
+returned.
+
+Once a tracing_map_elt available, hist_trigger_elt_update() is called.
+As the name implies, this updates the element, which basically means
+updating the element's fields. There's a tracing_map_field associated
+with each key and value in the histogram, and each of these correspond
+to the key and value hist_fields created when the histogram was
+created. hist_trigger_elt_update() goes through each value hist_field
+and, as for the keys, uses the hist_field's fn() and size and offset
+to grab the field's value from the current trace record. Once it has
+that value, it simply adds that value to that field's
+continually-updated tracing_map_field.sum member. Some hist_field
+fn()s, such as for the hitcount, don't actually grab anything from the
+trace record (the hitcount fn() just increments the counter sum by 1),
+but the idea is the same.
+
+Once all the values have been updated, hist_trigger_elt_update() is
+done and returns. Note that there are also tracing_map_fields for
+each subkey in the key, but hist_trigger_elt_update() doesn't look at
+them or update anything - those exist only for sorting, which can
+happen later.
+
+Basic histogram test
+--------------------
+
+This is a good example to try. It produces 3 value fields and 2 key
+fields in the output::
+
+ # echo 'hist:keys=common_pid,call_site.sym:values=bytes_req,bytes_alloc,hitcount' >> events/kmem/kmalloc/trigger
+
+To see the debug data, cat the kmem/kmalloc's 'hist_debug' file. It
+will show the trigger info of the histogram it corresponds to, along
+with the address of the hist_data associated with the histogram, which
+will become useful in later examples. It then displays the number of
+total hist_fields associated with the histogram along with a count of
+how many of those correspond to keys and how many correspond to values.
+
+It then goes on to display details for each field, including the
+field's flags and the position of each field in the hist_data's
+fields[] array, which is useful information for verifying that things
+internally appear correct or not, and which again will become even
+more useful in further examples::
+
+ # cat events/kmem/kmalloc/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=common_pid,call_site.sym:vals=hitcount,bytes_req,bytes_alloc:sort=hitcount:size=2048 [active]
+ #
+
+ hist_data: 000000005e48c9a5
+
+ n_vals: 3
+ n_keys: 2
+ n_fields: 5
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ VAL: normal u64 value
+ ftrace_event_field name: bytes_req
+ type: size_t
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[2]:
+ flags:
+ VAL: normal u64 value
+ ftrace_event_field name: bytes_alloc
+ type: size_t
+ size: 8
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[3]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: common_pid
+ type: int
+ size: 8
+ is_signed: 1
+
+ hist_data->fields[4]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: call_site
+ type: unsigned long
+ size: 8
+ is_signed: 0
+
+The commands below can be used to clean things up for the next test::
+
+ # echo '!hist:keys=common_pid,call_site.sym:values=bytes_req,bytes_alloc,hitcount' >> events/kmem/kmalloc/trigger
+
+Variables
+=========
+
+Variables allow data from one hist trigger to be saved by one hist
+trigger and retrieved by another hist trigger. For example, a trigger
+on the sched_waking event can capture a timestamp for a particular
+pid, and later a sched_switch event that switches to that pid event
+can grab the timestamp and use it to calculate a time delta between
+the two events::
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >>
+ events/sched/sched_waking/trigger
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0' >>
+ events/sched/sched_switch/trigger
+
+In terms of the histogram data structures, variables are implemented
+as another type of hist_field and for a given hist trigger are added
+to the hist_data.fields[] array just after all the val fields. To
+distinguish them from the existing key and val fields, they're given a
+new flag type, HIST_FIELD_FL_VAR (abbreviated FL_VAR) and they also
+make use of a new .var.idx field member in struct hist_field, which
+maps them to an index in a new map_elt.vars[] array added to the
+map_elt specifically designed to store and retrieve variable values.
+The diagram below shows those new elements and adds a new variable
+entry, ts0, corresponding to the ts0 variable in the sched_waking
+trigger above.
+
+sched_waking histogram
+----------------------::
+
+ +------------------+
+ | hist_data |<-------------------------------------------------------+
+ +------------------+ +-------------------+ |
+ | .fields[] |-->| val = hitcount | |
+ +----------------+ +-------------------+ |
+ | .map | | .size | |
+ +----------------+ +-----------------+ |
+ | .offset | |
+ +-----------------+ |
+ | .fn() | |
+ +-----------------+ |
+ | .flags | |
+ +-----------------+ |
+ | .var.idx | |
+ +-------------------+ |
+ | var = ts0 | |
+ +-------------------+ |
+ | .size | |
+ +-----------------+ |
+ | .offset | |
+ +-----------------+ |
+ | .fn() | |
+ +-----------------+ |
+ | .flags & FL_VAR | |
+ +-----------------+ |
+ | .var.idx |----------------------------+-+ |
+ +-----------------+ | | |
+ . | | |
+ . | | |
+ . | | |
+ +-------------------+ <--- n_vals | | |
+ | key = pid | | | |
+ +-------------------+ | | |
+ | .size | | | |
+ +-----------------+ | | |
+ | .offset | | | |
+ +-----------------+ | | |
+ | .fn() | | | |
+ +-----------------+ | | |
+ | .flags & FL_KEY | | | |
+ +-----------------+ | | |
+ | .var.idx | | | |
+ +-------------------+ <--- n_fields | | |
+ | unused | | | |
+ +-------------------+ | | |
+ | | | | |
+ +-----------------+ | | |
+ | | | | |
+ +-----------------+ | | |
+ | | | | |
+ +-----------------+ | | |
+ | | | | |
+ +-----------------+ | | |
+ | | | | |
+ +-----------------+ | | |
+ n_keys = n_fields - n_vals | | |
+ | | |
+
+This is very similar to the basic case. In the above diagram, we can | | |
+see a new .flags member has been added to the struct hist_field | | |
+struct, and a new entry added to hist_data.fields representing the ts0 | | |
+variable. For a normal val hist_field, .flags is just 0 (modulo | | |
+modifier flags), but if the value is defined as a variable, the .flags | | |
+contains a set FL_VAR bit. | | |
+
+As you can see, the ts0 entry's .var.idx member contains the index | | |
+into the tracing_map_elts' .vars[] array containing variable values. | | |
+This idx is used whenever the value of the variable is set or read. | | |
+The map_elt.vars idx assigned to the given variable is assigned and | | |
+saved in .var.idx by create_tracing_map_fields() after it calls | | |
+tracing_map_add_var(). | | |
+
+Below is a representation of the histogram at run-time, which | | |
+populates the map, along with correspondence to the above hist_data and | | |
+hist_field data structures. | | |
+
+The diagram attempts to show the relationship between the | | |
+hist_data.fields[] and the map_elt.fields[] and map_elt.vars[] with | | |
+the links drawn between diagrams. For each of the map_elts, you can | | |
+see that the .fields[] members point to the .sum or .offset of a key | | |
+or val and the .vars[] members point to the value of a variable. The | | |
+arrows between the two diagrams show the linkages between those | | |
+tracing_map members and the field definitions in the corresponding | | |
+hist_data fields[] members.::
+
+ +-----------+ | | |
+ | hist_data | | | |
+ +-----------+ | | |
+ | .fields | | | |
+ +---------+ +-----------+ | | |
+ | .map |---->| map_entry | | | |
+ +---------+ +-----------+ | | |
+ | .key |---> 0 | | |
+ +---------+ | | |
+ | .val |---> NULL | | |
+ +-----------+ | | |
+ | map_entry | | | |
+ +-----------+ | | |
+ | .key |---> pid = 999 | | |
+ +---------+ +-----------+ | | |
+ | .val |--->| map_elt | | | |
+ +---------+ +-----------+ | | |
+ . | .key |---> full key * | | |
+ . +---------+ +---------------+ | | |
+ . | .fields |--->| .sum (val) | | | |
+ . +---------+ | 2345 | | | |
+ . +--| .vars | +---------------+ | | |
+ . | +---------+ | .offset (key) | | | |
+ . | | 0 | | | |
+ . | +---------------+ | | |
+ . | . | | |
+ . | . | | |
+ . | . | | |
+ . | +---------------+ | | |
+ . | | .sum (val) or | | | |
+ . | | .offset (key) | | | |
+ . | +---------------+ | | |
+ . | | .sum (val) or | | | |
+ . | | .offset (key) | | | |
+ . | +---------------+ | | |
+ . | | | |
+ . +---------------->+---------------+ | | |
+ . | ts0 |<--+ | |
+ . | 113345679876 | | | |
+ . +---------------+ | | |
+ . | unused | | | |
+ . | | | | |
+ . +---------------+ | | |
+ . . | | |
+ . . | | |
+ . . | | |
+ . +---------------+ | | |
+ . | unused | | | |
+ . | | | | |
+ . +---------------+ | | |
+ . | unused | | | |
+ . | | | | |
+ . +---------------+ | | |
+ . | | |
+ +-----------+ | | |
+ | map_entry | | | |
+ +-----------+ | | |
+ | .key |---> pid = 4444 | | |
+ +---------+ +-----------+ | | |
+ | .val |--->| map_elt | | | |
+ +---------+ +-----------+ | | |
+ . | .key |---> full key * | | |
+ . +---------+ +---------------+ | | |
+ . | .fields |--->| .sum (val) | | | |
+ +---------+ | 2345 | | | |
+ +--| .vars | +---------------+ | | |
+ | +---------+ | .offset (key) | | | |
+ | | 0 | | | |
+ | +---------------+ | | |
+ | . | | |
+ | . | | |
+ | . | | |
+ | +---------------+ | | |
+ | | .sum (val) or | | | |
+ | | .offset (key) | | | |
+ | +---------------+ | | |
+ | | .sum (val) or | | | |
+ | | .offset (key) | | | |
+ | +---------------+ | | |
+ | | | |
+ | +---------------+ | | |
+ +---------------->| ts0 |<--+ | |
+ | 213499240729 | | |
+ +---------------+ | |
+ | unused | | |
+ | | | |
+ +---------------+ | |
+ . | |
+ . | |
+ . | |
+ +---------------+ | |
+ | unused | | |
+ | | | |
+ +---------------+ | |
+ | unused | | |
+ | | | |
+ +---------------+ | |
+
+For each used map entry, there's a map_elt pointing to an array of | |
+.vars containing the current value of the variables associated with | |
+that histogram entry. So in the above, the timestamp associated with | |
+pid 999 is 113345679876, and the timestamp variable in the same | |
+.var.idx for pid 4444 is 213499240729. | |
+
+sched_switch histogram | |
+---------------------- | |
+
+The sched_switch histogram paired with the above sched_waking | |
+histogram is shown below. The most important aspect of the | |
+sched_switch histogram is that it references a variable on the | |
+sched_waking histogram above. | |
+
+The histogram diagram is very similar to the others so far displayed, | |
+but it adds variable references. You can see the normal hitcount and | |
+key fields along with a new wakeup_lat variable implemented in the | |
+same way as the sched_waking ts0 variable, but in addition there's an | |
+entry with the new FL_VAR_REF (short for HIST_FIELD_FL_VAR_REF) flag. | |
+
+Associated with the new var ref field are a couple of new hist_field | |
+members, var.hist_data and var_ref_idx. For a variable reference, the | |
+var.hist_data goes with the var.idx, which together uniquely identify | |
+a particular variable on a particular histogram. The var_ref_idx is | |
+just the index into the var_ref_vals[] array that caches the values of | |
+each variable whenever a hist trigger is updated. Those resulting | |
+values are then finally accessed by other code such as trace action | |
+code that uses the var_ref_idx values to assign param values. | |
+
+The diagram below describes the situation for the sched_switch | |
+histogram referred to before::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0' >> | |
+ events/sched/sched_switch/trigger | |
+ | |
+ +------------------+ | |
+ | hist_data | | |
+ +------------------+ +-----------------------+ | |
+ | .fields[] |-->| val = hitcount | | |
+ +----------------+ +-----------------------+ | |
+ | .map | | .size | | |
+ +----------------+ +---------------------+ | |
+ +--| .var_refs[] | | .offset | | |
+ | +----------------+ +---------------------+ | |
+ | | .fn() | | |
+ | var_ref_vals[] +---------------------+ | |
+ | +-------------+ | .flags | | |
+ | | $ts0 |<---+ +---------------------+ | |
+ | +-------------+ | | .var.idx | | |
+ | | | | +---------------------+ | |
+ | +-------------+ | | .var.hist_data | | |
+ | | | | +---------------------+ | |
+ | +-------------+ | | .var_ref_idx | | |
+ | | | | +-----------------------+ | |
+ | +-------------+ | | var = wakeup_lat | | |
+ | . | +-----------------------+ | |
+ | . | | .size | | |
+ | . | +---------------------+ | |
+ | +-------------+ | | .offset | | |
+ | | | | +---------------------+ | |
+ | +-------------+ | | .fn() | | |
+ | | | | +---------------------+ | |
+ | +-------------+ | | .flags & FL_VAR | | |
+ | | +---------------------+ | |
+ | | | .var.idx | | |
+ | | +---------------------+ | |
+ | | | .var.hist_data | | |
+ | | +---------------------+ | |
+ | | | .var_ref_idx | | |
+ | | +---------------------+ | |
+ | | . | |
+ | | . | |
+ | | . | |
+ | | +-----------------------+ <--- n_vals | |
+ | | | key = pid | | |
+ | | +-----------------------+ | |
+ | | | .size | | |
+ | | +---------------------+ | |
+ | | | .offset | | |
+ | | +---------------------+ | |
+ | | | .fn() | | |
+ | | +---------------------+ | |
+ | | | .flags | | |
+ | | +---------------------+ | |
+ | | | .var.idx | | |
+ | | +-----------------------+ <--- n_fields | |
+ | | | unused | | |
+ | | +-----------------------+ | |
+ | | | | | |
+ | | +---------------------+ | |
+ | | | | | |
+ | | +---------------------+ | |
+ | | | | | |
+ | | +---------------------+ | |
+ | | | | | |
+ | | +---------------------+ | |
+ | | | | | |
+ | | +---------------------+ | |
+ | | n_keys = n_fields - n_vals | |
+ | | | |
+ | | | |
+ | | +-----------------------+ | |
+ +---------------------->| var_ref = $ts0 | | |
+ | +-----------------------+ | |
+ | | .size | | |
+ | +---------------------+ | |
+ | | .offset | | |
+ | +---------------------+ | |
+ | | .fn() | | |
+ | +---------------------+ | |
+ | | .flags & FL_VAR_REF | | |
+ | +---------------------+ | |
+ | | .var.idx |--------------------------+ |
+ | +---------------------+ |
+ | | .var.hist_data |----------------------------+
+ | +---------------------+
+ +---| .var_ref_idx |
+ +---------------------+
+
+Abbreviations used in the diagrams::
+
+ hist_data = struct hist_trigger_data
+ hist_data.fields = struct hist_field
+ fn = hist_field_fn_t
+ FL_KEY = HIST_FIELD_FL_KEY
+ FL_VAR = HIST_FIELD_FL_VAR
+ FL_VAR_REF = HIST_FIELD_FL_VAR_REF
+
+When a hist trigger makes use of a variable, a new hist_field is
+created with flag HIST_FIELD_FL_VAR_REF. For a VAR_REF field, the
+var.idx and var.hist_data take the same values as the referenced
+variable, as well as the referenced variable's size, type, and
+is_signed values. The VAR_REF field's .name is set to the name of the
+variable it references. If a variable reference was created using the
+explicit system.event.$var_ref notation, the hist_field's system and
+event_name variables are also set.
+
+So, in order to handle an event for the sched_switch histogram,
+because we have a reference to a variable on another histogram, we
+need to resolve all variable references first. This is done via the
+resolve_var_refs() calls made from event_hist_trigger(). What this
+does is grabs the var_refs[] array from the hist_data representing the
+sched_switch histogram. For each one of those, the referenced
+variable's var.hist_data along with the current key is used to look up
+the corresponding tracing_map_elt in that histogram. Once found, the
+referenced variable's var.idx is used to look up the variable's value
+using tracing_map_read_var(elt, var.idx), which yields the value of
+the variable for that element, ts0 in the case above. Note that both
+the hist_fields representing both the variable and the variable
+reference have the same var.idx, so this is straightforward.
+
+Variable and variable reference test
+------------------------------------
+
+This example creates a variable on the sched_waking event, ts0, and
+uses it in the sched_switch trigger. The sched_switch trigger also
+creates its own variable, wakeup_lat, but nothing yet uses it::
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0' >> events/sched/sched_switch/trigger
+
+Looking at the sched_waking 'hist_debug' output, in addition to the
+normal key and value hist_fields, in the val fields section we see a
+field with the HIST_FIELD_FL_VAR flag, which indicates that that field
+represents a variable. Note that in addition to the variable name,
+contained in the var.name field, it includes the var.idx, which is the
+index into the tracing_map_elt.vars[] array of the actual variable
+location. Note also that the output shows that variables live in the
+same part of the hist_data->fields[] array as normal values::
+
+ # cat events/sched/sched_waking/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
+ #
+
+ hist_data: 000000009536f554
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+Moving on to the sched_switch trigger hist_debug output, in addition
+to the unused wakeup_lat variable, we see a new section displaying
+variable references. Variable references are displayed in a separate
+section because in addition to being logically separate from
+variables and values, they actually live in a separate hist_data
+array, var_refs[].
+
+In this example, the sched_switch trigger has a reference to a
+variable on the sched_waking trigger, $ts0. Looking at the details,
+we can see that the var.hist_data value of the referenced variable
+matches the previously displayed sched_waking trigger, and the var.idx
+value matches the previously displayed var.idx value for that
+variable. Also displayed is the var_ref_idx value for that variable
+reference, which is where the value for that variable is cached for
+use when the trigger is invoked::
+
+ # cat events/sched/sched_switch/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global [active]
+ #
+
+ hist_data: 00000000f4ee8006
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 0
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+ variable reference fields:
+
+ hist_data->var_refs[0]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 000000009536f554
+ var_ref_idx (into hist_data->var_refs[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+The commands below can be used to clean things up for the next test::
+
+ # echo '!hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0' >> events/sched/sched_switch/trigger
+
+ # echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+Actions and Handlers
+====================
+
+Adding onto the previous example, we will now do something with that
+wakeup_lat variable, namely send it and another field as a synthetic
+event.
+
+The onmatch() action below basically says that whenever we have a
+sched_switch event, if we have a matching sched_waking event, in this
+case if we have a pid in the sched_waking histogram that matches the
+next_pid field on this sched_switch event, we retrieve the
+variables specified in the wakeup_latency() trace action, and use
+them to generate a new wakeup_latency event into the trace stream.
+
+Note that the way the trace handlers such as wakeup_latency() (which
+could equivalently be written trace(wakeup_latency,$wakeup_lat,next_pid)
+are implemented, the parameters specified to the trace handler must be
+variables. In this case, $wakeup_lat is obviously a variable, but
+next_pid isn't, since it's just naming a field in the sched_switch
+trace event. Since this is something that almost every trace() and
+save() action does, a special shortcut is implemented to allow field
+names to be used directly in those cases. How it works is that under
+the covers, a temporary variable is created for the named field, and
+this variable is what is actually passed to the trace handler. In the
+code and documentation, this type of variable is called a 'field
+variable'.
+
+Fields on other trace event's histograms can be used as well. In that
+case we have to generate a new histogram and an unfortunately named
+'synthetic_field' (the use of synthetic here has nothing to do with
+synthetic events) and use that special histogram field as a variable.
+
+The diagram below illustrates the new elements described above in the
+context of the sched_switch histogram using the onmatch() handler and
+the trace() action.
+
+First, we define the wakeup_latency synthetic event::
+
+ # echo 'wakeup_latency u64 lat; pid_t pid' >> synthetic_events
+
+Next, the sched_waking hist trigger as before::
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >>
+ events/sched/sched_waking/trigger
+
+Finally, we create a hist trigger on the sched_switch event that
+generates a wakeup_latency() trace event. In this case we pass
+next_pid into the wakeup_latency synthetic event invocation, which
+means it will be automatically converted into a field variable::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0: \
+ onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid)' >>
+ /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
+
+The diagram for the sched_switch event is similar to previous examples
+but shows the additional field_vars[] array for hist_data and shows
+the linkages between the field_vars and the variables and references
+created to implement the field variables. The details are discussed
+below::
+
+ +------------------+
+ | hist_data |
+ +------------------+ +-----------------------+
+ | .fields[] |-->| val = hitcount |
+ +----------------+ +-----------------------+
+ | .map | | .size |
+ +----------------+ +---------------------+
+ +---| .field_vars[] | | .offset |
+ | +----------------+ +---------------------+
+ |+--| .var_refs[] | | .offset |
+ || +----------------+ +---------------------+
+ || | .fn() |
+ || var_ref_vals[] +---------------------+
+ || +-------------+ | .flags |
+ || | $ts0 |<---+ +---------------------+
+ || +-------------+ | | .var.idx |
+ || | $next_pid |<-+ | +---------------------+
+ || +-------------+ | | | .var.hist_data |
+ ||+>| $wakeup_lat | | | +---------------------+
+ ||| +-------------+ | | | .var_ref_idx |
+ ||| | | | | +-----------------------+
+ ||| +-------------+ | | | var = wakeup_lat |
+ ||| . | | +-----------------------+
+ ||| . | | | .size |
+ ||| . | | +---------------------+
+ ||| +-------------+ | | | .offset |
+ ||| | | | | +---------------------+
+ ||| +-------------+ | | | .fn() |
+ ||| | | | | +---------------------+
+ ||| +-------------+ | | | .flags & FL_VAR |
+ ||| | | +---------------------+
+ ||| | | | .var.idx |
+ ||| | | +---------------------+
+ ||| | | | .var.hist_data |
+ ||| | | +---------------------+
+ ||| | | | .var_ref_idx |
+ ||| | | +---------------------+
+ ||| | | .
+ ||| | | .
+ ||| | | .
+ ||| | | .
+ ||| +--------------+ | | .
+ +-->| field_var | | | .
+ || +--------------+ | | .
+ || | var | | | .
+ || +------------+ | | .
+ || | val | | | .
+ || +--------------+ | | .
+ || | field_var | | | .
+ || +--------------+ | | .
+ || | var | | | .
+ || +------------+ | | .
+ || | val | | | .
+ || +------------+ | | .
+ || . | | .
+ || . | | .
+ || . | | +-----------------------+ <--- n_vals
+ || +--------------+ | | | key = pid |
+ || | field_var | | | +-----------------------+
+ || +--------------+ | | | .size |
+ || | var |--+| +---------------------+
+ || +------------+ ||| | .offset |
+ || | val |-+|| +---------------------+
+ || +------------+ ||| | .fn() |
+ || ||| +---------------------+
+ || ||| | .flags |
+ || ||| +---------------------+
+ || ||| | .var.idx |
+ || ||| +---------------------+ <--- n_fields
+ || |||
+ || ||| n_keys = n_fields - n_vals
+ || ||| +-----------------------+
+ || |+->| var = next_pid |
+ || | | +-----------------------+
+ || | | | .size |
+ || | | +---------------------+
+ || | | | .offset |
+ || | | +---------------------+
+ || | | | .flags & FL_VAR |
+ || | | +---------------------+
+ || | | | .var.idx |
+ || | | +---------------------+
+ || | | | .var.hist_data |
+ || | | +-----------------------+
+ || +-->| val for next_pid |
+ || | | +-----------------------+
+ || | | | .size |
+ || | | +---------------------+
+ || | | | .offset |
+ || | | +---------------------+
+ || | | | .fn() |
+ || | | +---------------------+
+ || | | | .flags |
+ || | | +---------------------+
+ || | | | |
+ || | | +---------------------+
+ || | |
+ || | |
+ || | | +-----------------------+
+ +|------------------|-|>| var_ref = $ts0 |
+ | | | +-----------------------+
+ | | | | .size |
+ | | | +---------------------+
+ | | | | .offset |
+ | | | +---------------------+
+ | | | | .fn() |
+ | | | +---------------------+
+ | | | | .flags & FL_VAR_REF |
+ | | | +---------------------+
+ | | +---| .var_ref_idx |
+ | | +-----------------------+
+ | | | var_ref = $next_pid |
+ | | +-----------------------+
+ | | | .size |
+ | | +---------------------+
+ | | | .offset |
+ | | +---------------------+
+ | | | .fn() |
+ | | +---------------------+
+ | | | .flags & FL_VAR_REF |
+ | | +---------------------+
+ | +-----| .var_ref_idx |
+ | +-----------------------+
+ | | var_ref = $wakeup_lat |
+ | +-----------------------+
+ | | .size |
+ | +---------------------+
+ | | .offset |
+ | +---------------------+
+ | | .fn() |
+ | +---------------------+
+ | | .flags & FL_VAR_REF |
+ | +---------------------+
+ +------------------------| .var_ref_idx |
+ +---------------------+
+
+As you can see, for a field variable, two hist_fields are created: one
+representing the variable, in this case next_pid, and one to actually
+get the value of the field from the trace stream, like a normal val
+field does. These are created separately from normal variable
+creation and are saved in the hist_data->field_vars[] array. See
+below for how these are used. In addition, a reference hist_field is
+also created, which is needed to reference the field variables such as
+$next_pid variable in the trace() action.
+
+Note that $wakeup_lat is also a variable reference, referencing the
+value of the expression common_timestamp-$ts0, and so also needs to
+have a hist field entry representing that reference created.
+
+When hist_trigger_elt_update() is called to get the normal key and
+value fields, it also calls update_field_vars(), which goes through
+each field_var created for the histogram, and available from
+hist_data->field_vars and calls val->fn() to get the data from the
+current trace record, and then uses the var's var.idx to set the
+variable at the var.idx offset in the appropriate tracing_map_elt's
+variable at elt->vars[var.idx].
+
+Once all the variables have been updated, resolve_var_refs() can be
+called from event_hist_trigger(), and not only can our $ts0 and
+$next_pid references be resolved but the $wakeup_lat reference as
+well. At this point, the trace() action can simply access the values
+assembled in the var_ref_vals[] array and generate the trace event.
+
+The same process occurs for the field variables associated with the
+save() action.
+
+Abbreviations used in the diagram::
+
+ hist_data = struct hist_trigger_data
+ hist_data.fields = struct hist_field
+ field_var = struct field_var
+ fn = hist_field_fn_t
+ FL_KEY = HIST_FIELD_FL_KEY
+ FL_VAR = HIST_FIELD_FL_VAR
+ FL_VAR_REF = HIST_FIELD_FL_VAR_REF
+
+trace() action field variable test
+----------------------------------
+
+This example adds to the previous test example by finally making use
+of the wakeup_lat variable, but in addition also creates a couple of
+field variables that then are all passed to the wakeup_latency() trace
+action via the onmatch() handler.
+
+First, we create the wakeup_latency synthetic event::
+
+ # echo 'wakeup_latency u64 lat; pid_t pid; char comm[16]' >> synthetic_events
+
+Next, the sched_waking trigger from previous examples::
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+Finally, as in the previous test example, we calculate and assign the
+wakeup latency using the $ts0 reference from the sched_waking trigger
+to the wakeup_lat variable, and finally use it along with a couple
+sched_switch event fields, next_pid and next_comm, to generate a
+wakeup_latency trace event. The next_pid and next_comm event fields
+are automatically converted into field variables for this purpose::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,next_comm)' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
+
+The sched_waking hist_debug output shows the same data as in the
+previous test example::
+
+ # cat events/sched/sched_waking/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
+ #
+
+ hist_data: 00000000d60ff61f
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+The sched_switch hist_debug output shows the same key and value fields
+as in the previous test example - note that wakeup_lat is still in the
+val fields section, but that the new field variables are not there -
+although the field variables are variables, they're held separately in
+the hist_data's field_vars[] array. Although the field variables and
+the normal variables are located in separate places, you can see that
+the actual variable locations for those variables in the
+tracing_map_elt.vars[] do have increasing indices as expected:
+wakeup_lat takes the var.idx = 0 slot, while the field variables for
+next_pid and next_comm have values var.idx = 1, and var.idx = 2. Note
+also that those are the same values displayed for the variable
+references corresponding to those variables in the variable reference
+fields section. Since there are two triggers and thus two hist_data
+addresses, those addresses also need to be accounted for when doing
+the matching - you can see that the first variable refers to the 0
+var.idx on the previous hist trigger (see the hist_data address
+associated with that trigger), while the second variable refers to the
+0 var.idx on the sched_switch hist trigger, as do all the remaining
+variable references.
+
+Finally, the action tracking variables section just shows the system
+and event name for the onmatch() handler::
+
+ # cat events/sched/sched_switch/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,next_comm) [active]
+ #
+
+ hist_data: 0000000008f551b7
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 0
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+ variable reference fields:
+
+ hist_data->var_refs[0]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 00000000d60ff61f
+ var_ref_idx (into hist_data->var_refs[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->var_refs[1]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 0000000008f551b7
+ var_ref_idx (into hist_data->var_refs[]): 1
+ type: u64
+ size: 0
+ is_signed: 0
+
+ hist_data->var_refs[2]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: next_pid
+ var.idx (into tracing_map_elt.vars[]): 1
+ var.hist_data: 0000000008f551b7
+ var_ref_idx (into hist_data->var_refs[]): 2
+ type: pid_t
+ size: 4
+ is_signed: 0
+
+ hist_data->var_refs[3]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: next_comm
+ var.idx (into tracing_map_elt.vars[]): 2
+ var.hist_data: 0000000008f551b7
+ var_ref_idx (into hist_data->var_refs[]): 3
+ type: char[16]
+ size: 256
+ is_signed: 0
+
+ field variables:
+
+ hist_data->field_vars[0]:
+
+ field_vars[0].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: next_pid
+ var.idx (into tracing_map_elt.vars[]): 1
+
+ field_vars[0].val:
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ hist_data->field_vars[1]:
+
+ field_vars[1].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: next_comm
+ var.idx (into tracing_map_elt.vars[]): 2
+
+ field_vars[1].val:
+ ftrace_event_field name: next_comm
+ type: char[16]
+ size: 256
+ is_signed: 0
+
+ action tracking variables (for onmax()/onchange()/onmatch()):
+
+ hist_data->actions[0].match_data.event_system: sched
+ hist_data->actions[0].match_data.event: sched_waking
+
+The commands below can be used to clean things up for the next test::
+
+ # echo '!hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,next_comm)' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
+
+ # echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+ # echo '!wakeup_latency u64 lat; pid_t pid; char comm[16]' >> synthetic_events
+
+action_data and the trace() action
+----------------------------------
+
+As mentioned above, when the trace() action generates a synthetic
+event, all the parameters to the synthetic event either already are
+variables or are converted into variables (via field variables), and
+finally all those variable values are collected via references to them
+into a var_ref_vals[] array.
+
+The values in the var_ref_vals[] array, however, don't necessarily
+follow the same ordering as the synthetic event params. To address
+that, struct action_data contains another array, var_ref_idx[] that
+maps the trace action params to the var_ref_vals[] values. Below is a
+diagram illustrating that for the wakeup_latency() synthetic event::
+
+ +------------------+ wakeup_latency()
+ | action_data | event params var_ref_vals[]
+ +------------------+ +-----------------+ +-----------------+
+ | .var_ref_idx[] |--->| $wakeup_lat idx |---+ | |
+ +----------------+ +-----------------+ | +-----------------+
+ | .synth_event | | $next_pid idx |---|-+ | $wakeup_lat val |
+ +----------------+ +-----------------+ | | +-----------------+
+ . | +->| $next_pid val |
+ . | +-----------------+
+ . | .
+ +-----------------+ | .
+ | | | .
+ +-----------------+ | +-----------------+
+ +--->| $wakeup_lat val |
+ +-----------------+
+
+Basically, how this ends up getting used in the synthetic event probe
+function, trace_event_raw_event_synth(), is as follows::
+
+ for each field i in .synth_event
+ val_idx = .var_ref_idx[i]
+ val = var_ref_vals[val_idx]
+
+action_data and the onXXX() handlers
+------------------------------------
+
+The hist trigger onXXX() actions other than onmatch(), such as onmax()
+and onchange(), also make use of and internally create hidden
+variables. This information is contained in the
+action_data.track_data struct, and is also visible in the hist_debug
+output as will be described in the example below.
+
+Typically, the onmax() or onchange() handlers are used in conjunction
+with the save() and snapshot() actions. For example::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0: \
+ onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm)' >>
+ /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
+
+or::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0: \
+ onmax($wakeup_lat).snapshot()' >>
+ /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
+
+save() action field variable test
+---------------------------------
+
+For this example, instead of generating a synthetic event, the save()
+action is used to save field values whenever an onmax() handler
+detects that a new max latency has been hit. As in the previous
+example, the values being saved are also field values, but in this
+case, are kept in a separate hist_data array named save_vars[].
+
+As in previous test examples, we set up the sched_waking trigger::
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+In this case, however, we set up the sched_switch trigger to save some
+sched_switch field values whenever we hit a new maximum latency. For
+both the onmax() handler and save() action, variables will be created,
+which we can use the hist_debug files to examine::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm)' >> events/sched/sched_switch/trigger
+
+The sched_waking hist_debug output shows the same data as in the
+previous test examples::
+
+ # cat events/sched/sched_waking/hist_debug
+
+ #
+ # trigger info: hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
+ #
+
+ hist_data: 00000000e6290f48
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+The output of the sched_switch trigger shows the same val and key
+values as before, but also shows a couple new sections.
+
+First, the action tracking variables section now shows the
+actions[].track_data information describing the special tracking
+variables and references used to track, in this case, the running
+maximum value. The actions[].track_data.var_ref member contains the
+reference to the variable being tracked, in this case the $wakeup_lat
+variable. In order to perform the onmax() handler function, there
+also needs to be a variable that tracks the current maximum by getting
+updated whenever a new maximum is hit. In this case, we can see that
+an auto-generated variable named ' __max' has been created and is
+visible in the actions[].track_data.track_var variable.
+
+Finally, in the new 'save action variables' section, we can see that
+the 4 params to the save() function have resulted in 4 field variables
+being created for the purposes of saving the values of the named
+fields when the max is hit. These variables are kept in a separate
+save_vars[] array off of hist_data, so are displayed in a separate
+section::
+
+ # cat events/sched/sched_switch/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global:onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm) [active]
+ #
+
+ hist_data: 0000000057bcd28d
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 0
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+ variable reference fields:
+
+ hist_data->var_refs[0]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 00000000e6290f48
+ var_ref_idx (into hist_data->var_refs[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->var_refs[1]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 0000000057bcd28d
+ var_ref_idx (into hist_data->var_refs[]): 1
+ type: u64
+ size: 0
+ is_signed: 0
+
+ action tracking variables (for onmax()/onchange()/onmatch()):
+
+ hist_data->actions[0].track_data.var_ref:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 0000000057bcd28d
+ var_ref_idx (into hist_data->var_refs[]): 1
+ type: u64
+ size: 0
+ is_signed: 0
+
+ hist_data->actions[0].track_data.track_var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: __max
+ var.idx (into tracing_map_elt.vars[]): 1
+ type: u64
+ size: 8
+ is_signed: 0
+
+ save action variables (save() params):
+
+ hist_data->save_vars[0]:
+
+ save_vars[0].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: next_comm
+ var.idx (into tracing_map_elt.vars[]): 2
+
+ save_vars[0].val:
+ ftrace_event_field name: next_comm
+ type: char[16]
+ size: 256
+ is_signed: 0
+
+ hist_data->save_vars[1]:
+
+ save_vars[1].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: prev_pid
+ var.idx (into tracing_map_elt.vars[]): 3
+
+ save_vars[1].val:
+ ftrace_event_field name: prev_pid
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ hist_data->save_vars[2]:
+
+ save_vars[2].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: prev_prio
+ var.idx (into tracing_map_elt.vars[]): 4
+
+ save_vars[2].val:
+ ftrace_event_field name: prev_prio
+ type: int
+ size: 4
+ is_signed: 1
+
+ hist_data->save_vars[3]:
+
+ save_vars[3].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: prev_comm
+ var.idx (into tracing_map_elt.vars[]): 5
+
+ save_vars[3].val:
+ ftrace_event_field name: prev_comm
+ type: char[16]
+ size: 256
+ is_signed: 0
+
+The commands below can be used to clean things up for the next test::
+
+ # echo '!hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm)' >> events/sched/sched_switch/trigger
+
+ # echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+A couple special cases
+======================
+
+While the above covers the basics of the histogram internals, there
+are a couple of special cases that should be discussed, since they
+tend to create even more confusion. Those are field variables on other
+histograms, and aliases, both described below through example tests
+using the hist_debug files.
+
+Test of field variables on other histograms
+-------------------------------------------
+
+This example is similar to the previous examples, but in this case,
+the sched_switch trigger references a hist trigger field on another
+event, namely the sched_waking event. In order to accomplish this, a
+field variable is created for the other event, but since an existing
+histogram can't be used, as existing histograms are immutable, a new
+histogram with a matching variable is created and used, and we'll see
+that reflected in the hist_debug output shown below.
+
+First, we create the wakeup_latency synthetic event. Note the
+addition of the prio field::
+
+ # echo 'wakeup_latency u64 lat; pid_t pid; int prio' >> synthetic_events
+
+As in previous test examples, we set up the sched_waking trigger::
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+Here we set up a hist trigger on sched_switch to send a wakeup_latency
+event using an onmatch handler naming the sched_waking event. Note
+that the third param being passed to the wakeup_latency() is prio,
+which is a field name that needs to have a field variable created for
+it. There isn't however any prio field on the sched_switch event so
+it would seem that it wouldn't be possible to create a field variable
+for it. The matching sched_waking event does have a prio field, so it
+should be possible to make use of it for this purpose. The problem
+with that is that it's not currently possible to define a new variable
+on an existing histogram, so it's not possible to add a new prio field
+variable to the existing sched_waking histogram. It is however
+possible to create an additional new 'matching' sched_waking histogram
+for the same event, meaning that it uses the same key and filters, and
+define the new prio field variable on that.
+
+Here's the sched_switch trigger::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,prio)' >> events/sched/sched_switch/trigger
+
+And here's the output of the hist_debug information for the
+sched_waking hist trigger. Note that there are two histograms
+displayed in the output: the first is the normal sched_waking
+histogram we've seen in the previous examples, and the second is the
+special histogram we created to provide the prio field variable.
+
+Looking at the second histogram below, we see a variable with the name
+synthetic_prio. This is the field variable created for the prio field
+on that sched_waking histogram::
+
+ # cat events/sched/sched_waking/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
+ #
+
+ hist_data: 00000000349570e4
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid:vals=hitcount:synthetic_prio=prio:sort=hitcount:size=2048 [active]
+ #
+
+ hist_data: 000000006920cf38
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ ftrace_event_field name: prio
+ var.name: synthetic_prio
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: int
+ size: 4
+ is_signed: 1
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+Looking at the sched_switch histogram below, we can see a reference to
+the synthetic_prio variable on sched_waking, and looking at the
+associated hist_data address we see that it is indeed associated with
+the new histogram. Note also that the other references are to a
+normal variable, wakeup_lat, and to a normal field variable, next_pid,
+the details of which are in the field variables section::
+
+ # cat events/sched/sched_switch/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,prio) [active]
+ #
+
+ hist_data: 00000000a73b67df
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 0
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+ variable reference fields:
+
+ hist_data->var_refs[0]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 00000000349570e4
+ var_ref_idx (into hist_data->var_refs[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->var_refs[1]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 00000000a73b67df
+ var_ref_idx (into hist_data->var_refs[]): 1
+ type: u64
+ size: 0
+ is_signed: 0
+
+ hist_data->var_refs[2]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: next_pid
+ var.idx (into tracing_map_elt.vars[]): 1
+ var.hist_data: 00000000a73b67df
+ var_ref_idx (into hist_data->var_refs[]): 2
+ type: pid_t
+ size: 4
+ is_signed: 0
+
+ hist_data->var_refs[3]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: synthetic_prio
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 000000006920cf38
+ var_ref_idx (into hist_data->var_refs[]): 3
+ type: int
+ size: 4
+ is_signed: 1
+
+ field variables:
+
+ hist_data->field_vars[0]:
+
+ field_vars[0].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: next_pid
+ var.idx (into tracing_map_elt.vars[]): 1
+
+ field_vars[0].val:
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ action tracking variables (for onmax()/onchange()/onmatch()):
+
+ hist_data->actions[0].match_data.event_system: sched
+ hist_data->actions[0].match_data.event: sched_waking
+
+The commands below can be used to clean things up for the next test::
+
+ # echo '!hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,prio)' >> events/sched/sched_switch/trigger
+
+ # echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+ # echo '!wakeup_latency u64 lat; pid_t pid; int prio' >> synthetic_events
+
+Alias test
+----------
+
+This example is very similar to previous examples, but demonstrates
+the alias flag.
+
+First, we create the wakeup_latency synthetic event::
+
+ # echo 'wakeup_latency u64 lat; pid_t pid; char comm[16]' >> synthetic_events
+
+Next, we create a sched_waking trigger similar to previous examples,
+but in this case we save the pid in the waking_pid variable::
+
+ # echo 'hist:keys=pid:waking_pid=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+For the sched_switch trigger, instead of using $waking_pid directly in
+the wakeup_latency synthetic event invocation, we create an alias of
+$waking_pid named $woken_pid, and use that in the synthetic event
+invocation instead::
+
+ # echo 'hist:keys=next_pid:woken_pid=$waking_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,$woken_pid,next_comm)' >> events/sched/sched_switch/trigger
+
+Looking at the sched_waking hist_debug output, in addition to the
+normal fields, we can see the waking_pid variable::
+
+ # cat events/sched/sched_waking/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid:vals=hitcount:waking_pid=pid,ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
+ #
+
+ hist_data: 00000000a250528c
+
+ n_vals: 3
+ n_keys: 1
+ n_fields: 4
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ ftrace_event_field name: pid
+ var.name: waking_pid
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: ts0
+ var.idx (into tracing_map_elt.vars[]): 1
+ type: u64
+ size: 8
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[3]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+The sched_switch hist_debug output shows that a variable named
+woken_pid has been created but that it also has the
+HIST_FIELD_FL_ALIAS flag set. It also has the HIST_FIELD_FL_VAR flag
+set, which is why it appears in the val field section.
+
+Despite that implementation detail, an alias variable is actually more
+like a variable reference; in fact it can be thought of as a reference
+to a reference. The implementation copies the var_ref->fn() from the
+variable reference being referenced, in this case, the waking_pid
+fn(), which is hist_field_var_ref() and makes that the fn() of the
+alias. The hist_field_var_ref() fn() requires the var_ref_idx of the
+variable reference it's using, so waking_pid's var_ref_idx is also
+copied to the alias. The end result is that when the value of alias
+is retrieved, in the end it just does the same thing the original
+reference would have done and retrieves the same value from the
+var_ref_vals[] array. You can verify this in the output by noting
+that the var_ref_idx of the alias, in this case woken_pid, is the same
+as the var_ref_idx of the reference, waking_pid, in the variable
+reference fields section.
+
+Additionally, once it gets that value, since it is also a variable, it
+then saves that value into its var.idx. So the var.idx of the
+woken_pid alias is 0, which it fills with the value from var_ref_idx 0
+when its fn() is called to update itself. You'll also notice that
+there's a woken_pid var_ref in the variable refs section. That is the
+reference to the woken_pid alias variable, and you can see that it
+retrieves the value from the same var.idx as the woken_pid alias, 0,
+and then in turn saves that value in its own var_ref_idx slot, 3, and
+the value at this position is finally what gets assigned to the
+$woken_pid slot in the trace event invocation::
+
+ # cat events/sched/sched_switch/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=next_pid:vals=hitcount:woken_pid=$waking_pid,wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,$woken_pid,next_comm) [active]
+ #
+
+ hist_data: 0000000055d65ed0
+
+ n_vals: 3
+ n_keys: 1
+ n_fields: 4
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ HIST_FIELD_FL_ALIAS
+ var.name: woken_pid
+ var.idx (into tracing_map_elt.vars[]): 0
+ var_ref_idx (into hist_data->var_refs[]): 0
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 1
+ type: u64
+ size: 0
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[3]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+ variable reference fields:
+
+ hist_data->var_refs[0]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: waking_pid
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 00000000a250528c
+ var_ref_idx (into hist_data->var_refs[]): 0
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ hist_data->var_refs[1]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: ts0
+ var.idx (into tracing_map_elt.vars[]): 1
+ var.hist_data: 00000000a250528c
+ var_ref_idx (into hist_data->var_refs[]): 1
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->var_refs[2]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 1
+ var.hist_data: 0000000055d65ed0
+ var_ref_idx (into hist_data->var_refs[]): 2
+ type: u64
+ size: 0
+ is_signed: 0
+
+ hist_data->var_refs[3]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: woken_pid
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 0000000055d65ed0
+ var_ref_idx (into hist_data->var_refs[]): 3
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ hist_data->var_refs[4]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: next_comm
+ var.idx (into tracing_map_elt.vars[]): 2
+ var.hist_data: 0000000055d65ed0
+ var_ref_idx (into hist_data->var_refs[]): 4
+ type: char[16]
+ size: 256
+ is_signed: 0
+
+ field variables:
+
+ hist_data->field_vars[0]:
+
+ field_vars[0].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: next_comm
+ var.idx (into tracing_map_elt.vars[]): 2
+
+ field_vars[0].val:
+ ftrace_event_field name: next_comm
+ type: char[16]
+ size: 256
+ is_signed: 0
+
+ action tracking variables (for onmax()/onchange()/onmatch()):
+
+ hist_data->actions[0].match_data.event_system: sched
+ hist_data->actions[0].match_data.event: sched_waking
+
+The commands below can be used to clean things up for the next test::
+
+ # echo '!hist:keys=next_pid:woken_pid=$waking_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,$woken_pid,next_comm)' >> events/sched/sched_switch/trigger
+
+ # echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+ # echo '!wakeup_latency u64 lat; pid_t pid; char comm[16]' >> synthetic_events
diff --git a/Documentation/trace/histogram.rst b/Documentation/trace/histogram.rst
index 8408670d0328..87bd772836c0 100644
--- a/Documentation/trace/histogram.rst
+++ b/Documentation/trace/histogram.rst
@@ -39,7 +39,7 @@ Documentation written by Tom Zanussi
will use the event's kernel stacktrace as the key. The keywords
'keys' or 'key' can be used to specify keys, and the keywords
'values', 'vals', or 'val' can be used to specify values. Compound
- keys consisting of up to two fields can be specified by the 'keys'
+ keys consisting of up to three fields can be specified by the 'keys'
keyword. Hashing a compound key produces a unique entry in the
table for each unique combination of component keys, and can be
useful for providing more fine-grained summaries of event data.
@@ -70,15 +70,16 @@ Documentation written by Tom Zanussi
modified by appending any of the following modifiers to the field
name:
- =========== ==========================================
- .hex display a number as a hex value
- .sym display an address as a symbol
- .sym-offset display an address as a symbol and offset
- .syscall display a syscall id as a system call name
- .execname display a common_pid as a program name
- .log2 display log2 value rather than raw number
- .usecs display a common_timestamp in microseconds
- =========== ==========================================
+ ============= =================================================
+ .hex display a number as a hex value
+ .sym display an address as a symbol
+ .sym-offset display an address as a symbol and offset
+ .syscall display a syscall id as a system call name
+ .execname display a common_pid as a program name
+ .log2 display log2 value rather than raw number
+ .buckets=size display grouping of values rather than raw number
+ .usecs display a common_timestamp in microseconds
+ ============= =================================================
Note that in general the semantics of a given field aren't
interpreted when applying a modifier to it, but there are some
@@ -191,7 +192,7 @@ Documentation written by Tom Zanussi
with the event, in nanoseconds. May be
modified by .usecs to have timestamps
interpreted as microseconds.
- cpu int the cpu on which the event occurred.
+ common_cpu int the cpu on which the event occurred.
====================== ==== =======================================
Extended error information
@@ -228,7 +229,7 @@ Extended error information
that lists the total number of bytes requested for each function in
the kernel that made one or more calls to kmalloc::
- # echo 'hist:key=call_site:val=bytes_req' > \
+ # echo 'hist:key=call_site:val=bytes_req.buckets=32' > \
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
This tells the tracing system to create a 'hist' trigger using the
@@ -411,7 +412,7 @@ Extended error information
Because the default sort key above is 'hitcount', the above shows a
the list of call_sites by increasing hitcount, so that at the bottom
we see the functions that made the most kmalloc calls during the
- run. If instead we we wanted to see the top kmalloc callers in
+ run. If instead we wanted to see the top kmalloc callers in
terms of the number of bytes requested rather than the number of
calls, and we wanted the top caller to appear at the top, we can use
the 'sort' parameter, along with the 'descending' modifier::
@@ -1495,7 +1496,7 @@ Extended error information
#
{ stacktrace:
- _do_fork+0x18e/0x330
+ kernel_clone+0x18e/0x330
kernel_thread+0x29/0x30
kthreadd+0x154/0x1b0
ret_from_fork+0x3f/0x70
@@ -1588,7 +1589,7 @@ Extended error information
SYSC_sendto+0xef/0x170
} hitcount: 88
{ stacktrace:
- _do_fork+0x18e/0x330
+ kernel_clone+0x18e/0x330
SyS_clone+0x19/0x20
entry_SYSCALL_64_fastpath+0x12/0x6a
} hitcount: 244
@@ -1762,6 +1763,21 @@ using the same key and variable from yet another event::
# echo 'hist:key=pid:wakeupswitch_lat=$wakeup_lat+$switchtime_lat ...' >> event3/trigger
+Expressions support the use of addition, subtraction, multiplication and
+division operators (+-\*/).
+
+Note if division by zero cannot be detected at parse time (i.e. the
+divisor is not a constant), the result will be -1.
+
+Numeric constants can also be used directly in an expression::
+
+ # echo 'hist:keys=next_pid:timestamp_secs=common_timestamp/1000000 ...' >> event/trigger
+
+or assigned to a variable and referenced in a subsequent expression::
+
+ # echo 'hist:keys=next_pid:us_per_sec=1000000 ...' >> event/trigger
+ # echo 'hist:keys=next_pid:timestamp_secs=common_timestamp/$us_per_sec ...' >> event/trigger
+
2.2.2 Synthetic Events
----------------------
@@ -1776,6 +1792,24 @@ consisting of the name of the new event along with one or more
variables and their types, which can be any valid field type,
separated by semicolons, to the tracing/synthetic_events file.
+See synth_field_size() for available types.
+
+If field_name contains [n], the field is considered to be a static array.
+
+If field_names contains[] (no subscript), the field is considered to
+be a dynamic array, which will only take as much space in the event as
+is required to hold the array.
+
+A string field can be specified using either the static notation:
+
+ char name[32];
+
+Or the dynamic:
+
+ char name[];
+
+The size limit for either is 256.
+
For instance, the following creates a new event named 'wakeup_latency'
with 3 fields: lat, pid, and prio. Each of those fields is simply a
variable reference to a variable on another event::
@@ -1805,20 +1839,99 @@ and variables defined on other events (see Section 2.2.3 below on
how that is done using hist trigger 'onmatch' action). Once that is
done, the 'wakeup_latency' synthetic event instance is created.
-A histogram can now be defined for the new synthetic event::
-
- # echo 'hist:keys=pid,prio,lat.log2:sort=pid,lat' >> \
- /sys/kernel/debug/tracing/events/synthetic/wakeup_latency/trigger
-
The new event is created under the tracing/events/synthetic/ directory
and looks and behaves just like any other event::
# ls /sys/kernel/debug/tracing/events/synthetic/wakeup_latency
enable filter format hist id trigger
+A histogram can now be defined for the new synthetic event::
+
+ # echo 'hist:keys=pid,prio,lat.log2:sort=lat' >> \
+ /sys/kernel/debug/tracing/events/synthetic/wakeup_latency/trigger
+
+The above shows the latency "lat" in a power of 2 grouping.
+
Like any other event, once a histogram is enabled for the event, the
output can be displayed by reading the event's 'hist' file.
+ # cat /sys/kernel/debug/tracing/events/synthetic/wakeup_latency/hist
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid,prio,lat.log2:vals=hitcount:sort=lat.log2:size=2048 [active]
+ #
+
+ { pid: 2035, prio: 9, lat: ~ 2^2 } hitcount: 43
+ { pid: 2034, prio: 9, lat: ~ 2^2 } hitcount: 60
+ { pid: 2029, prio: 9, lat: ~ 2^2 } hitcount: 965
+ { pid: 2034, prio: 120, lat: ~ 2^2 } hitcount: 9
+ { pid: 2033, prio: 120, lat: ~ 2^2 } hitcount: 5
+ { pid: 2030, prio: 9, lat: ~ 2^2 } hitcount: 335
+ { pid: 2030, prio: 120, lat: ~ 2^2 } hitcount: 10
+ { pid: 2032, prio: 120, lat: ~ 2^2 } hitcount: 1
+ { pid: 2035, prio: 120, lat: ~ 2^2 } hitcount: 2
+ { pid: 2031, prio: 9, lat: ~ 2^2 } hitcount: 176
+ { pid: 2028, prio: 120, lat: ~ 2^2 } hitcount: 15
+ { pid: 2033, prio: 9, lat: ~ 2^2 } hitcount: 91
+ { pid: 2032, prio: 9, lat: ~ 2^2 } hitcount: 125
+ { pid: 2029, prio: 120, lat: ~ 2^2 } hitcount: 4
+ { pid: 2031, prio: 120, lat: ~ 2^2 } hitcount: 3
+ { pid: 2029, prio: 120, lat: ~ 2^3 } hitcount: 2
+ { pid: 2035, prio: 9, lat: ~ 2^3 } hitcount: 41
+ { pid: 2030, prio: 120, lat: ~ 2^3 } hitcount: 1
+ { pid: 2032, prio: 9, lat: ~ 2^3 } hitcount: 32
+ { pid: 2031, prio: 9, lat: ~ 2^3 } hitcount: 44
+ { pid: 2034, prio: 9, lat: ~ 2^3 } hitcount: 40
+ { pid: 2030, prio: 9, lat: ~ 2^3 } hitcount: 29
+ { pid: 2033, prio: 9, lat: ~ 2^3 } hitcount: 31
+ { pid: 2029, prio: 9, lat: ~ 2^3 } hitcount: 31
+ { pid: 2028, prio: 120, lat: ~ 2^3 } hitcount: 18
+ { pid: 2031, prio: 120, lat: ~ 2^3 } hitcount: 2
+ { pid: 2028, prio: 120, lat: ~ 2^4 } hitcount: 1
+ { pid: 2029, prio: 9, lat: ~ 2^4 } hitcount: 4
+ { pid: 2031, prio: 120, lat: ~ 2^7 } hitcount: 1
+ { pid: 2032, prio: 120, lat: ~ 2^7 } hitcount: 1
+
+ Totals:
+ Hits: 2122
+ Entries: 30
+ Dropped: 0
+
+
+The latency values can also be grouped linearly by a given size with
+the ".buckets" modifier and specify a size (in this case groups of 10).
+
+ # echo 'hist:keys=pid,prio,lat.buckets=10:sort=lat' >> \
+ /sys/kernel/debug/tracing/events/synthetic/wakeup_latency/trigger
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid,prio,lat.buckets=10:vals=hitcount:sort=lat.buckets=10:size=2048 [active]
+ #
+
+ { pid: 2067, prio: 9, lat: ~ 0-9 } hitcount: 220
+ { pid: 2068, prio: 9, lat: ~ 0-9 } hitcount: 157
+ { pid: 2070, prio: 9, lat: ~ 0-9 } hitcount: 100
+ { pid: 2067, prio: 120, lat: ~ 0-9 } hitcount: 6
+ { pid: 2065, prio: 120, lat: ~ 0-9 } hitcount: 2
+ { pid: 2066, prio: 120, lat: ~ 0-9 } hitcount: 2
+ { pid: 2069, prio: 9, lat: ~ 0-9 } hitcount: 122
+ { pid: 2069, prio: 120, lat: ~ 0-9 } hitcount: 8
+ { pid: 2070, prio: 120, lat: ~ 0-9 } hitcount: 1
+ { pid: 2068, prio: 120, lat: ~ 0-9 } hitcount: 7
+ { pid: 2066, prio: 9, lat: ~ 0-9 } hitcount: 365
+ { pid: 2064, prio: 120, lat: ~ 0-9 } hitcount: 35
+ { pid: 2065, prio: 9, lat: ~ 0-9 } hitcount: 998
+ { pid: 2071, prio: 9, lat: ~ 0-9 } hitcount: 85
+ { pid: 2065, prio: 9, lat: ~ 10-19 } hitcount: 2
+ { pid: 2064, prio: 120, lat: ~ 10-19 } hitcount: 2
+
+ Totals:
+ Hits: 2112
+ Entries: 16
+ Dropped: 0
+
2.2.3 Hist trigger 'handlers' and 'actions'
-------------------------------------------
diff --git a/Documentation/trace/hwlat_detector.rst b/Documentation/trace/hwlat_detector.rst
index 5739349649c8..de94b499b0bc 100644
--- a/Documentation/trace/hwlat_detector.rst
+++ b/Documentation/trace/hwlat_detector.rst
@@ -76,8 +76,13 @@ in /sys/kernel/tracing:
- tracing_cpumask - the CPUs to move the hwlat thread across
- hwlat_detector/width - specified amount of time to spin within window (usecs)
- hwlat_detector/window - amount of time between (width) runs (usecs)
+ - hwlat_detector/mode - the thread mode
-The hwlat detector's kernel thread will migrate across each CPU specified in
-tracing_cpumask between each window. To limit the migration, either modify
-tracing_cpumask, or modify the hwlat kernel thread (named [hwlatd]) CPU
-affinity directly, and the migration will stop.
+By default, one hwlat detector's kernel thread will migrate across each CPU
+specified in cpumask at the beginning of a new window, in a round-robin
+fashion. This behavior can be changed by changing the thread mode,
+the available options are:
+
+ - none: do not force migration
+ - round-robin: migrate across each CPU specified in cpumask [default]
+ - per-cpu: create one thread for each cpu in tracing_cpumask
diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index fa9e1c730f6a..ea25a9220f92 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -9,6 +9,8 @@ Linux Tracing Technologies
tracepoint-analysis
ftrace
ftrace-uses
+ fprobe
+ kprobes
kprobetrace
uprobetracer
tracepoints
@@ -19,9 +21,16 @@ Linux Tracing Technologies
events-msr
mmiotrace
histogram
+ histogram-design
boottime-trace
hwlat_detector
+ osnoise-tracer
+ timerlat-tracer
intel_th
+ ring-buffer-design
stm
sys-t
coresight/index
+ user_events
+ rv/index
+ hisi-ptt
diff --git a/Documentation/trace/intel_th.rst b/Documentation/trace/intel_th.rst
index 70b7126eaeeb..b31818d5f6c5 100644
--- a/Documentation/trace/intel_th.rst
+++ b/Documentation/trace/intel_th.rst
@@ -58,7 +58,7 @@ Bus and Subdevices
For each Intel TH device in the system a bus of its own is
created and assigned an id number that reflects the order in which TH
-devices were emumerated. All TH subdevices (devices on intel_th bus)
+devices were enumerated. All TH subdevices (devices on intel_th bus)
begin with this id: 0-gth, 0-msc0, 0-msc1, 0-pti, 0-sth, which is
followed by device's name and an optional index.
diff --git a/Documentation/trace/kprobes.rst b/Documentation/trace/kprobes.rst
new file mode 100644
index 000000000000..48cf778a2468
--- /dev/null
+++ b/Documentation/trace/kprobes.rst
@@ -0,0 +1,789 @@
+=======================
+Kernel Probes (Kprobes)
+=======================
+
+:Author: Jim Keniston <jkenisto@us.ibm.com>
+:Author: Prasanna S Panchamukhi <prasanna.panchamukhi@gmail.com>
+:Author: Masami Hiramatsu <mhiramat@redhat.com>
+
+.. CONTENTS
+
+ 1. Concepts: Kprobes, and Return Probes
+ 2. Architectures Supported
+ 3. Configuring Kprobes
+ 4. API Reference
+ 5. Kprobes Features and Limitations
+ 6. Probe Overhead
+ 7. TODO
+ 8. Kprobes Example
+ 9. Kretprobes Example
+ 10. Deprecated Features
+ Appendix A: The kprobes debugfs interface
+ Appendix B: The kprobes sysctl interface
+ Appendix C: References
+
+Concepts: Kprobes and Return Probes
+=========================================
+
+Kprobes enables you to dynamically break into any kernel routine and
+collect debugging and performance information non-disruptively. You
+can trap at almost any kernel code address [1]_, specifying a handler
+routine to be invoked when the breakpoint is hit.
+
+.. [1] some parts of the kernel code can not be trapped, see
+ :ref:`kprobes_blacklist`)
+
+There are currently two types of probes: kprobes, and kretprobes
+(also called return probes). A kprobe can be inserted on virtually
+any instruction in the kernel. A return probe fires when a specified
+function returns.
+
+In the typical case, Kprobes-based instrumentation is packaged as
+a kernel module. The module's init function installs ("registers")
+one or more probes, and the exit function unregisters them. A
+registration function such as register_kprobe() specifies where
+the probe is to be inserted and what handler is to be called when
+the probe is hit.
+
+There are also ``register_/unregister_*probes()`` functions for batch
+registration/unregistration of a group of ``*probes``. These functions
+can speed up unregistration process when you have to unregister
+a lot of probes at once.
+
+The next four subsections explain how the different types of
+probes work and how jump optimization works. They explain certain
+things that you'll need to know in order to make the best use of
+Kprobes -- e.g., the difference between a pre_handler and
+a post_handler, and how to use the maxactive and nmissed fields of
+a kretprobe. But if you're in a hurry to start using Kprobes, you
+can skip ahead to :ref:`kprobes_archs_supported`.
+
+How Does a Kprobe Work?
+-----------------------
+
+When a kprobe is registered, Kprobes makes a copy of the probed
+instruction and replaces the first byte(s) of the probed instruction
+with a breakpoint instruction (e.g., int3 on i386 and x86_64).
+
+When a CPU hits the breakpoint instruction, a trap occurs, the CPU's
+registers are saved, and control passes to Kprobes via the
+notifier_call_chain mechanism. Kprobes executes the "pre_handler"
+associated with the kprobe, passing the handler the addresses of the
+kprobe struct and the saved registers.
+
+Next, Kprobes single-steps its copy of the probed instruction.
+(It would be simpler to single-step the actual instruction in place,
+but then Kprobes would have to temporarily remove the breakpoint
+instruction. This would open a small time window when another CPU
+could sail right past the probepoint.)
+
+After the instruction is single-stepped, Kprobes executes the
+"post_handler," if any, that is associated with the kprobe.
+Execution then continues with the instruction following the probepoint.
+
+Changing Execution Path
+-----------------------
+
+Since kprobes can probe into a running kernel code, it can change the
+register set, including instruction pointer. This operation requires
+maximum care, such as keeping the stack frame, recovering the execution
+path etc. Since it operates on a running kernel and needs deep knowledge
+of computer architecture and concurrent computing, you can easily shoot
+your foot.
+
+If you change the instruction pointer (and set up other related
+registers) in pre_handler, you must return !0 so that kprobes stops
+single stepping and just returns to the given address.
+This also means post_handler should not be called anymore.
+
+Note that this operation may be harder on some architectures which use
+TOC (Table of Contents) for function call, since you have to setup a new
+TOC for your function in your module, and recover the old one after
+returning from it.
+
+Return Probes
+-------------
+
+How Does a Return Probe Work?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When you call register_kretprobe(), Kprobes establishes a kprobe at
+the entry to the function. When the probed function is called and this
+probe is hit, Kprobes saves a copy of the return address, and replaces
+the return address with the address of a "trampoline." The trampoline
+is an arbitrary piece of code -- typically just a nop instruction.
+At boot time, Kprobes registers a kprobe at the trampoline.
+
+When the probed function executes its return instruction, control
+passes to the trampoline and that probe is hit. Kprobes' trampoline
+handler calls the user-specified return handler associated with the
+kretprobe, then sets the saved instruction pointer to the saved return
+address, and that's where execution resumes upon return from the trap.
+
+While the probed function is executing, its return address is
+stored in an object of type kretprobe_instance. Before calling
+register_kretprobe(), the user sets the maxactive field of the
+kretprobe struct to specify how many instances of the specified
+function can be probed simultaneously. register_kretprobe()
+pre-allocates the indicated number of kretprobe_instance objects.
+
+For example, if the function is non-recursive and is called with a
+spinlock held, maxactive = 1 should be enough. If the function is
+non-recursive and can never relinquish the CPU (e.g., via a semaphore
+or preemption), NR_CPUS should be enough. If maxactive <= 0, it is
+set to a default value. If CONFIG_PREEMPT is enabled, the default
+is max(10, 2*NR_CPUS). Otherwise, the default is NR_CPUS.
+
+It's not a disaster if you set maxactive too low; you'll just miss
+some probes. In the kretprobe struct, the nmissed field is set to
+zero when the return probe is registered, and is incremented every
+time the probed function is entered but there is no kretprobe_instance
+object available for establishing the return probe.
+
+Kretprobe entry-handler
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Kretprobes also provides an optional user-specified handler which runs
+on function entry. This handler is specified by setting the entry_handler
+field of the kretprobe struct. Whenever the kprobe placed by kretprobe at the
+function entry is hit, the user-defined entry_handler, if any, is invoked.
+If the entry_handler returns 0 (success) then a corresponding return handler
+is guaranteed to be called upon function return. If the entry_handler
+returns a non-zero error then Kprobes leaves the return address as is, and
+the kretprobe has no further effect for that particular function instance.
+
+Multiple entry and return handler invocations are matched using the unique
+kretprobe_instance object associated with them. Additionally, a user
+may also specify per return-instance private data to be part of each
+kretprobe_instance object. This is especially useful when sharing private
+data between corresponding user entry and return handlers. The size of each
+private data object can be specified at kretprobe registration time by
+setting the data_size field of the kretprobe struct. This data can be
+accessed through the data field of each kretprobe_instance object.
+
+In case probed function is entered but there is no kretprobe_instance
+object available, then in addition to incrementing the nmissed count,
+the user entry_handler invocation is also skipped.
+
+.. _kprobes_jump_optimization:
+
+How Does Jump Optimization Work?
+--------------------------------
+
+If your kernel is built with CONFIG_OPTPROBES=y (currently this flag
+is automatically set 'y' on x86/x86-64, non-preemptive kernel) and
+the "debug.kprobes_optimization" kernel parameter is set to 1 (see
+sysctl(8)), Kprobes tries to reduce probe-hit overhead by using a jump
+instruction instead of a breakpoint instruction at each probepoint.
+
+Init a Kprobe
+^^^^^^^^^^^^^
+
+When a probe is registered, before attempting this optimization,
+Kprobes inserts an ordinary, breakpoint-based kprobe at the specified
+address. So, even if it's not possible to optimize this particular
+probepoint, there'll be a probe there.
+
+Safety Check
+^^^^^^^^^^^^
+
+Before optimizing a probe, Kprobes performs the following safety checks:
+
+- Kprobes verifies that the region that will be replaced by the jump
+ instruction (the "optimized region") lies entirely within one function.
+ (A jump instruction is multiple bytes, and so may overlay multiple
+ instructions.)
+
+- Kprobes analyzes the entire function and verifies that there is no
+ jump into the optimized region. Specifically:
+
+ - the function contains no indirect jump;
+ - the function contains no instruction that causes an exception (since
+ the fixup code triggered by the exception could jump back into the
+ optimized region -- Kprobes checks the exception tables to verify this);
+ - there is no near jump to the optimized region (other than to the first
+ byte).
+
+- For each instruction in the optimized region, Kprobes verifies that
+ the instruction can be executed out of line.
+
+Preparing Detour Buffer
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Next, Kprobes prepares a "detour" buffer, which contains the following
+instruction sequence:
+
+- code to push the CPU's registers (emulating a breakpoint trap)
+- a call to the trampoline code which calls user's probe handlers.
+- code to restore registers
+- the instructions from the optimized region
+- a jump back to the original execution path.
+
+Pre-optimization
+^^^^^^^^^^^^^^^^
+
+After preparing the detour buffer, Kprobes verifies that none of the
+following situations exist:
+
+- The probe has a post_handler.
+- Other instructions in the optimized region are probed.
+- The probe is disabled.
+
+In any of the above cases, Kprobes won't start optimizing the probe.
+Since these are temporary situations, Kprobes tries to start
+optimizing it again if the situation is changed.
+
+If the kprobe can be optimized, Kprobes enqueues the kprobe to an
+optimizing list, and kicks the kprobe-optimizer workqueue to optimize
+it. If the to-be-optimized probepoint is hit before being optimized,
+Kprobes returns control to the original instruction path by setting
+the CPU's instruction pointer to the copied code in the detour buffer
+-- thus at least avoiding the single-step.
+
+Optimization
+^^^^^^^^^^^^
+
+The Kprobe-optimizer doesn't insert the jump instruction immediately;
+rather, it calls synchronize_rcu() for safety first, because it's
+possible for a CPU to be interrupted in the middle of executing the
+optimized region [3]_. As you know, synchronize_rcu() can ensure
+that all interruptions that were active when synchronize_rcu()
+was called are done, but only if CONFIG_PREEMPT=n. So, this version
+of kprobe optimization supports only kernels with CONFIG_PREEMPT=n [4]_.
+
+After that, the Kprobe-optimizer calls stop_machine() to replace
+the optimized region with a jump instruction to the detour buffer,
+using text_poke_smp().
+
+Unoptimization
+^^^^^^^^^^^^^^
+
+When an optimized kprobe is unregistered, disabled, or blocked by
+another kprobe, it will be unoptimized. If this happens before
+the optimization is complete, the kprobe is just dequeued from the
+optimized list. If the optimization has been done, the jump is
+replaced with the original code (except for an int3 breakpoint in
+the first byte) by using text_poke_smp().
+
+.. [3] Please imagine that the 2nd instruction is interrupted and then
+ the optimizer replaces the 2nd instruction with the jump *address*
+ while the interrupt handler is running. When the interrupt
+ returns to original address, there is no valid instruction,
+ and it causes an unexpected result.
+
+.. [4] This optimization-safety checking may be replaced with the
+ stop-machine method that ksplice uses for supporting a CONFIG_PREEMPT=y
+ kernel.
+
+NOTE for geeks:
+The jump optimization changes the kprobe's pre_handler behavior.
+Without optimization, the pre_handler can change the kernel's execution
+path by changing regs->ip and returning 1. However, when the probe
+is optimized, that modification is ignored. Thus, if you want to
+tweak the kernel's execution path, you need to suppress optimization,
+using one of the following techniques:
+
+- Specify an empty function for the kprobe's post_handler.
+
+or
+
+- Execute 'sysctl -w debug.kprobes_optimization=n'
+
+.. _kprobes_blacklist:
+
+Blacklist
+---------
+
+Kprobes can probe most of the kernel except itself. This means
+that there are some functions where kprobes cannot probe. Probing
+(trapping) such functions can cause a recursive trap (e.g. double
+fault) or the nested probe handler may never be called.
+Kprobes manages such functions as a blacklist.
+If you want to add a function into the blacklist, you just need
+to (1) include linux/kprobes.h and (2) use NOKPROBE_SYMBOL() macro
+to specify a blacklisted function.
+Kprobes checks the given probe address against the blacklist and
+rejects registering it, if the given address is in the blacklist.
+
+.. _kprobes_archs_supported:
+
+Architectures Supported
+=======================
+
+Kprobes and return probes are implemented on the following
+architectures:
+
+- i386 (Supports jump optimization)
+- x86_64 (AMD-64, EM64T) (Supports jump optimization)
+- ppc64
+- ia64 (Does not support probes on instruction slot1.)
+- sparc64 (Return probes not yet implemented.)
+- arm
+- ppc
+- mips
+- s390
+- parisc
+
+Configuring Kprobes
+===================
+
+When configuring the kernel using make menuconfig/xconfig/oldconfig,
+ensure that CONFIG_KPROBES is set to "y", look for "Kprobes" under
+"General architecture-dependent options".
+
+So that you can load and unload Kprobes-based instrumentation modules,
+make sure "Loadable module support" (CONFIG_MODULES) and "Module
+unloading" (CONFIG_MODULE_UNLOAD) are set to "y".
+
+Also make sure that CONFIG_KALLSYMS and perhaps even CONFIG_KALLSYMS_ALL
+are set to "y", since kallsyms_lookup_name() is used by the in-kernel
+kprobe address resolution code.
+
+If you need to insert a probe in the middle of a function, you may find
+it useful to "Compile the kernel with debug info" (CONFIG_DEBUG_INFO),
+so you can use "objdump -d -l vmlinux" to see the source-to-object
+code mapping.
+
+API Reference
+=============
+
+The Kprobes API includes a "register" function and an "unregister"
+function for each type of probe. The API also includes "register_*probes"
+and "unregister_*probes" functions for (un)registering arrays of probes.
+Here are terse, mini-man-page specifications for these functions and
+the associated probe handlers that you'll write. See the files in the
+samples/kprobes/ sub-directory for examples.
+
+register_kprobe
+---------------
+
+::
+
+ #include <linux/kprobes.h>
+ int register_kprobe(struct kprobe *kp);
+
+Sets a breakpoint at the address kp->addr. When the breakpoint is hit, Kprobes
+calls kp->pre_handler. After the probed instruction is single-stepped, Kprobe
+calls kp->post_handler. Any or all handlers can be NULL. If kp->flags is set
+KPROBE_FLAG_DISABLED, that kp will be registered but disabled, so, its handlers
+aren't hit until calling enable_kprobe(kp).
+
+.. note::
+
+ 1. With the introduction of the "symbol_name" field to struct kprobe,
+ the probepoint address resolution will now be taken care of by the kernel.
+ The following will now work::
+
+ kp.symbol_name = "symbol_name";
+
+ (64-bit powerpc intricacies such as function descriptors are handled
+ transparently)
+
+ 2. Use the "offset" field of struct kprobe if the offset into the symbol
+ to install a probepoint is known. This field is used to calculate the
+ probepoint.
+
+ 3. Specify either the kprobe "symbol_name" OR the "addr". If both are
+ specified, kprobe registration will fail with -EINVAL.
+
+ 4. With CISC architectures (such as i386 and x86_64), the kprobes code
+ does not validate if the kprobe.addr is at an instruction boundary.
+ Use "offset" with caution.
+
+register_kprobe() returns 0 on success, or a negative errno otherwise.
+
+User's pre-handler (kp->pre_handler)::
+
+ #include <linux/kprobes.h>
+ #include <linux/ptrace.h>
+ int pre_handler(struct kprobe *p, struct pt_regs *regs);
+
+Called with p pointing to the kprobe associated with the breakpoint,
+and regs pointing to the struct containing the registers saved when
+the breakpoint was hit. Return 0 here unless you're a Kprobes geek.
+
+User's post-handler (kp->post_handler)::
+
+ #include <linux/kprobes.h>
+ #include <linux/ptrace.h>
+ void post_handler(struct kprobe *p, struct pt_regs *regs,
+ unsigned long flags);
+
+p and regs are as described for the pre_handler. flags always seems
+to be zero.
+
+register_kretprobe
+------------------
+
+::
+
+ #include <linux/kprobes.h>
+ int register_kretprobe(struct kretprobe *rp);
+
+Establishes a return probe for the function whose address is
+rp->kp.addr. When that function returns, Kprobes calls rp->handler.
+You must set rp->maxactive appropriately before you call
+register_kretprobe(); see "How Does a Return Probe Work?" for details.
+
+register_kretprobe() returns 0 on success, or a negative errno
+otherwise.
+
+User's return-probe handler (rp->handler)::
+
+ #include <linux/kprobes.h>
+ #include <linux/ptrace.h>
+ int kretprobe_handler(struct kretprobe_instance *ri,
+ struct pt_regs *regs);
+
+regs is as described for kprobe.pre_handler. ri points to the
+kretprobe_instance object, of which the following fields may be
+of interest:
+
+- ret_addr: the return address
+- rp: points to the corresponding kretprobe object
+- task: points to the corresponding task struct
+- data: points to per return-instance private data; see "Kretprobe
+ entry-handler" for details.
+
+The regs_return_value(regs) macro provides a simple abstraction to
+extract the return value from the appropriate register as defined by
+the architecture's ABI.
+
+The handler's return value is currently ignored.
+
+unregister_*probe
+------------------
+
+::
+
+ #include <linux/kprobes.h>
+ void unregister_kprobe(struct kprobe *kp);
+ void unregister_kretprobe(struct kretprobe *rp);
+
+Removes the specified probe. The unregister function can be called
+at any time after the probe has been registered.
+
+.. note::
+
+ If the functions find an incorrect probe (ex. an unregistered probe),
+ they clear the addr field of the probe.
+
+register_*probes
+----------------
+
+::
+
+ #include <linux/kprobes.h>
+ int register_kprobes(struct kprobe **kps, int num);
+ int register_kretprobes(struct kretprobe **rps, int num);
+
+Registers each of the num probes in the specified array. If any
+error occurs during registration, all probes in the array, up to
+the bad probe, are safely unregistered before the register_*probes
+function returns.
+
+- kps/rps: an array of pointers to ``*probe`` data structures
+- num: the number of the array entries.
+
+.. note::
+
+ You have to allocate(or define) an array of pointers and set all
+ of the array entries before using these functions.
+
+unregister_*probes
+------------------
+
+::
+
+ #include <linux/kprobes.h>
+ void unregister_kprobes(struct kprobe **kps, int num);
+ void unregister_kretprobes(struct kretprobe **rps, int num);
+
+Removes each of the num probes in the specified array at once.
+
+.. note::
+
+ If the functions find some incorrect probes (ex. unregistered
+ probes) in the specified array, they clear the addr field of those
+ incorrect probes. However, other probes in the array are
+ unregistered correctly.
+
+disable_*probe
+--------------
+
+::
+
+ #include <linux/kprobes.h>
+ int disable_kprobe(struct kprobe *kp);
+ int disable_kretprobe(struct kretprobe *rp);
+
+Temporarily disables the specified ``*probe``. You can enable it again by using
+enable_*probe(). You must specify the probe which has been registered.
+
+enable_*probe
+-------------
+
+::
+
+ #include <linux/kprobes.h>
+ int enable_kprobe(struct kprobe *kp);
+ int enable_kretprobe(struct kretprobe *rp);
+
+Enables ``*probe`` which has been disabled by disable_*probe(). You must specify
+the probe which has been registered.
+
+Kprobes Features and Limitations
+================================
+
+Kprobes allows multiple probes at the same address. Also,
+a probepoint for which there is a post_handler cannot be optimized.
+So if you install a kprobe with a post_handler, at an optimized
+probepoint, the probepoint will be unoptimized automatically.
+
+In general, you can install a probe anywhere in the kernel.
+In particular, you can probe interrupt handlers. Known exceptions
+are discussed in this section.
+
+The register_*probe functions will return -EINVAL if you attempt
+to install a probe in the code that implements Kprobes (mostly
+kernel/kprobes.c and ``arch/*/kernel/kprobes.c``, but also functions such
+as do_page_fault and notifier_call_chain).
+
+If you install a probe in an inline-able function, Kprobes makes
+no attempt to chase down all inline instances of the function and
+install probes there. gcc may inline a function without being asked,
+so keep this in mind if you're not seeing the probe hits you expect.
+
+A probe handler can modify the environment of the probed function
+-- e.g., by modifying kernel data structures, or by modifying the
+contents of the pt_regs struct (which are restored to the registers
+upon return from the breakpoint). So Kprobes can be used, for example,
+to install a bug fix or to inject faults for testing. Kprobes, of
+course, has no way to distinguish the deliberately injected faults
+from the accidental ones. Don't drink and probe.
+
+Kprobes makes no attempt to prevent probe handlers from stepping on
+each other -- e.g., probing printk() and then calling printk() from a
+probe handler. If a probe handler hits a probe, that second probe's
+handlers won't be run in that instance, and the kprobe.nmissed member
+of the second probe will be incremented.
+
+As of Linux v2.6.15-rc1, multiple handlers (or multiple instances of
+the same handler) may run concurrently on different CPUs.
+
+Kprobes does not use mutexes or allocate memory except during
+registration and unregistration.
+
+Probe handlers are run with preemption disabled or interrupt disabled,
+which depends on the architecture and optimization state. (e.g.,
+kretprobe handlers and optimized kprobe handlers run without interrupt
+disabled on x86/x86-64). In any case, your handler should not yield
+the CPU (e.g., by attempting to acquire a semaphore, or waiting I/O).
+
+Since a return probe is implemented by replacing the return
+address with the trampoline's address, stack backtraces and calls
+to __builtin_return_address() will typically yield the trampoline's
+address instead of the real return address for kretprobed functions.
+(As far as we can tell, __builtin_return_address() is used only
+for instrumentation and error reporting.)
+
+If the number of times a function is called does not match the number
+of times it returns, registering a return probe on that function may
+produce undesirable results. In such a case, a line:
+kretprobe BUG!: Processing kretprobe d000000000041aa8 @ c00000000004f48c
+gets printed. With this information, one will be able to correlate the
+exact instance of the kretprobe that caused the problem. We have the
+do_exit() case covered. do_execve() and do_fork() are not an issue.
+We're unaware of other specific cases where this could be a problem.
+
+If, upon entry to or exit from a function, the CPU is running on
+a stack other than that of the current task, registering a return
+probe on that function may produce undesirable results. For this
+reason, Kprobes doesn't support return probes (or kprobes)
+on the x86_64 version of __switch_to(); the registration functions
+return -EINVAL.
+
+On x86/x86-64, since the Jump Optimization of Kprobes modifies
+instructions widely, there are some limitations to optimization. To
+explain it, we introduce some terminology. Imagine a 3-instruction
+sequence consisting of a two 2-byte instructions and one 3-byte
+instruction.
+
+::
+
+ IA
+ |
+ [-2][-1][0][1][2][3][4][5][6][7]
+ [ins1][ins2][ ins3 ]
+ [<- DCR ->]
+ [<- JTPR ->]
+
+ ins1: 1st Instruction
+ ins2: 2nd Instruction
+ ins3: 3rd Instruction
+ IA: Insertion Address
+ JTPR: Jump Target Prohibition Region
+ DCR: Detoured Code Region
+
+The instructions in DCR are copied to the out-of-line buffer
+of the kprobe, because the bytes in DCR are replaced by
+a 5-byte jump instruction. So there are several limitations.
+
+a) The instructions in DCR must be relocatable.
+b) The instructions in DCR must not include a call instruction.
+c) JTPR must not be targeted by any jump or call instruction.
+d) DCR must not straddle the border between functions.
+
+Anyway, these limitations are checked by the in-kernel instruction
+decoder, so you don't need to worry about that.
+
+Probe Overhead
+==============
+
+On a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0
+microseconds to process. Specifically, a benchmark that hits the same
+probepoint repeatedly, firing a simple handler each time, reports 1-2
+million hits per second, depending on the architecture. A return-probe
+hit typically takes 50-75% longer than a kprobe hit.
+When you have a return probe set on a function, adding a kprobe at
+the entry to that function adds essentially no overhead.
+
+Here are sample overhead figures (in usec) for different architectures::
+
+ k = kprobe; r = return probe; kr = kprobe + return probe
+ on same function
+
+ i386: Intel Pentium M, 1495 MHz, 2957.31 bogomips
+ k = 0.57 usec; r = 0.92; kr = 0.99
+
+ x86_64: AMD Opteron 246, 1994 MHz, 3971.48 bogomips
+ k = 0.49 usec; r = 0.80; kr = 0.82
+
+ ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU)
+ k = 0.77 usec; r = 1.26; kr = 1.45
+
+Optimized Probe Overhead
+------------------------
+
+Typically, an optimized kprobe hit takes 0.07 to 0.1 microseconds to
+process. Here are sample overhead figures (in usec) for x86 architectures::
+
+ k = unoptimized kprobe, b = boosted (single-step skipped), o = optimized kprobe,
+ r = unoptimized kretprobe, rb = boosted kretprobe, ro = optimized kretprobe.
+
+ i386: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips
+ k = 0.80 usec; b = 0.33; o = 0.05; r = 1.10; rb = 0.61; ro = 0.33
+
+ x86-64: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips
+ k = 0.99 usec; b = 0.43; o = 0.06; r = 1.24; rb = 0.68; ro = 0.30
+
+TODO
+====
+
+a. SystemTap (http://sourceware.org/systemtap): Provides a simplified
+ programming interface for probe-based instrumentation. Try it out.
+b. Kernel return probes for sparc64.
+c. Support for other architectures.
+d. User-space probes.
+e. Watchpoint probes (which fire on data references).
+
+Kprobes Example
+===============
+
+See samples/kprobes/kprobe_example.c
+
+Kretprobes Example
+==================
+
+See samples/kprobes/kretprobe_example.c
+
+Deprecated Features
+===================
+
+Jprobes is now a deprecated feature. People who are depending on it should
+migrate to other tracing features or use older kernels. Please consider to
+migrate your tool to one of the following options:
+
+- Use trace-event to trace target function with arguments.
+
+ trace-event is a low-overhead (and almost no visible overhead if it
+ is off) statically defined event interface. You can define new events
+ and trace it via ftrace or any other tracing tools.
+
+ See the following urls:
+
+ - https://lwn.net/Articles/379903/
+ - https://lwn.net/Articles/381064/
+ - https://lwn.net/Articles/383362/
+
+- Use ftrace dynamic events (kprobe event) with perf-probe.
+
+ If you build your kernel with debug info (CONFIG_DEBUG_INFO=y), you can
+ find which register/stack is assigned to which local variable or arguments
+ by using perf-probe and set up new event to trace it.
+
+ See following documents:
+
+ - Documentation/trace/kprobetrace.rst
+ - Documentation/trace/events.rst
+ - tools/perf/Documentation/perf-probe.txt
+
+
+The kprobes debugfs interface
+=============================
+
+
+With recent kernels (> 2.6.20) the list of registered kprobes is visible
+under the /sys/kernel/debug/kprobes/ directory (assuming debugfs is mounted at //sys/kernel/debug).
+
+/sys/kernel/debug/kprobes/list: Lists all registered probes on the system::
+
+ c015d71a k vfs_read+0x0
+ c03dedc5 r tcp_v4_rcv+0x0
+
+The first column provides the kernel address where the probe is inserted.
+The second column identifies the type of probe (k - kprobe and r - kretprobe)
+while the third column specifies the symbol+offset of the probe.
+If the probed function belongs to a module, the module name is also
+specified. Following columns show probe status. If the probe is on
+a virtual address that is no longer valid (module init sections, module
+virtual addresses that correspond to modules that've been unloaded),
+such probes are marked with [GONE]. If the probe is temporarily disabled,
+such probes are marked with [DISABLED]. If the probe is optimized, it is
+marked with [OPTIMIZED]. If the probe is ftrace-based, it is marked with
+[FTRACE].
+
+/sys/kernel/debug/kprobes/enabled: Turn kprobes ON/OFF forcibly.
+
+Provides a knob to globally and forcibly turn registered kprobes ON or OFF.
+By default, all kprobes are enabled. By echoing "0" to this file, all
+registered probes will be disarmed, till such time a "1" is echoed to this
+file. Note that this knob just disarms and arms all kprobes and doesn't
+change each probe's disabling state. This means that disabled kprobes (marked
+[DISABLED]) will be not enabled if you turn ON all kprobes by this knob.
+
+
+The kprobes sysctl interface
+============================
+
+/proc/sys/debug/kprobes-optimization: Turn kprobes optimization ON/OFF.
+
+When CONFIG_OPTPROBES=y, this sysctl interface appears and it provides
+a knob to globally and forcibly turn jump optimization (see section
+:ref:`kprobes_jump_optimization`) ON or OFF. By default, jump optimization
+is allowed (ON). If you echo "0" to this file or set
+"debug.kprobes_optimization" to 0 via sysctl, all optimized probes will be
+unoptimized, and any new probes registered after that will not be optimized.
+
+Note that this knob *changes* the optimized state. This means that optimized
+probes (marked [OPTIMIZED]) will be unoptimized ([OPTIMIZED] tag will be
+removed). If the knob is turned on, they will be optimized again.
+
+References
+==========
+
+For additional information on Kprobes, refer to the following URLs:
+
+- https://lwn.net/Articles/132196/
+- https://www.kernel.org/doc/ols/2006/ols2006v2-pages-109-124.pdf
+
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index cc4c5fc313df..4274cc6a2f94 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -28,19 +28,21 @@ Synopsis of kprobe_events
-------------------------
::
- p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS] : Set a probe
- r[MAXACTIVE][:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS] : Set a return probe
- -:[GRP/]EVENT : Clear a probe
+ p[:[GRP/][EVENT]] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS] : Set a probe
+ r[MAXACTIVE][:[GRP/][EVENT]] [MOD:]SYM[+0] [FETCHARGS] : Set a return probe
+ p[:[GRP/][EVENT]] [MOD:]SYM[+0]%return [FETCHARGS] : Set a return probe
+ -:[GRP/][EVENT] : Clear a probe
GRP : Group name. If omitted, use "kprobes" for it.
EVENT : Event name. If omitted, the event name is generated
based on SYM+offs or MEMADDR.
MOD : Module name which has given SYM.
SYM[+offs] : Symbol+offset where the probe is inserted.
+ SYM%return : Return address of the symbol
MEMADDR : Address where the probe is inserted.
MAXACTIVE : Maximum number of instances of the specified function that
can be probed simultaneously, or 0 for the default value
- as defined in Documentation/kprobes.txt section 1.3.1.
+ as defined in Documentation/trace/kprobes.rst section 1.3.1.
FETCHARGS : Arguments. Each probe can have up to 128 args.
%REG : Fetch register REG
diff --git a/Documentation/trace/mmiotrace.rst b/Documentation/trace/mmiotrace.rst
index 5116e8ca27b4..fed13eaead89 100644
--- a/Documentation/trace/mmiotrace.rst
+++ b/Documentation/trace/mmiotrace.rst
@@ -5,7 +5,7 @@ In-kernel memory-mapped I/O tracing
Home page and links to optional user space tools:
- http://nouveau.freedesktop.org/wiki/MmioTrace
+ https://nouveau.freedesktop.org/wiki/MmioTrace
MMIO tracing was originally developed by Intel around 2003 for their Fault
Injection Test Harness. In Dec 2006 - Jan 2007, using the code from Intel,
diff --git a/Documentation/trace/osnoise-tracer.rst b/Documentation/trace/osnoise-tracer.rst
new file mode 100644
index 000000000000..963def9f97c6
--- /dev/null
+++ b/Documentation/trace/osnoise-tracer.rst
@@ -0,0 +1,152 @@
+==============
+OSNOISE Tracer
+==============
+
+In the context of high-performance computing (HPC), the Operating System
+Noise (*osnoise*) refers to the interference experienced by an application
+due to activities inside the operating system. In the context of Linux,
+NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
+system. Moreover, hardware-related jobs can also cause noise, for example,
+via SMIs.
+
+hwlat_detector is one of the tools used to identify the most complex
+source of noise: *hardware noise*.
+
+In a nutshell, the hwlat_detector creates a thread that runs
+periodically for a given period. At the beginning of a period, the thread
+disables interrupt and starts sampling. While running, the hwlatd
+thread reads the time in a loop. As interrupts are disabled, threads,
+IRQs, and SoftIRQs cannot interfere with the hwlatd thread. Hence, the
+cause of any gap between two different reads of the time roots either on
+NMI or in the hardware itself. At the end of the period, hwlatd enables
+interrupts and reports the max observed gap between the reads. It also
+prints a NMI occurrence counter. If the output does not report NMI
+executions, the user can conclude that the hardware is the culprit for
+the latency. The hwlat detects the NMI execution by observing
+the entry and exit of a NMI.
+
+The osnoise tracer leverages the hwlat_detector by running a
+similar loop with preemption, SoftIRQs and IRQs enabled, thus allowing
+all the sources of *osnoise* during its execution. Using the same approach
+of hwlat, osnoise takes note of the entry and exit point of any
+source of interferences, increasing a per-cpu interference counter. The
+osnoise tracer also saves an interference counter for each source of
+interference. The interference counter for NMI, IRQs, SoftIRQs, and
+threads is increased anytime the tool observes these interferences' entry
+events. When a noise happens without any interference from the operating
+system level, the hardware noise counter increases, pointing to a
+hardware-related noise. In this way, osnoise can account for any
+source of interference. At the end of the period, the osnoise tracer
+prints the sum of all noise, the max single noise, the percentage of CPU
+available for the thread, and the counters for the noise sources.
+
+Usage
+-----
+
+Write the ASCII text "osnoise" into the current_tracer file of the
+tracing system (generally mounted at /sys/kernel/tracing).
+
+For example::
+
+ [root@f32 ~]# cd /sys/kernel/tracing/
+ [root@f32 tracing]# echo osnoise > current_tracer
+
+It is possible to follow the trace by reading the trace file::
+
+ [root@f32 tracing]# cat trace
+ # tracer: osnoise
+ #
+ # _-----=> irqs-off
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth MAX
+ # || / SINGLE Interference counters:
+ # |||| RUNTIME NOISE % OF CPU NOISE +-----------------------------+
+ # TASK-PID CPU# |||| TIMESTAMP IN US IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD
+ # | | | |||| | | | | | | | | | |
+ <...>-859 [000] .... 81.637220: 1000000 190 99.98100 9 18 0 1007 18 1
+ <...>-860 [001] .... 81.638154: 1000000 656 99.93440 74 23 0 1006 16 3
+ <...>-861 [002] .... 81.638193: 1000000 5675 99.43250 202 6 0 1013 25 21
+ <...>-862 [003] .... 81.638242: 1000000 125 99.98750 45 1 0 1011 23 0
+ <...>-863 [004] .... 81.638260: 1000000 1721 99.82790 168 7 0 1002 49 41
+ <...>-864 [005] .... 81.638286: 1000000 263 99.97370 57 6 0 1006 26 2
+ <...>-865 [006] .... 81.638302: 1000000 109 99.98910 21 3 0 1006 18 1
+ <...>-866 [007] .... 81.638326: 1000000 7816 99.21840 107 8 0 1016 39 19
+
+In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
+tracer prints a message at the end of each period for each CPU that is
+running an osnoise/ thread. The osnoise specific fields report:
+
+ - The RUNTIME IN US reports the amount of time in microseconds that
+ the osnoise thread kept looping reading the time.
+ - The NOISE IN US reports the sum of noise in microseconds observed
+ by the osnoise tracer during the associated runtime.
+ - The % OF CPU AVAILABLE reports the percentage of CPU available for
+ the osnoise thread during the runtime window.
+ - The MAX SINGLE NOISE IN US reports the maximum single noise observed
+ during the runtime window.
+ - The Interference counters display how many each of the respective
+ interference happened during the runtime window.
+
+Note that the example above shows a high number of HW noise samples.
+The reason being is that this sample was taken on a virtual machine,
+and the host interference is detected as a hardware interference.
+
+Tracer options
+---------------------
+
+The tracer has a set of options inside the osnoise directory, they are:
+
+ - osnoise/cpus: CPUs at which a osnoise thread will execute.
+ - osnoise/period_us: the period of the osnoise thread.
+ - osnoise/runtime_us: how long an osnoise thread will look for noise.
+ - osnoise/stop_tracing_us: stop the system tracing if a single noise
+ higher than the configured value happens. Writing 0 disables this
+ option.
+ - osnoise/stop_tracing_total_us: stop the system tracing if total noise
+ higher than the configured value happens. Writing 0 disables this
+ option.
+ - tracing_threshold: the minimum delta between two time() reads to be
+ considered as noise, in us. When set to 0, the default value will
+ be used, which is currently 5 us.
+
+Additional Tracing
+------------------
+
+In addition to the tracer, a set of tracepoints were added to
+facilitate the identification of the osnoise source.
+
+ - osnoise:sample_threshold: printed anytime a noise is higher than
+ the configurable tolerance_ns.
+ - osnoise:nmi_noise: noise from NMI, including the duration.
+ - osnoise:irq_noise: noise from an IRQ, including the duration.
+ - osnoise:softirq_noise: noise from a SoftIRQ, including the
+ duration.
+ - osnoise:thread_noise: noise from a thread, including the duration.
+
+Note that all the values are *net values*. For example, if while osnoise
+is running, another thread preempts the osnoise thread, it will start a
+thread_noise duration at the start. Then, an IRQ takes place, preempting
+the thread_noise, starting a irq_noise. When the IRQ ends its execution,
+it will compute its duration, and this duration will be subtracted from
+the thread_noise, in such a way as to avoid the double accounting of the
+IRQ execution. This logic is valid for all sources of noise.
+
+Here is one example of the usage of these tracepoints::
+
+ osnoise/8-961 [008] d.h. 5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
+ osnoise/8-961 [008] dNh. 5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
+ migration/8-54 [008] d... 5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
+ osnoise/8-961 [008] .... 5789.858413: sample_threshold: start 5789.858404555 duration 8812 ns interferences 2
+
+In this example, a noise sample of 8 microseconds was reported in the last
+line, pointing to two interferences. Looking backward in the trace, the
+two previous entries were about the migration thread running after a
+timer IRQ execution. The first event is not part of the noise because
+it took place one millisecond before.
+
+It is worth noticing that the sum of the duration reported in the
+tracepoints is smaller than eight us reported in the sample_threshold.
+The reason roots in the overhead of the entry and exit code that happens
+before and after any interference execution. This justifies the dual
+approach: measuring thread and tracing.
diff --git a/Documentation/trace/postprocess/decode_msr.py b/Documentation/trace/postprocess/decode_msr.py
index 0ab40e0db580..aa9cc7abd5c2 100644
--- a/Documentation/trace/postprocess/decode_msr.py
+++ b/Documentation/trace/postprocess/decode_msr.py
@@ -1,4 +1,4 @@
-#!/usr/bin/python
+#!/usr/bin/env python
# add symbolic names to read_msr / write_msr in trace
# decode_msr msr-index.h < trace
import sys
diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
index 0a120aae33ce..b9b7d80c2f9d 100644
--- a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
@@ -1,4 +1,4 @@
-#!/usr/bin/perl
+#!/usr/bin/env perl
# This is a POC (proof of concept or piece of crap, take your pick) for reading the
# text representation of trace output related to page allocation. It makes an attempt
# to extract some high-level information on what is going on. The accuracy of the parser
diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
index 995da15b16ca..2f4e39875fb3 100644
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -1,4 +1,4 @@
-#!/usr/bin/perl
+#!/usr/bin/env perl
# This is a POC for reading the text representation of trace output related to
# page reclaim. It makes an attempt to extract some high-level information on
# what is going on. The accuracy of the parser may vary
diff --git a/Documentation/trace/ring-buffer-design.txt b/Documentation/trace/ring-buffer-design.rst
index 2d53c6f25b91..c5d77fcbb5bc 100644
--- a/Documentation/trace/ring-buffer-design.txt
+++ b/Documentation/trace/ring-buffer-design.rst
@@ -1,11 +1,15 @@
- Lockless Ring Buffer Design
- ===========================
+.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-only
+
+===========================
+Lockless Ring Buffer Design
+===========================
Copyright 2009 Red Hat Inc.
- Author: Steven Rostedt <srostedt@redhat.com>
- License: The GNU Free Documentation License, Version 1.2
- (dual licensed under the GPL v2)
-Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto,
+
+:Author: Steven Rostedt <srostedt@redhat.com>
+:License: The GNU Free Documentation License, Version 1.2
+ (dual licensed under the GPL v2)
+:Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto,
and Frederic Weisbecker.
@@ -14,37 +18,50 @@ Written for: 2.6.31
Terminology used in this Document
---------------------------------
-tail - where new writes happen in the ring buffer.
+tail
+ - where new writes happen in the ring buffer.
-head - where new reads happen in the ring buffer.
+head
+ - where new reads happen in the ring buffer.
-producer - the task that writes into the ring buffer (same as writer)
+producer
+ - the task that writes into the ring buffer (same as writer)
-writer - same as producer
+writer
+ - same as producer
-consumer - the task that reads from the buffer (same as reader)
+consumer
+ - the task that reads from the buffer (same as reader)
-reader - same as consumer.
+reader
+ - same as consumer.
-reader_page - A page outside the ring buffer used solely (for the most part)
- by the reader.
+reader_page
+ - A page outside the ring buffer used solely (for the most part)
+ by the reader.
-head_page - a pointer to the page that the reader will use next
+head_page
+ - a pointer to the page that the reader will use next
-tail_page - a pointer to the page that will be written to next
+tail_page
+ - a pointer to the page that will be written to next
-commit_page - a pointer to the page with the last finished non-nested write.
+commit_page
+ - a pointer to the page with the last finished non-nested write.
-cmpxchg - hardware-assisted atomic transaction that performs the following:
+cmpxchg
+ - hardware-assisted atomic transaction that performs the following::
- A = B if previous A == C
+ A = B if previous A == C
- R = cmpxchg(A, C, B) is saying that we replace A with B if and only if
- current A is equal to C, and we put the old (current) A into R
+ R = cmpxchg(A, C, B) is saying that we replace A with B if and only
+ if current A is equal to C, and we put the old (current)
+ A into R
- R gets the previous A regardless if A is updated with B or not.
+ R gets the previous A regardless if A is updated with B or not.
- To see if the update was successful a compare of R == C may be used.
+ To see if the update was successful a compare of ``R == C``
+ may be used.
The Generic Ring Buffer
-----------------------
@@ -64,7 +81,7 @@ No two writers can write at the same time (on the same per-cpu buffer),
but a writer may interrupt another writer, but it must finish writing
before the previous writer may continue. This is very important to the
algorithm. The writers act like a "stack". The way interrupts works
-enforces this behavior.
+enforces this behavior::
writer1 start
@@ -115,6 +132,8 @@ A sample of how the reader page is swapped: Note this does not
show the head page in the buffer, it is for demonstrating a swap
only.
+::
+
+------+
|reader| RING BUFFER
|page |
@@ -172,21 +191,22 @@ only.
It is possible that the page swapped is the commit page and the tail page,
if what is in the ring buffer is less than what is held in a buffer page.
-
- reader page commit page tail page
- | | |
- v | |
- +---+ | |
- | |<----------+ |
- | |<------------------------+
- | |------+
- +---+ |
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+::
+
+ reader page commit page tail page
+ | | |
+ v | |
+ +---+ | |
+ | |<----------+ |
+ | |<------------------------+
+ | |------+
+ +---+ |
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
This case is still valid for this algorithm.
When the writer leaves the page, it simply goes into the ring buffer
@@ -196,15 +216,19 @@ buffer.
The main pointers:
- reader page - The page used solely by the reader and is not part
- of the ring buffer (may be swapped in)
+ reader page
+ - The page used solely by the reader and is not part
+ of the ring buffer (may be swapped in)
- head page - the next page in the ring buffer that will be swapped
+ head page
+ - the next page in the ring buffer that will be swapped
with the reader page.
- tail page - the page where the next write will take place.
+ tail page
+ - the page where the next write will take place.
- commit page - the page that last finished a write.
+ commit page
+ - the page that last finished a write.
The commit page only is updated by the outermost writer in the
writer stack. A writer that preempts another writer will not move the
@@ -219,7 +243,7 @@ transaction. If another write happens it must finish before continuing
with the previous write.
- Write reserve:
+ Write reserve::
Buffer page
+---------+
@@ -230,7 +254,7 @@ with the previous write.
| empty |
+---------+
- Write commit:
+ Write commit::
Buffer page
+---------+
@@ -242,7 +266,7 @@ with the previous write.
+---------+
- If a write happens after the first reserve:
+ If a write happens after the first reserve::
Buffer page
+---------+
@@ -253,7 +277,7 @@ with the previous write.
|reserved |
+---------+ <--- tail pointer
- After second writer commits:
+ After second writer commits::
Buffer page
@@ -266,7 +290,7 @@ with the previous write.
|commit |
+---------+ <--- tail pointer
- When the first writer commits:
+ When the first writer commits::
Buffer page
+---------+
@@ -292,21 +316,22 @@ be several pages ahead. If the tail page catches up to the commit
page then no more writes may take place (regardless of the mode
of the ring buffer: overwrite and produce/consumer).
-The order of pages is:
+The order of pages is::
head page
commit page
tail page
-Possible scenario:
- tail page
- head page commit page |
- | | |
- v v v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+Possible scenario::
+
+ tail page
+ head page commit page |
+ | | |
+ v v v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
There is a special case that the head page is after either the commit page
and possibly the tail page. That is when the commit (and tail) page has been
@@ -315,24 +340,25 @@ part of the ring buffer, but the reader page is not. Whenever there
has been less than a full page that has been committed inside the ring buffer,
and a reader swaps out a page, it will be swapping out the commit page.
-
- reader page commit page tail page
- | | |
- v | |
- +---+ | |
- | |<----------+ |
- | |<------------------------+
- | |------+
- +---+ |
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- head page
+::
+
+ reader page commit page tail page
+ | | |
+ v | |
+ +---+ | |
+ | |<----------+ |
+ | |<------------------------+
+ | |------+
+ +---+ |
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+ ^
+ |
+ head page
In this case, the head page will not move when the tail and commit
@@ -347,42 +373,42 @@ When the tail meets the head page, if the buffer is in overwrite mode,
the head page will be pushed ahead one. If the buffer is in producer/consumer
mode, the write will fail.
-Overwrite mode:
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- head page
-
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- head page
-
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- head page
+Overwrite mode::
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+ ^
+ |
+ head page
+
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+ ^
+ |
+ head page
+
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+ ^
+ |
+ head page
Note, the reader page will still point to the previous head page.
But when a swap takes place, it will use the most recent head page.
@@ -397,7 +423,7 @@ State flags are placed inside the pointer to the page. To do this,
each page must be aligned in memory by 4 bytes. This will allow the 2
least significant bits of the address to be used as flags, since
they will always be zero for the address. To get the address,
-simply mask out the flags.
+simply mask out the flags::
MASK = ~3
@@ -405,24 +431,27 @@ simply mask out the flags.
Two flags will be kept by these two bits:
- HEADER - the page being pointed to is a head page
+ HEADER
+ - the page being pointed to is a head page
- UPDATE - the page being pointed to is being updated by a writer
+ UPDATE
+ - the page being pointed to is being updated by a writer
and was or is about to be a head page.
+::
- reader page
- |
- v
- +---+
- | |------+
- +---+ |
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ reader page
+ |
+ v
+ +---+
+ | |------+
+ +---+ |
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-H->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
The above pointer "-H->" would have the HEADER flag set. That is
@@ -430,24 +459,24 @@ the next page is the next page to be swapped out by the reader.
This pointer means the next page is the head page.
When the tail page meets the head pointer, it will use cmpxchg to
-change the pointer to the UPDATE state:
+change the pointer to the UPDATE state::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-H->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
"-U->" represents a pointer in the UPDATE state.
@@ -462,7 +491,7 @@ head page does not have the HEADER flag set, the compare will fail
and the reader will need to look for the new head page and try again.
Note, the flags UPDATE and HEADER are never set at the same time.
-The reader swaps the reader page as follows:
+The reader swaps the reader page as follows::
+------+
|reader| RING BUFFER
@@ -477,7 +506,7 @@ The reader swaps the reader page as follows:
+-----H-------------+
The reader sets the reader page next pointer as HEADER to the page after
-the head page.
+the head page::
+------+
@@ -495,7 +524,7 @@ the head page.
It does a cmpxchg with the pointer to the previous head page to make it
point to the reader page. Note that the new pointer does not have the HEADER
-flag set. This action atomically moves the head page forward.
+flag set. This action atomically moves the head page forward::
+------+
|reader| RING BUFFER
@@ -511,7 +540,7 @@ flag set. This action atomically moves the head page forward.
+------------------------------------+
After the new head page is set, the previous pointer of the head page is
-updated to the reader page.
+updated to the reader page::
+------+
|reader| RING BUFFER
@@ -548,7 +577,7 @@ prev pointers may not.
Note, the way to determine a reader page is simply by examining the previous
pointer of the page. If the next pointer of the previous page does not
-point back to the original page, then the original page is a reader page:
+point back to the original page, then the original page is a reader page::
+--------+
@@ -572,54 +601,54 @@ not be able to swap the head page from the buffer, nor will it be able to
move the head page, until the writer is finished with the move.
This eliminates any races that the reader can have on the writer. The reader
-must spin, and this is why the reader cannot preempt the writer.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-The following page will be made into the new head page.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+must spin, and this is why the reader cannot preempt the writer::
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-H->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+The following page will be made into the new head page::
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
After the new head page has been set, we can set the old head page
-pointer back to NORMAL.
+pointer back to NORMAL::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
-After the head page has been moved, the tail page may now move forward.
+After the head page has been moved, the tail page may now move forward::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
The above are the trivial updates. Now for the more complex scenarios.
@@ -630,26 +659,26 @@ tail page may make it all the way around the buffer and meet the commit
page. At this time, we must start dropping writes (usually with some kind
of warning to the user). But what happens if the commit was still on the
reader page? The commit page is not part of the ring buffer. The tail page
-must account for this.
-
-
- reader page commit page
- | |
- v |
- +---+ |
- | |<----------+
- | |
- | |------+
- +---+ |
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- tail page
+must account for this::
+
+
+ reader page commit page
+ | |
+ v |
+ +---+ |
+ | |<----------+
+ | |
+ | |------+
+ +---+ |
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-H->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+ ^
+ |
+ tail page
If the tail page were to simply push the head page forward, the commit when
leaving the reader page would not be pointing to the correct page.
@@ -676,7 +705,7 @@ the head page if the head page is the next page. If the head page
is not the next page, the tail page is simply updated with a cmpxchg.
Only writers move the tail page. This must be done atomically to protect
-against nested writers.
+against nested writers::
temp_page = tail_page
next_page = temp_page->next
@@ -684,54 +713,54 @@ against nested writers.
The above will update the tail page if it is still pointing to the expected
page. If this fails, a nested write pushed it forward, the current write
-does not need to push it.
-
-
- temp page
- |
- v
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-Nested write comes in and moves the tail page forward:
-
- tail page (moved by nested writer)
- temp page |
- | |
- v v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+does not need to push it::
+
+
+ temp page
+ |
+ v
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+Nested write comes in and moves the tail page forward::
+
+ tail page (moved by nested writer)
+ temp page |
+ | |
+ v v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
The above would fail the cmpxchg, but since the tail page has already
been moved forward, the writer will just try again to reserve storage
on the new tail page.
-But the moving of the head page is a bit more complex.
+But the moving of the head page is a bit more complex::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-H->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
-The write converts the head page pointer to UPDATE.
+The write converts the head page pointer to UPDATE::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
But if a nested writer preempts here, it will see that the next
page is a head page, but it is also nested. It will detect that
@@ -739,217 +768,216 @@ it is nested and will save that information. The detection is the
fact that it sees the UPDATE flag instead of a HEADER or NORMAL
pointer.
-The nested writer will set the new head page pointer.
+The nested writer will set the new head page pointer::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
But it will not reset the update back to normal. Only the writer
that converted a pointer from HEAD to UPDATE will convert it back
-to NORMAL.
+to NORMAL::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
After the nested writer finishes, the outermost writer will convert
-the UPDATE pointer to NORMAL.
+the UPDATE pointer to NORMAL::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
It can be even more complex if several nested writes came in and moved
-the tail page ahead several pages:
+the tail page ahead several pages::
-(first writer)
+ (first writer)
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-H->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
-The write converts the head page pointer to UPDATE.
+The write converts the head page pointer to UPDATE::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
Next writer comes in, and sees the update and sets up the new
-head page.
+head page::
-(second writer)
+ (second writer)
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
The nested writer moves the tail page forward. But does not set the old
-update page to NORMAL because it is not the outermost writer.
+update page to NORMAL because it is not the outermost writer::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
Another writer preempts and sees the page after the tail page is a head page.
-It changes it from HEAD to UPDATE.
+It changes it from HEAD to UPDATE::
-(third writer)
+ (third writer)
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-U->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-U->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
-The writer will move the head page forward:
+The writer will move the head page forward::
-(third writer)
+ (third writer)
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-U->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-U->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
But now that the third writer did change the HEAD flag to UPDATE it
-will convert it to normal:
+will convert it to normal::
-(third writer)
+ (third writer)
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
-Then it will move the tail page, and return back to the second writer.
+Then it will move the tail page, and return back to the second writer::
-(second writer)
+ (second writer)
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
The second writer will fail to move the tail page because it was already
moved, so it will try again and add its data to the new tail page.
-It will return to the first writer.
+It will return to the first writer::
-(first writer)
+ (first writer)
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
The first writer cannot know atomically if the tail page moved
while it updates the HEAD page. It will then update the head page to
-what it thinks is the new head page.
+what it thinks is the new head page::
-(first writer)
+ (first writer)
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
Since the cmpxchg returns the old value of the pointer the first writer
will see it succeeded in updating the pointer from NORMAL to HEAD.
But as we can see, this is not good enough. It must also check to see
-if the tail page is either where it use to be or on the next page:
+if the tail page is either where it use to be or on the next page::
-(first writer)
+ (first writer)
- A B tail page
- | | |
- v v v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ A B tail page
+ | | |
+ v v v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
If tail page != A and tail page != B, then it must reset the pointer
back to NORMAL. The fact that it only needs to worry about nested
-writers means that it only needs to check this after setting the HEAD page.
+writers means that it only needs to check this after setting the HEAD page::
-(first writer)
+ (first writer)
- A B tail page
- | | |
- v v v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ A B tail page
+ | | |
+ v v v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
Now the writer can update the head page. This is also why the head page must
remain in UPDATE and only reset by the outermost writer. This prevents
-the reader from seeing the incorrect head page.
-
+the reader from seeing the incorrect head page::
-(first writer)
- A B tail page
- | | |
- v v v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ (first writer)
+ A B tail page
+ | | |
+ v v v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
diff --git a/Documentation/trace/rv/da_monitor_instrumentation.rst b/Documentation/trace/rv/da_monitor_instrumentation.rst
new file mode 100644
index 000000000000..6c67c7b57811
--- /dev/null
+++ b/Documentation/trace/rv/da_monitor_instrumentation.rst
@@ -0,0 +1,171 @@
+Deterministic Automata Instrumentation
+======================================
+
+The RV monitor file created by dot2k, with the name "$MODEL_NAME.c"
+includes a section dedicated to instrumentation.
+
+In the example of the wip.dot monitor created on [1], it will look like::
+
+ /*
+ * This is the instrumentation part of the monitor.
+ *
+ * This is the section where manual work is required. Here the kernel events
+ * are translated into model's event.
+ *
+ */
+ static void handle_preempt_disable(void *data, /* XXX: fill header */)
+ {
+ da_handle_event_wip(preempt_disable_wip);
+ }
+
+ static void handle_preempt_enable(void *data, /* XXX: fill header */)
+ {
+ da_handle_event_wip(preempt_enable_wip);
+ }
+
+ static void handle_sched_waking(void *data, /* XXX: fill header */)
+ {
+ da_handle_event_wip(sched_waking_wip);
+ }
+
+ static int enable_wip(void)
+ {
+ int retval;
+
+ retval = da_monitor_init_wip();
+ if (retval)
+ return retval;
+
+ rv_attach_trace_probe("wip", /* XXX: tracepoint */, handle_preempt_disable);
+ rv_attach_trace_probe("wip", /* XXX: tracepoint */, handle_preempt_enable);
+ rv_attach_trace_probe("wip", /* XXX: tracepoint */, handle_sched_waking);
+
+ return 0;
+ }
+
+The comment at the top of the section explains the general idea: the
+instrumentation section translates *kernel events* into the *model's
+event*.
+
+Tracing callback functions
+--------------------------
+
+The first three functions are the starting point of the callback *handler
+functions* for each of the three events from the wip model. The developer
+does not necessarily need to use them: they are just starting points.
+
+Using the example of::
+
+ void handle_preempt_disable(void *data, /* XXX: fill header */)
+ {
+ da_handle_event_wip(preempt_disable_wip);
+ }
+
+The preempt_disable event from the model connects directly to the
+preemptirq:preempt_disable. The preemptirq:preempt_disable event
+has the following signature, from include/trace/events/preemptirq.h::
+
+ TP_PROTO(unsigned long ip, unsigned long parent_ip)
+
+Hence, the handle_preempt_disable() function will look like::
+
+ void handle_preempt_disable(void *data, unsigned long ip, unsigned long parent_ip)
+
+In this case, the kernel event translates one to one with the automata
+event, and indeed, no other change is required for this function.
+
+The next handler function, handle_preempt_enable() has the same argument
+list from the handle_preempt_disable(). The difference is that the
+preempt_enable event will be used to synchronize the system to the model.
+
+Initially, the *model* is placed in the initial state. However, the *system*
+might or might not be in the initial state. The monitor cannot start
+processing events until it knows that the system has reached the initial state.
+Otherwise, the monitor and the system could be out-of-sync.
+
+Looking at the automata definition, it is possible to see that the system
+and the model are expected to return to the initial state after the
+preempt_enable execution. Hence, it can be used to synchronize the
+system and the model at the initialization of the monitoring section.
+
+The start is informed via a special handle function, the
+"da_handle_start_event_$(MONITOR_NAME)(event)", in this case::
+
+ da_handle_start_event_wip(preempt_enable_wip);
+
+So, the callback function will look like::
+
+ void handle_preempt_enable(void *data, unsigned long ip, unsigned long parent_ip)
+ {
+ da_handle_start_event_wip(preempt_enable_wip);
+ }
+
+Finally, the "handle_sched_waking()" will look like::
+
+ void handle_sched_waking(void *data, struct task_struct *task)
+ {
+ da_handle_event_wip(sched_waking_wip);
+ }
+
+And the explanation is left for the reader as an exercise.
+
+enable and disable functions
+----------------------------
+
+dot2k automatically creates two special functions::
+
+ enable_$(MONITOR_NAME)()
+ disable_$(MONITOR_NAME)()
+
+These functions are called when the monitor is enabled and disabled,
+respectively.
+
+They should be used to *attach* and *detach* the instrumentation to the running
+system. The developer must add to the relative function all that is needed to
+*attach* and *detach* its monitor to the system.
+
+For the wip case, these functions were named::
+
+ enable_wip()
+ disable_wip()
+
+But no change was required because: by default, these functions *attach* and
+*detach* the tracepoints_to_attach, which was enough for this case.
+
+Instrumentation helpers
+-----------------------
+
+To complete the instrumentation, the *handler functions* need to be attached to a
+kernel event, at the monitoring enable phase.
+
+The RV interface also facilitates this step. For example, the macro "rv_attach_trace_probe()"
+is used to connect the wip model events to the relative kernel event. dot2k automatically
+adds "rv_attach_trace_probe()" function call for each model event in the enable phase, as
+a suggestion.
+
+For example, from the wip sample model::
+
+ static int enable_wip(void)
+ {
+ int retval;
+
+ retval = da_monitor_init_wip();
+ if (retval)
+ return retval;
+
+ rv_attach_trace_probe("wip", /* XXX: tracepoint */, handle_preempt_enable);
+ rv_attach_trace_probe("wip", /* XXX: tracepoint */, handle_sched_waking);
+ rv_attach_trace_probe("wip", /* XXX: tracepoint */, handle_preempt_disable);
+
+ return 0;
+ }
+
+The probes then need to be detached at the disable phase.
+
+[1] The wip model is presented in::
+
+ Documentation/trace/rv/deterministic_automata.rst
+
+The wip monitor is presented in::
+
+ Documentation/trace/rv/da_monitor_synthesis.rst
diff --git a/Documentation/trace/rv/da_monitor_synthesis.rst b/Documentation/trace/rv/da_monitor_synthesis.rst
new file mode 100644
index 000000000000..0dbdcd1e62b9
--- /dev/null
+++ b/Documentation/trace/rv/da_monitor_synthesis.rst
@@ -0,0 +1,147 @@
+Deterministic Automata Monitor Synthesis
+========================================
+
+The starting point for the application of runtime verification (RV) technics
+is the *specification* or *modeling* of the desired (or undesired) behavior
+of the system under scrutiny.
+
+The formal representation needs to be then *synthesized* into a *monitor*
+that can then be used in the analysis of the trace of the system. The
+*monitor* connects to the system via an *instrumentation* that converts
+the events from the *system* to the events of the *specification*.
+
+
+In Linux terms, the runtime verification monitors are encapsulated inside
+the *RV monitor* abstraction. The RV monitor includes a set of instances
+of the monitor (per-cpu monitor, per-task monitor, and so on), the helper
+functions that glue the monitor to the system reference model, and the
+trace output as a reaction to event parsing and exceptions, as depicted
+below::
+
+ Linux +----- RV Monitor ----------------------------------+ Formal
+ Realm | | Realm
+ +-------------------+ +----------------+ +-----------------+
+ | Linux kernel | | Monitor | | Reference |
+ | Tracing | -> | Instance(s) | <- | Model |
+ | (instrumentation) | | (verification) | | (specification) |
+ +-------------------+ +----------------+ +-----------------+
+ | | |
+ | V |
+ | +----------+ |
+ | | Reaction | |
+ | +--+--+--+-+ |
+ | | | | |
+ | | | +-> trace output ? |
+ +------------------------|--|----------------------+
+ | +----> panic ?
+ +-------> <user-specified>
+
+DA monitor synthesis
+--------------------
+
+The synthesis of automata-based models into the Linux *RV monitor* abstraction
+is automated by the dot2k tool and the rv/da_monitor.h header file that
+contains a set of macros that automatically generate the monitor's code.
+
+dot2k
+-----
+
+The dot2k utility leverages dot2c by converting an automaton model in
+the DOT format into the C representation [1] and creating the skeleton of
+a kernel monitor in C.
+
+For example, it is possible to transform the wip.dot model present in
+[1] into a per-cpu monitor with the following command::
+
+ $ dot2k -d wip.dot -t per_cpu
+
+This will create a directory named wip/ with the following files:
+
+- wip.h: the wip model in C
+- wip.c: the RV monitor
+
+The wip.c file contains the monitor declaration and the starting point for
+the system instrumentation.
+
+Monitor macros
+--------------
+
+The rv/da_monitor.h enables automatic code generation for the *Monitor
+Instance(s)* using C macros.
+
+The benefits of the usage of macro for monitor synthesis are 3-fold as it:
+
+- Reduces the code duplication;
+- Facilitates the bug fix/improvement;
+- Avoids the case of developers changing the core of the monitor code
+ to manipulate the model in a (let's say) non-standard way.
+
+This initial implementation presents three different types of monitor instances:
+
+- ``#define DECLARE_DA_MON_GLOBAL(name, type)``
+- ``#define DECLARE_DA_MON_PER_CPU(name, type)``
+- ``#define DECLARE_DA_MON_PER_TASK(name, type)``
+
+The first declares the functions for a global deterministic automata monitor,
+the second for monitors with per-cpu instances, and the third with per-task
+instances.
+
+In all cases, the 'name' argument is a string that identifies the monitor, and
+the 'type' argument is the data type used by dot2k on the representation of
+the model in C.
+
+For example, the wip model with two states and three events can be
+stored in an 'unsigned char' type. Considering that the preemption control
+is a per-cpu behavior, the monitor declaration in the 'wip.c' file is::
+
+ DECLARE_DA_MON_PER_CPU(wip, unsigned char);
+
+The monitor is executed by sending events to be processed via the functions
+presented below::
+
+ da_handle_event_$(MONITOR_NAME)($(event from event enum));
+ da_handle_start_event_$(MONITOR_NAME)($(event from event enum));
+ da_handle_start_run_event_$(MONITOR_NAME)($(event from event enum));
+
+The function ``da_handle_event_$(MONITOR_NAME)()`` is the regular case where
+the event will be processed if the monitor is processing events.
+
+When a monitor is enabled, it is placed in the initial state of the automata.
+However, the monitor does not know if the system is in the *initial state*.
+
+The ``da_handle_start_event_$(MONITOR_NAME)()`` function is used to notify the
+monitor that the system is returning to the initial state, so the monitor can
+start monitoring the next event.
+
+The ``da_handle_start_run_event_$(MONITOR_NAME)()`` function is used to notify
+the monitor that the system is known to be in the initial state, so the
+monitor can start monitoring and monitor the current event.
+
+Using the wip model as example, the events "preempt_disable" and
+"sched_waking" should be sent to monitor, respectively, via [2]::
+
+ da_handle_event_wip(preempt_disable_wip);
+ da_handle_event_wip(sched_waking_wip);
+
+While the event "preempt_enabled" will use::
+
+ da_handle_start_event_wip(preempt_enable_wip);
+
+To notify the monitor that the system will be returning to the initial state,
+so the system and the monitor should be in sync.
+
+Final remarks
+-------------
+
+With the monitor synthesis in place using the rv/da_monitor.h and
+dot2k, the developer's work should be limited to the instrumentation
+of the system, increasing the confidence in the overall approach.
+
+[1] For details about deterministic automata format and the translation
+from one representation to another, see::
+
+ Documentation/trace/rv/deterministic_automata.rst
+
+[2] dot2k appends the monitor's name suffix to the events enums to
+avoid conflicting variables when exporting the global vmlinux.h
+use by BPF programs.
diff --git a/Documentation/trace/rv/deterministic_automata.rst b/Documentation/trace/rv/deterministic_automata.rst
new file mode 100644
index 000000000000..d0638f95a455
--- /dev/null
+++ b/Documentation/trace/rv/deterministic_automata.rst
@@ -0,0 +1,184 @@
+Deterministic Automata
+======================
+
+Formally, a deterministic automaton, denoted by G, is defined as a quintuple:
+
+ *G* = { *X*, *E*, *f*, x\ :subscript:`0`, X\ :subscript:`m` }
+
+where:
+
+- *X* is the set of states;
+- *E* is the finite set of events;
+- x\ :subscript:`0` is the initial state;
+- X\ :subscript:`m` (subset of *X*) is the set of marked (or final) states.
+- *f* : *X* x *E* -> *X* $ is the transition function. It defines the state
+ transition in the occurrence of an event from *E* in the state *X*. In the
+ special case of deterministic automata, the occurrence of the event in *E*
+ in a state in *X* has a deterministic next state from *X*.
+
+For example, a given automaton named 'wip' (wakeup in preemptive) can
+be defined as:
+
+- *X* = { ``preemptive``, ``non_preemptive``}
+- *E* = { ``preempt_enable``, ``preempt_disable``, ``sched_waking``}
+- x\ :subscript:`0` = ``preemptive``
+- X\ :subscript:`m` = {``preemptive``}
+- *f* =
+ - *f*\ (``preemptive``, ``preempt_disable``) = ``non_preemptive``
+ - *f*\ (``non_preemptive``, ``sched_waking``) = ``non_preemptive``
+ - *f*\ (``non_preemptive``, ``preempt_enable``) = ``preemptive``
+
+One of the benefits of this formal definition is that it can be presented
+in multiple formats. For example, using a *graphical representation*, using
+vertices (nodes) and edges, which is very intuitive for *operating system*
+practitioners, without any loss.
+
+The previous 'wip' automaton can also be represented as::
+
+ preempt_enable
+ +---------------------------------+
+ v |
+ #============# preempt_disable +------------------+
+ --> H preemptive H -----------------> | non_preemptive |
+ #============# +------------------+
+ ^ |
+ | sched_waking |
+ +--------------+
+
+Deterministic Automaton in C
+----------------------------
+
+In the paper "Efficient formal verification for the Linux kernel",
+the authors present a simple way to represent an automaton in C that can
+be used as regular code in the Linux kernel.
+
+For example, the 'wip' automata can be presented as (augmented with comments)::
+
+ /* enum representation of X (set of states) to be used as index */
+ enum states {
+ preemptive = 0,
+ non_preemptive,
+ state_max
+ };
+
+ #define INVALID_STATE state_max
+
+ /* enum representation of E (set of events) to be used as index */
+ enum events {
+ preempt_disable = 0,
+ preempt_enable,
+ sched_waking,
+ event_max
+ };
+
+ struct automaton {
+ char *state_names[state_max]; // X: the set of states
+ char *event_names[event_max]; // E: the finite set of events
+ unsigned char function[state_max][event_max]; // f: transition function
+ unsigned char initial_state; // x_0: the initial state
+ bool final_states[state_max]; // X_m: the set of marked states
+ };
+
+ struct automaton aut = {
+ .state_names = {
+ "preemptive",
+ "non_preemptive"
+ },
+ .event_names = {
+ "preempt_disable",
+ "preempt_enable",
+ "sched_waking"
+ },
+ .function = {
+ { non_preemptive, INVALID_STATE, INVALID_STATE },
+ { INVALID_STATE, preemptive, non_preemptive },
+ },
+ .initial_state = preemptive,
+ .final_states = { 1, 0 },
+ };
+
+The *transition function* is represented as a matrix of states (lines) and
+events (columns), and so the function *f* : *X* x *E* -> *X* can be solved
+in O(1). For example::
+
+ next_state = automaton_wip.function[curr_state][event];
+
+Graphviz .dot format
+--------------------
+
+The Graphviz open-source tool can produce the graphical representation
+of an automaton using the (textual) DOT language as the source code.
+The DOT format is widely used and can be converted to many other formats.
+
+For example, this is the 'wip' model in DOT::
+
+ digraph state_automaton {
+ {node [shape = circle] "non_preemptive"};
+ {node [shape = plaintext, style=invis, label=""] "__init_preemptive"};
+ {node [shape = doublecircle] "preemptive"};
+ {node [shape = circle] "preemptive"};
+ "__init_preemptive" -> "preemptive";
+ "non_preemptive" [label = "non_preemptive"];
+ "non_preemptive" -> "non_preemptive" [ label = "sched_waking" ];
+ "non_preemptive" -> "preemptive" [ label = "preempt_enable" ];
+ "preemptive" [label = "preemptive"];
+ "preemptive" -> "non_preemptive" [ label = "preempt_disable" ];
+ { rank = min ;
+ "__init_preemptive";
+ "preemptive";
+ }
+ }
+
+This DOT format can be transformed into a bitmap or vectorial image
+using the dot utility, or into an ASCII art using graph-easy. For
+instance::
+
+ $ dot -Tsvg -o wip.svg wip.dot
+ $ graph-easy wip.dot > wip.txt
+
+dot2c
+-----
+
+dot2c is a utility that can parse a .dot file containing an automaton as
+in the example above and automatically convert it to the C representation
+presented in [3].
+
+For example, having the previous 'wip' model into a file named 'wip.dot',
+the following command will transform the .dot file into the C
+representation (previously shown) in the 'wip.h' file::
+
+ $ dot2c wip.dot > wip.h
+
+The 'wip.h' content is the code sample in section 'Deterministic Automaton
+in C'.
+
+Remarks
+-------
+
+The automata formalism allows modeling discrete event systems (DES) in
+multiple formats, suitable for different applications/users.
+
+For example, the formal description using set theory is better suitable
+for automata operations, while the graphical format for human interpretation;
+and computer languages for machine execution.
+
+References
+----------
+
+Many textbooks cover automata formalism. For a brief introduction see::
+
+ O'Regan, Gerard. Concise guide to software engineering. Springer,
+ Cham, 2017.
+
+For a detailed description, including operations, and application on Discrete
+Event Systems (DES), see::
+
+ Cassandras, Christos G., and Stephane Lafortune, eds. Introduction to discrete
+ event systems. Boston, MA: Springer US, 2008.
+
+For the C representation in kernel, see::
+
+ De Oliveira, Daniel Bristot; Cucinotta, Tommaso; De Oliveira, Romulo
+ Silva. Efficient formal verification for the Linux kernel. In:
+ International Conference on Software Engineering and Formal Methods.
+ Springer, Cham, 2019. p. 315-332.
diff --git a/Documentation/trace/rv/index.rst b/Documentation/trace/rv/index.rst
new file mode 100644
index 000000000000..15fa966102c0
--- /dev/null
+++ b/Documentation/trace/rv/index.rst
@@ -0,0 +1,14 @@
+====================
+Runtime Verification
+====================
+
+.. toctree::
+ :maxdepth: 2
+ :glob:
+
+ runtime-verification.rst
+ deterministic_automata.rst
+ da_monitor_synthesis.rst
+ da_monitor_instrumentation.rst
+ monitor_wip.rst
+ monitor_wwnr.rst
diff --git a/Documentation/trace/rv/monitor_wip.rst b/Documentation/trace/rv/monitor_wip.rst
new file mode 100644
index 000000000000..a95763438c48
--- /dev/null
+++ b/Documentation/trace/rv/monitor_wip.rst
@@ -0,0 +1,55 @@
+Monitor wip
+===========
+
+- Name: wip - wakeup in preemptive
+- Type: per-cpu deterministic automaton
+- Author: Daniel Bristot de Oliveira <bristot@kernel.org>
+
+Description
+-----------
+
+The wakeup in preemptive (wip) monitor is a sample per-cpu monitor
+that verifies if the wakeup events always take place with
+preemption disabled::
+
+ |
+ |
+ v
+ #==================#
+ H preemptive H <+
+ #==================# |
+ | |
+ | preempt_disable | preempt_enable
+ v |
+ sched_waking +------------------+ |
+ +--------------- | | |
+ | | non_preemptive | |
+ +--------------> | | -+
+ +------------------+
+
+The wakeup event always takes place with preemption disabled because
+of the scheduler synchronization. However, because the preempt_count
+and its trace event are not atomic with regard to interrupts, some
+inconsistencies might happen. For example::
+
+ preempt_disable() {
+ __preempt_count_add(1)
+ -------> smp_apic_timer_interrupt() {
+ preempt_disable()
+ do not trace (preempt count >= 1)
+
+ wake up a thread
+
+ preempt_enable()
+ do not trace (preempt count >= 1)
+ }
+ <------
+ trace_preempt_disable();
+ }
+
+This problem was reported and discussed here:
+ https://lore.kernel.org/r/cover.1559051152.git.bristot@redhat.com/
+
+Specification
+-------------
+Grapviz Dot file in tools/verification/models/wip.dot
diff --git a/Documentation/trace/rv/monitor_wwnr.rst b/Documentation/trace/rv/monitor_wwnr.rst
new file mode 100644
index 000000000000..80f1777b85aa
--- /dev/null
+++ b/Documentation/trace/rv/monitor_wwnr.rst
@@ -0,0 +1,45 @@
+Monitor wwnr
+============
+
+- Name: wwrn - wakeup while not running
+- Type: per-task deterministic automaton
+- Author: Daniel Bristot de Oliveira <bristot@kernel.org>
+
+Description
+-----------
+
+This is a per-task sample monitor, with the following
+definition::
+
+ |
+ |
+ v
+ wakeup +-------------+
+ +--------- | |
+ | | not_running |
+ +--------> | | <+
+ +-------------+ |
+ | |
+ | switch_in | switch_out
+ v |
+ +-------------+ |
+ | running | -+
+ +-------------+
+
+This model is borken, the reason is that a task can be running
+in the processor without being set as RUNNABLE. Think about a
+task about to sleep::
+
+ 1: set_current_state(TASK_UNINTERRUPTIBLE);
+ 2: schedule();
+
+And then imagine an IRQ happening in between the lines one and two,
+waking the task up. BOOM, the wakeup will happen while the task is
+running.
+
+- Why do we need this model, so?
+- To test the reactors.
+
+Specification
+-------------
+Grapviz Dot file in tools/verification/models/wwnr.dot
diff --git a/Documentation/trace/rv/runtime-verification.rst b/Documentation/trace/rv/runtime-verification.rst
new file mode 100644
index 000000000000..c46b6149470e
--- /dev/null
+++ b/Documentation/trace/rv/runtime-verification.rst
@@ -0,0 +1,231 @@
+====================
+Runtime Verification
+====================
+
+Runtime Verification (RV) is a lightweight (yet rigorous) method that
+complements classical exhaustive verification techniques (such as *model
+checking* and *theorem proving*) with a more practical approach for complex
+systems.
+
+Instead of relying on a fine-grained model of a system (e.g., a
+re-implementation a instruction level), RV works by analyzing the trace of the
+system's actual execution, comparing it against a formal specification of
+the system behavior.
+
+The main advantage is that RV can give precise information on the runtime
+behavior of the monitored system, without the pitfalls of developing models
+that require a re-implementation of the entire system in a modeling language.
+Moreover, given an efficient monitoring method, it is possible execute an
+*online* verification of a system, enabling the *reaction* for unexpected
+events, avoiding, for example, the propagation of a failure on safety-critical
+systems.
+
+Runtime Monitors and Reactors
+=============================
+
+A monitor is the central part of the runtime verification of a system. The
+monitor stands in between the formal specification of the desired (or
+undesired) behavior, and the trace of the actual system.
+
+In Linux terms, the runtime verification monitors are encapsulated inside the
+*RV monitor* abstraction. A *RV monitor* includes a reference model of the
+system, a set of instances of the monitor (per-cpu monitor, per-task monitor,
+and so on), and the helper functions that glue the monitor to the system via
+trace, as depicted bellow::
+
+ Linux +---- RV Monitor ----------------------------------+ Formal
+ Realm | | Realm
+ +-------------------+ +----------------+ +-----------------+
+ | Linux kernel | | Monitor | | Reference |
+ | Tracing | -> | Instance(s) | <- | Model |
+ | (instrumentation) | | (verification) | | (specification) |
+ +-------------------+ +----------------+ +-----------------+
+ | | |
+ | V |
+ | +----------+ |
+ | | Reaction | |
+ | +--+--+--+-+ |
+ | | | | |
+ | | | +-> trace output ? |
+ +------------------------|--|----------------------+
+ | +----> panic ?
+ +-------> <user-specified>
+
+In addition to the verification and monitoring of the system, a monitor can
+react to an unexpected event. The forms of reaction can vary from logging the
+event occurrence to the enforcement of the correct behavior to the extreme
+action of taking a system down to avoid the propagation of a failure.
+
+In Linux terms, a *reactor* is an reaction method available for *RV monitors*.
+By default, all monitors should provide a trace output of their actions,
+which is already a reaction. In addition, other reactions will be available
+so the user can enable them as needed.
+
+For further information about the principles of runtime verification and
+RV applied to Linux:
+
+ Bartocci, Ezio, et al. *Introduction to runtime verification.* In: Lectures on
+ Runtime Verification. Springer, Cham, 2018. p. 1-33.
+
+ Falcone, Ylies, et al. *A taxonomy for classifying runtime verification tools.*
+ In: International Conference on Runtime Verification. Springer, Cham, 2018. p.
+ 241-262.
+
+ De Oliveira, Daniel Bristot. *Automata-based formal analysis and
+ verification of the real-time Linux kernel.* Ph.D. Thesis, 2020.
+
+Online RV monitors
+==================
+
+Monitors can be classified as *offline* and *online* monitors. *Offline*
+monitor process the traces generated by a system after the events, generally by
+reading the trace execution from a permanent storage system. *Online* monitors
+process the trace during the execution of the system. Online monitors are said
+to be *synchronous* if the processing of an event is attached to the system
+execution, blocking the system during the event monitoring. On the other hand,
+an *asynchronous* monitor has its execution detached from the system. Each type
+of monitor has a set of advantages. For example, *offline* monitors can be
+executed on different machines but require operations to save the log to a
+file. In contrast, *synchronous online* method can react at the exact moment
+a violation occurs.
+
+Another important aspect regarding monitors is the overhead associated with the
+event analysis. If the system generates events at a frequency higher than the
+monitor's ability to process them in the same system, only the *offline*
+methods are viable. On the other hand, if the tracing of the events incurs
+on higher overhead than the simple handling of an event by a monitor, then a
+*synchronous online* monitors will incur on lower overhead.
+
+Indeed, the research presented in:
+
+ De Oliveira, Daniel Bristot; Cucinotta, Tommaso; De Oliveira, Romulo Silva.
+ *Efficient formal verification for the Linux kernel.* In: International
+ Conference on Software Engineering and Formal Methods. Springer, Cham, 2019.
+ p. 315-332.
+
+Shows that for Deterministic Automata models, the synchronous processing of
+events in-kernel causes lower overhead than saving the same events to the trace
+buffer, not even considering collecting the trace for user-space analysis.
+This motivated the development of an in-kernel interface for online monitors.
+
+For further information about modeling of Linux kernel behavior using automata,
+see:
+
+ De Oliveira, Daniel B.; De Oliveira, Romulo S.; Cucinotta, Tommaso. *A thread
+ synchronization model for the PREEMPT_RT Linux kernel.* Journal of Systems
+ Architecture, 2020, 107: 101729.
+
+The user interface
+==================
+
+The user interface resembles the tracing interface (on purpose). It is
+currently at "/sys/kernel/tracing/rv/".
+
+The following files/folders are currently available:
+
+**available_monitors**
+
+- Reading list the available monitors, one per line
+
+For example::
+
+ # cat available_monitors
+ wip
+ wwnr
+
+**available_reactors**
+
+- Reading shows the available reactors, one per line.
+
+For example::
+
+ # cat available_reactors
+ nop
+ panic
+ printk
+
+**enabled_monitors**:
+
+- Reading lists the enabled monitors, one per line
+- Writing to it enables a given monitor
+- Writing a monitor name with a '!' prefix disables it
+- Truncating the file disables all enabled monitors
+
+For example::
+
+ # cat enabled_monitors
+ # echo wip > enabled_monitors
+ # echo wwnr >> enabled_monitors
+ # cat enabled_monitors
+ wip
+ wwnr
+ # echo '!wip' >> enabled_monitors
+ # cat enabled_monitors
+ wwnr
+ # echo > enabled_monitors
+ # cat enabled_monitors
+ #
+
+Note that it is possible to enable more than one monitor concurrently.
+
+**monitoring_on**
+
+This is an on/off general switcher for monitoring. It resembles the
+"tracing_on" switcher in the trace interface.
+
+- Writing "0" stops the monitoring
+- Writing "1" continues the monitoring
+- Reading returns the current status of the monitoring
+
+Note that it does not disable enabled monitors but stop the per-entity
+monitors monitoring the events received from the system.
+
+**reacting_on**
+
+- Writing "0" prevents reactions for happening
+- Writing "1" enable reactions
+- Reading returns the current status of the reaction
+
+**monitors/**
+
+Each monitor will have its own directory inside "monitors/". There the
+monitor-specific files will be presented. The "monitors/" directory resembles
+the "events" directory on tracefs.
+
+For example::
+
+ # cd monitors/wip/
+ # ls
+ desc enable
+ # cat desc
+ wakeup in preemptive per-cpu testing monitor.
+ # cat enable
+ 0
+
+**monitors/MONITOR/desc**
+
+- Reading shows a description of the monitor *MONITOR*
+
+**monitors/MONITOR/enable**
+
+- Writing "0" disables the *MONITOR*
+- Writing "1" enables the *MONITOR*
+- Reading return the current status of the *MONITOR*
+
+**monitors/MONITOR/reactors**
+
+- List available reactors, with the select reaction for the given *MONITOR*
+ inside "[]". The default one is the nop (no operation) reactor.
+- Writing the name of a reactor enables it to the given MONITOR.
+
+For example::
+
+ # cat monitors/wip/reactors
+ [nop]
+ panic
+ printk
+ # echo panic > monitors/wip/reactors
+ # cat monitors/wip/reactors
+ nop
+ [panic]
+ printk
diff --git a/Documentation/trace/stm.rst b/Documentation/trace/stm.rst
index 99f99963e5e7..1ed49dde04fc 100644
--- a/Documentation/trace/stm.rst
+++ b/Documentation/trace/stm.rst
@@ -33,8 +33,8 @@ This policy is a tree structure containing rules (policy_node) that
have a name (string identifier) and a range of masters and channels
associated with it, located in "stp-policy" subsystem directory in
configfs. The topmost directory's name (the policy) is formatted as
-the STM device name to which this policy applies and and arbitrary
-string identifier separated by a stop. From the examle above, a rule
+the STM device name to which this policy applies and an arbitrary
+string identifier separated by a stop. From the example above, a rule
may look like this::
$ ls /config/stp-policy/dummy_stm.my-policy/user
diff --git a/Documentation/trace/timerlat-tracer.rst b/Documentation/trace/timerlat-tracer.rst
new file mode 100644
index 000000000000..db17df312bc8
--- /dev/null
+++ b/Documentation/trace/timerlat-tracer.rst
@@ -0,0 +1,182 @@
+###############
+Timerlat tracer
+###############
+
+The timerlat tracer aims to help the preemptive kernel developers to
+find sources of wakeup latencies of real-time threads. Like cyclictest,
+the tracer sets a periodic timer that wakes up a thread. The thread then
+computes a *wakeup latency* value as the difference between the *current
+time* and the *absolute time* that the timer was set to expire. The main
+goal of timerlat is tracing in such a way to help kernel developers.
+
+Usage
+-----
+
+Write the ASCII text "timerlat" into the current_tracer file of the
+tracing system (generally mounted at /sys/kernel/tracing).
+
+For example::
+
+ [root@f32 ~]# cd /sys/kernel/tracing/
+ [root@f32 tracing]# echo timerlat > current_tracer
+
+It is possible to follow the trace by reading the trace file::
+
+ [root@f32 tracing]# cat trace
+ # tracer: timerlat
+ #
+ # _-----=> irqs-off
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth
+ # || /
+ # |||| ACTIVATION
+ # TASK-PID CPU# |||| TIMESTAMP ID CONTEXT LATENCY
+ # | | | |||| | | | |
+ <idle>-0 [000] d.h1 54.029328: #1 context irq timer_latency 932 ns
+ <...>-867 [000] .... 54.029339: #1 context thread timer_latency 11700 ns
+ <idle>-0 [001] dNh1 54.029346: #1 context irq timer_latency 2833 ns
+ <...>-868 [001] .... 54.029353: #1 context thread timer_latency 9820 ns
+ <idle>-0 [000] d.h1 54.030328: #2 context irq timer_latency 769 ns
+ <...>-867 [000] .... 54.030330: #2 context thread timer_latency 3070 ns
+ <idle>-0 [001] d.h1 54.030344: #2 context irq timer_latency 935 ns
+ <...>-868 [001] .... 54.030347: #2 context thread timer_latency 4351 ns
+
+
+The tracer creates a per-cpu kernel thread with real-time priority that
+prints two lines at every activation. The first is the *timer latency*
+observed at the *hardirq* context before the activation of the thread.
+The second is the *timer latency* observed by the thread. The ACTIVATION
+ID field serves to relate the *irq* execution to its respective *thread*
+execution.
+
+The *irq*/*thread* splitting is important to clarify in which context
+the unexpected high value is coming from. The *irq* context can be
+delayed by hardware-related actions, such as SMIs, NMIs, IRQs,
+or by thread masking interrupts. Once the timer happens, the delay
+can also be influenced by blocking caused by threads. For example, by
+postponing the scheduler execution via preempt_disable(), scheduler
+execution, or masking interrupts. Threads can also be delayed by the
+interference from other threads and IRQs.
+
+Tracer options
+---------------------
+
+The timerlat tracer is built on top of osnoise tracer.
+So its configuration is also done in the osnoise/ config
+directory. The timerlat configs are:
+
+ - cpus: CPUs at which a timerlat thread will execute.
+ - timerlat_period_us: the period of the timerlat thread.
+ - stop_tracing_us: stop the system tracing if a
+ timer latency at the *irq* context higher than the configured
+ value happens. Writing 0 disables this option.
+ - stop_tracing_total_us: stop the system tracing if a
+ timer latency at the *thread* context is higher than the configured
+ value happens. Writing 0 disables this option.
+ - print_stack: save the stack of the IRQ occurrence. The stack is printed
+ after the *thread context* event, or at the IRQ handler if *stop_tracing_us*
+ is hit.
+
+timerlat and osnoise
+----------------------------
+
+The timerlat can also take advantage of the osnoise: traceevents.
+For example::
+
+ [root@f32 ~]# cd /sys/kernel/tracing/
+ [root@f32 tracing]# echo timerlat > current_tracer
+ [root@f32 tracing]# echo 1 > events/osnoise/enable
+ [root@f32 tracing]# echo 25 > osnoise/stop_tracing_total_us
+ [root@f32 tracing]# tail -10 trace
+ cc1-87882 [005] d..h... 548.771078: #402268 context irq timer_latency 13585 ns
+ cc1-87882 [005] dNLh1.. 548.771082: irq_noise: local_timer:236 start 548.771077442 duration 7597 ns
+ cc1-87882 [005] dNLh2.. 548.771099: irq_noise: qxl:21 start 548.771085017 duration 7139 ns
+ cc1-87882 [005] d...3.. 548.771102: thread_noise: cc1:87882 start 548.771078243 duration 9909 ns
+ timerlat/5-1035 [005] ....... 548.771104: #402268 context thread timer_latency 39960 ns
+
+In this case, the root cause of the timer latency does not point to a
+single cause but to multiple ones. Firstly, the timer IRQ was delayed
+for 13 us, which may point to a long IRQ disabled section (see IRQ
+stacktrace section). Then the timer interrupt that wakes up the timerlat
+thread took 7597 ns, and the qxl:21 device IRQ took 7139 ns. Finally,
+the cc1 thread noise took 9909 ns of time before the context switch.
+Such pieces of evidence are useful for the developer to use other
+tracing methods to figure out how to debug and optimize the system.
+
+It is worth mentioning that the *duration* values reported
+by the osnoise: events are *net* values. For example, the
+thread_noise does not include the duration of the overhead caused
+by the IRQ execution (which indeed accounted for 12736 ns). But
+the values reported by the timerlat tracer (timerlat_latency)
+are *gross* values.
+
+The art below illustrates a CPU timeline and how the timerlat tracer
+observes it at the top and the osnoise: events at the bottom. Each "-"
+in the timelines means circa 1 us, and the time moves ==>::
+
+ External timer irq thread
+ clock latency latency
+ event 13585 ns 39960 ns
+ | ^ ^
+ v | |
+ |-------------| |
+ |-------------+-------------------------|
+ ^ ^
+ ========================================================================
+ [tmr irq] [dev irq]
+ [another thread...^ v..^ v.......][timerlat/ thread] <-- CPU timeline
+ =========================================================================
+ |-------| |-------|
+ |--^ v-------|
+ | | |
+ | | + thread_noise: 9909 ns
+ | +-> irq_noise: 6139 ns
+ +-> irq_noise: 7597 ns
+
+IRQ stacktrace
+---------------------------
+
+The osnoise/print_stack option is helpful for the cases in which a thread
+noise causes the major factor for the timer latency, because of preempt or
+irq disabled. For example::
+
+ [root@f32 tracing]# echo 500 > osnoise/stop_tracing_total_us
+ [root@f32 tracing]# echo 500 > osnoise/print_stack
+ [root@f32 tracing]# echo timerlat > current_tracer
+ [root@f32 tracing]# tail -21 per_cpu/cpu7/trace
+ insmod-1026 [007] dN.h1.. 200.201948: irq_noise: local_timer:236 start 200.201939376 duration 7872 ns
+ insmod-1026 [007] d..h1.. 200.202587: #29800 context irq timer_latency 1616 ns
+ insmod-1026 [007] dN.h2.. 200.202598: irq_noise: local_timer:236 start 200.202586162 duration 11855 ns
+ insmod-1026 [007] dN.h3.. 200.202947: irq_noise: local_timer:236 start 200.202939174 duration 7318 ns
+ insmod-1026 [007] d...3.. 200.203444: thread_noise: insmod:1026 start 200.202586933 duration 838681 ns
+ timerlat/7-1001 [007] ....... 200.203445: #29800 context thread timer_latency 859978 ns
+ timerlat/7-1001 [007] ....1.. 200.203446: <stack trace>
+ => timerlat_irq
+ => __hrtimer_run_queues
+ => hrtimer_interrupt
+ => __sysvec_apic_timer_interrupt
+ => asm_call_irq_on_stack
+ => sysvec_apic_timer_interrupt
+ => asm_sysvec_apic_timer_interrupt
+ => delay_tsc
+ => dummy_load_1ms_pd_init
+ => do_one_initcall
+ => do_init_module
+ => __do_sys_finit_module
+ => do_syscall_64
+ => entry_SYSCALL_64_after_hwframe
+
+In this case, it is possible to see that the thread added the highest
+contribution to the *timer latency* and the stack trace, saved during
+the timerlat IRQ handler, points to a function named
+dummy_load_1ms_pd_init, which had the following code (on purpose)::
+
+ static int __init dummy_load_1ms_pd_init(void)
+ {
+ preempt_disable();
+ mdelay(1);
+ preempt_enable();
+ return 0;
+
+ }
diff --git a/Documentation/trace/tracepoints.rst b/Documentation/trace/tracepoints.rst
index 6e3ce3bf3593..0cb8d9ca3d60 100644
--- a/Documentation/trace/tracepoints.rst
+++ b/Documentation/trace/tracepoints.rst
@@ -146,3 +146,30 @@ with jump labels and avoid conditional branches.
define tracepoints. Check http://lwn.net/Articles/379903,
http://lwn.net/Articles/381064 and http://lwn.net/Articles/383362
for a series of articles with more details.
+
+If you require calling a tracepoint from a header file, it is not
+recommended to call one directly or to use the trace_<tracepoint>_enabled()
+function call, as tracepoints in header files can have side effects if a
+header is included from a file that has CREATE_TRACE_POINTS set, as
+well as the trace_<tracepoint>() is not that small of an inline
+and can bloat the kernel if used by other inlined functions. Instead,
+include tracepoint-defs.h and use tracepoint_enabled().
+
+In a C file::
+
+ void do_trace_foo_bar_wrapper(args)
+ {
+ trace_foo_bar(args);
+ }
+
+In the header file::
+
+ DECLARE_TRACEPOINT(foo_bar);
+
+ static inline void some_inline_function()
+ {
+ [..]
+ if (tracepoint_enabled(foo_bar))
+ do_trace_foo_bar_wrapper(args);
+ [..]
+ }
diff --git a/Documentation/trace/uprobetracer.rst b/Documentation/trace/uprobetracer.rst
index 98cde99939d7..3a1797d707f4 100644
--- a/Documentation/trace/uprobetracer.rst
+++ b/Documentation/trace/uprobetracer.rst
@@ -26,15 +26,17 @@ Synopsis of uprobe_tracer
-------------------------
::
- p[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a uprobe
- r[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a return uprobe (uretprobe)
- -:[GRP/]EVENT : Clear uprobe or uretprobe event
+ p[:[GRP/][EVENT]] PATH:OFFSET [FETCHARGS] : Set a uprobe
+ r[:[GRP/][EVENT]] PATH:OFFSET [FETCHARGS] : Set a return uprobe (uretprobe)
+ p[:[GRP/][EVENT]] PATH:OFFSET%return [FETCHARGS] : Set a return uprobe (uretprobe)
+ -:[GRP/][EVENT] : Clear uprobe or uretprobe event
GRP : Group name. If omitted, "uprobes" is the default value.
EVENT : Event name. If omitted, the event name is generated based
on PATH+OFFSET.
PATH : Path to an executable or a library.
OFFSET : Offset where the probe is inserted.
+ OFFSET%return : Offset where the return probe is inserted.
FETCHARGS : Arguments. Each probe can have up to 128 args.
%REG : Fetch register REG
diff --git a/Documentation/trace/user_events.rst b/Documentation/trace/user_events.rst
new file mode 100644
index 000000000000..9f181f342a70
--- /dev/null
+++ b/Documentation/trace/user_events.rst
@@ -0,0 +1,238 @@
+=========================================
+user_events: User-based Event Tracing
+=========================================
+
+:Author: Beau Belgrave
+
+Overview
+--------
+User based trace events allow user processes to create events and trace data
+that can be viewed via existing tools, such as ftrace and perf.
+To enable this feature, build your kernel with CONFIG_USER_EVENTS=y.
+
+Programs can view status of the events via
+/sys/kernel/debug/tracing/user_events_status and can both register and write
+data out via /sys/kernel/debug/tracing/user_events_data.
+
+Programs can also use /sys/kernel/debug/tracing/dynamic_events to register and
+delete user based events via the u: prefix. The format of the command to
+dynamic_events is the same as the ioctl with the u: prefix applied.
+
+Typically programs will register a set of events that they wish to expose to
+tools that can read trace_events (such as ftrace and perf). The registration
+process gives back two ints to the program for each event. The first int is
+the status bit. This describes which bit in little-endian format in the
+/sys/kernel/debug/tracing/user_events_status file represents this event. The
+second int is the write index which describes the data when a write() or
+writev() is called on the /sys/kernel/debug/tracing/user_events_data file.
+
+The structures referenced in this document are contained within the
+/include/uapi/linux/user_events.h file in the source tree.
+
+**NOTE:** *Both user_events_status and user_events_data are under the tracefs
+filesystem and may be mounted at different paths than above.*
+
+Registering
+-----------
+Registering within a user process is done via ioctl() out to the
+/sys/kernel/debug/tracing/user_events_data file. The command to issue is
+DIAG_IOCSREG.
+
+This command takes a packed struct user_reg as an argument::
+
+ struct user_reg {
+ u32 size;
+ u64 name_args;
+ u32 status_bit;
+ u32 write_index;
+ };
+
+The struct user_reg requires two inputs, the first is the size of the structure
+to ensure forward and backward compatibility. The second is the command string
+to issue for registering. Upon success two outputs are set, the status bit
+and the write index.
+
+User based events show up under tracefs like any other event under the
+subsystem named "user_events". This means tools that wish to attach to the
+events need to use /sys/kernel/debug/tracing/events/user_events/[name]/enable
+or perf record -e user_events:[name] when attaching/recording.
+
+**NOTE:** *The write_index returned is only valid for the FD that was used*
+
+Command Format
+^^^^^^^^^^^^^^
+The command string format is as follows::
+
+ name[:FLAG1[,FLAG2...]] [Field1[;Field2...]]
+
+Supported Flags
+^^^^^^^^^^^^^^^
+None yet
+
+Field Format
+^^^^^^^^^^^^
+::
+
+ type name [size]
+
+Basic types are supported (__data_loc, u32, u64, int, char, char[20], etc).
+User programs are encouraged to use clearly sized types like u32.
+
+**NOTE:** *Long is not supported since size can vary between user and kernel.*
+
+The size is only valid for types that start with a struct prefix.
+This allows user programs to describe custom structs out to tools, if required.
+
+For example, a struct in C that looks like this::
+
+ struct mytype {
+ char data[20];
+ };
+
+Would be represented by the following field::
+
+ struct mytype myname 20
+
+Deleting
+-----------
+Deleting an event from within a user process is done via ioctl() out to the
+/sys/kernel/debug/tracing/user_events_data file. The command to issue is
+DIAG_IOCSDEL.
+
+This command only requires a single string specifying the event to delete by
+its name. Delete will only succeed if there are no references left to the
+event (in both user and kernel space). User programs should use a separate file
+to request deletes than the one used for registration due to this.
+
+Status
+------
+When tools attach/record user based events the status of the event is updated
+in realtime. This allows user programs to only incur the cost of the write() or
+writev() calls when something is actively attached to the event.
+
+User programs call mmap() on /sys/kernel/debug/tracing/user_events_status to
+check the status for each event that is registered. The bit to check in the
+file is given back after the register ioctl() via user_reg.status_bit. The bit
+is always in little-endian format. Programs can check if the bit is set either
+using a byte-wise index with a mask or a long-wise index with a little-endian
+mask.
+
+Currently the size of user_events_status is a single page, however, custom
+kernel configurations can change this size to allow more user based events. In
+all cases the size of the file is a multiple of a page size.
+
+For example, if the register ioctl() gives back a status_bit of 3 you would
+check byte 0 (3 / 8) of the returned mmap data and then AND the result with 8
+(1 << (3 % 8)) to see if anything is attached to that event.
+
+A byte-wise index check is performed as follows::
+
+ int index, mask;
+ char *status_page;
+
+ index = status_bit / 8;
+ mask = 1 << (status_bit % 8);
+
+ ...
+
+ if (status_page[index] & mask) {
+ /* Enabled */
+ }
+
+A long-wise index check is performed as follows::
+
+ #include <asm/bitsperlong.h>
+ #include <endian.h>
+
+ #if __BITS_PER_LONG == 64
+ #define endian_swap(x) htole64(x)
+ #else
+ #define endian_swap(x) htole32(x)
+ #endif
+
+ long index, mask, *status_page;
+
+ index = status_bit / __BITS_PER_LONG;
+ mask = 1L << (status_bit % __BITS_PER_LONG);
+ mask = endian_swap(mask);
+
+ ...
+
+ if (status_page[index] & mask) {
+ /* Enabled */
+ }
+
+Administrators can easily check the status of all registered events by reading
+the user_events_status file directly via a terminal. The output is as follows::
+
+ Byte:Name [# Comments]
+ ...
+
+ Active: ActiveCount
+ Busy: BusyCount
+ Max: MaxCount
+
+For example, on a system that has a single event the output looks like this::
+
+ 1:test
+
+ Active: 1
+ Busy: 0
+ Max: 32768
+
+If a user enables the user event via ftrace, the output would change to this::
+
+ 1:test # Used by ftrace
+
+ Active: 1
+ Busy: 1
+ Max: 32768
+
+**NOTE:** *A status bit of 0 will never be returned. This allows user programs
+to have a bit that can be used on error cases.*
+
+Writing Data
+------------
+After registering an event the same fd that was used to register can be used
+to write an entry for that event. The write_index returned must be at the start
+of the data, then the remaining data is treated as the payload of the event.
+
+For example, if write_index returned was 1 and I wanted to write out an int
+payload of the event. Then the data would have to be 8 bytes (2 ints) in size,
+with the first 4 bytes being equal to 1 and the last 4 bytes being equal to the
+value I want as the payload.
+
+In memory this would look like this::
+
+ int index;
+ int payload;
+
+User programs might have well known structs that they wish to use to emit out
+as payloads. In those cases writev() can be used, with the first vector being
+the index and the following vector(s) being the actual event payload.
+
+For example, if I have a struct like this::
+
+ struct payload {
+ int src;
+ int dst;
+ int flags;
+ };
+
+It's advised for user programs to do the following::
+
+ struct iovec io[2];
+ struct payload e;
+
+ io[0].iov_base = &write_index;
+ io[0].iov_len = sizeof(write_index);
+ io[1].iov_base = &e;
+ io[1].iov_len = sizeof(e);
+
+ writev(fd, (const struct iovec*)io, 2);
+
+**NOTE:** *The write_index is not emitted out into the trace being recorded.*
+
+Example Code
+------------
+See sample code in samples/user_events.