aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/tools/perf (follow)
AgeCommit message (Collapse)AuthorFilesLines
2024-03-21Merge tag 'net-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds1-7/+0
Pull networking fixes from Jakub Kicinski: "Including fixes from CAN, netfilter, wireguard and IPsec. I'd like to highlight [ lowlight? - Linus ] Florian W stepping down as a netfilter maintainer due to constant stream of bug reports. Not sure what we can do but IIUC this is not the first such case. Current release - regressions: - rxrpc: fix use of page_frag_alloc_align(), it changed semantics and we added a new caller in a different subtree - xfrm: allow UDP encapsulation only in offload modes Current release - new code bugs: - tcp: fix refcnt handling in __inet_hash_connect() - Revert "net: Re-use and set mono_delivery_time bit for userspace tstamp packets", conflicted with some expectations in BPF uAPI Previous releases - regressions: - ipv4: raw: fix sending packets from raw sockets via IPsec tunnels - devlink: fix devlink's parallel command processing - veth: do not manipulate GRO when using XDP - esp: fix bad handling of pages from page_pool Previous releases - always broken: - report RCU QS for busy network kthreads (with Paul McK's blessing) - tcp/rds: fix use-after-free on netns with kernel TCP reqsk - virt: vmxnet3: fix missing reserved tailroom with XDP Misc: - couple of build fixes for Documentation" * tag 'net-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (59 commits) selftests: forwarding: Fix ping failure due to short timeout MAINTAINERS: step down as netfilter maintainer netfilter: nf_tables: Fix a memory leak in nf_tables_updchain net: dsa: mt7530: fix handling of all link-local frames net: dsa: mt7530: fix link-local frames that ingress vlan filtering ports bpf: report RCU QS in cpumap kthread net: report RCU QS on threaded NAPI repolling rcu: add a helper to report consolidated flavor QS ionic: update documentation for XDP support lib/bitmap: Fix bitmap_scatter() and bitmap_gather() kernel doc netfilter: nf_tables: do not compare internal table flags on updates netfilter: nft_set_pipapo: release elements in clone only from destroy path octeontx2-af: Use separate handlers for interrupts octeontx2-pf: Send UP messages to VF only when VF is up. octeontx2-pf: Use default max_active works instead of one octeontx2-pf: Wait till detach_resources msg is complete octeontx2: Detect the mbox up or down message via register devlink: fix port new reply cmd type tcp: Clear req->syncookie in reqsk_alloc(). net/bnx2x: Prevent access to a freed page in page_pool ...
2024-03-14net: remove {revc,send}msg_copy_msghdr() from exportsJens Axboe1-7/+0
The only user of these was io_uring, and it's not using them anymore. Make them static and remove them from the socket header file. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/1b6089d3-c1cf-464a-abd3-b0f0b6bb2523@kernel.dk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-06perf annotate: Add comments in the data structuresNamhyung Kim1-7/+62
Reviewed-by: Ian Rogers <irogers@google.com> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240304230815.1440583-5-namhyung@kernel.org
2024-03-06perf annotate: Remove sym_hist.addr[] arrayNamhyung Kim2-34/+6
It's not used anymore and the code is coverted to use a hash map. Now sym_hist has a static size, so no need to have sizeof_sym_hist in the struct annotated_source. Reviewed-by: Ian Rogers <irogers@google.com> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240304230815.1440583-4-namhyung@kernel.org
2024-03-06perf annotate: Calculate instruction overhead using hashmapNamhyung Kim3-17/+52
Use annotated_source.samples hashmap instead of addr array in the struct sym_hist. Reviewed-by: Ian Rogers <irogers@google.com> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240304230815.1440583-3-namhyung@kernel.org
2024-03-06perf annotate: Add a hashmap for symbol histogramNamhyung Kim2-2/+42
Now symbol histogram uses an array to save per-offset sample counts. But it wastes a lot of memory if the symbol has a few samples only. Add a hashmap to save values only for actual samples. For now, it has duplicate histogram (one in the existing array and another in the new hash map). Once it can convert to use the hash in all places, we can get rid of the array later. Reviewed-by: Ian Rogers <irogers@google.com> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240304230815.1440583-2-namhyung@kernel.org
2024-03-03perf threads: Reduce table size from 256 to 8Ian Rogers1-1/+1
The threads data structure is an array of hashmaps, previously rbtrees. The two levels allows for a fixed outer array where access is guarded by rw_semaphores. Commit 91e467bc568f ("perf machine: Use hashtable for machine threads") sized the outer table at 256 entries to avoid future scalability problems, however, this means the threads struct is sized at 30,720 bytes. As the hashmaps allow O(1) access for the common find/insert/remove operations, lower the number of entries to 8. This reduces the size overhead to 960 bytes. Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Yang Jihong <yangjihong1@huawei.com> Cc: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240301053646.1449657-8-irogers@google.com
2024-03-03perf threads: Switch from rbtree to hashmapIan Rogers2-105/+47
The rbtree provides a sorting on entries but this is unused. Switch to using hashmap for O(1) rather than O(log n) find/insert/remove complexity. Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Yang Jihong <yangjihong1@huawei.com> Cc: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240301053646.1449657-7-irogers@google.com
2024-03-03perf threads: Move threads to its own filesIan Rogers5-273/+285
Move threads out of machine and into its own file. Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Yang Jihong <yangjihong1@huawei.com> Cc: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240301053646.1449657-6-irogers@google.com
2024-03-03perf machine: Move machine's threads into its own abstractionIan Rogers5-203/+243
Move thread_rb_node into the machine.c file. This hides the implementation of threads from the rest of the code allowing for it to be refactored. Locking discipline is tightened up in this change. As the lock is now encapsulated in threads, the findnew function requires holding it (as it already did in machine). Rather than do conditionals with locks based on whether the thread should be created (which could potentially be error prone with a read lock match with a write unlock), have a separate threads__find that won't create the thread and only holds the read lock. This effectively duplicates the findnew logic, with the existing findnew logic only operating under a write lock assuming creation is necessary as a previous find failed. The creation may still fail with the write lock due to another thread. The duplication is removed in a later next patch that delegates the implementation to hashtable. Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Yang Jihong <yangjihong1@huawei.com> Cc: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240301053646.1449657-5-irogers@google.com
2024-03-03perf machine: Move fprintf to for_each loop and a callbackIan Rogers1-16/+27
Avoid exposing the threads data structure by switching to the callback machine__for_each_thread approach. machine__fprintf is only used in tests and verbose >3 output so don't turn to list and sort. Add machine__threads_nr to be refactored later. Note, all existing *_fprintf routines ignore fprintf errors. Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Yang Jihong <yangjihong1@huawei.com> Cc: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240301053646.1449657-4-irogers@google.com
2024-03-03perf trace: Ignore thread hashing in summaryIan Rogers2-23/+23
Commit 91e467bc568f ("perf machine: Use hashtable for machine threads") made the iteration of thread tids unordered. The perf trace --summary output sorts and prints each hash bucket, rather than all threads globally. Change this behavior by turn all threads into a list, sort the list by number of trace events then by tids, finally print the list. This also allows the rbtree in threads to be not accessed outside of machine. Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Yang Jihong <yangjihong1@huawei.com> Cc: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240301053646.1449657-3-irogers@google.com
2024-03-03perf report: Sort child tasks by tidIan Rogers3-89/+168
Commit 91e467bc568f ("perf machine: Use hashtable for machine threads") made the iteration of thread tids unordered. The perf report --tasks output now shows child threads in an order determined by the hashing. For example, in this snippet tid 3 appears after tid 256 even though they have the same ppid 2: ``` $ perf report --tasks % pid tid ppid comm 0 0 -1 |swapper 2 2 0 | kthreadd 256 256 2 | kworker/12:1H-k 693761 693761 2 | kworker/10:1-mm 1301762 1301762 2 | kworker/1:1-mm_ 1302530 1302530 2 | kworker/u32:0-k 3 3 2 | rcu_gp ... ``` The output is easier to read if threads appear numerically increasing. To allow for this, read all threads into a list then sort with a comparator that orders by the child task's of the first common parent. The list creation and deletion are created as utilities on machine. The indentation is possible by counting the number of parents a child has. With this change the output for the same data file is now like: ``` $ perf report --tasks % pid tid ppid comm 0 0 -1 |swapper 1 1 0 | systemd 823 823 1 | systemd-journal 853 853 1 | systemd-udevd 3230 3230 1 | systemd-timesyn 3236 3236 1 | auditd 3239 3239 3236 | audisp-syslog 3321 3321 1 | accounts-daemon ... ``` Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Yang Jihong <yangjihong1@huawei.com> Cc: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240301053646.1449657-2-irogers@google.com
2024-03-03perf vendor events amd: Fix Zen 4 cache latency eventsSandipan Das2-0/+60
L3PMCx0AC and L3PMCx0AD, used in l3_xi_sampled_latency* events, have a quirk that requires them to be programmed with SliceId set to 0x3. Without this, the events do not count at all and affects dependent metrics such as l3_read_miss_latency. If ThreadMask is not specified, the amd-uncore driver internally sets ThreadMask to 0x3, EnAllCores to 0x1 and EnAllSlices to 0x1 but does not set SliceId. Since SliceId must also be set to 0x3 in this case, specify all the other fields explicitly. E.g. $ sudo perf stat -e l3_xi_sampled_latency.all,l3_xi_sampled_latency_requests.all -a sleep 1 Before: Performance counter stats for 'system wide': 0 l3_xi_sampled_latency.all 0 l3_xi_sampled_latency_requests.all 1.005155399 seconds time elapsed After: Performance counter stats for 'system wide': 921,446 l3_xi_sampled_latency.all 54,210 l3_xi_sampled_latency_requests.all 1.005664472 seconds time elapsed Fixes: 5b2ca349c313 ("perf vendor events amd: Add Zen 4 uncore events") Signed-off-by: Sandipan Das <sandipan.das@amd.com> Reviewed-by: Ian Rogers <irogers@google.com> Cc: ananth.narayan@amd.com Cc: ravi.bangoria@amd.com Cc: eranian@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240301084431.646221-1-sandipan.das@amd.com
2024-03-03perf version: Display availability of OpenCSD supportJames Clark1-0/+1
This is useful for scripts that work with Perf and ETM trace. Rather than them trying to parse Perf's error output at runtime to see if it was linked or not. Signed-off-by: James Clark <james.clark@arm.com> Reviewed-by: Ian Rogers <irogers@google.com> Cc: al.grant@arm.com Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240301133829.346286-1-james.clark@arm.com
2024-02-29perf vendor events intel: Add umasks/occ_sel to PCU events.Ian Rogers9-0/+27
UMasks were being dropped leading to all PCU UNC_P_POWER_STATE_OCCUPANCY events having the same encoding. Don't drop the umask trying to be consistent with other sources of events like libpfm4 [1]. Older models need to use occ_sel rather than umask, correct these values too. This applies the change from [2]. [1] https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/lib/events/intel_skx_unc_pcu_events.h#l30 [2] https://github.com/captain5050/perfmon/commit/cbd4aee81023e5bfa09677b1ce170ff69e9c423d Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240228170529.4035675-1-irogers@google.com
2024-02-29perf map: Fix map reference count issuesIan Rogers2-10/+8
The find will get the map, ensure puts are done on all paths. Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240229062048.558799-1-irogers@google.com
2024-02-29perf lock contention: Account contending locks tooNamhyung Kim3-7/+136
Currently it accounts the contention using delta between timestamps in lock:contention_begin and lock:contention_end tracepoints. But it means the lock should see the both events during the monitoring period. Actually there are 4 cases that happen with the monitoring: monitoring period / \ | | 1: B------+-----------------------+--------E 2: B----+-------------E | 3: | B-----------+----E 4: | B-------------E | | | t0 t1 where B and E mean contention BEGIN and END, respectively. So it only accounts the case 4 for now. It seems there's no way to handle the case 1. The case 2 might be handled if it saved the timestamp (t0), but it lacks the information from the B notably the flags which shows the lock types. Also it could be a nested lock which it currently ignores. So I think we should ignore the case 2. However we can handle the case 3 if we save the timestamp (t1) at the end of the period. And then it can iterate the map entries in the userspace and update the lock stat accordinly. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Ian Rogers <irogers@google.com> Reviwed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20240228053335.312776-1-namhyung@kernel.org
2024-02-29perf metrics: Fix segv for metrics with no eventsIan Rogers1-1/+1
A metric may have no events, for example, the transaction metrics on x86 are dependent on there being TSX events. Fix a segv where an evsel of NULL is dereferenced for a metric leader value. Fixes: a59fb796a36b ("perf metrics: Compute unmerged uncore metrics individually") Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240224011420.3066322-2-irogers@google.com
2024-02-29perf metrics: Fix metric matchingIan Rogers1-12/+10
The metric match function fails for cases like looking for "metric" in the string "all;foo_metric;metric" as the "metric" in "foo_metric" matches but isn't preceeded by a ';'. Fix this by matching the first list item and recursively matching on failure the next item after a semicolon. Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240224011420.3066322-1-irogers@google.com
2024-02-26perf pmu: Fix a potential memory leak in perf_pmu__lookup()Christophe JAILLET1-4/+3
The commit in Fixes has reordered some code, but missed an error handling path. 'goto err' now, in order to avoid a memory leak in case of error. Fixes: f63a536f03a2 ("perf pmu: Merge JSON events with sysfs at load time") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Ian Rogers <irogers@google.com> Cc: kernel-janitors@vger.kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/9538b2b634894c33168dfe9d848d4df31fd4d801.1693085544.git.christophe.jaillet@wanadoo.fr
2024-02-26perf test: Fix spelling mistake "curent" -> "current"Colin Ian King1-1/+1
There is a spelling mistake in a pr_debug message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Adrian Hunter <adrian.hunter@intel.com> Cc: kernel-janitors@vger.kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240226105326.3944887-1-colin.i.king@gmail.com
2024-02-26perf test: Use TEST_FAIL in the TEST_ASSERT macros instead of -1Arnaldo Carvalho de Melo1-8/+8
Just to make things clearer, return TEST_FAIL (-1) instead of an open coded -1. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/ZdepeMsjagbf1ufD@x1
2024-02-26perf data convert: Fix segfault when converting to json when cpu_desc isn't setIlkka Koskinen1-1/+3
Arm64 doesn't have Model in /proc/cpuinfo and, thus, cpu_desc doesn't get assigned. Running $ perf data convert --to-json perf.data.json ends up calling output_json_string() with NULL pointer, which causes a segmentation fault. Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: James Clark <james.clark@arm.com> Cc: Evgeny Pistun <kotborealis@awooo.ru> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240223220458.15282-1-ilkka@os.amperecomputing.com
2024-02-26perf bpf: Check that the minimal vmlinux.h installed is the latest oneArnaldo Carvalho de Melo1-1/+1
When building BPF skels perf will, by default, install a minimalistic vmlinux.h file with the types needed by the BPF skels in tools/perf/util/bpf_skel/ in its build directory. When 29d16de26df17e94 ("perf augmented_raw_syscalls.bpf: Move 'struct timespec64' to vmlinux.h") was added, a type used in the augmented_raw_syscalls BPF skel, 'struct timespec64' was not found when building from a pre-existing build directory, because the vmlinux.h there didn't contain that type, ending up with this error, spotted in linux-next: CLANG /tmp/build/perf-tools-next/util/bpf_skel/.tmp/augmented_raw_syscalls.bpf.o util/bpf_skel/augmented_raw_syscalls.bpf.c:329:15: error: invalid application of 'sizeof' to an incomplete type 'struct timespec64' 329 | __u32 size = sizeof(struct timespec64); | ^ ~~~~~~~~~~~~~~~~~~~ util/bpf_skel/augmented_raw_syscalls.bpf.c:329:29: note: forward declaration of 'struct timespec64' 329 | __u32 size = sizeof(struct timespec64); | ^ util/bpf_skel/augmented_raw_syscalls.bpf.c:350:15: error: invalid application of 'sizeof' to an incomplete type 'struct timespec64' 350 | __u32 size = sizeof(struct timespec64); | ^ ~~~~~~~~~~~~~~~~~~~ util/bpf_skel/augmented_raw_syscalls.bpf.c:350:29: note: forward declaration of 'struct timespec64' 350 | __u32 size = sizeof(struct timespec64); | ^ 2 errors generated. make[2]: *** [Makefile.perf:1158: /tmp/build/perf-tools-next/util/bpf_skel/.tmp/augmented_raw_syscalls.bpf.o] Error 1 make[2]: *** Waiting for unfinished jobs.... make[1]: *** [Makefile.perf:261: sub-make] Error 2 make: *** [Makefile:113: install-bin] Error 2 make: Leaving directory '/home/acme/git/perf-tools-next/tools/perf' So add a Makefile dependency (Namhyung's suggestion) to make sure that the new tools/perf/util/bpf_skel/vmlinux/vmlinux.h minimal vmlinux is updated in the build directory, providing the moved 'struct timespec64' type. Fixes: 29d16de26df17e94 ("perf augmented_raw_syscalls.bpf: Move 'struct timespec64' to vmlinux.h") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Ian Rogers <irogers@google.com> Suggested-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/ZdoPrWg-qYFpBJbz@x1
2024-02-23treewide: remove meaningless assignments in MakefilesMasahiro Yamada8-53/+53
In Makefiles, $(error ), $(warning ), and $(info ) expand to the empty string, as explained in the GNU Make manual [1]: "The result of the expansion of this function is the empty string." Therefore, they are no-op except for logging purposes. $(shell ...) expands to the output of the command. It expands to the empty string when the command does not print anything to stdout. Hence, $(shell mkdir ...) is no-op except for creating the directory. Remove meaningless assignments. [1]: https://www.gnu.org/software/make/manual/make.html#Make-Control-Functions Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20240221134201.2656908-1-masahiroy@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: linux-kbuild@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-perf-users@vger.kernel.org
2024-02-23perf print-events: make is_event_supported() more robustMark Rutland1-8/+19
Currently the perf tool doesn't detect support for extended event types on Apple M1/M2 systems, and will not auto-expand plain PERF_EVENT_TYPE hardware events into per-PMU events. This is due to the detection of extended event types not handling mandatory filters required by the M1/M2 PMU driver. PMU drivers and the core perf_events code can require that perf_event_attr::exclude_* filters are configured in a specific way and may reject certain configurations of filters, for example: (a) Many PMUs lack support for any event filtering, and require all perf_event_attr::exclude_* bits to be clear. This includes Alpha's CPU PMU, and ARM CPU PMUs prior to the introduction of PMUv2 in ARMv7, (b) When /proc/sys/kernel/perf_event_paranoid >= 2, the perf core requires that perf_event_attr::exclude_kernel is set. (c) The Apple M1/M2 PMU requires that perf_event_attr::exclude_guest is set as the hardware PMU does not count while a guest is running (but might be extended in future to do so). In is_event_supported(), we try to account for cases (a) and (b), first attempting to open an event without any filters, and if this fails, retrying with perf_event_attr::exclude_kernel set. We do not account for case (c), or any other filters that drivers could theoretically require to be set. Thus is_event_supported() will fail to detect support for any events targeting an Apple M1/M2 PMU, even where events would be supported with perf_event_attr:::exclude_guest set. Since commit: 82fe2e45cdb00de4 ("perf pmus: Check if we can encode the PMU number in perf_event_attr.type") ... we use is_event_supported() to detect support for extended types, with the PMU ID encoded into the perf_event_attr::type. As above, on an Apple M1/M2 system this will always fail to detect that the event is supported, and consequently we fail to detect support for extended types even when these are supported, as they have been since commit: 5c816728651ae425 ("arm_pmu: Add PERF_PMU_CAP_EXTENDED_HW_TYPE capability") Due to this, the perf tool will not automatically expand plain PERF_TYPE_HARDWARE events into per-PMU events, even when all the necessary kernel support is present. This patch updates is_event_supported() to additionally try opening events with perf_event_attr::exclude_guest set, allowing support for events to be detected on Apple M1/M2 systems. I believe that this is sufficient for all contemporary CPU PMU drivers, though in future it may be necessary to check for other combinations of filter bits. I've deliberately changed the check to not expect a specific error code for missing filters, as today ;the kernel may return a number of different error codes for missing filters (e.g. -EACCESS, -EINVAL, or -EOPNOTSUPP) depending on why and where the filter configuration is rejected, and retrying for any error is more robust. Note that this does not remove the need for commit: a24d9d9dc096fc0d ("perf parse-events: Make legacy events lower priority than sysfs/JSON") ... which is still necessary so that named-pmu/event/ events work on kernels without extended type support, even if the event name happens to be the same as a PERF_EVENT_TYPE_HARDWARE event (e.g. as is the case for the M1/M2 PMU's 'cycles' and 'instructions' events). Fixes: 82fe2e45cdb00de4 ("perf pmus: Check if we can encode the PMU number in perf_event_attr.type") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Ian Rogers <irogers@google.com> Tested-by: James Clark <james.clark@arm.com> Tested-by: Marc Zyngier <maz@kernel.org> Cc: Hector Martin <marcan@marcan.st> Cc: James Clark <james.clark@arm.com> Cc: John Garry <john.g.garry@oracle.com> Cc: Leo Yan <leo.yan@linux.dev> Cc: Mike Leach <mike.leach@linaro.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: linux-arm-kernel@lists.infradead.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240126145605.1005472-1-mark.rutland@arm.com
2024-02-22perf tests: Add option to run tests in parallelIan Rogers1-99/+215
By default tests are forked, add an option (-p or --parallel) so that the forked tests are all started in parallel and then their output gathered serially. This is opt-in as running in parallel can cause test flakes. Rather than fork within the code, the start_command/finish_command from libsubcmd are used. This changes how stderr and stdout are handled. The child stderr and stdout are always read to avoid the child blocking. If verbose is 1 (-v) then if the test fails the child stdout and stderr are displayed. If the verbose is >1 (e.g. -vv) then the stdout and stderr from the child are immediately displayed. An unscientific test on my laptop shows the wall clock time for perf test without parallel being 5 minutes 21 seconds and with parallel (-p) being 1 minute 50 seconds. Signed-off-by: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@arm.com> Cc: Justin Stitt <justinstitt@google.com> Cc: Bill Wendling <morbo@google.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Yang Jihong <yangjihong1@huawei.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Athira Jajeev <atrajeev@linux.vnet.ibm.com> Cc: llvm@lists.linux.dev Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240221034155.1500118-9-irogers@google.com
2024-02-22perf tests: Run time generate shell test suitesIan Rogers3-149/+80
Rather than special shell test logic, do a single pass to create an array of test suites. Hold the shell test file name in the test suite priv field. This makes the special shell test logic in builtin-test.c redundant so remove it. Signed-off-by: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@arm.com> Cc: Justin Stitt <justinstitt@google.com> Cc: Bill Wendling <morbo@google.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Yang Jihong <yangjihong1@huawei.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Athira Jajeev <atrajeev@linux.vnet.ibm.com> Cc: llvm@lists.linux.dev Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240221034155.1500118-8-irogers@google.com
2024-02-22perf tests: Use scandirat for shell script findingIan Rogers3-71/+95
Avoid filename appending buffers by using openat, faccessat and scandirat more widely. Turn the script's path back to a file name using readlink from /proc/<pid>/fd/<fd>. Read the script's description using api/io.h to avoid fdopen conversions. Whilst reading perform additional sanity checks on the script's contents. Signed-off-by: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@arm.com> Cc: Justin Stitt <justinstitt@google.com> Cc: Bill Wendling <morbo@google.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Yang Jihong <yangjihong1@huawei.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Athira Jajeev <atrajeev@linux.vnet.ibm.com> Cc: llvm@lists.linux.dev Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240221034155.1500118-7-irogers@google.com
2024-02-22perf test: Rename builtin-test-list and add missed header guardIan Rogers4-3/+7
builtin-test-list is primarily concerned with shell script tests. Rename the file to better reflect this and add a missed header guard. Signed-off-by: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@arm.com> Cc: Justin Stitt <justinstitt@google.com> Cc: Bill Wendling <morbo@google.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Yang Jihong <yangjihong1@huawei.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Athira Jajeev <atrajeev@linux.vnet.ibm.com> Cc: llvm@lists.linux.dev Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240221034155.1500118-6-irogers@google.com
2024-02-22perf tests: Avoid fork in perf_has_symbol testIan Rogers1-1/+1
perf test -vv Symbols is used to indentify symbols within the perf binary. Add the -F flag so that the test command doesn't fork the test before running. This removes a little overhead. Acked-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@arm.com> Cc: Justin Stitt <justinstitt@google.com> Cc: Bill Wendling <morbo@google.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Yang Jihong <yangjihong1@huawei.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Athira Jajeev <atrajeev@linux.vnet.ibm.com> Cc: llvm@lists.linux.dev Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240221034155.1500118-4-irogers@google.com
2024-02-22perf list: Add scandirat compatibility functionIan Rogers3-9/+31
scandirat is used during the printing of tracepoint events but may be missing from certain libcs. Add a compatibility implementation that uses the symlink of an fd in /proc as a path for the reliably present scandir. Signed-off-by: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@arm.com> Cc: Justin Stitt <justinstitt@google.com> Cc: Bill Wendling <morbo@google.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Yang Jihong <yangjihong1@huawei.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Athira Jajeev <atrajeev@linux.vnet.ibm.com> Cc: llvm@lists.linux.dev Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240221034155.1500118-3-irogers@google.com
2024-02-22perf thread_map: Skip exited threads when scanning /procIan Rogers1-5/+4
Scanning /proc is inherently racy. Scanning /proc/pid/task within that is also racy as the pid can terminate. Rather than failing in __thread_map__new_all_cpus, skip pids for such failures. Signed-off-by: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@arm.com> Cc: Justin Stitt <justinstitt@google.com> Cc: Bill Wendling <morbo@google.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Yang Jihong <yangjihong1@huawei.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Athira Jajeev <atrajeev@linux.vnet.ibm.com> Cc: llvm@lists.linux.dev Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240221034155.1500118-2-irogers@google.com
2024-02-22perf list: fix short description for some cache eventsThomas Richter1-31/+31
Correct the short description of the following events: DCW_REQ, DCW_REQ_CHIP_HIT, DCW_REQ_DRAWER_HIT, DCW_REQ_IV, DCW_ON_CHIP, DCW_ON_CHIP_IV, DCW_ON_CHIP_CHIP_HIT, DCW_ON_CHIP_DRAWER_HIT, CW_ON_MODULE, DCW_ON_DRAWER, DCW_OFF_DRAWER, IDCW_ON_MODULE_IV, IDCW_ON_MODULE_CHIP_HIT, IDCW_ON_MODULE_DRAWER_HIT, IDCW_ON_DRAWER_IV, IDCW_ON_DRAWER_CHIP_HIT, IDCW_ON_DRAWER_DRAWER_HIT, IDCW_OFF_DRAWER_IV, IDCW_OFF_DRAWER_CHIP_HIT, IDCW_OFF_DRAWER_DRAWER_HIT, ICW_REQ, ICW_REQ_IV, CW_REQ_CHIP_HIT, ICW_REQ_DRAWER_HIT, ICW_ON_CHIP, ICW_ON_CHIP_IV, ICW_ON_CHIP_CHIP_HIT, ICW_ON_CHIP_DRAWER_HIT, ICW_ON_MODULE and ICW_OFF_DRAWER. The second Cache should be L2-Cache. Output before (display diff of the first four events) # perf list -d DCW_REQ [Directory Write Level 1 Data Cache from Cache. Unit: cpum_cf] DCW_REQ_CHIP_HIT [Directory Write Level 1 Data Cache from Cache with Chip HP \ Hit. Unit: cpum_cf] DCW_REQ_DRAWER_HIT [Directory Write Level 1 Data Cache from Cache with Drawer \ HP Hit. Unit: cpum_cf] DCW_REQ_IV [Directory Write Level 1 Data Cache from Cache with Intervention. \ Unit: cpum_cf] Output after: # perf list -d DCW_REQ [Directory Write Level 1 Data Cache from L2-Cache. Unit: cpum_cf] DCW_REQ_CHIP_HIT [Directory Write Level 1 Data Cache from L2-Cache with Chip HP \ Hit. Unit: cpum_cf] DCW_REQ_DRAWER_HIT [Directory Write Level 1 Data Cache from L2-Cache with Drawer \ HP Hit. Unit: cpum_cf] DCW_REQ_IV [Directory Write Level 1 Data Cache from L2-Cache with \ Intervention. Unit: cpum_cf] Fixes: 7f76b3113068 ("perf list: Add IBM z16 event description for s390") Reported-by: Andreas Krebbel <krebbel@linux.ibm.com> Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Andreas Krebbel <krebbel@linux.ibm.com> Reviewed-by: Ian Rogers <irogers@google.com> Cc: gor@linux.ibm.com Cc: hca@linux.ibm.com Cc: sumanthk@linux.ibm.com Cc: svens@linux.ibm.com Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240221091908.1759083-1-tmricht@linux.ibm.com
2024-02-22perf stat: Fix metric-only aggregation indexIan Rogers1-2/+7
Aggregation index was being computed using the evsel's cpumap which may have a different (typically the same or fewer) entries. Before: ``` $ perf stat --metric-only -A -M memory_bandwidth_total -a sleep 1 Performance counter stats for 'system wide': MB/s memory_bandwidth_total MB/s memory_bandwidth_total MB/s memory_bandwidth_total MB/s memory_bandwidth_total MB/s memory_bandwidth_total MB/s memory_bandwidth_total CPU0 12.8 0.0 12.9 12.7 0.0 12.6 CPU1 1.007806367 seconds time elapsed ``` After: ``` $ perf stat --metric-only -A -M memory_bandwidth_total -a sleep 1 Performance counter stats for 'system wide': MB/s memory_bandwidth_total MB/s memory_bandwidth_total MB/s memory_bandwidth_total MB/s memory_bandwidth_total MB/s memory_bandwidth_total MB/s memory_bandwidth_total CPU0 15.4 0.0 15.3 15.0 0.0 14.9 CPU18 0.0 0.0 13.5 5.2 0.0 11.9 1.007858736 seconds time elapsed ``` Signed-off-by: Ian Rogers <irogers@google.com> | Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Stephane Eranian <eranian@google.com> Cc: Kaige Ye <ye@kaige.org> Cc: Kajol Jain <kjain@linux.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: John Garry <john.g.garry@oracle.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240221070754.4163916-3-irogers@google.com
2024-02-22perf metrics: Compute unmerged uncore metrics individuallyIan Rogers2-4/+29
When merging counts from multiple uncore PMUs the metric is only computed for the metric leader. When merging/aggregation is disabled, prior to this patch just the leader's metric would be computed. Fix this by computing the metric for each PMU. On a SkylakeX: Before: ``` $ perf stat -A -M memory_bandwidth_total -a sleep 1 Performance counter stats for 'system wide': CPU0 82,217 UNC_M_CAS_COUNT.RD [uncore_imc_0] # 9.2 MB/s memory_bandwidth_total CPU18 0 UNC_M_CAS_COUNT.RD [uncore_imc_0] # 0.0 MB/s memory_bandwidth_total CPU0 61,395 UNC_M_CAS_COUNT.WR [uncore_imc_0] CPU18 0 UNC_M_CAS_COUNT.WR [uncore_imc_0] CPU0 0 UNC_M_CAS_COUNT.RD [uncore_imc_1] CPU18 0 UNC_M_CAS_COUNT.RD [uncore_imc_1] CPU0 0 UNC_M_CAS_COUNT.WR [uncore_imc_1] CPU18 0 UNC_M_CAS_COUNT.WR [uncore_imc_1] CPU0 81,570 UNC_M_CAS_COUNT.RD [uncore_imc_2] CPU18 113,886 UNC_M_CAS_COUNT.RD [uncore_imc_2] CPU0 62,330 UNC_M_CAS_COUNT.WR [uncore_imc_2] CPU18 66,942 UNC_M_CAS_COUNT.WR [uncore_imc_2] CPU0 75,489 UNC_M_CAS_COUNT.RD [uncore_imc_3] CPU18 27,958 UNC_M_CAS_COUNT.RD [uncore_imc_3] CPU0 55,864 UNC_M_CAS_COUNT.WR [uncore_imc_3] CPU18 38,727 UNC_M_CAS_COUNT.WR [uncore_imc_3] CPU0 0 UNC_M_CAS_COUNT.RD [uncore_imc_4] CPU18 0 UNC_M_CAS_COUNT.RD [uncore_imc_4] CPU0 0 UNC_M_CAS_COUNT.WR [uncore_imc_4] CPU18 0 UNC_M_CAS_COUNT.WR [uncore_imc_4] CPU0 75,423 UNC_M_CAS_COUNT.RD [uncore_imc_5] CPU18 104,527 UNC_M_CAS_COUNT.RD [uncore_imc_5] CPU0 57,596 UNC_M_CAS_COUNT.WR [uncore_imc_5] CPU18 56,777 UNC_M_CAS_COUNT.WR [uncore_imc_5] CPU0 1,003,440,851 ns duration_time 1.003440851 seconds time elapsed ``` After: ``` $ perf stat -A -M memory_bandwidth_total -a sleep 1 Performance counter stats for 'system wide': CPU0 88,968 UNC_M_CAS_COUNT.RD [uncore_imc_0] # 9.5 MB/s memory_bandwidth_total CPU18 0 UNC_M_CAS_COUNT.RD [uncore_imc_0] # 0.0 MB/s memory_bandwidth_total CPU0 59,498 UNC_M_CAS_COUNT.WR [uncore_imc_0] CPU18 0 UNC_M_CAS_COUNT.WR [uncore_imc_0] CPU0 0 UNC_M_CAS_COUNT.RD [uncore_imc_1] # 0.0 MB/s memory_bandwidth_total CPU18 0 UNC_M_CAS_COUNT.RD [uncore_imc_1] # 0.0 MB/s memory_bandwidth_total CPU0 0 UNC_M_CAS_COUNT.WR [uncore_imc_1] CPU18 0 UNC_M_CAS_COUNT.WR [uncore_imc_1] CPU0 88,635 UNC_M_CAS_COUNT.RD [uncore_imc_2] # 9.5 MB/s memory_bandwidth_total CPU18 117,975 UNC_M_CAS_COUNT.RD [uncore_imc_2] # 11.5 MB/s memory_bandwidth_total CPU0 60,829 UNC_M_CAS_COUNT.WR [uncore_imc_2] CPU18 62,105 UNC_M_CAS_COUNT.WR [uncore_imc_2] CPU0 82,238 UNC_M_CAS_COUNT.RD [uncore_imc_3] # 8.7 MB/s memory_bandwidth_total CPU18 22,906 UNC_M_CAS_COUNT.RD [uncore_imc_3] # 3.6 MB/s memory_bandwidth_total CPU0 53,959 UNC_M_CAS_COUNT.WR [uncore_imc_3] CPU18 32,990 UNC_M_CAS_COUNT.WR [uncore_imc_3] CPU0 0 UNC_M_CAS_COUNT.RD [uncore_imc_4] # 0.0 MB/s memory_bandwidth_total CPU18 0 UNC_M_CAS_COUNT.RD [uncore_imc_4] # 0.0 MB/s memory_bandwidth_total CPU0 0 UNC_M_CAS_COUNT.WR [uncore_imc_4] CPU18 0 UNC_M_CAS_COUNT.WR [uncore_imc_4] CPU0 83,595 UNC_M_CAS_COUNT.RD [uncore_imc_5] # 8.9 MB/s memory_bandwidth_total CPU18 110,151 UNC_M_CAS_COUNT.RD [uncore_imc_5] # 10.5 MB/s memory_bandwidth_total CPU0 56,540 UNC_M_CAS_COUNT.WR [uncore_imc_5] CPU18 53,816 UNC_M_CAS_COUNT.WR [uncore_imc_5] CPU0 1,003,353,416 ns duration_time ``` Signed-off-by: Ian Rogers <irogers@google.com> | Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Stephane Eranian <eranian@google.com> Cc: Kaige Ye <ye@kaige.org> Cc: Kajol Jain <kjain@linux.ibm.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: John Garry <john.g.garry@oracle.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240221070754.4163916-2-irogers@google.com
2024-02-22perf stat: Pass fewer metric argumentsIan Rogers1-20/+18
Pass metric_expr and evsel rather than specific variables from the struct, thereby reducing the number of arguments. This will enable later fixes. To reduce the size of the diff, local variables are added to match the previous parameter names. This isn't done in the case of "name" as evsel->name is more intention revealing. A whitespace issue is also addressed. Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Stephane Eranian <eranian@google.com> Cc: Kaige Ye <ye@kaige.org> Cc: Kajol Jain <kjain@linux.ibm.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: John Garry <john.g.garry@oracle.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240221070754.4163916-1-irogers@google.com
2024-02-20perf: script: prefer capstone to XEDChangbin Du3-7/+11
Now perf can show assembly instructions with libcapstone for x86, and the capstone is better in general. Signed-off-by: Changbin Du <changbin.du@huawei.com> Reviewed-by: Adrian Hunter <adrian.hunter@intel.com> Cc: changbin.du@gmail.com Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240217074046.4100789-6-changbin.du@huawei.com
2024-02-20perf: script: add raw|disasm arguments to --insn-trace optionChangbin Du2-7/+22
Now '--insn-trace' accept a argument to specify the output format: - raw: display raw instructions. - disasm: display mnemonic instructions (if capstone is installed). $ sudo perf script --insn-trace=raw ls 1443864 [006] 2275506.209908875: 7f216b426100 _start+0x0 (/usr/lib/x86_64-linux-gnu/ld-2.31.so) insn: 48 89 e7 ls 1443864 [006] 2275506.209908875: 7f216b426103 _start+0x3 (/usr/lib/x86_64-linux-gnu/ld-2.31.so) insn: e8 e8 0c 00 00 ls 1443864 [006] 2275506.209908875: 7f216b426df0 _dl_start+0x0 (/usr/lib/x86_64-linux-gnu/ld-2.31.so) insn: f3 0f 1e fa $ sudo perf script --insn-trace=disasm ls 1443864 [006] 2275506.209908875: 7f216b426100 _start+0x0 (/usr/lib/x86_64-linux-gnu/ld-2.31.so) movq %rsp, %rdi ls 1443864 [006] 2275506.209908875: 7f216b426103 _start+0x3 (/usr/lib/x86_64-linux-gnu/ld-2.31.so) callq _dl_start+0x0 ls 1443864 [006] 2275506.209908875: 7f216b426df0 _dl_start+0x0 (/usr/lib/x86_64-linux-gnu/ld-2.31.so) illegal instruction ls 1443864 [006] 2275506.209908875: 7f216b426df4 _dl_start+0x4 (/usr/lib/x86_64-linux-gnu/ld-2.31.so) pushq %rbp ls 1443864 [006] 2275506.209908875: 7f216b426df5 _dl_start+0x5 (/usr/lib/x86_64-linux-gnu/ld-2.31.so) movq %rsp, %rbp ls 1443864 [006] 2275506.209908875: 7f216b426df8 _dl_start+0x8 (/usr/lib/x86_64-linux-gnu/ld-2.31.so) pushq %r15 Signed-off-by: Changbin Du <changbin.du@huawei.com> Reviewed-by: Adrian Hunter <adrian.hunter@intel.com> Cc: changbin.du@gmail.com Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240217074046.4100789-5-changbin.du@huawei.com
2024-02-20perf: script: add field 'disasm' to display mnemonic instructionsChangbin Du2-7/+21
In addition to the 'insn' field, this adds a new field 'disasm' to display mnemonic instructions instead of the raw code. $ sudo perf script -F +disasm perf-exec 1443864 [006] 2275506.209848: psb: psb offs: 0 0 [unknown] ([unknown]) perf-exec 1443864 [006] 2275506.209848: cbr: cbr: 41 freq: 4100 MHz (114%) 0 [unknown] ([unknown]) ls 1443864 [006] 2275506.209905: 1 branches:uH: 7f216b426100 _start+0x0 (/usr/lib/x86_64-linux-gnu/ld-2.31.so) movq %rsp, %rdi ls 1443864 [006] 2275506.209908: 1 branches:uH: 7f216b426103 _start+0x3 (/usr/lib/x86_64-linux-gnu/ld-2.31.so) callq _dl_start+0x0 Signed-off-by: Changbin Du <changbin.du@huawei.com> Reviewed-by: Adrian Hunter <adrian.hunter@intel.com> Cc: changbin.du@gmail.com Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240217074046.4100789-4-changbin.du@huawei.com
2024-02-20perf: util: use capstone disasm engine to show assembly instructionsChangbin Du5-6/+155
Currently, the instructions of samples are shown as raw hex strings which are hard to read. x86 has a special option '--xed' to disassemble the hex string via intel XED tool. Here we use capstone as our disassembler engine to give more friendly instructions. We select libcapstone because capstone can provide more insn details. Perf will fallback to raw instructions if libcapstone is not available. The advantages compared to XED tool: * Support arm, arm64, x86-32, x86_64 (more could be supported), xed only for x86_64. * Immediate address operands are shown as symbol+offs. Signed-off-by: Changbin Du <changbin.du@huawei.com> Reviewed-by: Adrian Hunter <adrian.hunter@intel.com> Cc: changbin.du@gmail.com Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240217074046.4100789-3-changbin.du@huawei.com
2024-02-20perf: build: introduce the libcapstoneChangbin Du4-1/+28
Later we will use libcapstone to disassemble instructions of samples. Signed-off-by: Changbin Du <changbin.du@huawei.com> Reviewed-by: Adrian Hunter <adrian.hunter@intel.com> Cc: changbin.du@gmail.com Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240217074046.4100789-2-changbin.du@huawei.com
2024-02-16perf list: For metricgroup only list include descriptionIan Rogers1-7/+14
If perf list is invoked with 'metricgroups' include the description unless it is invoked with flags to exclude it. Make the description of metricgroup dumping dependent on the desc flag in print_state as with metrics. Before: ``` $ perf list metricgroups List of pre-defined events (to be used in -e or -M): Metric Groups: Backend Bad BadSpec ... ``` After: ``` $ perf list metricgroups List of pre-defined events (to be used in -e or -M): Metric Groups: Backend [Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet] Bad [Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet] BadSpec ... ``` Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240216192044.119897-1-irogers@google.com
2024-02-16perf tools: Fixup module symbol end address properlyNamhyung Kim1-2/+19
I got a strange error on ARM to fail on processing FINISHED_ROUND record. It turned out that it was failing in symbol__alloc_hist() because the symbol size is too big. When a sample is captured on a specific BPF program, it failed. I've added a debug code and found the end address of the symbol is from the next module which is placed far way. ffff800008795778-ffff80000879d6d8: bpf_prog_1bac53b8aac4bc58_netcg_sock [bpf] ffff80000879d6d8-ffff80000ad656b4: bpf_prog_76867454b5944e15_netcg_getsockopt [bpf] ffff80000ad656b4-ffffd69b7af74048: bpf_prog_1d50286d2eb1be85_hn_egress [bpf] <---------- here ffffd69b7af74048-ffffd69b7af74048: $x.5 [sha3_generic] ffffd69b7af74048-ffffd69b7af740b8: crypto_sha3_init [sha3_generic] ffffd69b7af740b8-ffffd69b7af741e0: crypto_sha3_update [sha3_generic] The logic in symbols__fixup_end() just uses curr->start to update the prev->end. But in this case, it won't work as it's too different. I think ARM has a different kernel memory layout for modules and BPF than on x86. Actually there's a logic to handle kernel and module boundary. Let's do the same for symbols between different modules. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Leo Yan <leo.yan@linux.dev> Cc: Will Deacon <will@kernel.org> Cc: Mike Leach <mike.leach@linaro.org> Cc: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240212233322.1855161-1-namhyung@kernel.org
2024-02-16perf vendor events intel: Update tigerlake TMA metrics to 4.7Ian Rogers2-157/+261
Top-Down Microarchitecture Analysis (TMA) metrics simplify cycle-accounting using microarchitecture-abstracted metrics organized in one hierarchy. This update is from version 4.5 to 4.7. The update includes: - tma_info_bottleneck* metrics, an abstraction or summarization of the 100+ TMA tree nodes into 12-entry familiar performance metrics. - Reduce number of events (multiplexing) for tma_info_system_gflops, tma_info_core_flopc, tma_info_inst_mix_ipflop and tma_ports_utilized_0. - Fixes for tma_info_bottleneck_mispredictions and tma_info_bad_spec_branch_misprediction_cost. - New tma_info_inst_mix_ippause metric. - tma_serializing_operation is raised to level 3. - Swapped tma_info_core_ilp (becomes per SMT thread) and tma_info_pipeline_execute (per physical core). - tma_nop_instructions and tma_shuffles_256b are lowered to level 4 under tma_other_light_ops_group. - Reduced number of events when SMT is off. - Tuned thresholds for tma_info_bottleneck_branching_overhead, tma_fetch_bandwidth and tma_ports_utilized_3m. The update came from: https://github.com/intel/perfmon/pull/140 https://github.com/intel/perfmon/pull/138 Running the script: https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: Caleb Biggers <caleb.biggers@intel.com> Cc: Edward Baker <edward.baker@intel.com> Cc: Perry Taylor <perry.taylor@intel.com> Cc: Samantha Alt <samantha.alt@intel.com> Cc: Weilin Wang <weilin.wang@intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240214011820.644458-31-irogers@google.com
2024-02-16perf vendor events intel: Update skylakex TMA metrics to 4.7Ian Rogers2-168/+392
Top-Down Microarchitecture Analysis (TMA) metrics simplify cycle-accounting using microarchitecture-abstracted metrics organized in one hierarchy. This update is from version 4.5 to 4.7. The update includes: - tma_info_bottleneck* metrics, an abstraction or summarization of the 100+ TMA tree nodes into 12-entry familiar performance metrics. - Reduce number of events (multiplexing) for tma_info_system_gflops, tma_info_core_flopc, tma_info_inst_mix_ipflop and tma_ports_utilized_0. - Fixes for tma_info_bottleneck_mispredictions and tma_info_bad_spec_branch_misprediction_cost. - tma_serializing_operation is raised to level 3. - Swapped tma_info_core_ilp (becomes per SMT thread) and tma_info_pipeline_execute (per physical core). - tma_nop_instructions and tma_shuffles_256b are lowered to level 4 under tma_other_light_ops_group. - Reduced number of events when SMT is off. - Tuned thresholds for tma_info_bottleneck_branching_overhead, tma_fetch_bandwidth and tma_ports_utilized_3m. The update came from: https://github.com/intel/perfmon/pull/140 https://github.com/intel/perfmon/pull/138 Running the script: https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: Caleb Biggers <caleb.biggers@intel.com> Cc: Edward Baker <edward.baker@intel.com> Cc: Perry Taylor <perry.taylor@intel.com> Cc: Samantha Alt <samantha.alt@intel.com> Cc: Weilin Wang <weilin.wang@intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240214011820.644458-30-irogers@google.com
2024-02-16perf vendor events intel: Update skylake TMA metrics to 4.7Ian Rogers2-161/+246
Top-Down Microarchitecture Analysis (TMA) metrics simplify cycle-accounting using microarchitecture-abstracted metrics organized in one hierarchy. This update is from version 4.5 to 4.7. The update includes: - tma_info_bottleneck* metrics, an abstraction or summarization of the 100+ TMA tree nodes into 12-entry familiar performance metrics. - Reduce number of events (multiplexing) for tma_info_system_gflops, tma_info_core_flopc, tma_info_inst_mix_ipflop and tma_ports_utilized_0. - Fixes for tma_info_bottleneck_mispredictions and tma_info_bad_spec_branch_misprediction_cost. - tma_serializing_operation is raised to level 3. - Swapped tma_info_core_ilp (becomes per SMT thread) and tma_info_pipeline_execute (per physical core). - tma_nop_instructions and tma_shuffles_256b are lowered to level 4 under tma_other_light_ops_group. - Reduced number of events when SMT is off. - Tuned thresholds for tma_info_bottleneck_branching_overhead, tma_fetch_bandwidth and tma_ports_utilized_3m. The update came from: https://github.com/intel/perfmon/pull/140 https://github.com/intel/perfmon/pull/138 Running the script: https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: Caleb Biggers <caleb.biggers@intel.com> Cc: Edward Baker <edward.baker@intel.com> Cc: Perry Taylor <perry.taylor@intel.com> Cc: Samantha Alt <samantha.alt@intel.com> Cc: Weilin Wang <weilin.wang@intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240214011820.644458-29-irogers@google.com
2024-02-16perf vendor events intel: Update sapphirerapids TMA metrics to 4.7Ian Rogers2-221/+564
Top-Down Microarchitecture Analysis (TMA) metrics simplify cycle-accounting using microarchitecture-abstracted metrics organized in one hierarchy. This update is from version 4.5 to 4.7. The update includes: - tma_info_bottleneck* metrics, an abstraction or summarization of the 100+ TMA tree nodes into 12-entry familiar performance metrics. - tma_c01_wait and tma_c02_wait metrics measure power-performance states. - Reduce number of events (multiplexing) for tma_info_system_gflops, tma_info_core_flopc, tma_info_inst_mix_ipflop and tma_ports_utilized_0. - Fixes for tma_info_bottleneck_mispredictions and tma_info_bad_spec_branch_misprediction_cost. - New tma_info_inst_mix_ippause metric. - tma_serializing_operation is raised to level 3. - Swapped tma_info_core_ilp (becomes per SMT thread) and tma_info_pipeline_execute (per physical core). - tma_nop_instructions and tma_shuffles_256b are lowered to level 4 under tma_other_light_ops_group. - Reduced number of events when SMT is off. - Tuned thresholds for tma_info_bottleneck_branching_overhead, tma_fetch_bandwidth and tma_ports_utilized_3m. The update came from: https://github.com/intel/perfmon/pull/140 https://github.com/intel/perfmon/pull/138 Running the script: https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: Caleb Biggers <caleb.biggers@intel.com> Cc: Edward Baker <edward.baker@intel.com> Cc: Perry Taylor <perry.taylor@intel.com> Cc: Samantha Alt <samantha.alt@intel.com> Cc: Weilin Wang <weilin.wang@intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240214011820.644458-28-irogers@google.com
2024-02-16perf vendor events intel: Update sandybridge TMA metrics to 4.7Ian Rogers2-32/+46
Top-Down Microarchitecture Analysis (TMA) metrics simplify cycle-accounting using microarchitecture-abstracted metrics organized in one hierarchy. This update is from version 4.5 to 4.7. The update includes: - Add metrics tma_fp_vector_128b, tma_fp_vector_256b and tma_info_system_cpus_utilized. - Remove metrics tma_info_system_mem_parallel_requests, tma_info_system_core_frequency and tma_info_system_mem_request_latency. - Swapped tma_info_core_ilp (becomes per SMT thread) and tma_info_pipeline_execute (per physical core). - Tuned thresholds for tma_fetch_bandwidth. The update came from: https://github.com/intel/perfmon/pull/140 https://github.com/intel/perfmon/pull/138 Running the script: https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: Caleb Biggers <caleb.biggers@intel.com> Cc: Edward Baker <edward.baker@intel.com> Cc: Perry Taylor <perry.taylor@intel.com> Cc: Samantha Alt <samantha.alt@intel.com> Cc: Weilin Wang <weilin.wang@intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240214011820.644458-27-irogers@google.com