aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2021-04-19perf/x86: Register hybrid PMUsKan Liang3-21/+223
Different hybrid PMUs have different PMU capabilities and events. Perf should registers a dedicated PMU for each of them. To check the X86 event, perf has to go through all possible hybrid pmus. All the hybrid PMUs are registered at boot time. Before the registration, add intel_pmu_check_hybrid_pmus() to check and update the counters information, the event constraints, the extra registers and the unique capabilities for each hybrid PMUs. Postpone the display of the PMU information and HW check to CPU_STARTING, because the boot CPU is the only online CPU in the init_hw_perf_events(). Perf doesn't know the availability of the other PMUs. Perf should display the PMU information only if the counters of the PMU are available. One type of CPUs may be all offline. For this case, users can still observe the PMU in /sys/devices, but its CPU mask is 0. All hybrid PMUs have capability PERF_PMU_CAP_HETEROGENEOUS_CPUS. The PMU name for hybrid PMUs will be "cpu_XXX", which will be assigned later in a separated patch. The PMU type id for the core PMU is still PERF_TYPE_RAW. For the other hybrid PMUs, the PMU type id is not hard code. The event->cpu must be compatitable with the supported CPUs of the PMU. Add a check in the x86_pmu_event_init(). The events in a group must be from the same type of hybrid PMU. The fake cpuc used in the validation must be from the supported CPU of the event->pmu. Perf may not retrieve a valid core type from get_this_hybrid_cpu_type(). For example, ADL may have an alternative configuration. With that configuration, Perf cannot retrieve the core type from the CPUID leaf 0x1a. Add a platform specific get_hybrid_cpu_type(). If the generic way fails, invoke the platform specific get_hybrid_cpu_type(). Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1618237865-33448-17-git-send-email-kan.liang@linux.intel.com
2021-04-19perf/x86: Factor out x86_pmu_show_pmu_capKan Liang2-9/+19
The PMU capabilities are different among hybrid PMUs. Perf should dump the PMU capabilities information for each hybrid PMU. Factor out x86_pmu_show_pmu_cap() which shows the PMU capabilities information. The function will be reused later when registering a dedicated hybrid PMU. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/1618237865-33448-16-git-send-email-kan.liang@linux.intel.com
2021-04-19perf/x86: Remove temporary pmu assignment in event_initKan Liang1-11/+0
The temporary pmu assignment in event_init is unnecessary. The assignment was introduced by commit 8113070d6639 ("perf_events: Add fast-path to the rescheduling code"). At that time, event->pmu is not assigned yet when initializing an event. The assignment is required. However, from commit 7e5b2a01d2ca ("perf: provide PMU when initing events"), the event->pmu is provided before event_init is invoked. The temporary pmu assignment in event_init should be removed. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/1618237865-33448-15-git-send-email-kan.liang@linux.intel.com
2021-04-19perf/x86/intel: Factor out intel_pmu_check_extra_regsKan Liang1-14/+21
Each Hybrid PMU has to check and update its own extra registers before registration. The intel_pmu_check_extra_regs will be reused later to check the extra registers of each hybrid PMU. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/1618237865-33448-14-git-send-email-kan.liang@linux.intel.com
2021-04-19perf/x86/intel: Factor out intel_pmu_check_event_constraintsKan Liang1-35/+47
Each Hybrid PMU has to check and update its own event constraints before registration. The intel_pmu_check_event_constraints will be reused later to check the event constraints of each hybrid PMU. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/1618237865-33448-13-git-send-email-kan.liang@linux.intel.com
2021-04-19perf/x86/intel: Factor out intel_pmu_check_num_countersKan Liang1-14/+24
Each Hybrid PMU has to check its own number of counters and mask fixed counters before registration. The intel_pmu_check_num_counters will be reused later to check the number of the counters for each hybrid PMU. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/1618237865-33448-12-git-send-email-kan.liang@linux.intel.com
2021-04-19perf/x86: Hybrid PMU support for extra_regsKan Liang3-8/+13
Different hybrid PMU may have different extra registers, e.g. Core PMU may have offcore registers, frontend register and ldlat register. Atom core may only have offcore registers and ldlat register. Each hybrid PMU should use its own extra_regs. An Intel Hybrid system should always have extra registers. Unconditionally allocate shared_regs for Intel Hybrid system. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/1618237865-33448-11-git-send-email-kan.liang@linux.intel.com
2021-04-19perf/x86: Hybrid PMU support for event constraintsKan Liang4-5/+10
The events are different among hybrid PMUs. Each hybrid PMU should use its own event constraints. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/1618237865-33448-10-git-send-email-kan.liang@linux.intel.com
2021-04-19perf/x86: Hybrid PMU support for hardware cache eventKan Liang2-3/+11
The hardware cache events are different among hybrid PMUs. Each hybrid PMU should have its own hw cache event table. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1618237865-33448-9-git-send-email-kan.liang@linux.intel.com
2021-04-19perf/x86: Hybrid PMU support for unconstrainedKan Liang2-1/+12
The unconstrained value depends on the number of GP and fixed counters. Each hybrid PMU should use its own unconstrained. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1618237865-33448-8-git-send-email-kan.liang@linux.intel.com
2021-04-19perf/x86: Hybrid PMU support for countersKan Liang4-25/+56
The number of GP and fixed counters are different among hybrid PMUs. Each hybrid PMU should use its own counter related information. When handling a certain hybrid PMU, apply the number of counters from the corresponding hybrid PMU. When reserving the counters in the initialization of a new event, reserve all possible counters. The number of counter recored in the global x86_pmu is for the architecture counters which are available for all hybrid PMUs. KVM doesn't support the hybrid PMU yet. Return the number of the architecture counters for now. For the functions only available for the old platforms, e.g., intel_pmu_drain_pebs_nhm(), nothing is changed. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/1618237865-33448-7-git-send-email-kan.liang@linux.intel.com
2021-04-19perf/x86: Hybrid PMU support for intel_ctrlKan Liang3-14/+24
The intel_ctrl is the counter mask of a PMU. The PMU counter information may be different among hybrid PMUs, each hybrid PMU should use its own intel_ctrl to check and access the counters. When handling a certain hybrid PMU, apply the intel_ctrl from the corresponding hybrid PMU. When checking the HW existence, apply the PMU and number of counters from the corresponding hybrid PMU as well. Perf will check the HW existence for each Hybrid PMU before registration. Expose the check_hw_exists() for a later patch. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/1618237865-33448-6-git-send-email-kan.liang@linux.intel.com
2021-04-19perf/x86/intel: Hybrid PMU support for perf capabilitiesKan Liang5-7/+60
Some platforms, e.g. Alder Lake, have hybrid architecture. Although most PMU capabilities are the same, there are still some unique PMU capabilities for different hybrid PMUs. Perf should register a dedicated pmu for each hybrid PMU. Add a new struct x86_hybrid_pmu, which saves the dedicated pmu and capabilities for each hybrid PMU. The architecture MSR, MSR_IA32_PERF_CAPABILITIES, only indicates the architecture features which are available on all hybrid PMUs. The architecture features are stored in the global x86_pmu.intel_cap. For Alder Lake, the model-specific features are perf metrics and PEBS-via-PT. The corresponding bits of the global x86_pmu.intel_cap should be 0 for these two features. Perf should not use the global intel_cap to check the features on a hybrid system. Add a dedicated intel_cap in the x86_hybrid_pmu to store the model-specific capabilities. Use the dedicated intel_cap to replace the global intel_cap for thse two features. The dedicated intel_cap will be set in the following "Add Alder Lake Hybrid support" patch. Add is_hybrid() to distinguish a hybrid system. ADL may have an alternative configuration. With that configuration, the X86_FEATURE_HYBRID_CPU is not set. Perf cannot rely on the feature bit. Add a new static_key_false, perf_is_hybrid, to indicate a hybrid system. It will be assigned in the following "Add Alder Lake Hybrid support" patch as well. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1618237865-33448-5-git-send-email-kan.liang@linux.intel.com
2021-04-19perf/x86: Track pmu in per-CPU cpu_hw_eventsKan Liang5-12/+24
Some platforms, e.g. Alder Lake, have hybrid architecture. In the same package, there may be more than one type of CPU. The PMU capabilities are different among different types of CPU. Perf will register a dedicated PMU for each type of CPU. Add a 'pmu' variable in the struct cpu_hw_events to track the dedicated PMU of the current CPU. Current x86_get_pmu() use the global 'pmu', which will be broken on a hybrid platform. Modify it to apply the 'pmu' of the specific CPU. Initialize the per-CPU 'pmu' variable with the global 'pmu'. There is nothing changed for the non-hybrid platforms. The is_x86_event() will be updated in the later patch ("perf/x86: Register hybrid PMUs") for hybrid platforms. For the non-hybrid platforms, nothing is changed here. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1618237865-33448-4-git-send-email-kan.liang@linux.intel.com
2021-04-19x86/cpu: Add helper function to get the type of the current hybrid CPURicardo Neri2-0/+22
On processors with Intel Hybrid Technology (i.e., one having more than one type of CPU in the same package), all CPUs support the same instruction set and enumerate the same features on CPUID. Thus, all software can run on any CPU without restrictions. However, there may be model-specific differences among types of CPUs. For instance, each type of CPU may support a different number of performance counters. Also, machine check error banks may be wired differently. Even though most software will not care about these differences, kernel subsystems dealing with these differences must know. Add and expose a new helper function get_this_hybrid_cpu_type() to query the type of the current hybrid CPU. The function will be used later in the perf subsystem. The Intel Software Developer's Manual defines the CPU type as 8-bit identifier. Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Len Brown <len.brown@intel.com> Acked-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/1618237865-33448-3-git-send-email-kan.liang@linux.intel.com
2021-04-19x86/cpufeatures: Enumerate Intel Hybrid Technology feature bitRicardo Neri1-0/+1
Add feature enumeration to identify a processor with Intel Hybrid Technology: one in which CPUs of more than one type are the same package. On a hybrid processor, all CPUs support the same homogeneous (i.e., symmetric) instruction set. All CPUs enumerate the same features in CPUID. Thus, software (user space and kernel) can run and migrate to any CPU in the system as well as utilize any of the enumerated features without any change or special provisions. The main difference among CPUs in a hybrid processor are power and performance properties. Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Len Brown <len.brown@intel.com> Acked-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/1618237865-33448-2-git-send-email-kan.liang@linux.intel.com
2021-04-16perf/amd/uncore: Fix sysfs type mismatchNathan Chancellor1-3/+3
dev_attr_show() calls the __uncore_*_show() functions via an indirect call but their type does not currently match the type of the show() member in 'struct device_attribute', resulting in a Control Flow Integrity violation. $ cat /sys/devices/amd_l3/format/umask config:8-15 $ dmesg | grep "CFI failure" [ 1258.174653] CFI failure (target: __uncore_umask_show...): Update the type in the DEFINE_UNCORE_FORMAT_ATTR macro to match 'struct device_attribute' so that there is no more CFI violation. Fixes: 06f2c24584f3 ("perf/amd/uncore: Prepare to scale for more attributes that vary per family") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210415001112.3024673-2-nathan@kernel.org
2021-04-16x86/events/amd/iommu: Fix sysfs type mismatchNathan Chancellor1-3/+3
dev_attr_show() calls _iommu_event_show() via an indirect call but _iommu_event_show()'s type does not currently match the type of the show() member in 'struct device_attribute', resulting in a Control Flow Integrity violation. $ cat /sys/devices/amd_iommu_1/events/mem_dte_hit csource=0x0a $ dmesg | grep "CFI failure" [ 3526.735140] CFI failure (target: _iommu_event_show...): Change _iommu_event_show() and 'struct amd_iommu_event_desc' to 'struct device_attribute' so that there is no more CFI violation. Fixes: 7be6296fdd75 ("perf/x86/amd: AMD IOMMU Performance Counter PERF uncore PMU implementation") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210415001112.3024673-1-nathan@kernel.org
2021-04-16perf core: Add PERF_COUNT_SW_CGROUP_SWITCHES eventNamhyung Kim2-0/+8
This patch adds a new software event to count context switches involving cgroup switches. So it's counted only if cgroups of previous and next tasks are different. Note that it only checks the cgroups in the perf_event subsystem. For cgroup v2, it shouldn't matter anyway. One can argue that we can do this by using existing sched_switch event with eBPF. But some systems might not have eBPF for some reason so I'd like to add this as a simple way. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210210083327.22726-2-namhyung@kernel.org
2021-04-16perf core: Factor out __perf_sw_event_schedNamhyung Kim1-21/+12
In some cases, we need to check more than whether the software event is enabled. So split the condition check and the actual event handling. This is a preparation for the next change. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210210083327.22726-1-namhyung@kernel.org
2021-04-16perf/x86: Move cpuc->running into P4 specific codeKan Liang3-5/+13
The 'running' variable is only used in the P4 PMU. Current perf sets the variable in the critical function x86_pmu_start(), which wastes cycles for everybody not running on P4. Move cpuc->running into the P4 specific p4_pmu_enable_event(). Add a static per-CPU 'p4_running' variable to replace the 'running' variable in the struct cpu_hw_events. Saves space for the generic structure. The p4_pmu_enable_all() also invokes the p4_pmu_enable_event(), but it should not set cpuc->running. Factor out __p4_pmu_enable_event() for p4_pmu_enable_all(). Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1618410990-21383-1-git-send-email-kan.liang@linux.intel.com
2021-04-16selftests/perf_events: Add kselftest for remove_on_execMarco Elver3-1/+262
Add kselftest to test that remove_on_exec removes inherited events from child tasks. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210408103605.1676875-9-elver@google.com
2021-04-16selftests/perf_events: Add kselftest for process-wide sigtrap handlingMarco Elver5-0/+220
Add a kselftest for testing process-wide perf events with synchronous SIGTRAP on events (using breakpoints). In particular, we want to test that changes to the event propagate to all children, and the SIGTRAPs are in fact synchronously sent to the thread where the event occurred. Note: The "signal_stress" test case is also added later in the series to perf tool's built-in tests. The test here is more elaborate in that respect, which on one hand avoids bloating the perf tool unnecessarily, but we also benefit from structured tests with TAP-compliant output that the kselftest framework provides. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210408103605.1676875-8-elver@google.com
2021-04-16perf: Add support for SIGTRAP on perf eventsMarco Elver3-2/+58
Adds bit perf_event_attr::sigtrap, which can be set to cause events to send SIGTRAP (with si_code TRAP_PERF) to the task where the event occurred. The primary motivation is to support synchronous signals on perf events in the task where an event (such as breakpoints) triggered. To distinguish perf events based on the event type, the type is set in si_errno. For events that are associated with an address, si_addr is copied from perf_sample_data. The new field perf_event_attr::sig_data is copied to si_perf, which allows user space to disambiguate which event (of the same type) triggered the signal. For example, user space could encode the relevant information it cares about in sig_data. We note that the choice of an opaque u64 provides the simplest and most flexible option. Alternatives where a reference to some user space data is passed back suffer from the problem that modification of referenced data (be it the event fd, or the perf_event_attr) can race with the signal being delivered (of course, the same caveat applies if user space decides to store a pointer in sig_data, but the ABI explicitly avoids prescribing such a design). Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dmitry Vyukov <dvyukov@google.com> Link: https://lore.kernel.org/lkml/YBv3rAT566k+6zjg@hirez.programming.kicks-ass.net/
2021-04-16signal: Introduce TRAP_PERF si_code and si_perf to siginfoMarco Elver8-3/+33
Introduces the TRAP_PERF si_code, and associated siginfo_t field si_perf. These will be used by the perf event subsystem to send signals (if requested) to the task where an event occurred. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k Acked-by: Arnd Bergmann <arnd@arndb.de> # asm-generic Link: https://lkml.kernel.org/r/20210408103605.1676875-6-elver@google.com
2021-04-16perf: Add support for event removal on execMarco Elver2-9/+64
Adds bit perf_event_attr::remove_on_exec, to support removing an event from a task on exec. This option supports the case where an event is supposed to be process-wide only, and should not propagate beyond exec, to limit monitoring to the original process image only. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210408103605.1676875-5-elver@google.com
2021-04-16perf: Support only inheriting events if cloned with CLONE_THREADMarco Elver4-11/+20
Adds bit perf_event_attr::inherit_thread, to restricting inheriting events only if the child was cloned with CLONE_THREAD. This option supports the case where an event is supposed to be process-wide only (including subthreads), but should not propagate beyond the current process's shared environment. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/lkml/YBvj6eJR%2FDY2TsEB@hirez.programming.kicks-ass.net/
2021-04-16perf: Apply PERF_EVENT_IOC_MODIFY_ATTRIBUTES to childrenMarco Elver1-1/+21
As with other ioctls (such as PERF_EVENT_IOC_{ENABLE,DISABLE}), fix up handling of PERF_EVENT_IOC_MODIFY_ATTRIBUTES to also apply to children. Suggested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Link: https://lkml.kernel.org/r/20210408103605.1676875-3-elver@google.com
2021-04-16perf: Rework perf_event_exit_event()Peter Zijlstra2-63/+80
Make perf_event_exit_event() more robust, such that we can use it from other contexts. Specifically the up and coming remove_on_exec. For this to work we need to address a few issues. Remove_on_exec will not destroy the entire context, so we cannot rely on TASK_TOMBSTONE to disable event_function_call() and we thus have to use perf_remove_from_context(). When using perf_remove_from_context(), there's two races to consider. The first is against close(), where we can have concurrent tear-down of the event. The second is against child_list iteration, which should not find a half baked event. To address this, teach perf_remove_from_context() to special case !ctx->is_active and about DETACH_CHILD. [ elver@google.com: fix racing parent/child exit in sync_child_event(). ] Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210408103605.1676875-2-elver@google.com
2021-04-16perf intel-pt: Use aux_watermarkAlexander Shishkin1-0/+6
Turns out, the default setting of attr.aux_watermark to half of the total buffer size is not very useful, especially with smaller buffers. The problem is that, after half of the buffer is filled up, the kernel updates ->aux_head and sets up the next "transaction", while observing that ->aux_tail is still zero (as userspace haven't had the chance to update it), meaning that the trace will have to stop at the end of this second "transaction". This means, for example, that the second PERF_RECORD_AUX in every trace comes with TRUNCATED flag set. Setting attr.aux_watermark to quarter of the buffer gives enough space for the ->aux_tail update to be observed and prevents the data loss. The obligatory before/after showcase: > # perf_before record -e intel_pt//u -m,8 uname > Linux > [ perf record: Woken up 6 times to write data ] > Warning: > AUX data lost 4 times out of 10! > > [ perf record: Captured and wrote 0.099 MB perf.data ] > # perf record -e intel_pt//u -m,8 uname > Linux > [ perf record: Woken up 4 times to write data ] > [ perf record: Captured and wrote 0.039 MB perf.data ] The effect is still visible with large workloads and large buffers, although less pronounced. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210414154955.49603-3-alexander.shishkin@linux.intel.com
2021-04-16perf: Cap allocation order at aux_watermarkAlexander Shishkin1-16/+18
Currently, we start allocating AUX pages half the size of the total requested AUX buffer size, ignoring the attr.aux_watermark setting. This, in turn, makes intel_pt driver disregard the watermark also, as it uses page order for its SG (ToPA) configuration. Now, this can be fixed in the intel_pt PMU driver, but seeing as it's the only one currently making use of high order allocations, there is no reason not to fix the allocator instead. This way, any other driver wishing to add this support would not have to worry about this. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210414154955.49603-2-alexander.shishkin@linux.intel.com
2021-04-02perf/x86/intel/uncore: Enable IIO stacks to PMON mapping for multi-segment SKXAlexander Antonov3-34/+47
IIO stacks to PMON mapping on Skylake servers is exposed through introduced early attributes /sys/devices/uncore_iio_<pmu_idx>/dieX, where dieX is a file which holds "Segment:Root Bus" for PCIe root port which can be monitored by that IIO PMON block. These sysfs attributes are disabled for multiple segment topologies except VMD domains which start at 0x10000. This patch removes the limitation and enables IIO stacks to PMON mapping for multi-segment Skylake servers by introducing segment-aware intel_uncore_topology structure and attributing the topology configuration to the segment in skx_iio_get_topology() function. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Alexander Antonov <alexander.antonov@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Tested-by: Kyle Meyer <kyle.meyer@hpe.com> Link: https://lkml.kernel.org/r/20210323150507.2013-1-alexander.antonov@linux.intel.com
2021-04-02perf/x86/intel/uncore: Generic support for the MMIO type of uncore blocksKan Liang4-0/+101
The discovery table provides the generic uncore block information for the MMIO type of uncore blocks, which is good enough to provide basic uncore support. The box control field is composed of the BAR address and box control offset. When initializing the uncore blocks, perf should ioremap the address from the box control field. Implement the generic support for the MMIO type of uncore block. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1616003977-90612-6-git-send-email-kan.liang@linux.intel.com
2021-04-02perf/x86/intel/uncore: Generic support for the PCI type of uncore blocksKan Liang4-7/+177
The discovery table provides the generic uncore block information for the PCI type of uncore blocks, which is good enough to provide basic uncore support. The PCI BUS and DEVFN information can be retrieved from the box control field. Introduce the uncore_pci_pmus_register() to register all the PCICFG type of uncore blocks. The old PCI probe/remove way is dropped. The PCI BUS and DEVFN information are different among dies. Add box_ctls to store the box control field of each die. Add a new BUS notifier for the PCI type of uncore block to support the hotplug. If the device is "hot remove", the corresponding registered PMU has to be unregistered. Perf cannot locate the PMU by searching a const pci_device_id table, because the discovery tables don't provide such information. Introduce uncore_pci_find_dev_pmu_from_types() to search the whole uncore_pci_uncores for the PMU. Implement generic support for the PCI type of uncore block. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1616003977-90612-5-git-send-email-kan.liang@linux.intel.com
2021-04-02perf/x86/intel/uncore: Rename uncore_notifier to uncore_pci_sub_notifierKan Liang1-6/+14
Perf will use a similar method to the PCI sub driver to register the PMUs for the PCI type of uncore blocks. The method requires a BUS notifier to support hotplug. The current BUS notifier cannot be reused, because it searches a const id_table for the corresponding registered PMU. The PCI type of uncore blocks in the discovery tables doesn't provide an id_table. Factor out uncore_bus_notify() and add the pointer of an id_table as a parameter. The uncore_bus_notify() will be reused in the following patch. The current BUS notifier is only used by the PCI sub driver. Its name is too generic. Rename it to uncore_pci_sub_notifier, which is specific for the PCI sub driver. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1616003977-90612-4-git-send-email-kan.liang@linux.intel.com
2021-04-02perf/x86/intel/uncore: Generic support for the MSR type of uncore blocksKan Liang4-10/+182
The discovery table provides the generic uncore block information for the MSR type of uncore blocks, e.g., the counter width, the number of counters, the location of control/counter registers, which is good enough to provide basic uncore support. It can be used as a fallback solution when the kernel doesn't support a platform. The name of the uncore box cannot be retrieved from the discovery table. uncore_type_&typeID_&boxID will be used as its name. Save the type ID and the box ID information in the struct intel_uncore_type. Factor out uncore_get_pmu_name() to handle different naming methods. Implement generic support for the MSR type of uncore block. Some advanced features, such as filters and constraints, cannot be retrieved from discovery tables. Features that rely on that information are not be supported here. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1616003977-90612-3-git-send-email-kan.liang@linux.intel.com
2021-04-02perf/x86/intel/uncore: Parse uncore discovery tablesKan Liang4-8/+448
A self-describing mechanism for the uncore PerfMon hardware has been introduced with the latest Intel platforms. By reading through an MMIO page worth of information, perf can 'discover' all the standard uncore PerfMon registers in a machine. The discovery mechanism relies on BIOS's support. With a proper BIOS, a PCI device with the unique capability ID 0x23 can be found on each die. Perf can retrieve the information of all available uncore PerfMons from the device via MMIO. The information is composed of one global discovery table and several unit discovery tables. - The global discovery table includes global uncore information of the die, e.g., the address of the global control register, the offset of the global status register, the number of uncore units, the offset of unit discovery tables, etc. - The unit discovery table includes generic uncore unit information, e.g., the access type, the counter width, the address of counters, the address of the counter control, the unit ID, the unit type, etc. The unit is also called "box" in the code. Perf can provide basic uncore support based on this information with the following patches. To locate the PCI device with the discovery tables, check the generic PCI ID first. If it doesn't match, go through the entire PCI device tree and locate the device with the unique capability ID. The uncore information is similar among dies. To save parsing time and space, only completely parse and store the discovery tables on the first die and the first box of each die. The parsed information is stored in an RB tree structure, intel_uncore_discovery_type. The size of the stored discovery tables varies among platforms. It's around 4KB for a Sapphire Rapids server. If a BIOS doesn't support the 'discovery' mechanism, the uncore driver will exit with -ENODEV. There is nothing changed. Add a module parameter to disable the discovery feature. If a BIOS gets the discovery tables wrong, users can have an option to disable the feature. For the current patchset, the uncore driver will exit with -ENODEV. In the future, it may fall back to the hardcode uncore driver on a known platform. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1616003977-90612-2-git-send-email-kan.liang@linux.intel.com
2021-03-16perf/core: Fix unconditional security_locked_down() callOndrej Mosnacek1-6/+6
Currently, the lockdown state is queried unconditionally, even though its result is used only if the PERF_SAMPLE_REGS_INTR bit is set in attr.sample_type. While that doesn't matter in case of the Lockdown LSM, it causes trouble with the SELinux's lockdown hook implementation. SELinux implements the locked_down hook with a check whether the current task's type has the corresponding "lockdown" class permission ("integrity" or "confidentiality") allowed in the policy. This means that calling the hook when the access control decision would be ignored generates a bogus permission check and audit record. Fix this by checking sample_type first and only calling the hook when its result would be honored. Fixes: b0c8fdc7fdb7 ("lockdown: Lock down perf when in confidentiality mode") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paul Moore <paul@paul-moore.com> Link: https://lkml.kernel.org/r/20210224215628.192519-1-omosnace@redhat.com
2021-03-16perf core: Allocate perf_event in the target node memoryNamhyung Kim1-1/+4
For cpu events, it'd better allocating them in the corresponding node memory as they would be mostly accessed by the target cpu. Although perf tools sets the cpu affinity before calling perf_event_open, there are places it doesn't (notably perf record) and we should consider other external users too. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210311115413.444407-2-namhyung@kernel.org
2021-03-16perf core: Add a kmem_cache for struct perf_eventNamhyung Kim1-3/+6
The kernel can allocate a lot of struct perf_event when profiling. For example, 256 cpu x 8 events x 20 cgroups = 40K instances of the struct would be allocated on a large system. The size of struct perf_event in my setup is 1152 byte. As it's allocated by kmalloc, the actual allocation size would be rounded up to 2K. Then there's 896 byte (~43%) of waste per instance resulting in total ~35MB with 40K instances. We can create a dedicated kmem_cache to avoid such a big unnecessary memory consumption. With this change, I can see below (note this machine has 112 cpus). # grep perf_event /proc/slabinfo perf_event 224 784 1152 7 2 : tunables 24 12 8 : slabdata 112 112 0 The sixth column is pages-per-slab which is 2, and the fifth column is obj-per-slab which is 7. Thus actually it can use 1152 x 7 = 8064 byte in the 8K, and wasted memory is (8192 - 8064) / 7 = ~18 byte per instance. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210311115413.444407-1-namhyung@kernel.org
2021-03-16perf core: Allocate perf_buffer in the target node memoryNamhyung Kim1-3/+6
I found the ring buffer pages are allocated in the node but the ring buffer itself is not. Let's convert it to use kzalloc_node() too. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210315033436.682438-1-namhyung@kernel.org
2021-03-14Linux 5.12-rc3Linus Torvalds1-1/+1
2021-03-14prctl: fix PR_SET_MM_AUXV kernel stack leakAlexey Dobriyan1-1/+1
Doing a prctl(PR_SET_MM, PR_SET_MM_AUXV, addr, 1); will copy 1 byte from userspace to (quite big) on-stack array and then stash everything to mm->saved_auxv. AT_NULL terminator will be inserted at the very end. /proc/*/auxv handler will find that AT_NULL terminator and copy original stack contents to userspace. This devious scheme requires CAP_SYS_RESOURCE. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-14Merge tag 'irq-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds7-18/+8
Pull irq fixes from Thomas Gleixner: "A set of irqchip updates: - Make the GENERIC_IRQ_MULTI_HANDLER configuration correct - Add a missing DT compatible string for the Ingenic driver - Remove the pointless debugfs_file pointer from struct irqdomain" * tag 'irq-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/ingenic: Add support for the JZ4760 dt-bindings/irq: Add compatible string for the JZ4760B irqchip: Do not blindly select CONFIG_GENERIC_IRQ_MULTI_HANDLER ARM: ep93xx: Select GENERIC_IRQ_MULTI_HANDLER directly irqdomain: Remove debugfs_file from struct irq_domain
2021-03-14Merge tag 'timers-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-21/+39
Pull timer fix from Thomas Gleixner: "A single fix in for hrtimers to prevent an interrupt storm caused by the lack of reevaluation of the timers which expire in softirq context under certain circumstances, e.g. when the clock was set" * tag 'timers-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()
2021-03-14Merge tag 'sched-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-67/+63
Pull scheduler fixes from Thomas Gleixner: "A set of scheduler updates: - Prevent a NULL pointer dereference in the migration_stop_cpu() mechanims - Prevent self concurrency of affine_move_task() - Small fixes and cleanups related to task migration/affinity setting - Ensure that sync_runqueues_membarrier_state() is invoked on the current CPU when it is in the cpu mask" * tag 'sched-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/membarrier: fix missing local execution of ipi_sync_rq_state() sched: Simplify set_affinity_pending refcounts sched: Fix affine_move_task() self-concurrency sched: Optimize migration_cpu_stop() sched: Collate affine_move_task() stoppers sched: Simplify migration_cpu_stop() sched: Fix migration_cpu_stop() requeueing
2021-03-14Merge tag 'objtool-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-6/+7
Pull objtool fix from Thomas Gleixner: "A single objtool fix to handle the PUSHF/POPF validation correctly for the paravirt changes which modified arch_local_irq_restore not to use popf" * tag 'objtool-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool,x86: Fix uaccess PUSHF/POPF validation
2021-03-14Merge tag 'locking-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds3-10/+9
Pull locking fixes from Thomas Gleixner: "A couple of locking fixes: - A fix for the static_call mechanism so it handles unaligned addresses correctly. - Make u64_stats_init() a macro so every instance gets a seperate lockdep key. - Make seqcount_latch_init() a macro as well to preserve the static variable which is used for the lockdep key" * tag 'locking-urgent-2021-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: seqlock,lockdep: Fix seqcount_latch_init() u64_stats,lockdep: Fix u64_stats_init() vs lockdep static_call: Fix the module key fixup
2021-03-14Merge tag 'perf_urgent_for_v5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds5-15/+51
Pull perf fixes from Borislav Petkov: - Make sure PMU internal buffers are flushed for per-CPU events too and properly handle PID/TID for large PEBS. - Handle the case properly when there's no PMU and therefore return an empty list of perf MSRs for VMX to switch instead of reading random garbage from the stack. * tag 'perf_urgent_for_v5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/perf: Use RET0 as default for guest_get_msrs to handle "no PMU" case perf/x86/intel: Set PERF_ATTACH_SCHED_CB for large PEBS and LBR perf/core: Flush PMU internal buffers for per-CPU events
2021-03-14Merge tag 'efi-urgent-for-v5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-0/+16
Pull EFI fix from Ard Biesheuvel via Borislav Petkov: "Fix an oversight in the handling of EFI_RT_PROPERTIES_TABLE, which was added v5.10, but failed to take the SetVirtualAddressMap() RT service into account" * tag 'efi-urgent-for-v5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi: stub: omit SetVirtualAddressMap() if marked unsupported in RT_PROP table