aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/tools/perf/scripts/python/export-to-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2024-07-09task_work: s/task_work_cancel()/task_work_cancel_func()/Frederic Weisbecker4-8/+8
A proper task_work_cancel() API that actually cancels a callback and not *any* callback pointing to a given function is going to be needed for perf events event freeing. Do the appropriate rename to prepare for that. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240621091601.18227-2-frederic@kernel.org
2024-07-04perf/x86/amd/uncore: Fix DF and UMC domain identificationSandipan Das1-3/+3
For uncore PMUs, a single context is shared across all CPUs in a domain. The domain can be a CCX, like in the case of the L3 PMU, or a socket, like in the case of DF and UMC PMUs. This information is available via the PMU's cpumask. For contexts shared across a socket, the domain is currently determined from topology_die_id() which is incorrect after the introduction of commit 63edbaa48a57 ("x86/cpu/topology: Add support for the AMD 0x80000026 leaf") as it now returns a CCX identifier on Zen 4 and later systems which support CPUID leaf 0x80000026. Use topology_logical_package_id() instead as it always returns a socket identifier irrespective of the availability of CPUID leaf 0x80000026. Fixes: 63edbaa48a57 ("x86/cpu/topology: Add support for the AMD 0x80000026 leaf") Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240626074942.1044818-1-sandipan.das@amd.com
2024-07-04perf/x86/amd/uncore: Avoid PMU registration if counters are unavailableSandipan Das1-8/+14
X86_FEATURE_PERFCTR_NB and X86_FEATURE_PERFCTR_LLC are derived from CPUID leaf 0x80000001 ECX bits 24 and 28 respectively and denote the availability of DF and L3 counters. When these bits are not set, the corresponding PMUs have no counters and hence, should not be registered. Fixes: 07888daa056e ("perf/x86/amd/uncore: Move discovery and registration") Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240626074404.1044230-1-sandipan.das@amd.com
2024-07-04perf/x86/intel: Support Perfmon MSRs aliasingKan Liang4-5/+32
The architectural performance monitoring V6 supports a new range of counters' MSRs in the 19xxH address range. They include all the GP counter MSRs, the GP control MSRs, and the fixed counter MSRs. The step between each sibling counter is 4. Add intel_pmu_addr_offset() to calculate the correct offset. Add fixedctr in struct x86_pmu to store the address of the fixed counter 0. It can be used to calculate the rest of the fixed counters. The MSR address of the fixed counter control is not changed. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lkml.kernel.org/r/20240626143545.480761-9-kan.liang@linux.intel.com
2024-07-04perf/x86/intel: Support PERFEVTSEL extensionKan Liang2-4/+69
Two new fields (the unit mask2, and the equal flag) are added in the IA32_PERFEVTSELx MSRs. They can be enumerated by the CPUID.23H.0.EBX. Update the config_mask in x86_pmu and x86_hybrid_pmu for the true layout of the PERFEVTSEL. Expose the new formats into sysfs if they are available. The umask extension reuses the same format attr name "umask" as the previous umask. Add umask2_show to determine/display the correct format for the current machine. Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lkml.kernel.org/r/20240626143545.480761-8-kan.liang@linux.intel.com
2024-07-04perf/x86: Add config_mask to represent EVENTSEL bitmaskKan Liang3-1/+12
Different vendors may support different fields in EVENTSEL MSR, such as Intel would introduce new fields umask2 and eq bits in EVENTSEL MSR since Perfmon version 6. However, a fixed mask X86_RAW_EVENT_MASK is used to filter the attr.config. Introduce a new config_mask to record the real supported EVENTSEL bitmask. Only apply it to the existing code now. No functional change. Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lkml.kernel.org/r/20240626143545.480761-7-kan.liang@linux.intel.com
2024-07-04perf/x86/intel: Support new data source for Lunar LakeKan Liang4-5/+113
A new PEBS data source format is introduced for the p-core of Lunar Lake. The data source field is extended to 8 bits with new encodings. A new layout is introduced into the union intel_x86_pebs_dse. Introduce the lnl_latency_data() to parse the new format. Enlarge the pebs_data_source[] accordingly to include new encodings. Only the mem load and the mem store events can generate the data source. Introduce INTEL_HYBRID_LDLAT_CONSTRAINT and INTEL_HYBRID_STLAT_CONSTRAINT to mark them. Add two new bits for the new cache-related data src, L2_MHB and MSC. The L2_MHB is short for L2 Miss Handling Buffer, which is similar to LFB (Line Fill Buffer), but to track the L2 Cache misses. The MSC stands for the memory-side cache. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lkml.kernel.org/r/20240626143545.480761-6-kan.liang@linux.intel.com
2024-07-04perf/x86/intel: Rename model-specific pebs_latency_data functionsKan Liang3-16/+16
The model-specific pebs_latency_data functions of ADL and MTL use the "small" as a postfix to indicate the e-core. The postfix is too generic for a model-specific function. It cannot provide useful information that can directly map it to a specific uarch, which can facilitate the development and maintenance. Use the abbr of the uarch to rename the model-specific functions. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lkml.kernel.org/r/20240626143545.480761-5-kan.liang@linux.intel.com
2024-07-04perf/x86: Add Lunar Lake and Arrow Lake supportKan Liang4-0/+147
From PMU's perspective, Lunar Lake and Arrow Lake are similar to the previous generation Meteor Lake. Both are hybrid platforms, with e-core and p-core. The key differences include: - The e-core supports 3 new fixed counters - The p-core supports an updated PEBS Data Source format - More GP counters (Updated event constraint table) - New Architectural performance monitoring V6 (New Perfmon MSRs aliasing, umask2, eq). - New PEBS format V6 (Counters Snapshotting group) - New RDPMC metrics clear mode The legacy features, the 3 new fixed counters and updated event constraint table are enabled in this patch. The new PEBS data source format, the architectural performance monitoring V6, the PEBS format V6, and the new RDPMC metrics clear mode are supported in the following patches. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lkml.kernel.org/r/20240626143545.480761-4-kan.liang@linux.intel.com
2024-07-04perf/x86: Support counter maskKan Liang9-179/+199
The current perf assumes that both GP and fixed counters are contiguous. But it's not guaranteed on newer Intel platforms or in a virtualization environment. Use the counter mask to replace the number of counters for both GP and the fixed counters. For the other ARCHs or old platforms which don't support a counter mask, using GENMASK_ULL(num_counter - 1, 0) to replace. There is no functional change for them. The interface to KVM is not changed. The number of counters still be passed to KVM. It can be updated later separately. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lkml.kernel.org/r/20240626143545.480761-3-kan.liang@linux.intel.com
2024-07-04perf/x86/intel: Support the PEBS event maskKan Liang4-13/+26
The current perf assumes that the counters that support PEBS are contiguous. But it's not guaranteed with the new leaf 0x23 introduced. The counters are enumerated with a counter mask. There may be holes in the counter mask for future platforms or in a virtualization environment. Store the PEBS event mask rather than the maximum number of PEBS counters in the x86 PMU structures. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lkml.kernel.org/r/20240626143545.480761-2-kan.liang@linux.intel.com
2024-07-04perf/x86/intel/cstate: Add Lunarlake supportZhang Rui1-7/+19
Compared with previous client platforms, PC8 is removed from Lunarlake. It supports CC1/CC6/CC7 and PC2/PC3/PC6/PC10 residency counters. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20240628031758.43103-4-rui.zhang@intel.com
2024-07-04perf/x86/intel/cstate: Add Arrowlake supportZhang Rui1-8/+12
Like Alderlake, Arrowlake supports CC1/CC6/CC7 and PC2/PC3/PC6/PC8/PC10. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20240628031758.43103-3-rui.zhang@intel.com
2024-07-04perf/x86/intel/cstate: Fix Alderlake/Raptorlake/MeteorlakeZhang Rui1-5/+2
For Alderlake, the spec changes after the patch submitted and PC7/PC9 are removed. Raptorlake and Meteorlake, which copy the Alderlake cstate PMU, also don't have PC7/PC9. Remove PC7/PC9 support for Alderlake/Raptorlake/Meteorlake. Fixes: d0ca946bcf84 ("perf/x86/cstate: Add Alder Lake CPU support") Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20240628031758.43103-2-rui.zhang@intel.com
2024-07-04perf: Make rb_alloc_aux() return an error immediately if nr_pages <= 0Adrian Hunter1-0/+3
rb_alloc_aux() should not be called with nr_pages <= 0. Make it more robust and readable by returning an error immediately in that case. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240624201101.60186-8-adrian.hunter@intel.com
2024-07-04perf: Fix default aux_watermark calculationAdrian Hunter1-1/+3
The default aux_watermark is half the AUX area buffer size. In general, on a 64-bit architecture, the AUX area buffer size could be a bigger than fits in a 32-bit type, but the calculation does not allow for that possibility. However the aux_watermark value is recorded in a u32, so should not be more than U32_MAX either. Fix by doing the calculation in a correctly sized type, and limiting the result to U32_MAX. Fixes: d68e6799a5c8 ("perf: Cap allocation order at aux_watermark") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240624201101.60186-7-adrian.hunter@intel.com
2024-07-04perf: Prevent passing zero nr_pages to rb_alloc_aux()Adrian Hunter1-0/+2
nr_pages is unsigned long but gets passed to rb_alloc_aux() as an int, and is stored as an int. Only power-of-2 values are accepted, so if nr_pages is a 64_bit value, it will be passed to rb_alloc_aux() as zero. That is not ideal because: 1. the value is incorrect 2. rb_alloc_aux() is at risk of misbehaving, although it manages to return -ENOMEM in that case, it is a result of passing zero to get_order() even though the get_order() result is documented to be undefined in that case. Fix by simply validating the maximum supported value in the first place. Use -ENOMEM error code for consistency with the current error code that is returned in that case. Fixes: 45bfb2e50471 ("perf: Add AUX area to ring buffer for raw data streams") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240624201101.60186-6-adrian.hunter@intel.com
2024-07-04perf: Fix perf_aux_size() for greater-than 32-bit sizeAdrian Hunter1-1/+1
perf_buffer->aux_nr_pages uses a 32-bit type, so a cast is needed to calculate a 64-bit size. Fixes: 45bfb2e50471 ("perf: Add AUX area to ring buffer for raw data streams") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240624201101.60186-5-adrian.hunter@intel.com
2024-07-04perf/x86/intel/pt: Fix pt_topa_entry_for_page() address calculationAdrian Hunter1-1/+1
Currently, perf allocates an array of page pointers which is limited in size by MAX_PAGE_ORDER. That in turn limits the maximum Intel PT buffer size to 2GiB. Should that limitation be lifted, the Intel PT driver can support larger sizes, except for one calculation in pt_topa_entry_for_page(), which is limited to 32-bits. Fix pt_topa_entry_for_page() address calculation by adding a cast. Fixes: 39152ee51b77 ("perf/x86/intel/pt: Get rid of reverse lookup table for ToPA") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240624201101.60186-4-adrian.hunter@intel.com
2024-07-04perf/x86/intel/pt: Fix a topa_entry base address calculationAdrian Hunter1-1/+1
topa_entry->base is a bit-field. Bit-fields are not promoted to a 64-bit type, even if the underlying type is 64-bit, and so, if necessary, must be cast to a larger type when calculations are done. Fix a topa_entry->base address calculation by adding a cast. Without the cast, the address was limited to 36-bits i.e. 64GiB. The address calculation is used on systems that do not support Multiple Entry ToPA (only Broadwell), and affects physical addresses on or above 64GiB. Instead of writing to the correct address, the address comprising the first 36 bits would be written to. Intel PT snapshot and sampling modes are not affected. Fixes: 52ca9ced3f70 ("perf/x86/intel/pt: Add Intel PT PMU driver") Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240624201101.60186-3-adrian.hunter@intel.com
2024-07-04perf/x86/intel/pt: Fix topa_entry base lengthMarco Cavenati1-2/+2
topa_entry->base needs to store a pfn. It obviously needs to be large enough to store the largest possible x86 pfn which is MAXPHYADDR-PAGE_SIZE (52-12). So it is 4 bits too small. Increase the size of topa_entry->base from 36 bits to 40 bits. Note, systems where physical addresses can be 256TiB or more are affected. [ Adrian: Amend commit message as suggested by Dave Hansen ] Fixes: 52ca9ced3f70 ("perf/x86/intel/pt: Add Intel PT PMU driver") Signed-off-by: Marco Cavenati <cavenati.marco@gmail.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Adrian Hunter <adrian.hunter@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240624201101.60186-2-adrian.hunter@intel.com
2024-06-29x86/cpu/intel: Drop stray FAM6 check with new Intel CPU model definesAndrew Cooper1-11/+7
The outer if () should have been dropped when switching to c->x86_vfm. Fixes: 6568fc18c2f6 ("x86/cpu/intel: Switch to new Intel CPU model defines") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240529183605.17520-1-andrew.cooper3@citrix.com
2024-06-20x86/cpufeatures: Flip the /proc/cpuinfo appearance logicBorislav Petkov (AMD)3-457/+456
I'm getting tired of telling people to put a magic "" in the #define X86_FEATURE /* "" ... */ comment to hide the new feature flag from the user-visible /proc/cpuinfo. Flip the logic to make it explicit: an explicit "<name>" in the comment adds the flag to /proc/cpuinfo and otherwise not, by default. Add the "<name>" of all the existing flags to keep backwards compatibility with userspace. There should be no functional changes resulting from this. Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240618113840.24163-1-bp@kernel.org
2024-06-17perf/x86/intel/uncore: Support HBM and CXL PMON countersKan Liang1-2/+53
Unknown uncore PMON types can be found in both SPR and EMR with HBM or CXL. $ls /sys/devices/ | grep type uncore_type_12_16 uncore_type_12_18 uncore_type_12_2 uncore_type_12_4 uncore_type_12_6 uncore_type_12_8 uncore_type_13_17 uncore_type_13_19 uncore_type_13_3 uncore_type_13_5 uncore_type_13_7 uncore_type_13_9 The unknown PMON types are HBM and CXL PMON. Except for the name, the other information regarding the HBM and CXL PMON counters can be retrieved via the discovery table. Add them into the uncores tables for SPR and EMR. The event config registers for all CXL related units are 8-byte apart. Add SPR_UNCORE_MMIO_OFFS8_COMMON_FORMAT to specially handle it. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Yunying Sun <yunying.sun@intel.com> Link: https://lore.kernel.org/r/20240614134631.1092359-9-kan.liang@linux.intel.com
2024-06-17perf/x86/uncore: Cleanup unused unit structureKan Liang4-112/+12
The unit control and ID information are retrieved from the unit control RB tree. No one uses the old structure anymore. Remove them. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Yunying Sun <yunying.sun@intel.com> Link: https://lore.kernel.org/r/20240614134631.1092359-8-kan.liang@linux.intel.com
2024-06-17perf/x86/uncore: Apply the unit control RB tree to PCI uncore unitsKan Liang5-48/+94
The unit control RB tree has the unit control and unit ID information for all the PCI units. Use them to replace the box_ctls/pci_offsets to get an accurate unit control address for PCI uncore units. The UPI/M3UPI units in the discovery table are ignored. Please see the commit 65248a9a9ee1 ("perf/x86/uncore: Add a quirk for UPI on SPR"). Manually allocate a unit control RB tree for UPI/M3UPI. Add cleanup_extra_boxes to release such manual allocation. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Yunying Sun <yunying.sun@intel.com> Link: https://lore.kernel.org/r/20240614134631.1092359-7-kan.liang@linux.intel.com
2024-06-17perf/x86/uncore: Apply the unit control RB tree to MSR uncore unitsKan Liang4-11/+59
The unit control RB tree has the unit control and unit ID information for all the MSR units. Use them to replace the box_ctl and uncore_msr_box_ctl() to get an accurate unit control address for MSR uncore units. Add intel_generic_uncore_assign_hw_event(), which utilizes the accurate unit control address from the unit control RB tree to calculate the config_base and event_base. The unit id related information should be retrieved from the unit control RB tree as well. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Yunying Sun <yunying.sun@intel.com> Link: https://lore.kernel.org/r/20240614134631.1092359-6-kan.liang@linux.intel.com
2024-06-17perf/x86/uncore: Apply the unit control RB tree to MMIO uncore unitsKan Liang1-16/+14
The unit control RB tree has the unit control and unit ID information for all the units. Use it to replace the box_ctls/mmio_offsets to get an accurate unit control address for MMIO uncore units. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Yunying Sun <yunying.sun@intel.com> Link: https://lore.kernel.org/r/20240614134631.1092359-5-kan.liang@linux.intel.com
2024-06-17perf/x86/uncore: Retrieve the unit ID from the unit control RB treeKan Liang1-0/+3
The box_ids only save the unit ID for the first die. If a unit, e.g., a CXL unit, doesn't exist in the first die. The unit ID cannot be retrieved. The unit control RB tree also stores the unit ID information. Retrieve the unit ID from the unit control RB tree Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Yunying Sun <yunying.sun@intel.com> Link: https://lore.kernel.org/r/20240614134631.1092359-4-kan.liang@linux.intel.com
2024-06-17perf/x86/uncore: Support per PMU cpumaskKan Liang4-5/+89
The cpumask of some uncore units, e.g., CXL uncore units, may be wrong under some configurations. Perf may access an uncore counter of a non-existent uncore unit. The uncore driver assumes that all uncore units are symmetric among dies. A global cpumask is shared among all uncore PMUs. However, some CXL uncore units may only be available on some dies. A per PMU cpumask is introduced to track the CPU mask of this PMU. The driver searches the unit control RB tree to check whether the PMU is available on a given die, and updates the per PMU cpumask accordingly. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Yunying Sun <yunying.sun@intel.com> Link: https://lore.kernel.org/r/20240614134631.1092359-3-kan.liang@linux.intel.com
2024-06-17perf/x86/uncore: Save the unit control address of all unitsKan Liang2-2/+87
The unit control address of some CXL units may be wrongly calculated under some configuration on a EMR machine. The current implementation only saves the unit control address of the units from the first die, and the first unit of the rest of dies. Perf assumed that the units from the other dies have the same offset as the first die. So the unit control address of the rest of the units can be calculated. However, the assumption is wrong, especially for the CXL units. Introduce an RB tree for each uncore type to save the unit control address and three kinds of ID information (unit ID, PMU ID, and die ID) for all units. The unit ID is a physical ID of a unit. The PMU ID is a logical ID assigned to a unit. The logical IDs start from 0 and must be contiguous. The physical ID and the logical ID are 1:1 mapping. The units with the same physical ID in different dies share the same PMU. The die ID indicates which die a unit belongs to. The RB tree can be searched by two different keys (unit ID or PMU ID + die ID). During the RB tree setup, the unit ID is used as a key to look up the RB tree. The perf can create/assign a proper PMU ID to the unit. Later, after the RB tree is setup, PMU ID + die ID is used as a key to look up the RB tree to fill the cpumask of a PMU. It's used more frequently, so PMU ID + die ID is compared in the unit_less(). The uncore_find_unit() has to be O(N). But the RB tree setup only occurs once during the driver load time. It should be acceptable. Compared with the current implementation, more space is required to save the information of all units. The extra size should be acceptable. For example, on EMR, there are 221 units at most. For a 2-socket machine, the extra space is ~6KB at most. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240614134631.1092359-2-kan.liang@linux.intel.com
2024-06-13x86/CPU/AMD: Always inline amd_clear_divider()Mateusz Guzik2-12/+11
The routine is used on syscall exit and on non-AMD CPUs is guaranteed to be empty. It probably does not need to be a function call even on CPUs which do need the mitigation. [ bp: Make sure it is always inlined so that noinstr marking works. ] Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240613082637.659133-1-mjguzik@gmail.com
2024-06-02x86/mce/inject: Add missing MODULE_DESCRIPTION() lineJeff Johnson1-0/+1
make W=1 C=1 warns: WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/kernel/cpu/mce/mce-inject.o Add the missing MODULE_DESCRIPTION(). Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240530-md-x86-mce-inject-v1-1-2a9dc998f709@quicinc.com
2024-05-28perf/x86/rapl: Switch to new Intel CPU model definesTony Luck1-45/+45
New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240520224620.9480-44-tony.luck%40intel.com
2024-05-28x86/boot: Switch to new Intel CPU model definesTony Luck1-1/+1
New CPU #defines encode vendor and family as well as model but boot code doesn't have all the infrastructure to use them. Hard code the one CPU model number used here. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240520224620.9480-35-tony.luck%40intel.com
2024-05-28x86/cpu: Switch to new Intel CPU model definesTony Luck2-36/+36
New CPU #defines encode vendor and family as well as model. Update INTEL_CPU_DESC() to work with vendor/family/model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240520224620.9480-34-tony.luck%40intel.com
2024-05-28perf/x86/intel: Switch to new Intel CPU model definesTony Luck1-74/+74
New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240520224620.9480-32-tony.luck%40intel.com
2024-05-28x86/virt/tdx: Switch to new Intel CPU model definesTony Luck1-4/+4
New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240520224620.9480-31-tony.luck%40intel.com
2024-05-28x86/PCI: Switch to new Intel CPU model definesTony Luck1-2/+2
New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240520224620.9480-30-tony.luck%40intel.com
2024-05-28x86/cpu/intel: Switch to new Intel CPU model definesTony Luck1-55/+53
New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240520224620.9480-29-tony.luck%40intel.com
2024-05-28x86/platform/intel-mid: Switch to new Intel CPU model definesTony Luck1-3/+3
New CPU #defines encode vendor and family as well as model. N.B. Drop Haswell. CPU model 0x3C was included by mistake in upstream code. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Andy Shevchenko <andy@kernel.org> Link: https://lore.kernel.org/all/20240521161002.12866-1-tony.luck%40intel.com
2024-05-28x86/pconfig: Remove unused MKTME pconfig codeAlison Schofield3-150/+1
Code supporting Intel PCONFIG targets was an early piece of enabling for MKTME (Multi-Key Total Memory Encryption). Since MKTME feature enablement did not follow into the kernel, remove the unused PCONFIG code. Signed-off-by: Alison Schofield <alison.schofield@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/all/4ddff30d466785b4adb1400f0518783012835141.1715054189.git.alison.schofield%40intel.com
2024-05-28x86/cpu: Remove useless work in detect_tme_early()Alison Schofield1-60/+12
TME (Total Memory Encryption) and MKTME (Multi-Key Total Memory Encryption) BIOS detection were introduced together here [1] and are loosely coupled in the Intel CPU init code. TME is a hardware only feature and its BIOS status is all that needs to be shared with the kernel user: enabled or disabled. The TME algorithm the BIOS is using and whether or not the kernel recognizes that algorithm is useless to the kernel user. MKTME is a hardware feature that requires kernel support. MKTME detection code was added in advance of broader kernel support for MKTME that never followed. So, rather than continuing to spew needless and confusing messages about BIOS MKTME status, remove most of the MKTME pieces from detect_tme_early(). Keep one useful message: alert the user when BIOS enabled MKTME reduces the available physical address bits. Recovery of the MKTME consumed bits requires a reboot with MKTME disabled in BIOS. There is no functional change for the user, only a change in boot messages. Below is one example when both TME and MKTME are enabled in BIOS with AES_XTS_256 which is unknown to the detect tme code. Before: [] x86/tme: enabled by BIOS [] x86/tme: Unknown policy is active: 0x2 [] x86/mktme: No known encryption algorithm is supported: 0x4 [] x86/mktme: enabled by BIOS [] x86/mktme: 127 KeyIDs available After: [] x86/tme: enabled by BIOS [] x86/mktme: BIOS enable: x86_phys_bits reduced by 8 [1] commit cb06d8e3d020 ("x86/tme: Detect if TME and MKTME is activated by BIOS") Signed-off-by: Alison Schofield <alison.schofield@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/86dfdf6ced8c9b790f9376bf6c7e22b5608f47c2.1715054189.git.alison.schofield%40intel.com
2024-05-26Linux 6.10-rc1Linus Torvalds1-3/+3
2024-05-26mm: percpu: Include smp.h in alloc_tag.hKent Overstreet1-0/+1
percpu.h depends on smp.h, but doesn't include it directly because of circular header dependency issues; percpu.h is needed in a bunch of low level headers. This fixes a randconfig build error on mips: include/linux/alloc_tag.h: In function '__alloc_tag_ref_set': include/asm-generic/percpu.h:31:40: error: implicit declaration of function 'raw_smp_processor_id' [-Werror=implicit-function-declaration] Reported-by: kernel test robot <lkp@intel.com> Fixes: 24e44cc22aa3 ("mm: percpu: enable per-cpu allocation tagging") Closes: https://lore.kernel.org/oe-kbuild-all/202405210052.DIrMXJNz-lkp@intel.com/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-05-26Revert "perf parse-events: Prefer sysfs/JSON hardware events over legacy"Arnaldo Carvalho de Melo4-103/+68
This reverts commit 617824a7f0f73e4de325cf8add58e55b28c12493. This made a simple 'perf record -e cycles:pp make -j199' stop working on the Ampere ARM64 system Linus uses to test ARM64 kernels, as discussed at length in the threads in the Link tags below. The fix provided by Ian wasn't acceptable and work to fix this will take time we don't have at this point, so lets revert this and work on it on the next devel cycle. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com> Cc: Ethan Adams <j.ethan.adams@gmail.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Tycho Andersen <tycho@tycho.pizza> Cc: Yang Jihong <yangjihong@bytedance.com> Link: https://lore.kernel.org/lkml/CAHk-=wi5Ri=yR2jBVk-4HzTzpoAWOgstr1LEvg_-OXtJvXXJOA@mail.gmail.com Link: https://lore.kernel.org/lkml/CAHk-=wiWvtFyedDNpoV7a8Fq_FpbB+F5KmWK2xPY3QoYseOf_A@mail.gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-05-24cifs: Fix missing set of remote_i_sizeDavid Howells2-3/+4
Occasionally, the generic/001 xfstest will fail indicating corruption in one of the copy chains when run on cifs against a server that supports FSCTL_DUPLICATE_EXTENTS_TO_FILE (eg. Samba with a share on btrfs). The problem is that the remote_i_size value isn't updated by cifs_setsize() when called by smb2_duplicate_extents(), but i_size *is*. This may cause cifs_remap_file_range() to then skip the bit after calling ->duplicate_extents() that sets sizes. Fix this by calling netfs_resize_file() in smb2_duplicate_extents() before calling cifs_setsize() to set i_size. This means we don't then need to call netfs_resize_file() upon return from ->duplicate_extents(), but we also fix the test to compare against the pre-dup inode size. [Note that this goes back before the addition of remote_i_size with the netfs_inode struct. It should probably have been setting cifsi->server_eof previously.] Fixes: cfc63fc8126a ("smb3: fix cached file size problems in duplicate extents (reflink)") Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev Signed-off-by: Steve French <stfrench@microsoft.com>
2024-05-24cifs: Fix smb3_insert_range() to move the zero_pointDavid Howells1-0/+1
Fix smb3_insert_range() to move the zero_point over to the new EOF. Without this, generic/147 fails as reads of data beyond the old EOF point return zeroes. Fixes: 3ee1a1fc3981 ("cifs: Cut over to using netfslib") Signed-off-by: David Howells <dhowells@redhat.com> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev Signed-off-by: Steve French <stfrench@microsoft.com>
2024-05-24mm/ksm: fix possible UAF of stable_nodeChengming Zhou1-1/+2
The commit 2c653d0ee2ae ("ksm: introduce ksm_max_page_sharing per page deduplication limit") introduced a possible failure case in the stable_tree_insert(), where we may free the new allocated stable_node_dup if we fail to prepare the missing chain node. Then that kfolio return and unlock with a freed stable_node set... And any MM activities can come in to access kfolio->mapping, so UAF. Fix it by moving folio_set_stable_node() to the end after stable_node is inserted successfully. Link: https://lkml.kernel.org/r/20240513-b4-ksm-stable-node-uaf-v1-1-f687de76f452@linux.dev Fixes: 2c653d0ee2ae ("ksm: introduce ksm_max_page_sharing per page deduplication limit") Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-24mm/memory-failure: fix handling of dissolved but not taken off from buddy pagesMiaohe Lin1-2/+2
When I did memory failure tests recently, below panic occurs: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00 flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff) raw: 06fffe0000000000 dead000000000100 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000009 00000000ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(!PageBuddy(page)) ------------[ cut here ]------------ kernel BUG at include/linux/page-flags.h:1009! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI RIP: 0010:__del_page_from_free_list+0x151/0x180 RSP: 0018:ffffa49c90437998 EFLAGS: 00000046 RAX: 0000000000000035 RBX: 0000000000000009 RCX: ffff8dd8dfd1c9c8 RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff8dd8dfd1c9c0 RBP: ffffd901233b8000 R08: ffffffffab5511f8 R09: 0000000000008c69 R10: 0000000000003c15 R11: ffffffffab5511f8 R12: ffff8dd8fffc0c80 R13: 0000000000000001 R14: ffff8dd8fffc0c80 R15: 0000000000000009 FS: 00007ff916304740(0000) GS:ffff8dd8dfd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055eae50124c8 CR3: 00000008479e0000 CR4: 00000000000006f0 Call Trace: <TASK> __rmqueue_pcplist+0x23b/0x520 get_page_from_freelist+0x26b/0xe40 __alloc_pages_noprof+0x113/0x1120 __folio_alloc_noprof+0x11/0xb0 alloc_buddy_hugetlb_folio.isra.0+0x5a/0x130 __alloc_fresh_hugetlb_folio+0xe7/0x140 alloc_pool_huge_folio+0x68/0x100 set_max_huge_pages+0x13d/0x340 hugetlb_sysctl_handler_common+0xe8/0x110 proc_sys_call_handler+0x194/0x280 vfs_write+0x387/0x550 ksys_write+0x64/0xe0 do_syscall_64+0xc2/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7ff916114887 RSP: 002b:00007ffec8a2fd78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 000055eae500e350 RCX: 00007ff916114887 RDX: 0000000000000004 RSI: 000055eae500e390 RDI: 0000000000000003 RBP: 000055eae50104c0 R08: 0000000000000000 R09: 000055eae50104c0 R10: 0000000000000077 R11: 0000000000000246 R12: 0000000000000004 R13: 0000000000000004 R14: 00007ff916216b80 R15: 00007ff916216a00 </TASK> Modules linked in: mce_inject hwpoison_inject ---[ end trace 0000000000000000 ]--- And before the panic, there had an warning about bad page state: BUG: Bad page state in process page-types pfn:8cee00 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00 flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff) page_type: 0xffffff7f(buddy) raw: 06fffe0000000000 ffffd901241c0008 ffffd901240f8008 0000000000000000 raw: 0000000000000000 0000000000000009 00000000ffffff7f 0000000000000000 page dumped because: nonzero mapcount Modules linked in: mce_inject hwpoison_inject CPU: 8 PID: 154211 Comm: page-types Not tainted 6.9.0-rc4-00499-g5544ec3178e2-dirty #22 Call Trace: <TASK> dump_stack_lvl+0x83/0xa0 bad_page+0x63/0xf0 free_unref_page+0x36e/0x5c0 unpoison_memory+0x50b/0x630 simple_attr_write_xsigned.constprop.0.isra.0+0xb3/0x110 debugfs_attr_write+0x42/0x60 full_proxy_write+0x5b/0x80 vfs_write+0xcd/0x550 ksys_write+0x64/0xe0 do_syscall_64+0xc2/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f189a514887 RSP: 002b:00007ffdcd899718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f189a514887 RDX: 0000000000000009 RSI: 00007ffdcd899730 RDI: 0000000000000003 RBP: 00007ffdcd8997a0 R08: 0000000000000000 R09: 00007ffdcd8994b2 R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcda199a8 R13: 0000000000404af1 R14: 000000000040ad78 R15: 00007f189a7a5040 </TASK> The root cause should be the below race: memory_failure try_memory_failure_hugetlb me_huge_page __page_handle_poison dissolve_free_hugetlb_folio drain_all_pages -- Buddy page can be isolated e.g. for compaction. take_page_off_buddy -- Failed as page is not in the buddy list. -- Page can be putback into buddy after compaction. page_ref_inc -- Leads to buddy page with refcnt = 1. Then unpoison_memory() can unpoison the page and send the buddy page back into buddy list again leading to the above bad page state warning. And bad_page() will call page_mapcount_reset() to remove PageBuddy from buddy page leading to later VM_BUG_ON_PAGE(!PageBuddy(page)) when trying to allocate this page. Fix this issue by only treating __page_handle_poison() as successful when it returns 1. Link: https://lkml.kernel.org/r/20240523071217.1696196-1-linmiaohe@huawei.com Fixes: ceaf8fbea79a ("mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>