aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/drivers/edac (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-05-27Merge tag 'edac_updates_for_v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/rasLinus Torvalds6-231/+422
Pull EDAC updates from Borislav Petkov: - ie31200: Add support for Raptor Lake-S and Alder Lake-S compute dies - Rework how RRL registers per channel tracking is done in order to support newer hardware with different RRL configurations and refactor that code. Add support for Granite Rapids server - i10nm: explicitly set RRL modes to fix any wrong BIOS programming - Properly save and restore Retry Read error Log channel configuration info on Intel drivers - igen6: Handle correctly the case of fused off memory controllers on Arizona Beach and Amston Lake SoCs before adding support for them - the usual set of fixes and cleanups * tag 'edac_updates_for_v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/bluefield: Don't use bluefield_edac_readl() result on error EDAC/i10nm: Fix the bitwise operation between variables of different sizes EDAC/ie31200: Add two Intel SoCs for EDAC support EDAC/{skx_common,i10nm}: Add RRL support for Intel Granite Rapids server EDAC/{skx_common,i10nm}: Refactor show_retry_rd_err_log() EDAC/{skx_common,i10nm}: Refactor enable_retry_rd_err_log() EDAC/{skx_common,i10nm}: Structure the per-channel RRL registers EDAC/i10nm: Explicitly set the modes of the RRL register sets EDAC/{skx_common,i10nm}: Fix the loss of saved RRL for HBM pseudo channel 0 EDAC/skx_common: Fix general protection fault EDAC/igen6: Add Intel Amston Lake SoCs support EDAC/igen6: Add Intel Arizona Beach SoCs support EDAC/igen6: Skip absent memory controllers
2025-05-27Merge tag 'irq-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-2/+2
Pull irq cleanups from Thomas Gleixner: "A set of cleanups for the generic interrupt subsystem: - Consolidate on one set of functions for the interrupt domain code to get rid of pointlessly duplicated code with only marginal different semantics. - Update the documentation accordingly and consolidate the coding style of the irqdomain header" * tag 'irq-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits) irqdomain: Consolidate coding style irqdomain: Fix kernel-doc and add it to Documentation Documentation: irqdomain: Update it Documentation: irq-domain.rst: Simple improvements Documentation: irq/concepts: Minor improvements Documentation: irq/concepts: Add commas and reflow irqdomain: Improve kernel-docs of functions irqdomain: Make struct irq_domain_info variables const irqdomain: Use irq_domain_instantiate()'s return value as initializers irqdomain: Drop irq_linear_revmap() pinctrl: keembay: Switch to irq_find_mapping() irqchip/armada-370-xp: Switch to irq_find_mapping() gpu: ipu-v3: Switch to irq_find_mapping() gpio: idt3243x: Switch to irq_find_mapping() sh: Switch to irq_find_mapping() powerpc: Switch to irq_find_mapping() irqdomain: Drop irq_domain_add_*() functions powerpc: Switch irq_domain_add_nomap() to use fwnode thermal: Switch to irq_domain_create_linear() soc: Switch to irq_domain_create_*() ...
2025-05-22EDAC/bluefield: Don't use bluefield_edac_readl() result on errorDavid Thompson1-5/+15
The bluefield_edac_readl() routine returns an uninitialized result on error paths. In those cases the calling routine should not use the uninitialized result. The driver should simply log the error, and then return early. Fixes: e41967575474 ("EDAC/bluefield: Use Arm SMC for EMI access on BlueField-2") Signed-off-by: David Thompson <davthompson@nvidia.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shravan Kumar Ramani <shravankr@nvidia.com> Link: https://lore.kernel.org/20250318214747.12271-1-davthompson@nvidia.com
2025-05-16EDAC/altera: Switch to irq_domain_create_linear()Jiri Slaby (SUSE)1-2/+2
irq_domain_add_linear() is going away as being obsolete now. Switch to the preferred irq_domain_create_linear(). That differs in the first parameter: It takes more generic struct fwnode_handle instead of struct device_node. Therefore, of_fwnode_handle() is added around the parameter. Note some of the users can likely use dev->fwnode directly instead of indirect of_fwnode_handle(dev->of_node). But dev->fwnode is not guaranteed to be set for all, so this has to be investigated on case to case basis (by people who can actually test with the HW). [ tglx: Fix up subject prefix ] Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250319092951.37667-17-jirislaby@kernel.org
2025-05-13Merge branch 'x86/msr' into x86/core, to resolve conflictsIngo Molnar3-3/+5
Conflicts: arch/x86/boot/startup/sme.c arch/x86/coco/sev/core.c arch/x86/kernel/fpu/core.c arch/x86/kernel/fpu/xstate.c Semantic conflict: arch/x86/include/asm/sev-internal.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-06Merge tag 'v6.15-rc5' into x86/msr, to pick up fixes and to resolve conflictsIngo Molnar2-4/+7
Conflicts: drivers/cpufreq/intel_pstate.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-06Merge tag 'v6.15-rc5' into x86/cpu, to resolve conflictsIngo Molnar2-4/+7
Conflicts: tools/arch/x86/include/asm/cpufeatures.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-02x86/msr: Add explicit includes of <asm/msr.h>Xin Li (Intel)2-0/+2
For historic reasons there are some TSC-related functions in the <asm/msr.h> header, even though there's an <asm/tsc.h> header. To facilitate the relocation of rdtsc{,_ordered}() from <asm/msr.h> to <asm/tsc.h> and to eventually eliminate the inclusion of <asm/msr.h> in <asm/tsc.h>, add an explicit <asm/msr.h> dependency to the source files that reference definitions from <asm/msr.h>. [ mingo: Clarified the changelog. ] Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Kees Cook <keescook@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20250501054241.1245648-1-xin@zytor.com
2025-04-28EDAC/altera: Set DDR and SDMMC interrupt mask before registrationNiravkumar L Rabara2-3/+6
Mask DDR and SDMMC in probe function to avoid spurious interrupts before registration. Removed invalid register write to system manager. Fixes: 1166fde93d5b ("EDAC, altera: Add Arria10 ECC memory init functions") Signed-off-by: Niravkumar L Rabara <niravkumar.l.rabara@altera.com> Signed-off-by: Matthew Gerlach <matthew.gerlach@altera.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Dinh Nguyen <dinguyen@kernel.org> Cc: stable@kernel.org Link: https://lore.kernel.org/20250425142640.33125-3-matthew.gerlach@altera.com
2025-04-28EDAC/altera: Test the correct error reg offsetNiravkumar L Rabara1-1/+1
Test correct structure member, ecc_cecnt_offset, before using it. [ bp: Massage commit message. ] Fixes: 73bcc942f427 ("EDAC, altera: Add Arria10 EDAC support") Signed-off-by: Niravkumar L Rabara <niravkumar.l.rabara@altera.com> Signed-off-by: Matthew Gerlach <matthew.gerlach@altera.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Dinh Nguyen <dinguyen@kernel.org> Cc: stable@kernel.org Link: https://lore.kernel.org/20250425142640.33125-2-matthew.gerlach@altera.com
2025-04-24EDAC/i10nm: Fix the bitwise operation between variables of different sizesQiuxu Zhuo1-2/+2
The tool of Smatch static checker reported the following warning: drivers/edac/i10nm_base.c:364 show_retry_rd_err_log() warn: should bitwise negate be 'ullong'? This warning was due to the bitwise NOT/AND operations between 'status_mask' (a u32 type) and 'log' (a u64 type), which resulted in the high 32 bits of 'log' were cleared. This was a false positive warning, as only the low 32 bits of 'log' was written to the first RRL memory controller register (a u32 type). To improve code sanity, fix this warning by changing 'status_mask' to a u64 type, ensuring it matches the size of 'log' for bitwise operations. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/aAih0KmEVq7ch6v2@stanley.mountain/ Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250424081454.2952632-1-qiuxu.zhuo@intel.com
2025-04-22EDAC/ie31200: Add two Intel SoCs for EDAC supportQiuxu Zhuo1-0/+6
Add two compute die IDs for Raptor Lake-S and Alder Lake-S for EDAC support. Note that because Alder Lake-S shares the same memory controller registers as Raptor Lake-S, it can reuse the configuration data of Raptor Lake-S for EDAC support. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: James Jernigan <jameswestonjernigan@gmail.com> Link: https://lore.kernel.org/r/20250422134450.1880648-1-qiuxu.zhuo@intel.com
2025-04-17EDAC/{skx_common,i10nm}: Add RRL support for Intel Granite Rapids serverQiuxu Zhuo2-4/+37
Compared to previous generations, Granite Rapids defines the RRL control bits {en_patspr, noover, en} in different positions, adds an extra RRL set for the new mode of the first patrol-scrub read error, and extends the number of CORRERRCNT registers from 4 to 8, encoding one counter per CORRERRCNT register. Add a Granite Rapids reg_rrl configuration table and adjust the code to accommodate the differences mentioned above for RRL support. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Link: https://lore.kernel.org/r/20250417150724.1170168-8-qiuxu.zhuo@intel.com
2025-04-17EDAC/{skx_common,i10nm}: Refactor show_retry_rd_err_log()Qiuxu Zhuo2-92/+81
Make the {valid bit, overwritten status, number} of RRL registers and the {number, offsets, widths} of per-channel CORRERRCNT registers configurable. Refactor show_retry_rd_err_log() to use the configurable fields of struct reg_rrl, making the code more scalable and simpler. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Link: https://lore.kernel.org/r/20250417150724.1170168-7-qiuxu.zhuo@intel.com
2025-04-17EDAC/{skx_common,i10nm}: Refactor enable_retry_rd_err_log()Qiuxu Zhuo2-101/+156
Refactor enable_retry_rd_err_log() using helper functions for both DDR and HBM, making the RRL control bits configurable instead of hard-coded. Additionally, explicitly define the four RRL modes for better readability. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Link: https://lore.kernel.org/r/20250417150724.1170168-6-qiuxu.zhuo@intel.com
2025-04-17EDAC/{skx_common,i10nm}: Structure the per-channel RRL registersQiuxu Zhuo2-44/+69
As the number of RRL (retry_rd_err_log) registers per memory channel increases, the positions of the RRL control bits and the widths of the RRL registers vary across different CPU generations. Adding RRL support for a new CPU requires handling these differences throughout the RRL-related code. Structure the offsets, widths, control bit positions, set numbers, modes, etc., of the per-channel RRL registers and make them configurable to facilitate easier RRL support for new CPUs. No functional changes are intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Link: https://lore.kernel.org/r/20250417150724.1170168-5-qiuxu.zhuo@intel.com
2025-04-17EDAC/i10nm: Explicitly set the modes of the RRL register setsQiuxu Zhuo1-0/+10
The i10nm_edac driver uses the default modes (either patrol scrub read or on-demand read) of the RRL register sets configured by the BIOS. Explicitly set the modes during the loading of the i10nm_edac driver with the module parameter retry_rd_err_log=2. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Link: https://lore.kernel.org/r/20250417150724.1170168-4-qiuxu.zhuo@intel.com
2025-04-17EDAC/{skx_common,i10nm}: Fix the loss of saved RRL for HBM pseudo channel 0Qiuxu Zhuo2-19/+27
When enabling the retry_rd_err_log (RRL) feature during the loading of the i10nm_edac driver with the module parameter retry_rd_err_log=2 (Linux RRL control mode), the default values of the control bits of RRL are saved so that they can be restored during the unloading of the driver. In the current code, the RRL of pseudo channel 1 of HBM overwrites pseudo channel 0 during the loading of the driver, resulting in the loss of saved RRL for pseudo channel 0. This causes the RRL of pseudo channel 0 of HBM to be wrongly restored with the values from pseudo channel 1 when unloading the driver. Fix this issue by creating two separate groups of RRL control registers per channel to save default RRL settings of two {sub-,pseudo-}channels. Fixes: acd4cf68fefe ("EDAC/i10nm: Retrieve and print retry_rd_err_log registers for HBM") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Link: https://lore.kernel.org/r/20250417150724.1170168-3-qiuxu.zhuo@intel.com
2025-04-17EDAC/skx_common: Fix general protection faultQiuxu Zhuo1-0/+1
After loading i10nm_edac (which automatically loads skx_edac_common), if unload only i10nm_edac, then reload it and perform error injection testing, a general protection fault may occur: mce: [Hardware Error]: Machine check events logged Oops: general protection fault ... ... Workqueue: events mce_gen_pool_process RIP: 0010:string+0x53/0xe0 ... Call Trace: <TASK> ? die_addr+0x37/0x90 ? exc_general_protection+0x1e7/0x3f0 ? asm_exc_general_protection+0x26/0x30 ? string+0x53/0xe0 vsnprintf+0x23e/0x4c0 snprintf+0x4d/0x70 skx_adxl_decode+0x16a/0x330 [skx_edac_common] skx_mce_check_error.part.0+0xf8/0x220 [skx_edac_common] skx_mce_check_error+0x17/0x20 [skx_edac_common] ... The issue arose was because the variable 'adxl_component_count' (inside skx_edac_common), which counts the ADXL components, was not reset. During the reloading of i10nm_edac, the count was incremented by the actual number of ADXL components again, resulting in a count that was double the real number of ADXL components. This led to an out-of-bounds reference to the ADXL component array, causing the general protection fault above. Fix this issue by resetting the 'adxl_component_count' in adxl_put(), which is called during the unloading of {skx,i10nm}_edac. Fixes: 123b15863550 ("EDAC, i10nm: make skx_common.o a separate module") Reported-by: Feng Xu <feng.f.xu@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Feng Xu <feng.f.xu@intel.com> Link: https://lore.kernel.org/r/20250417150724.1170168-2-qiuxu.zhuo@intel.com
2025-04-17EDAC/igen6: Add Intel Amston Lake SoCs supportQiuxu Zhuo1-0/+4
Intel Amston Lake is a series of SoCs tailored for edge computing needs. The Amston Lake SoCs, equipped with IBECC(In-Band ECC) capability, share the same IBECC registers with Alder Lake-N SoCs. Add the Intel Amston Lake SoC compute die ID for EDAC support. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250408132455.489046-4-qiuxu.zhuo@intel.com
2025-04-17EDAC/igen6: Add Intel Arizona Beach SoCs supportQiuxu Zhuo1-0/+4
The Intel Arizona Beach SoC series is oriented toward network computing. Some types of these SoCs are equipped with IBECC(In-Band ECC) and share the same IBECC registers with Alder Lake-N SoCs. Add a die ID for Arizona Beach SoC for EDAC support. [Tony: s/Arizona Lake/Arizona Beach/ in commit message] Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250408132455.489046-3-qiuxu.zhuo@intel.com
2025-04-17EDAC/igen6: Skip absent memory controllersQiuxu Zhuo1-16/+62
Some BIOS versions may fuse off certain memory controllers and set the registers of these absent memory controllers to ~0. The current igen6_edac mistakenly enumerates these absent memory controllers and registers them with the EDAC core. Skip the absent memory controllers to avoid mistakenly enumerating them. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250408132455.489046-2-qiuxu.zhuo@intel.com
2025-04-14x86/platform/amd: Move the <asm/amd_node.h> header to <asm/amd/node.h>Ingo Molnar1-1/+1
Collect AMD specific platform header files in <asm/amd/*.h>. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mario Limonciello <superm1@kernel.org> Link: https://lore.kernel.org/r/20250413084144.3746608-7-mingo@kernel.org
2025-04-14x86/platform/amd: Move the <asm/amd_nb.h> header to <asm/amd/nb.h>Ingo Molnar1-1/+1
Collect AMD specific platform header files in <asm/amd/*.h>. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mario Limonciello <superm1@kernel.org> Link: https://lore.kernel.org/r/20250413084144.3746608-4-mingo@kernel.org
2025-04-10x86/msr: Rename 'rdmsrl()' to 'rdmsrq()'Ingo Molnar1-3/+3
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-03-25Merge tag 'edac_updates_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/rasLinus Torvalds17-314/+1493
Pull EDAC updates from Borislav Petkov: - Add infrastructure support to EDAC in order to be able to register memory scrubbing RAS functionality with the kernel and expose sysfs nodes to control such scrubbing functionality. The main use case is CXL devices which provide different scrubbers for their built-in memories so that tools like rasdaemon can configure and control memory scrubbing and other, more advanced RAS functionality (Shiju Jose and Jonathan Cameron) - Add support to ie31200_edac for client SoCs like Raptor Lake-S which have multiple memory controllers and out-of-band ECC capability (Qiuxu Zhuo) - The usual round of cleanups, simplifications and fixlets * tag 'edac_updates_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: (25 commits) MAINTAINERS: Add a secondary maintainer for bluefield_edac EDAC/ie31200: Switch Raptor Lake-S to interrupt mode EDAC/ie31200: Add Intel Raptor Lake-S SoCs support EDAC/ie31200: Break up ie31200_probe1() EDAC/ie31200: Fold the two channel loops into one loop EDAC/ie31200: Make struct dimm_data contain decoded information EDAC/ie31200: Make the memory controller resources configurable EDAC/ie31200: Simplify the pci_device_id table EDAC/ie31200: Fix the 3rd parameter name of *populate_dimm_info() EDAC/ie31200: Fix the error path order of ie31200_init() EDAC/ie31200: Fix the DIMM size mask for several SoCs EDAC/ie31200: Fix the size of EDAC_MC_LAYER_CHIP_SELECT layer EDAC/device: Fix dev_set_name() format string EDAC/pnd2: Make read-only const array intlv static EDAC/igen6: Constify struct res_config EDAC/amd64: Simplify return statement in dct_ecc_enabled() EDAC: Update memory repair control interface for memory sparing feature EDAC: Add a memory repair control feature EDAC: Use string choice helper functions EDAC: Add a Error Check Scrub control feature ...
2025-03-25Merge remote-tracking branches 'ras/edac-cxl', 'ras/edac-drivers' and 'ras/edac-misc' into edac-updatesBorislav Petkov (AMD)12-314/+504
* ras/edac-cxl: EDAC/device: Fix dev_set_name() format string EDAC: Update memory repair control interface for memory sparing feature EDAC: Add a memory repair control feature EDAC: Add a Error Check Scrub control feature EDAC: Add scrub control feature EDAC: Add support for EDAC device features control * ras/edac-drivers: EDAC/ie31200: Switch Raptor Lake-S to interrupt mode EDAC/ie31200: Add Intel Raptor Lake-S SoCs support EDAC/ie31200: Break up ie31200_probe1() EDAC/ie31200: Fold the two channel loops into one loop EDAC/ie31200: Make struct dimm_data contain decoded information EDAC/ie31200: Make the memory controller resources configurable EDAC/ie31200: Simplify the pci_device_id table EDAC/ie31200: Fix the 3rd parameter name of *populate_dimm_info() EDAC/ie31200: Fix the error path order of ie31200_init() EDAC/ie31200: Fix the DIMM size mask for several SoCs EDAC/ie31200: Fix the size of EDAC_MC_LAYER_CHIP_SELECT layer EDAC/{skx_common,i10nm}: Fix some missing error reports on Emerald Rapids EDAC/igen6: Fix the flood of invalid error reports EDAC/ie31200: work around false positive build warning * ras/edac-misc: MAINTAINERS: Add a secondary maintainer for bluefield_edac EDAC/pnd2: Make read-only const array intlv static EDAC/igen6: Constify struct res_config EDAC/amd64: Simplify return statement in dct_ecc_enabled() EDAC: Use string choice helper functions Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-03-10EDAC/ie31200: Switch Raptor Lake-S to interrupt modeQiuxu Zhuo2-7/+78
Raptor Lake-S SoCs notify correctable memory errors via CMCI (Corrected Machine Check Interrupt). Switch Raptor Lake-S EDAC support from polling to interrupt mode by registering the callback to the MCE decode notifier chain. Note that as Raptor Lake-S SoCs may not recover from uncorrectable memory errors, the system will hang as soon as this type of error occurs, and the registered callback on the MCE decode chain will not be executed. This is the expected behavior. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-12-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Add Intel Raptor Lake-S SoCs supportQiuxu Zhuo1-33/+149
The Intel Raptor Lake-S SoC contains two memory controllers with DDR5 memory type and out-of-band ECC capability. The resource definitions of the memory controller are different from previous generations. One notable difference is that the PCI ERRSTS register is deprecated and is not used to indicate the presence of errors or to clear the MMIO-mapped ECC error log regsiters. Extend the ie31200_edac driver to support multiple memory controllers, add a resource configuration table and use an MSR register to clear the ECC error log registers to provide EDAC support for Raptor Lake-S SoCs. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-11-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Break up ie31200_probe1()Qiuxu Zhuo1-47/+61
Split ie31200_probe1() into two helper functions to easily extend support for multiple memory controllers. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-10-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Fold the two channel loops into one loopQiuxu Zhuo1-15/+7
Fold the two channel loops to simplify the code and improve readability. Also, delete the comments related to the DRB register, as this register is not used here. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-9-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Make struct dimm_data contain decoded informationQiuxu Zhuo1-43/+19
The current dimm_data structure contains encoded DIMM information, which needs to be decoded for a given SoC when it is used. Make it contain decoded information when it's initialized so that the places where it is used do not need to decode it again, thereby simplifying the code. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-8-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Make the memory controller resources configurableQiuxu Zhuo1-146/+111
The resources such as MMIO, register offset, register mask, memory DIMM information, ECC error log location, etc., of the memory controller, and the number of memory controllers can be device-ID-specific. It requires adding numerous 'if (device_id == new_id)' special handling cases to the code to support a new SoC. Make these kinds of resources configurable and separate them from the code to facilitate the addition of new SoC support. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-7-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Simplify the pci_device_id tableQiuxu Zhuo1-22/+22
Use PCI_VDEVICE() to simplify the pci_device_id table. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-6-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Fix the 3rd parameter name of *populate_dimm_info()Qiuxu Zhuo1-12/+12
The 3rd parameter of *populate_dimm_info() pertains to the DIMM index within a channel, not the channel index. Fix the parameter name to dimm to reflect its actual purpose. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-5-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Fix the error path order of ie31200_init()Qiuxu Zhuo1-5/+7
The error path order of ie31200_init() is incorrect, fix it. Fixes: 709ed1bcef12 ("EDAC/ie31200: Fallback if host bridge device is already initialized") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-4-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Fix the DIMM size mask for several SoCsQiuxu Zhuo1-1/+2
The DIMM size mask for {Sky, Kaby, Coffee} Lake is not bits{7:0}, but bits{5:0}. Fix it. Fixes: 953dee9bbd24 ("EDAC, ie31200_edac: Add Skylake support") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-3-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Fix the size of EDAC_MC_LAYER_CHIP_SELECT layerQiuxu Zhuo1-3/+1
The EDAC_MC_LAYER_CHIP_SELECT layer pertains to the rank, not the DIMM. Fix its size to reflect the number of ranks instead of the number of DIMMs. Also delete the unused macros IE31200_{DIMMS,RANKS}. Fixes: 7ee40b897d18 ("ie31200_edac: Introduce the driver") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-2-qiuxu.zhuo@intel.com
2025-03-05EDAC/device: Fix dev_set_name() format stringArnd Bergmann1-1/+1
Passing a variable string as the format to dev_set_name() causes a W=1 warning: drivers/edac/edac_device.c:736:9: error: format not a string literal and no format arguments [-Werror=format-security] 736 | ret = dev_set_name(&ctx->dev, name); | ^~~ Use a literal "%s" instead so the name can be the argument. Fixes: db99ea5f2c03 ("EDAC: Add support for EDAC device features control") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250304143603.995820-1-arnd@kernel.org
2025-03-03EDAC/pnd2: Make read-only const array intlv staticColin Ian King1-2/+2
Don't populate the const read-only array intlv on the stack at run time, instead make it static. This also shrinks the object size: $ size pnd2_edac.o.* text data bss dec hex filename 15632 264 1384 17280 4380 pnd2_edac.o.new 15644 264 1384 17292 438c pnd2_edac.o.old Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://lore.kernel.org/r/20240919170427.497429-1-colin.i.king@gmail.com
2025-03-03EDAC/igen6: Constify struct res_configChristophe JAILLET1-10/+10
The res_config structs are not modified in this driver. Constifying these structures moves some data to a read-only section, so increase overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 36777 2479 4304 43560 aa28 drivers/edac/igen6_edac.o After: ===== text data bss dec hex filename 37297 1959 4304 43560 aa28 drivers/edac/igen6_edac.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://lore.kernel.org/r/a06153870951a64b438e76adf97d440e02c1a1fc.1738355198.git.christophe.jaillet@wanadoo.fr
2025-02-28EDAC/amd64: Simplify return statement in dct_ecc_enabled()Thorsten Blum1-4/+1
Simplify the return statement to improve the code's readability. No functional changes. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20250201130953.1377-2-thorsten.blum@linux.dev
2025-02-26EDAC: Update memory repair control interface for memory sparing featureShiju Jose1-0/+84
Update memory repair control interface for memory sparing feature. CXL memory devices can support soft and hard memory sparing at cacheline, row, bank and rank granularities. Memory sparing is defined as a repair function that replaces a portion of memory with a portion of functional memory at that same granularity. When a CXL device detects an error in memory, it will report to the host that there's need for a repair maintenance operation by using an event record where the "maintenance needed" flag is set. The event records contain the device physical address (DPA) and other attributes of the memory to repair such as bank group, bank, rank, row, column, channel etc. The kernel will report the corresponding CXL general media or DRAM trace event to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair operation in response to the device request via the sysfs repair control. [ bp: Massage. ] Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250212143654.1893-15-shiju.jose@huawei.com
2025-02-26EDAC: Add a memory repair control featureShiju Jose4-0/+319
Add a generic EDAC memory repair control driver to manage memory repairs in the system, such as CXL Post Package Repair (PPR) and other soft and hard PPR features. For example, a CXL device with DRAM components that support PPR features may implement PPR maintenance operations. DRAM components may support two types of PPR: - hard PPR, for a permanent row repair, and - soft PPR, for a temporary row repair. Soft PPR is much faster than hard PPR, but the repair is lost with a power cycle. When a CXL device detects an error in a memory, it may report the need for a repair maintenance operation by using an event record where the "maintenance needed" flag is set. The event records contain the device physical address (DPA) and other optional attributes of the memory to repair. The kernel will report the corresponding CXL general media or DRAM trace event to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair operation in response to the device request via the sysfs repair control. Device with memory repair features registers with EDAC device driver, which retrieves a memory repair descriptor from EDAC memory repair driver and exposes the sysfs repair control attributes to userspace in /sys/bus/edac/devices/<dev-name>/mem_repairX/. The common memory repair control interface abstracts the control of arbitrary memory repair functionality into a standardized set of functions. The sysfs memory repair attribute nodes are only available if the client driver has implemented the corresponding attribute callback function and provided operations to the EDAC device driver during registration. [ bp: Massage, fixup edac_dev_register() retvals, merge write_overflow fix to mem_repair_create_desc() ] Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250212143654.1893-5-shiju.jose@huawei.com
2025-02-25EDAC: Use string choice helper functionsThorsten Blum5-37/+42
Remove hard-coded strings by using the str_enabled_disabled(), str_yes_no(), str_write_read(), and str_plural() helper functions. Add a space in "All DIMMs support ECC: yes/no" to improve readability. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://lore.kernel.org/r/20250223212429.3466-2-thorsten.blum@linux.dev
2025-02-25EDAC: Add a Error Check Scrub control featureShiju Jose4-0/+234
Add an Error Check Scrub (ECS) control to manage a memory device's ECS feature. The ECS is a feature defined in JEDEC DDR5 SDRAM Specification (JESD79-5) and allows the DRAM to internally read, correct single-bit errors, and write back corrected data bits to the DRAM array while providing transparency to error counts. The DDR5 device contains a number of memory media Field Replaceable Units (FRU) per device. The DDR5 ECS feature and thus the ECS control driver supports configuring the ECS parameters per FRU. Memory devices support the ECS feature register with the EDAC device driver, which retrieves the ECS descriptor from the EDAC ECS driver. This driver exposes sysfs ECS control attributes to userspace via /sys/bus/edac/devices/<dev-name>/ecs_fruX/. The common sysfs ECS control interface abstracts the control of an arbitrary ECS functionality to a common set of functions. Support for the ECS feature is added separately because the control attributes of the DDR5 ECS feature differ from those of the scrub feature. The sysfs ECS attribute nodes are only present if the client driver has implemented the corresponding attribute callback function and passed the necessary operations to the EDAC RAS feature driver during registration. [ bp: Massage, fixup edac_dev_register() retvals. ] Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Fan Ni <fan.ni@samsung.com> Tested-by: Fan Ni <fan.ni@samsung.com> Link: https://lore.kernel.org/r/20250212143654.1893-4-shiju.jose@huawei.com
2025-02-25EDAC: Add scrub control featureShiju Jose4-4/+255
Add a scrub control to manage memory scrubbers in the system. Devices with a scrub feature register with the EDAC device driver which retrieves the scrub descriptor from the scrub driver and exposes the control attributes for a instance to userspace at /sys/bus/edac/devices/<dev-name>/scrubX/. The common sysfs scrub control interface abstracts the control of arbitrary scrubbing functionality into a common set of functions. The attribute nodes are only present if the client driver has implemented the corresponding attribute callback function and passed the operations to the device driver during registration. [ bp: Massage commit message, docs and code, simplify text a bit. Integrate fixup for: https://lore.kernel.org/r/202502251009.0sGkolEJ-lkp@intel.com Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> ] Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Daniel Ferguson <danielf@os.amperecomputing.com> Tested-by: Fan Ni <fan.ni@samsung.com> Link: https://lore.kernel.org/r/20250212143654.1893-3-shiju.jose@huawei.com
2025-02-25EDAC: Add support for EDAC device features controlShiju Jose1-0/+101
Add generic EDAC device feature controls supporting the registration of RAS features available in the system. The driver exposes control attributes for these features to userspace in /sys/bus/edac/devices/<dev-name>/<ras-feature> [ bp: Touch-up documentation, simplify, make edac_dev_type static, fixup edac_dev_register() retvals. ] Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Fan Ni <fan.ni@samsung.com> Tested-by: Daniel Ferguson <danielf@os.amperecomputing.com> Tested-by: Fan Ni <fan.ni@samsung.com> Link: https://lore.kernel.org/r/20250212143654.1893-2-shiju.jose@huawei.com
2025-02-20EDAC/{skx_common,i10nm}: Fix some missing error reports on Emerald RapidsQiuxu Zhuo3-0/+46
When doing error injection to some memory DIMMs on certain Intel Emerald Rapids servers, the i10nm_edac missed error reports for some memory DIMMs. Certain BIOS configurations may hide some memory controllers, and the i10nm_edac doesn't enumerate these hidden memory controllers. However, the ADXL decodes memory errors using memory controller physical indices even if there are hidden memory controllers. Therefore, the memory controller physical indices reported by the ADXL may mismatch the logical indices enumerated by the i10nm_edac, resulting in missed error reports for some memory DIMMs. Fix this issue by creating a mapping table from memory controller physical indices (used by the ADXL) to logical indices (used by the i10nm_edac) and using it to convert the physical indices to the logical indices during the error handling process. Fixes: c545f5e41225 ("EDAC/i10nm: Skip the absent memory controllers") Reported-by: Kevin Chang <kevin1.chang@intel.com> Tested-by: Kevin Chang <kevin1.chang@intel.com> Reported-by: Thomas Chen <Thomas.Chen@intel.com> Tested-by: Thomas Chen <Thomas.Chen@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250214002728.6287-1-qiuxu.zhuo@intel.com
2025-02-20EDAC/igen6: Fix the flood of invalid error reportsQiuxu Zhuo1-6/+15
The ECC_ERROR_LOG register of certain SoCs may contain the invalid value ~0, which results in a flood of invalid error reports in polling mode. Fix the flood of invalid error reports by skipping the invalid ECC error log value ~0. Fixes: e14232afa944 ("EDAC/igen6: Add polling support") Reported-by: Ramses <ramses@well-founded.dev> Closes: https://lore.kernel.org/all/OISL8Rv--F-9@well-founded.dev/ Tested-by: Ramses <ramses@well-founded.dev> Reported-by: John <therealgraysky@proton.me> Closes: https://lore.kernel.org/all/p5YcxOE6M3Ncxpn2-Ia_wCt61EM4LwIiN3LroQvT_-G2jMrFDSOW5k2A9D8UUzD2toGpQBN1eI0sL5dSKnkO8iteZegLoQEj-DwQaMhGx4A=@proton.me/ Tested-by: John <therealgraysky@proton.me> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250212083354.31919-1-qiuxu.zhuo@intel.com