Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull EDAC updates from Borislav Petkov:
- ie31200: Add support for Raptor Lake-S and Alder Lake-S compute dies
- Rework how RRL registers per channel tracking is done in order to
support newer hardware with different RRL configurations and refactor
that code. Add support for Granite Rapids server
- i10nm: explicitly set RRL modes to fix any wrong BIOS programming
- Properly save and restore Retry Read error Log channel configuration
info on Intel drivers
- igen6: Handle correctly the case of fused off memory controllers on
Arizona Beach and Amston Lake SoCs before adding support for them
- the usual set of fixes and cleanups
* tag 'edac_updates_for_v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/bluefield: Don't use bluefield_edac_readl() result on error
EDAC/i10nm: Fix the bitwise operation between variables of different sizes
EDAC/ie31200: Add two Intel SoCs for EDAC support
EDAC/{skx_common,i10nm}: Add RRL support for Intel Granite Rapids server
EDAC/{skx_common,i10nm}: Refactor show_retry_rd_err_log()
EDAC/{skx_common,i10nm}: Refactor enable_retry_rd_err_log()
EDAC/{skx_common,i10nm}: Structure the per-channel RRL registers
EDAC/i10nm: Explicitly set the modes of the RRL register sets
EDAC/{skx_common,i10nm}: Fix the loss of saved RRL for HBM pseudo channel 0
EDAC/skx_common: Fix general protection fault
EDAC/igen6: Add Intel Amston Lake SoCs support
EDAC/igen6: Add Intel Arizona Beach SoCs support
EDAC/igen6: Skip absent memory controllers
|
|
Pull irq cleanups from Thomas Gleixner:
"A set of cleanups for the generic interrupt subsystem:
- Consolidate on one set of functions for the interrupt domain code
to get rid of pointlessly duplicated code with only marginal
different semantics.
- Update the documentation accordingly and consolidate the coding
style of the irqdomain header"
* tag 'irq-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
irqdomain: Consolidate coding style
irqdomain: Fix kernel-doc and add it to Documentation
Documentation: irqdomain: Update it
Documentation: irq-domain.rst: Simple improvements
Documentation: irq/concepts: Minor improvements
Documentation: irq/concepts: Add commas and reflow
irqdomain: Improve kernel-docs of functions
irqdomain: Make struct irq_domain_info variables const
irqdomain: Use irq_domain_instantiate()'s return value as initializers
irqdomain: Drop irq_linear_revmap()
pinctrl: keembay: Switch to irq_find_mapping()
irqchip/armada-370-xp: Switch to irq_find_mapping()
gpu: ipu-v3: Switch to irq_find_mapping()
gpio: idt3243x: Switch to irq_find_mapping()
sh: Switch to irq_find_mapping()
powerpc: Switch to irq_find_mapping()
irqdomain: Drop irq_domain_add_*() functions
powerpc: Switch irq_domain_add_nomap() to use fwnode
thermal: Switch to irq_domain_create_linear()
soc: Switch to irq_domain_create_*()
...
|
|
The bluefield_edac_readl() routine returns an uninitialized result on error
paths. In those cases the calling routine should not use the uninitialized
result. The driver should simply log the error, and then return early.
Fixes: e41967575474 ("EDAC/bluefield: Use Arm SMC for EMI access on BlueField-2")
Signed-off-by: David Thompson <davthompson@nvidia.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shravan Kumar Ramani <shravankr@nvidia.com>
Link: https://lore.kernel.org/20250318214747.12271-1-davthompson@nvidia.com
|
|
irq_domain_add_linear() is going away as being obsolete now. Switch to
the preferred irq_domain_create_linear(). That differs in the first
parameter: It takes more generic struct fwnode_handle instead of struct
device_node. Therefore, of_fwnode_handle() is added around the
parameter.
Note some of the users can likely use dev->fwnode directly instead of
indirect of_fwnode_handle(dev->of_node). But dev->fwnode is not
guaranteed to be set for all, so this has to be investigated on case to
case basis (by people who can actually test with the HW).
[ tglx: Fix up subject prefix ]
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250319092951.37667-17-jirislaby@kernel.org
|
|
Conflicts:
arch/x86/boot/startup/sme.c
arch/x86/coco/sev/core.c
arch/x86/kernel/fpu/core.c
arch/x86/kernel/fpu/xstate.c
Semantic conflict:
arch/x86/include/asm/sev-internal.h
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Conflicts:
drivers/cpufreq/intel_pstate.c
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Conflicts:
tools/arch/x86/include/asm/cpufeatures.h
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
For historic reasons there are some TSC-related functions in the
<asm/msr.h> header, even though there's an <asm/tsc.h> header.
To facilitate the relocation of rdtsc{,_ordered}() from <asm/msr.h>
to <asm/tsc.h> and to eventually eliminate the inclusion of
<asm/msr.h> in <asm/tsc.h>, add an explicit <asm/msr.h> dependency
to the source files that reference definitions from <asm/msr.h>.
[ mingo: Clarified the changelog. ]
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20250501054241.1245648-1-xin@zytor.com
|
|
Mask DDR and SDMMC in probe function to avoid spurious interrupts before
registration. Removed invalid register write to system manager.
Fixes: 1166fde93d5b ("EDAC, altera: Add Arria10 ECC memory init functions")
Signed-off-by: Niravkumar L Rabara <niravkumar.l.rabara@altera.com>
Signed-off-by: Matthew Gerlach <matthew.gerlach@altera.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
Cc: stable@kernel.org
Link: https://lore.kernel.org/20250425142640.33125-3-matthew.gerlach@altera.com
|
|
Test correct structure member, ecc_cecnt_offset, before using it.
[ bp: Massage commit message. ]
Fixes: 73bcc942f427 ("EDAC, altera: Add Arria10 EDAC support")
Signed-off-by: Niravkumar L Rabara <niravkumar.l.rabara@altera.com>
Signed-off-by: Matthew Gerlach <matthew.gerlach@altera.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
Cc: stable@kernel.org
Link: https://lore.kernel.org/20250425142640.33125-2-matthew.gerlach@altera.com
|
|
The tool of Smatch static checker reported the following warning:
drivers/edac/i10nm_base.c:364 show_retry_rd_err_log()
warn: should bitwise negate be 'ullong'?
This warning was due to the bitwise NOT/AND operations between
'status_mask' (a u32 type) and 'log' (a u64 type), which resulted in
the high 32 bits of 'log' were cleared.
This was a false positive warning, as only the low 32 bits of 'log' was
written to the first RRL memory controller register (a u32 type).
To improve code sanity, fix this warning by changing 'status_mask' to
a u64 type, ensuring it matches the size of 'log' for bitwise operations.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/aAih0KmEVq7ch6v2@stanley.mountain/
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250424081454.2952632-1-qiuxu.zhuo@intel.com
|
|
Add two compute die IDs for Raptor Lake-S and Alder Lake-S for EDAC
support. Note that because Alder Lake-S shares the same memory controller
registers as Raptor Lake-S, it can reuse the configuration data of Raptor
Lake-S for EDAC support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: James Jernigan <jameswestonjernigan@gmail.com>
Link: https://lore.kernel.org/r/20250422134450.1880648-1-qiuxu.zhuo@intel.com
|
|
Compared to previous generations, Granite Rapids defines the RRL control
bits {en_patspr, noover, en} in different positions, adds an extra RRL set
for the new mode of the first patrol-scrub read error, and extends the
number of CORRERRCNT registers from 4 to 8, encoding one counter per
CORRERRCNT register.
Add a Granite Rapids reg_rrl configuration table and adjust the code to
accommodate the differences mentioned above for RRL support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Feng Xu <feng.f.xu@intel.com>
Link: https://lore.kernel.org/r/20250417150724.1170168-8-qiuxu.zhuo@intel.com
|
|
Make the {valid bit, overwritten status, number} of RRL registers and the
{number, offsets, widths} of per-channel CORRERRCNT registers configurable.
Refactor show_retry_rd_err_log() to use the configurable fields of struct
reg_rrl, making the code more scalable and simpler.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Feng Xu <feng.f.xu@intel.com>
Link: https://lore.kernel.org/r/20250417150724.1170168-7-qiuxu.zhuo@intel.com
|
|
Refactor enable_retry_rd_err_log() using helper functions for both
DDR and HBM, making the RRL control bits configurable instead of
hard-coded. Additionally, explicitly define the four RRL modes for
better readability.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Feng Xu <feng.f.xu@intel.com>
Link: https://lore.kernel.org/r/20250417150724.1170168-6-qiuxu.zhuo@intel.com
|
|
As the number of RRL (retry_rd_err_log) registers per memory channel
increases, the positions of the RRL control bits and the widths of the
RRL registers vary across different CPU generations. Adding RRL support
for a new CPU requires handling these differences throughout the
RRL-related code.
Structure the offsets, widths, control bit positions, set numbers, modes,
etc., of the per-channel RRL registers and make them configurable to
facilitate easier RRL support for new CPUs.
No functional changes are intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Feng Xu <feng.f.xu@intel.com>
Link: https://lore.kernel.org/r/20250417150724.1170168-5-qiuxu.zhuo@intel.com
|
|
The i10nm_edac driver uses the default modes (either patrol scrub read
or on-demand read) of the RRL register sets configured by the BIOS.
Explicitly set the modes during the loading of the i10nm_edac driver with
the module parameter retry_rd_err_log=2.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Feng Xu <feng.f.xu@intel.com>
Link: https://lore.kernel.org/r/20250417150724.1170168-4-qiuxu.zhuo@intel.com
|
|
When enabling the retry_rd_err_log (RRL) feature during the loading of the
i10nm_edac driver with the module parameter retry_rd_err_log=2 (Linux RRL
control mode), the default values of the control bits of RRL are saved so
that they can be restored during the unloading of the driver.
In the current code, the RRL of pseudo channel 1 of HBM overwrites pseudo
channel 0 during the loading of the driver, resulting in the loss of saved
RRL for pseudo channel 0. This causes the RRL of pseudo channel 0 of HBM to
be wrongly restored with the values from pseudo channel 1 when unloading
the driver.
Fix this issue by creating two separate groups of RRL control registers
per channel to save default RRL settings of two {sub-,pseudo-}channels.
Fixes: acd4cf68fefe ("EDAC/i10nm: Retrieve and print retry_rd_err_log registers for HBM")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Feng Xu <feng.f.xu@intel.com>
Link: https://lore.kernel.org/r/20250417150724.1170168-3-qiuxu.zhuo@intel.com
|
|
After loading i10nm_edac (which automatically loads skx_edac_common), if
unload only i10nm_edac, then reload it and perform error injection testing,
a general protection fault may occur:
mce: [Hardware Error]: Machine check events logged
Oops: general protection fault ...
...
Workqueue: events mce_gen_pool_process
RIP: 0010:string+0x53/0xe0
...
Call Trace:
<TASK>
? die_addr+0x37/0x90
? exc_general_protection+0x1e7/0x3f0
? asm_exc_general_protection+0x26/0x30
? string+0x53/0xe0
vsnprintf+0x23e/0x4c0
snprintf+0x4d/0x70
skx_adxl_decode+0x16a/0x330 [skx_edac_common]
skx_mce_check_error.part.0+0xf8/0x220 [skx_edac_common]
skx_mce_check_error+0x17/0x20 [skx_edac_common]
...
The issue arose was because the variable 'adxl_component_count' (inside
skx_edac_common), which counts the ADXL components, was not reset. During
the reloading of i10nm_edac, the count was incremented by the actual number
of ADXL components again, resulting in a count that was double the real
number of ADXL components. This led to an out-of-bounds reference to the
ADXL component array, causing the general protection fault above.
Fix this issue by resetting the 'adxl_component_count' in adxl_put(),
which is called during the unloading of {skx,i10nm}_edac.
Fixes: 123b15863550 ("EDAC, i10nm: make skx_common.o a separate module")
Reported-by: Feng Xu <feng.f.xu@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Feng Xu <feng.f.xu@intel.com>
Link: https://lore.kernel.org/r/20250417150724.1170168-2-qiuxu.zhuo@intel.com
|
|
Intel Amston Lake is a series of SoCs tailored for edge computing needs.
The Amston Lake SoCs, equipped with IBECC(In-Band ECC) capability, share
the same IBECC registers with Alder Lake-N SoCs. Add the Intel Amston Lake
SoC compute die ID for EDAC support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250408132455.489046-4-qiuxu.zhuo@intel.com
|
|
The Intel Arizona Beach SoC series is oriented toward network computing.
Some types of these SoCs are equipped with IBECC(In-Band ECC) and share
the same IBECC registers with Alder Lake-N SoCs. Add a die ID for Arizona
Beach SoC for EDAC support.
[Tony: s/Arizona Lake/Arizona Beach/ in commit message]
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250408132455.489046-3-qiuxu.zhuo@intel.com
|
|
Some BIOS versions may fuse off certain memory controllers and set the
registers of these absent memory controllers to ~0. The current igen6_edac
mistakenly enumerates these absent memory controllers and registers them
with the EDAC core.
Skip the absent memory controllers to avoid mistakenly enumerating them.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250408132455.489046-2-qiuxu.zhuo@intel.com
|
|
Collect AMD specific platform header files in <asm/amd/*.h>.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mario Limonciello <superm1@kernel.org>
Link: https://lore.kernel.org/r/20250413084144.3746608-7-mingo@kernel.org
|
|
Collect AMD specific platform header files in <asm/amd/*.h>.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mario Limonciello <superm1@kernel.org>
Link: https://lore.kernel.org/r/20250413084144.3746608-4-mingo@kernel.org
|
|
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull EDAC updates from Borislav Petkov:
- Add infrastructure support to EDAC in order to be able to register
memory scrubbing RAS functionality with the kernel and expose sysfs
nodes to control such scrubbing functionality.
The main use case is CXL devices which provide different scrubbers
for their built-in memories so that tools like rasdaemon can
configure and control memory scrubbing and other, more advanced RAS
functionality (Shiju Jose and Jonathan Cameron)
- Add support to ie31200_edac for client SoCs like Raptor Lake-S which
have multiple memory controllers and out-of-band ECC capability
(Qiuxu Zhuo)
- The usual round of cleanups, simplifications and fixlets
* tag 'edac_updates_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: (25 commits)
MAINTAINERS: Add a secondary maintainer for bluefield_edac
EDAC/ie31200: Switch Raptor Lake-S to interrupt mode
EDAC/ie31200: Add Intel Raptor Lake-S SoCs support
EDAC/ie31200: Break up ie31200_probe1()
EDAC/ie31200: Fold the two channel loops into one loop
EDAC/ie31200: Make struct dimm_data contain decoded information
EDAC/ie31200: Make the memory controller resources configurable
EDAC/ie31200: Simplify the pci_device_id table
EDAC/ie31200: Fix the 3rd parameter name of *populate_dimm_info()
EDAC/ie31200: Fix the error path order of ie31200_init()
EDAC/ie31200: Fix the DIMM size mask for several SoCs
EDAC/ie31200: Fix the size of EDAC_MC_LAYER_CHIP_SELECT layer
EDAC/device: Fix dev_set_name() format string
EDAC/pnd2: Make read-only const array intlv static
EDAC/igen6: Constify struct res_config
EDAC/amd64: Simplify return statement in dct_ecc_enabled()
EDAC: Update memory repair control interface for memory sparing feature
EDAC: Add a memory repair control feature
EDAC: Use string choice helper functions
EDAC: Add a Error Check Scrub control feature
...
|
|
* ras/edac-cxl:
EDAC/device: Fix dev_set_name() format string
EDAC: Update memory repair control interface for memory sparing feature
EDAC: Add a memory repair control feature
EDAC: Add a Error Check Scrub control feature
EDAC: Add scrub control feature
EDAC: Add support for EDAC device features control
* ras/edac-drivers:
EDAC/ie31200: Switch Raptor Lake-S to interrupt mode
EDAC/ie31200: Add Intel Raptor Lake-S SoCs support
EDAC/ie31200: Break up ie31200_probe1()
EDAC/ie31200: Fold the two channel loops into one loop
EDAC/ie31200: Make struct dimm_data contain decoded information
EDAC/ie31200: Make the memory controller resources configurable
EDAC/ie31200: Simplify the pci_device_id table
EDAC/ie31200: Fix the 3rd parameter name of *populate_dimm_info()
EDAC/ie31200: Fix the error path order of ie31200_init()
EDAC/ie31200: Fix the DIMM size mask for several SoCs
EDAC/ie31200: Fix the size of EDAC_MC_LAYER_CHIP_SELECT layer
EDAC/{skx_common,i10nm}: Fix some missing error reports on Emerald Rapids
EDAC/igen6: Fix the flood of invalid error reports
EDAC/ie31200: work around false positive build warning
* ras/edac-misc:
MAINTAINERS: Add a secondary maintainer for bluefield_edac
EDAC/pnd2: Make read-only const array intlv static
EDAC/igen6: Constify struct res_config
EDAC/amd64: Simplify return statement in dct_ecc_enabled()
EDAC: Use string choice helper functions
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
|
|
Raptor Lake-S SoCs notify correctable memory errors via CMCI (Corrected
Machine Check Interrupt). Switch Raptor Lake-S EDAC support from polling
to interrupt mode by registering the callback to the MCE decode notifier
chain.
Note that as Raptor Lake-S SoCs may not recover from uncorrectable memory
errors, the system will hang as soon as this type of error occurs, and the
registered callback on the MCE decode chain will not be executed. This is
the expected behavior.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-12-qiuxu.zhuo@intel.com
|
|
The Intel Raptor Lake-S SoC contains two memory controllers with DDR5
memory type and out-of-band ECC capability. The resource definitions of
the memory controller are different from previous generations. One notable
difference is that the PCI ERRSTS register is deprecated and is not used
to indicate the presence of errors or to clear the MMIO-mapped ECC error
log regsiters.
Extend the ie31200_edac driver to support multiple memory controllers,
add a resource configuration table and use an MSR register to clear the
ECC error log registers to provide EDAC support for Raptor Lake-S SoCs.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-11-qiuxu.zhuo@intel.com
|
|
Split ie31200_probe1() into two helper functions to easily extend support
for multiple memory controllers.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-10-qiuxu.zhuo@intel.com
|
|
Fold the two channel loops to simplify the code and improve readability.
Also, delete the comments related to the DRB register, as this register
is not used here.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-9-qiuxu.zhuo@intel.com
|
|
The current dimm_data structure contains encoded DIMM information,
which needs to be decoded for a given SoC when it is used. Make it
contain decoded information when it's initialized so that the places
where it is used do not need to decode it again, thereby simplifying
the code.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-8-qiuxu.zhuo@intel.com
|
|
The resources such as MMIO, register offset, register mask, memory DIMM
information, ECC error log location, etc., of the memory controller, and
the number of memory controllers can be device-ID-specific. It requires
adding numerous 'if (device_id == new_id)' special handling cases to the
code to support a new SoC.
Make these kinds of resources configurable and separate them from the code
to facilitate the addition of new SoC support.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-7-qiuxu.zhuo@intel.com
|
|
Use PCI_VDEVICE() to simplify the pci_device_id table.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-6-qiuxu.zhuo@intel.com
|
|
The 3rd parameter of *populate_dimm_info() pertains to the DIMM index
within a channel, not the channel index. Fix the parameter name to dimm
to reflect its actual purpose.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-5-qiuxu.zhuo@intel.com
|
|
The error path order of ie31200_init() is incorrect, fix it.
Fixes: 709ed1bcef12 ("EDAC/ie31200: Fallback if host bridge device is already initialized")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-4-qiuxu.zhuo@intel.com
|
|
The DIMM size mask for {Sky, Kaby, Coffee} Lake is not bits{7:0},
but bits{5:0}. Fix it.
Fixes: 953dee9bbd24 ("EDAC, ie31200_edac: Add Skylake support")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-3-qiuxu.zhuo@intel.com
|
|
The EDAC_MC_LAYER_CHIP_SELECT layer pertains to the rank, not the DIMM.
Fix its size to reflect the number of ranks instead of the number of DIMMs.
Also delete the unused macros IE31200_{DIMMS,RANKS}.
Fixes: 7ee40b897d18 ("ie31200_edac: Introduce the driver")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-2-qiuxu.zhuo@intel.com
|
|
Passing a variable string as the format to dev_set_name() causes a W=1 warning:
drivers/edac/edac_device.c:736:9: error: format not a string literal and no format arguments [-Werror=format-security]
736 | ret = dev_set_name(&ctx->dev, name);
| ^~~
Use a literal "%s" instead so the name can be the argument.
Fixes: db99ea5f2c03 ("EDAC: Add support for EDAC device features control")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250304143603.995820-1-arnd@kernel.org
|
|
Don't populate the const read-only array intlv on the stack at run time,
instead make it static. This also shrinks the object size:
$ size pnd2_edac.o.*
text data bss dec hex filename
15632 264 1384 17280 4380 pnd2_edac.o.new
15644 264 1384 17292 438c pnd2_edac.o.old
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20240919170427.497429-1-colin.i.king@gmail.com
|
|
The res_config structs are not modified in this driver.
Constifying these structures moves some data to a read-only section, so
increase overall security, especially when the structure holds some function
pointers.
On a x86_64, with allmodconfig, as an example:
Before:
======
text data bss dec hex filename
36777 2479 4304 43560 aa28 drivers/edac/igen6_edac.o
After:
=====
text data bss dec hex filename
37297 1959 4304 43560 aa28 drivers/edac/igen6_edac.o
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/a06153870951a64b438e76adf97d440e02c1a1fc.1738355198.git.christophe.jaillet@wanadoo.fr
|
|
Simplify the return statement to improve the code's readability.
No functional changes.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20250201130953.1377-2-thorsten.blum@linux.dev
|
|
Update memory repair control interface for memory sparing feature.
CXL memory devices can support soft and hard memory sparing at cacheline,
row, bank and rank granularities. Memory sparing is defined as a repair
function that replaces a portion of memory with a portion of functional
memory at that same granularity.
When a CXL device detects an error in memory, it will report to the host
that there's need for a repair maintenance operation by using an event
record where the "maintenance needed" flag is set.
The event records contain the device physical address (DPA) and other
attributes of the memory to repair such as bank group, bank, rank, row,
column, channel etc.
The kernel will report the corresponding CXL general media or DRAM trace
event to userspace, and userspace tools (e.g. rasdaemon) will initiate
a repair operation in response to the device request via the sysfs
repair control.
[ bp: Massage. ]
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250212143654.1893-15-shiju.jose@huawei.com
|
|
Add a generic EDAC memory repair control driver to manage memory repairs in
the system, such as CXL Post Package Repair (PPR) and other soft and hard PPR
features.
For example, a CXL device with DRAM components that support PPR features may
implement PPR maintenance operations. DRAM components may support two types of
PPR:
- hard PPR, for a permanent row repair, and
- soft PPR, for a temporary row repair.
Soft PPR is much faster than hard PPR, but the repair is lost with a power
cycle.
When a CXL device detects an error in a memory, it may report the need for
a repair maintenance operation by using an event record where the "maintenance
needed" flag is set. The event records contain the device physical
address (DPA) and other optional attributes of the memory to repair.
The kernel will report the corresponding CXL general media or DRAM trace event
to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
operation in response to the device request via the sysfs repair control.
Device with memory repair features registers with EDAC device driver, which
retrieves a memory repair descriptor from EDAC memory repair driver and exposes
the sysfs repair control attributes to userspace in
/sys/bus/edac/devices/<dev-name>/mem_repairX/.
The common memory repair control interface abstracts the control of arbitrary
memory repair functionality into a standardized set of functions. The sysfs
memory repair attribute nodes are only available if the client driver has
implemented the corresponding attribute callback function and provided
operations to the EDAC device driver during registration.
[ bp: Massage, fixup edac_dev_register() retvals, merge
write_overflow fix to mem_repair_create_desc() ]
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250212143654.1893-5-shiju.jose@huawei.com
|
|
Remove hard-coded strings by using the str_enabled_disabled(), str_yes_no(),
str_write_read(), and str_plural() helper functions.
Add a space in "All DIMMs support ECC: yes/no" to improve readability.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20250223212429.3466-2-thorsten.blum@linux.dev
|
|
Add an Error Check Scrub (ECS) control to manage a memory device's ECS
feature.
The ECS is a feature defined in JEDEC DDR5 SDRAM Specification (JESD79-5) and
allows the DRAM to internally read, correct single-bit errors, and write back
corrected data bits to the DRAM array while providing transparency to error
counts.
The DDR5 device contains a number of memory media Field Replaceable Units
(FRU) per device. The DDR5 ECS feature and thus the ECS control driver
supports configuring the ECS parameters per FRU.
Memory devices support the ECS feature register with the EDAC device driver,
which retrieves the ECS descriptor from the EDAC ECS driver. This driver
exposes sysfs ECS control attributes to userspace via
/sys/bus/edac/devices/<dev-name>/ecs_fruX/.
The common sysfs ECS control interface abstracts the control of an arbitrary
ECS functionality to a common set of functions.
Support for the ECS feature is added separately because the control attributes
of the DDR5 ECS feature differ from those of the scrub feature.
The sysfs ECS attribute nodes are only present if the client driver has
implemented the corresponding attribute callback function and passed the
necessary operations to the EDAC RAS feature driver during registration.
[ bp: Massage, fixup edac_dev_register() retvals. ]
Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/20250212143654.1893-4-shiju.jose@huawei.com
|
|
Add a scrub control to manage memory scrubbers in the system.
Devices with a scrub feature register with the EDAC device driver which
retrieves the scrub descriptor from the scrub driver and exposes the
control attributes for a instance to userspace at
/sys/bus/edac/devices/<dev-name>/scrubX/.
The common sysfs scrub control interface abstracts the control of
arbitrary scrubbing functionality into a common set of functions. The
attribute nodes are only present if the client driver has implemented
the corresponding attribute callback function and passed the operations
to the device driver during registration.
[ bp: Massage commit message, docs and code, simplify text a bit.
Integrate fixup for: https://lore.kernel.org/r/202502251009.0sGkolEJ-lkp@intel.com
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org> ]
Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Daniel Ferguson <danielf@os.amperecomputing.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/20250212143654.1893-3-shiju.jose@huawei.com
|
|
Add generic EDAC device feature controls supporting the registration of RAS
features available in the system. The driver exposes control attributes for
these features to userspace in
/sys/bus/edac/devices/<dev-name>/<ras-feature>
[ bp: Touch-up documentation, simplify, make edac_dev_type static,
fixup edac_dev_register() retvals. ]
Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Tested-by: Daniel Ferguson <danielf@os.amperecomputing.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/20250212143654.1893-2-shiju.jose@huawei.com
|
|
When doing error injection to some memory DIMMs on certain Intel Emerald
Rapids servers, the i10nm_edac missed error reports for some memory DIMMs.
Certain BIOS configurations may hide some memory controllers, and the
i10nm_edac doesn't enumerate these hidden memory controllers. However, the
ADXL decodes memory errors using memory controller physical indices even
if there are hidden memory controllers. Therefore, the memory controller
physical indices reported by the ADXL may mismatch the logical indices
enumerated by the i10nm_edac, resulting in missed error reports for some
memory DIMMs.
Fix this issue by creating a mapping table from memory controller physical
indices (used by the ADXL) to logical indices (used by the i10nm_edac) and
using it to convert the physical indices to the logical indices during the
error handling process.
Fixes: c545f5e41225 ("EDAC/i10nm: Skip the absent memory controllers")
Reported-by: Kevin Chang <kevin1.chang@intel.com>
Tested-by: Kevin Chang <kevin1.chang@intel.com>
Reported-by: Thomas Chen <Thomas.Chen@intel.com>
Tested-by: Thomas Chen <Thomas.Chen@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250214002728.6287-1-qiuxu.zhuo@intel.com
|
|
The ECC_ERROR_LOG register of certain SoCs may contain the invalid value
~0, which results in a flood of invalid error reports in polling mode.
Fix the flood of invalid error reports by skipping the invalid ECC error
log value ~0.
Fixes: e14232afa944 ("EDAC/igen6: Add polling support")
Reported-by: Ramses <ramses@well-founded.dev>
Closes: https://lore.kernel.org/all/OISL8Rv--F-9@well-founded.dev/
Tested-by: Ramses <ramses@well-founded.dev>
Reported-by: John <therealgraysky@proton.me>
Closes: https://lore.kernel.org/all/p5YcxOE6M3Ncxpn2-Ia_wCt61EM4LwIiN3LroQvT_-G2jMrFDSOW5k2A9D8UUzD2toGpQBN1eI0sL5dSKnkO8iteZegLoQEj-DwQaMhGx4A=@proton.me/
Tested-by: John <therealgraysky@proton.me>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250212083354.31919-1-qiuxu.zhuo@intel.com
|