aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/tools/perf/scripts/python/export-to-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2019-11-22EDAC/altera: Use fast register IO for S10 IRQsMeng Li1-0/+1
When an IRQ occurs, regmap_{read,write,...}() is invoked in atomic context. Regmap must indicate register IO is fast so that a spinlock is used instead of a mutex to avoid sleeping in atomic context: lock_acquire __mutex_lock mutex_lock_nested regmap_lock_mutex regmap_write a10_eccmgr_irq_unmask unmask_irq.part.0 irq_enable __irq_startup irq_startup __setup_irq request_threaded_irq devm_request_threaded_irq altr_sdram_probe Mark it so. [ bp: Massage. ] Fixes: 3dab6bd52687 ("EDAC, altera: Add support for Stratix10 SDRAM EDAC") Reported-by: Meng Li <Meng.Li@windriver.com> Signed-off-by: Meng Li <Meng.Li@windriver.com> Signed-off-by: Thor Thayer <thor.thayer@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: James Morse <james.morse@arm.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: stable <stable@vger.kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/1574361048-17572-2-git-send-email-thor.thayer@linux.intel.com
2019-11-22EDAC/ghes: Do not warn when incrementing refcount on 0Robert Richter1-2/+2
The following warning from the refcount framework is seen during ghes initialization: EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT) ------------[ cut here ]------------ refcount_t: increment on 0; use-after-free. WARNING: CPU: 36 PID: 1 at lib/refcount.c:156 refcount_inc_checked [...] Call trace: refcount_inc_checked ghes_edac_register ghes_probe ... It warns if the refcount is incremented from zero. This warning is reasonable as a kernel object is typically created with a refcount of one and freed once the refcount is zero. Afterwards the object would be "used-after-free". For GHES, the refcount is initialized with zero, and that is why this message is seen when initializing the first instance. However, whenever the refcount is zero, the device will be allocated and registered. Since the ghes_reg_mutex protects the refcount and serializes allocation and freeing of ghes devices, a use-after-free cannot happen here. Instead of using refcount_inc() for the first instance, use refcount_set(). This can be used here because the refcount is zero at this point and can not change due to its protection by the mutex. Fixes: 23f61b9fc5cc ("EDAC/ghes: Fix locking and memory barrier issues") Reported-by: John Garry <john.garry@huawei.com> Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: John Garry <john.garry@huawei.com> Cc: <huangming23@huawei.com> Cc: James Morse <james.morse@arm.com> Cc: <linuxarm@huawei.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: <tanxiaofei@huawei.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <wanghuiqiang@huawei.com> Link: https://lkml.kernel.org/r/20191121213628.21244-1-rrichter@marvell.com
2019-11-10EDAC/Documentation: Describe CPER module definition and DIMM ranksRobert Richter1-12/+19
Update on CPER DIMM naming convention and DIMM ranks. [ bp: Touchups. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-14-rrichter@marvell.com
2019-11-10EDAC: Unify the mc_event tracepoint callRobert Richter2-28/+15
The code in ghes_edac.c and edac_mc.c for grain_bits calculation and calling trace_mc_event() is now the same. Move it to a single location in edac_raw_mc_handle_error(). The only difference is the missing IS_ENABLED(CONFIG_RAS) switch, but this is needed for ghes too. Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-13-rrichter@marvell.com
2019-11-10EDAC/ghes: Remove intermediate buffer pvt->detail_locationRobert Richter1-5/+4
detail_location[] is used to collect two location strings so they can be passed as one to trace_mc_event(). Instead of having an extra copy step, assemble the location string in other_detail[] from the beginning. Using other_detail[] to call trace_mc_event() is now the same as in edac_mc.c and code can be unified. Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: James Morse <james.morse@arm.com> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-12-rrichter@marvell.com
2019-11-10EDAC/ghes: Fix grain calculationRobert Richter1-2/+8
The current code to convert a physical address mask to a grain (defined as granularity in bytes) is: e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK); This is broken in several ways: 1) It calculates to wrong grain values. E.g., a physical address mask of ~0xfff should give a grain of 0x1000. Without considering PAGE_MASK, there is an off-by-one. Things are worse when also filtering it with ~PAGE_MASK. This will calculate to a grain with the upper bits set. In the example it even calculates to ~0. 2) The grain does not depend on and is unrelated to the kernel's page-size. The page-size only matters when unmapping memory in memory_failure(). Smaller grains are wrongly rounded up to the page-size, on architectures with a configurable page-size (e.g. arm64) this could round up to the even bigger page-size of the hypervisor. Fix this with: e->grain = ~mem_err->physical_addr_mask + 1; The grain_bits are defined as: grain = 1 << grain_bits; Change also the grain_bits calculation accordingly, it is the same formula as in edac_mc.c now and the code can be unified. The value in ->physical_addr_mask coming from firmware is assumed to be contiguous, but this is not sanity-checked. However, in case the mask is non-contiguous, a conversion to grain_bits effectively converts the grain bit mask to a power of 2 by rounding it up. Suggested-by: James Morse <james.morse@arm.com> Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-11-rrichter@marvell.com
2019-11-10EDAC/ghes: Use standard kernel macros for page calculationsRobert Richter1-2/+2
Use standard macros for page calculations. Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: James Morse <james.morse@arm.com> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-10-rrichter@marvell.com
2019-11-10EDAC: Remove misleading comment in struct edac_raw_error_descRobert Richter1-5/+0
There never has been such function edac_raw_error_desc_clean() and in function ghes_edac_report_mem_error() the whole struct is zero'ed including the string arrays. Remove that comment. Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-9-rrichter@marvell.com
2019-11-10EDAC/mc: Reduce indentation level in edac_mc_handle_error()Robert Richter1-28/+29
Reduce the indentation level in edac_mc_handle_error() a bit. No functional changes. Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-7-rrichter@marvell.com
2019-11-10EDAC/mc: Remove needless zero string terminationRobert Richter1-1/+0
The e string to which this is pointing to has already been cleared earlier in the function so remove the needless zero string termination. [ bp: Correct the commit message. ] Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-6-rrichter@marvell.com
2019-11-10EDAC/mc: Do not BUG_ON() in edac_mc_alloc()Robert Richter1-1/+2
No need to crash the system in case edac_mc_alloc() is called with invalid arguments, just warn and return. This would cause a checkpatch warning when touching the code later, so just fix it. Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-5-rrichter@marvell.com
2019-11-10EDAC: Introduce an mci_for_each_dimm() iteratorRobert Richter5-37/+40
Introduce an mci_for_each_dimm() iterator. It returns a pointer to a struct dimm_info. This makes the declaration and use of an index obsolete and avoids access to internal data of struct mci (direct array access etc). [ bp: push the struct dimm_info *dimm; declaration into the CONFIG_EDAC_DEBUG block. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-4-rrichter@marvell.com
2019-11-09EDAC: Remove EDAC_DIMM_OFF() macroRobert Richter3-72/+21
The EDAC_DIMM_OFF() macro takes 5 arguments to get the DIMM's index. Simplify this by storing the index in struct dimm_info to avoid its calculation and remove the EDAC_DIMM_OFF() macro. The index can be directly used then. Another advantage is that edac_mc_alloc() could be used even if the exact size of the layers is unknown. Only the number of DIMMs would be needed. Rename iterator variable to idx, while at it. The name is more handy, esp. when searching for it in the code. Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-3-rrichter@marvell.com
2019-11-09EDAC: Replace EDAC_DIMM_PTR() macro with edac_get_dimm() functionRobert Richter14-63/+74
The EDAC_DIMM_PTR() macro takes 3 arguments from struct mem_ctl_info. Clean up this interface to only pass the mci struct and replace this macro with a new function edac_get_dimm(). Also introduce an edac_get_dimm_by_index() function for later use. This allows it to get a DIMM pointer only by a given index. This can be useful if the DIMM's position within the layers of the memory controller or the exact size of the layers are unknown. Small style changes made for some hunks after applying the semantic patch. Semantic patch used: @@ expression mci, a, b,c; @@ -EDAC_DIMM_PTR(mci->layers, mci->dimms, mci->n_layers, a, b, c) +edac_get_dimm(mci, a, b, c) [ bp: Touchups. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: Tero Kristo <t-kristo@ti.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-2-rrichter@marvell.com
2019-11-09EDAC/amd64: Get rid of the ECC disabled long messageBorislav Petkov1-16/+3
This message keeps flooding dmesg on boxes where ECC is disabled or the DIMMs do not support ECC but the module gets auto-probed. What's even worse is that autoprobing happens on every CPU due to the CPU-family matching the driver does and uevent being generated for each CPU device. What is more, this message is becoming even more useless on newer systems where forcing ECC is not recommended and it should be done in the BIOS so the BIOS can do all the necessary work, i.e., just setting a bit in an MSR is not enough anymore. So get rid of it. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Yazen Ghannam <yazen.ghannam@amd.com> Cc: linux-edac@vger.kernel.org Link: https://lkml.kernel.org/r/20191106160607.GC28380@zn.tnic
2019-11-08EDAC/ghes: Fix locking and memory barrier issuesRobert Richter1-24/+66
The ghes registration and refcount is broken in several ways: * ghes_edac_register() returns with success for a 2nd instance even if a first instance's registration is still running. This is not correct as the first instance may fail later. A subsequent registration may not finish before the first. Parallel registrations must be avoided. * The refcount was increased even if a registration failed. This leads to stale counters preventing the device from being released. * The ghes refcount may not be decremented properly on unregistration. Always decrement the refcount once ghes_edac_unregister() is called to keep the refcount sane. * The ghes_pvt pointer is handed to the irq handler before registration finished. * The mci structure could be freed while the irq handler is running. Fix this by adding a mutex to ghes_edac_register(). This mutex serializes instances to register and unregister. The refcount is only increased if the registration succeeded. This makes sure the refcount is in a consistent state after registering or unregistering a device. Note: A spinlock cannot be used here as the code section may sleep. The ghes_pvt is protected by ghes_lock now. This ensures the pointer is not updated before registration was finished or while the irq handler is running. It is unset before unregistering the device including necessary (implicit) memory barriers making the changes visible to other CPUs. Thus, the device can not be used anymore by an interrupt. Also, rename ghes_init to ghes_refcount for better readability and switch to refcount API. A refcount is needed because there can be multiple GHES structures being defined (see ACPI 6.3 specification, 18.3.2.7 Generic Hardware Error Source, "Some platforms may describe multiple Generic Hardware Error Source structures with different notification types, ..."). Another approach to use the mci's device refcount (get_device()) and have a release function does not work here. A release function will be called only for device_release() with the last put_device() call. The device must be deleted *before* that with device_del(). This is only possible by maintaining an own refcount. [ bp: touchups. ] Fixes: 0fe5f281f749 ("EDAC, ghes: Model a single, logical memory controller") Fixes: 1e72e673b9d1 ("EDAC/ghes: Fix Use after free in ghes_edac remove path") Co-developed-by: James Morse <james.morse@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Co-developed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191105200732.3053-1-rrichter@marvell.com
2019-11-06EDAC/amd64: Check for memory before fully initializing an instanceYazen Ghannam1-3/+22
Return early before checking for ECC if the node does not have any populated memory. Free any cached hardware data before returning. Also, return 0 in this case since this is not a failure. Other nodes may have memory and the module should attempt to load an instance for them. Move printing of hardware information to after the instance is initialized, so that the information is only printed for nodes with memory. Return an error code when ECC is disabled. This check happens after checking for memory. The module should explicitly fail to load if memory is populated on a node and ECC is disabled. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106012448.243970-6-Yazen.Ghannam@amd.com
2019-11-06EDAC/amd64: Use cached data when checking for ECCYazen Ghannam1-12/+8
...now that the data is available earlier. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106012448.243970-5-Yazen.Ghannam@amd.com
2019-11-06EDAC/amd64: Save max number of controllers to family typeYazen Ghannam2-30/+16
The maximum number of memory controllers is fixed within a family/model group. In most cases, this has been fixed at 2, but some systems may have up to 8. The struct amd64_family_type already contains family/model-specific information, and this can be used rather than adding model checks to various functions. Create a new field in struct amd64_family_type for max_mcs. Set this when setting other family type information, and use this when needing the maximum number of memory controllers possible for a system. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106012448.243970-4-Yazen.Ghannam@amd.com
2019-11-06EDAC/amd64: Gather hardware information earlyYazen Ghannam1-50/+51
Split out gathering hardware information from init_one_instance() into a separate function hw_info_get(). This is necessary so that the information can be cached earlier and used to check if memory is populated and if ECC is enabled on a node. Also, define a function hw_info_put() to back out changes made in hw_info_get(). Check for an allocated PCI device (Function 0 for Family 17h or Function 1 for pre-Family 17h) before freeing, since hw_info_put() may be called before PCI siblings are reserved. Drop the family check when freeing pvt->umc. This will be NULL on pre-Family 17h systems. However, kfree() is safe and will check for a NULL pointer before freeing. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106012448.243970-3-Yazen.Ghannam@amd.com
2019-11-06EDAC/amd64: Make struct amd64_family_type globalYazen Ghannam1-7/+5
The struct amd64_family_type doesn't change between multiple nodes and instances of the module, so make it global. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106012448.243970-2-Yazen.Ghannam@amd.com
2019-10-25EDAC/amd64: Set grain per DIMMYazen Ghannam1-0/+2
The following commit introduced a warning on error reports without a non-zero grain value. 3724ace582d9 ("EDAC/mc: Fix grain_bits calculation") The amd64_edac_mod module does not provide a value, so the warning will be given on the first reported memory error. Set the grain per DIMM to cacheline size (64 bytes). This is the current recommendation. Fixes: 3724ace582d9 ("EDAC/mc: Fix grain_bits calculation") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191022203448.13962-7-Yazen.Ghannam@amd.com
2019-10-24EDAC/aspeed: Use devm_platform_ioremap_resource() in aspeed_probe()Markus Elfring1-6/+1
Simplify this function implementation by using a known wrapper function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Joel Stanley <joel@jms.id.au> Cc: Andrew Jeffery <andrew@aj.id.au> Cc: James Morse <james.morse@arm.com> Cc: kernel-janitors@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-aspeed@lists.ozlabs.org Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Stefan Schaeckeler <sschaeck@cisco.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/baabb9e9-a1b2-3a04-9fb6-aa632de5f722@web.de
2019-10-18EDAC, skx: Retrieve and print retry_rd_err_log registersTony Luck3-7/+60
Skylake logs some additional useful information in per-channel registers in addition the the architectural status/addr/misc logged in the machine check bank. Pick up this information and add it to the EDAC log: retry_rd_err_[five 32-bit register values] Sorry, no definitions for these registers. OEMs and DIMM vendors will be able to use them to isolate which cells in the DIMM are causing problems. correrrcnt[per rank corrected error counts] Note that if additional errors are logged while these registers are being read, you may see a jumble of values some from earlier errors, others from later errors (since the registers report the most recent logged error). The correrrcnt registers provide error counts per possible rank. If these counts only change by one since the previous error logged for this channel, then it is safe to assume that the registers logged provide a coherent view of one error. With this change EDAC logs look like this: EDAC MC4: 1 CE memory read error on CPU_SrcID#2_MC#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8f26018 offset:0x0 grain:32 syndrome:0x0 - err_code:0x0101:0x0091 socket:2 imc:0 rank:0 bg:0 ba:0 row:0x1f880 col:0x200 retry_rd_err_log[0001a209 00000000 00000001 04800001 0001f880] correrrcnt[0001 0000 0000 0000 0000 0000 0000 0000]) Acked-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2019-10-18EDAC, skx_common: Refactor so that we initialize "dev" in result of adxl decode.Tony Luck1-25/+23
Simplifies the code a little. Acked-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2019-10-17EDAC/ghes: Fix Use after free in ghes_edac remove pathJames Morse1-0/+4
ghes_edac models a single logical memory controller, and uses a global ghes_init variable to ensure only the first ghes_edac_register() will do anything. ghes_edac is registered the first time a GHES entry in the HEST is probed. There may be multiple entries, so subsequent attempts to register ghes_edac are silently ignored as the work has already been done. When a GHES entry is unregistered, it calls ghes_edac_unregister(), which free()s the memory behind the global variables in ghes_edac. But there may be multiple GHES entries, the next call to ghes_edac_unregister() will dereference the free()d memory, and attempt to free it a second time. This may also be triggered on a platform with one GHES entry, if the driver is unbound/re-bound and unbound. The re-bind step will do nothing because of ghes_init, the second unbind will then do the same work as the first. Doing the unregister work on the first call is unsafe, as another CPU may be processing a notification in ghes_edac_report_mem_error(), using the memory we are about to free. ghes_init is already half of the reference counting. We only need to do the register work for the first call, and the unregister work for the last. Add the unregister check. This means we no longer free ghes_edac's memory while there are GHES entries that may receive a notification. This was detected by KASAN and DEBUG_TEST_DRIVER_REMOVE. [ bp: merge into a single patch. ] Fixes: 0fe5f281f749 ("EDAC, ghes: Model a single, logical memory controller") Reported-by: John Garry <john.garry@huawei.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20191014171919.85044-2-james.morse@arm.com Link: https://lkml.kernel.org/r/304df85b-8b56-b77e-1a11-aa23769f2e7c@huawei.com
2019-10-09EDAC/device: Rework error logging APIHanna Hawa2-32/+72
Make the main workhorse the "count" functions which can log a @count of errors. Have the current APIs edac_device_handle_{ce,ue}() call the _count() variants and this way keep the exported symbols number unchanged. [ bp: Rewrite. ] Signed-off-by: Hanna Hawa <hhhawa@amazon.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: benh@amazon.com Cc: dwmw@amazon.co.uk Cc: hanochu@amazon.com Cc: James Morse <james.morse@arm.com> Cc: jonnyc@amazon.com Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: ronenk@amazon.com Cc: talel@amazon.com Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190923191741.29322-2-hhhawa@amazon.com
2019-09-30EDAC: skx_common: get rid of unused type varMauro Carvalho Chehab1-4/+1
drivers/edac/skx_common.c: In function ‘skx_mce_output_error’: drivers/edac/skx_common.c:478:8: warning: variable ‘type’ set but not used [-Wunused-but-set-variable] 478 | char *type, *optype; | ^~~~ Acked-by: Borislav Petkov <bp@alien8.de> Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-09-30EDAC: sb_edac: get rid of unused varsMauro Carvalho Chehab1-13/+8
There are several vars unused on this driver, probably because it was a modified copy of another driver. Get rid of them. drivers/edac/sb_edac.c: In function ‘knl_get_dimm_capacity’: drivers/edac/sb_edac.c:1343:16: warning: variable ‘sad_size’ set but not used [-Wunused-but-set-variable] 1343 | u64 sad_base, sad_size, sad_limit = 0; | ^~~~~~~~ drivers/edac/sb_edac.c: In function ‘sbridge_mce_output_error’: drivers/edac/sb_edac.c:2955:8: warning: variable ‘type’ set but not used [-Wunused-but-set-variable] 2955 | char *type, *optype, msg[256]; | ^~~~ drivers/edac/sb_edac.c: In function ‘sbridge_unregister_mci’: drivers/edac/sb_edac.c:3203:22: warning: variable ‘pvt’ set but not used [-Wunused-but-set-variable] 3203 | struct sbridge_pvt *pvt; | ^~~ At top level: drivers/edac/sb_edac.c:266:18: warning: ‘correrrthrsld’ defined but not used [-Wunused-const-variable=] 266 | static const u32 correrrthrsld[] = { | ^~~~~~~~~~~~~ drivers/edac/sb_edac.c:257:18: warning: ‘correrrcnt’ defined but not used [-Wunused-const-variable=] 257 | static const u32 correrrcnt[] = { | ^~~~~~~~~~ Acked-by: Borislav Petkov <bp@alien8.de> Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-09-30EDAC: i5400_edac: get rid of some unused varsMauro Carvalho Chehab1-10/+1
There are several temporary unused vars: drivers/edac/i5400_edac.c: In function ‘i5400_get_mc_regs’: drivers/edac/i5400_edac.c:1058:6: warning: variable ‘maxdimmperch’ set but not used [-Wunused-but-set-variable] 1058 | int maxdimmperch; | ^~~~~~~~~~~~ drivers/edac/i5400_edac.c:1057:6: warning: variable ‘maxch’ set but not used [-Wunused-but-set-variable] 1057 | int maxch; | ^~~~~ drivers/edac/i5400_edac.c: In function ‘i5400_init_dimms’: drivers/edac/i5400_edac.c:1174:6: warning: variable ‘max_dimms’ set but not used [-Wunused-but-set-variable] 1174 | int max_dimms; | ^~~~~~~~~ drivers/edac/i5400_edac.c:1173:14: warning: variable ‘channel_count’ set but not used [-Wunused-but-set-variable] 1173 | int ndimms, channel_count; | ^~~~~~~~~~~~~ Get rid of them. Acked-by: Borislav Petkov <bp@alien8.de> Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-09-30EDAC: i5400_edac: print type at debug messageMauro Carvalho Chehab1-2/+2
There are 3 types of non-recoverable errors that the MC reports: - Fatal; - Non-fatal uncorrected - Non-fatal correctable While we don't add it to the log itself, it could be useful to have this at least for debug messages. This shuts up this warning: drivers/edac/i5400_edac.c: In function ‘i5400_proccess_non_recoverable_info’: drivers/edac/i5400_edac.c:524:8: warning: variable ‘type’ set but not used [-Wunused-but-set-variable] 524 | char *type = NULL; | ^~~~ Acked-by: Borislav Petkov <bp@alien8.de> Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-09-30EDAC: i7300_edac: fix a kernel-doc syntaxMauro Carvalho Chehab1-1/+1
The declaration of the kerneldoc entry is wrong, causing this warning: drivers/edac/i7300_edac.c:824: warning: Function parameter or member 'mir_no' not described in 'decode_mir' Acked-by: Borislav Petkov <bp@alien8.de> Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-09-30EDAC: i7300_edac: rename a kernel-doc var descriptionMauro Carvalho Chehab1-1/+1
One var was renamed, but the associated kernel-doc markup still points to the old name. Acked-by: Borislav Petkov <bp@alien8.de> Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-09-30EDAC: i5100_edac: get rid of an unused varMauro Carvalho Chehab1-2/+0
As reported by GCC with W=1: drivers/edac/i5100_edac.c:714:16: warning: variable ‘et’ set but not used [-Wunused-but-set-variable] 714 | unsigned long et; | ^~ It sounds some left over from some code before the addition of an udelay(). Acked-by: Borislav Petkov <bp@alien8.de> Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-09-30Linux 5.4-rc1Linus Torvalds1-2/+2
2019-09-30csky: Move static keyword to the front of declarationKrzysztof Wilczynski1-1/+1
Move the static keyword to the front of declaration of csky_pmu_of_device_ids, and resolve the following compiler warning that can be seen when building with warnings enabled (W=1): arch/csky/kernel/perf_event.c:1340:1: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration] Signed-off-by: Krzysztof Wilczynski <kw@linux.com> Signed-off-by: Guo Ren <guoren@kernel.org>
2019-09-30csky: entry: Remove unneeded need_resched() loopValentin Schneider1-4/+0
Since the enabling and disabling of IRQs within preempt_schedule_irq() is contained in a need_resched() loop, we don't need the outer arch code loop. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Guo Ren <guoren@kernel.org>
2019-09-30csky: Fixup csky_pmu.max_period assignmentMao Han1-1/+1
The csky_pmu.max_period has type u64, and BIT() can only return 32 bits unsigned long on C-SKY. The initialization for max_period will be incorrect when count_width is bigger than 32. Use BIT_ULL() Signed-off-by: Mao Han <han_mao@c-sky.com> Signed-off-by: Guo Ren <ren_guo@c-sky.com>
2019-09-30csky: Fixup add zero_fp fixup perf backtrace panicGuo Ren2-21/+31
We need set fp zero to let backtrace know the end. The patch fixup perf callchain panic problem, because backtrace didn't know what is the end of fp. Signed-off-by: Guo Ren <ren_guo@c-sky.com> Reported-by: Mao Han <han_mao@c-sky.com>
2019-09-30csky: Use generic free_initrd_mem()Mike Rapoport1-16/+0
The csky implementation of free_initrd_mem() is an open-coded version of free_reserved_area() without poisoning. Remove it and make csky use the generic version of free_initrd_mem(). Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Guo Ren <guoren@kernel.org>
2019-09-29Revert "Revert "ext4: make __ext4_get_inode_loc plug""Linus Torvalds1-0/+3
This reverts commit 72dbcf72156641fde4d8ea401e977341bfd35a05. Instead of waiting forever for entropy that may just not happen, we now try to actively generate entropy when required, and are thus hopefully avoiding the problem that caused the nice ext4 IO pattern fix to be reverted. So revert the revert. Cc: Ahmed S. Darwish <darwish.07@gmail.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Willy Tarreau <w@1wt.eu> Cc: Alexander E. Patrakov <patrakov@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-29random: try to actively add entropy rather than passively wait for itLinus Torvalds1-1/+61
For 5.3 we had to revert a nice ext4 IO pattern improvement, because it caused a bootup regression due to lack of entropy at bootup together with arguably broken user space that was asking for secure random numbers when it really didn't need to. See commit 72dbcf721566 (Revert "ext4: make __ext4_get_inode_loc plug"). This aims to solve the issue by actively generating entropy noise using the CPU cycle counter when waiting for the random number generator to initialize. This only works when you have a high-frequency time stamp counter available, but that's the case on all modern x86 CPU's, and on most other modern CPU's too. What we do is to generate jitter entropy from the CPU cycle counter under a somewhat complex load: calling the scheduler while also guaranteeing a certain amount of timing noise by also triggering a timer. I'm sure we can tweak this, and that people will want to look at other alternatives, but there's been a number of papers written on jitter entropy, and this should really be fairly conservative by crediting one bit of entropy for every timer-induced jump in the cycle counter. Not because the timer itself would be all that unpredictable, but because the interaction between the timer and the loop is going to be. Even if (and perhaps particularly if) the timer actually happens on another CPU, the cacheline interaction between the loop that reads the cycle counter and the timer itself firing is going to add perturbations to the cycle counter values that get mixed into the entropy pool. As Thomas pointed out, with a modern out-of-order CPU, even quite simple loops show a fair amount of hard-to-predict timing variability even in the absense of external interrupts. But this tries to take that further by actually having a fairly complex interaction. This is not going to solve the entropy issue for architectures that have no CPU cycle counter, but it's not clear how (and if) that is solvable, and the hardware in question is largely starting to be irrelevant. And by doing this we can at least avoid some of the even more contentious approaches (like making the entropy waiting time out in order to avoid the possibly unbounded waiting). Cc: Ahmed Darwish <darwish.07@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Nicholas Mc Guire <hofrat@opentech.at> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Willy Tarreau <w@1wt.eu> Cc: Alexander E. Patrakov <patrakov@gmail.com> Cc: Lennart Poettering <mzxreary@0pointer.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-29Documentation/process: Clarify disclosure rulesThomas Gleixner1-7/+33
The role of the contact list provided by the disclosing party and how it affects the disclosure process and the ability to include experts into the development process is not really well explained. Neither is it entirely clear when the disclosing party will be informed about the fact that a developer who is not covered by an employer NDA needs to be brought in and disclosed. Explain the role of the contact list and the information policy along with an eventual conflict resolution better. Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/alpine.DEB.2.21.1909251028390.10825@nanos.tec.linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-28selftests/ftrace: Fix same probe error testSteven Rostedt (VMware)1-1/+1
The "same probe" selftest that tests that adding the same probe fails doesn't add the same probe and passes, which fails the test. Fixes: b78b94b82122 ("selftests/ftrace: Update kprobe event error testcase") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-28mm, tracing: Print symbol name for call_site in trace eventsChangbin Du1-3/+4
To improve the readability of raw slab trace points, print the call_site ip using '%pS'. Then we can grep events with function names. [002] .... 808.188897: kmem_cache_free: call_site=putname+0x47/0x50 ptr=00000000cef40c80 [002] .... 808.188898: kfree: call_site=security_cred_free+0x42/0x50 ptr=0000000062400820 [002] .... 808.188904: kmem_cache_free: call_site=put_cred_rcu+0x88/0xa0 ptr=0000000058d74ef8 [002] .... 808.188913: kmem_cache_alloc: call_site=prepare_creds+0x26/0x100 ptr=0000000058d74ef8 bytes_req=168 bytes_alloc=576 gfp_flags=GFP_KERNEL [002] .... 808.188917: kmalloc: call_site=security_prepare_creds+0x77/0xa0 ptr=0000000062400820 bytes_req=8 bytes_alloc=336 gfp_flags=GFP_KERNEL|__GFP_ZERO [002] .... 808.188920: kmem_cache_alloc: call_site=getname_flags+0x4f/0x1e0 ptr=00000000cef40c80 bytes_req=4096 bytes_alloc=4480 gfp_flags=GFP_KERNEL [002] .... 808.188925: kmem_cache_free: call_site=putname+0x47/0x50 ptr=00000000cef40c80 [002] .... 808.188926: kfree: call_site=security_cred_free+0x42/0x50 ptr=0000000062400820 [002] .... 808.188931: kmem_cache_free: call_site=put_cred_rcu+0x88/0xa0 ptr=0000000058d74ef8 Link: http://lkml.kernel.org/r/20190914103215.23301-1-changbin.du@gmail.com Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-28tracing: Have error path in predicate_parse() free its allocated memoryNavid Emamdoost1-2/+4
In predicate_parse, there is an error path that is not going to out_free instead it returns directly which leads to a memory leak. Link: http://lkml.kernel.org/r/20190920225800.3870-1-navid.emamdoost@gmail.com Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-28tracing: Fix clang -Wint-in-bool-context warnings in IF_ASSIGN macroNathan Chancellor1-5/+5
After r372664 in clang, the IF_ASSIGN macro causes a couple hundred warnings along the lines of: kernel/trace/trace_output.c:1331:2: warning: converting the enum constant to a boolean [-Wint-in-bool-context] kernel/trace/trace.h:409:3: note: expanded from macro 'trace_assign_type' IF_ASSIGN(var, ent, struct ftrace_graph_ret_entry, ^ kernel/trace/trace.h:371:14: note: expanded from macro 'IF_ASSIGN' WARN_ON(id && (entry)->type != id); \ ^ 264 warnings generated. This warning can catch issues with constructs like: if (state == A || B) where the developer really meant: if (state == A || state == B) This is currently the only occurrence of the warning in the kernel tree across defconfig, allyesconfig, allmodconfig for arm32, arm64, and x86_64. Add the implicit '!= 0' to the WARN_ON statement to fix the warnings and find potential issues in the future. Link: https://github.com/llvm/llvm-project/commit/28b38c277a2941e9e891b2db30652cfd962f070b Link: https://github.com/ClangBuiltLinux/linux/issues/686 Link: http://lkml.kernel.org/r/20190926162258.466321-1-natechancellor@gmail.com Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-28tracing/probe: Fix to check the difference of nr_args before adding probeMasami Hiramatsu1-0/+16
Steven reported that a test triggered: ================================================================== BUG: KASAN: slab-out-of-bounds in trace_kprobe_create+0xa9e/0xe40 Read of size 8 at addr ffff8880c4f25a48 by task ftracetest/4798 CPU: 2 PID: 4798 Comm: ftracetest Not tainted 5.3.0-rc6-test+ #30 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 Call Trace: dump_stack+0x7c/0xc0 ? trace_kprobe_create+0xa9e/0xe40 print_address_description+0x6c/0x332 ? trace_kprobe_create+0xa9e/0xe40 ? trace_kprobe_create+0xa9e/0xe40 __kasan_report.cold.6+0x1a/0x3b ? trace_kprobe_create+0xa9e/0xe40 kasan_report+0xe/0x12 trace_kprobe_create+0xa9e/0xe40 ? print_kprobe_event+0x280/0x280 ? match_held_lock+0x1b/0x240 ? find_held_lock+0xac/0xd0 ? fs_reclaim_release.part.112+0x5/0x20 ? lock_downgrade+0x350/0x350 ? kasan_unpoison_shadow+0x30/0x40 ? __kasan_kmalloc.constprop.6+0xc1/0xd0 ? trace_kprobe_create+0xe40/0xe40 ? trace_kprobe_create+0xe40/0xe40 create_or_delete_trace_kprobe+0x2e/0x60 trace_run_command+0xc3/0xe0 ? trace_panic_handler+0x20/0x20 ? kasan_unpoison_shadow+0x30/0x40 trace_parse_run_command+0xdc/0x163 vfs_write+0xe1/0x240 ksys_write+0xba/0x150 ? __ia32_sys_read+0x50/0x50 ? tracer_hardirqs_on+0x61/0x180 ? trace_hardirqs_off_caller+0x43/0x110 ? mark_held_locks+0x29/0xa0 ? do_syscall_64+0x14/0x260 do_syscall_64+0x68/0x260 Fix to check the difference of nr_args before adding probe on existing probes. This also may set the error log index bigger than the number of command parameters. In that case it sets the error position is next to the last parameter. Link: http://lkml.kernel.org/r/156966474783.3478.13217501608215769150.stgit@devnote2 Fixes: ca89bc071d5e ("tracing/kprobe: Add multi-probe per event support") Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-28mm, page_alloc: allow hugepage fallback to remote nodes when madvisedDavid Rientjes1-0/+11
For systems configured to always try hard to allocate transparent hugepages (thp defrag setting of "always") or for memory that has been explicitly madvised to MADV_HUGEPAGE, it is often better to fallback to remote memory to allocate the hugepage if the local allocation fails first. The point is to allow the initial call to __alloc_pages_node() to attempt to defragment local memory to make a hugepage available, if possible, rather than immediately fallback to remote memory. Local hugepages will always have a better access latency than remote (huge)pages, so an attempt to make a hugepage available locally is always preferred. If memory compaction cannot be successful locally, however, it is likely better to fallback to remote memory. This could take on two forms: either allow immediate fallback to remote memory or do per-zone watermark checks. It would be possible to fallback only when per-zone watermarks fail for order-0 memory, since that would require local reclaim for all subsequent faults so remote huge allocation is likely better than thrashing the local zone for large workloads. In this case, it is assumed that because the system is configured to try hard to allocate hugepages or the vma is advised to explicitly want to try hard for hugepages that remote allocation is better when local allocation and memory compaction have both failed. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-28mm, page_alloc: avoid expensive reclaim when compaction may not succeedDavid Rientjes1-0/+22
Memory compaction has a couple significant drawbacks as the allocation order increases, specifically: - isolate_freepages() is responsible for finding free pages to use as migration targets and is implemented as a linear scan of memory starting at the end of a zone, - failing order-0 watermark checks in memory compaction does not account for how far below the watermarks the zone actually is: to enable migration, there must be *some* free memory available. Per the above, watermarks are not always suffficient if isolate_freepages() cannot find the free memory but it could require hundreds of MBs of reclaim to even reach this threshold (read: potentially very expensive reclaim with no indication compaction can be successful), and - if compaction at this order has failed recently so that it does not even run as a result of deferred compaction, looping through reclaim can often be pointless. For hugepage allocations, these are quite substantial drawbacks because these are very high order allocations (order-9 on x86) and falling back to doing reclaim can potentially be *very* expensive without any indication that compaction would even be successful. Reclaim itself is unlikely to free entire pageblocks and certainly no reliance should be put on it to do so in isolation (recall lumpy reclaim). This means we should avoid reclaim and simply fail hugepage allocation if compaction is deferred. It is also not helpful to thrash a zone by doing excessive reclaim if compaction may not be able to access that memory. If order-0 watermarks fail and the allocation order is sufficiently large, it is likely better to fail the allocation rather than thrashing the zone. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>