aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/base (follow)
AgeCommit message (Collapse)AuthorFilesLines
2021-10-24net: Convert more users of mdiobus_* to mdiodev_*Sean Anderson1-3/+3
This converts users of mdiobus to mdiodev using the following semantic patch: @@ identifier mdiodev; expression regnum; @@ - mdiobus_read(mdiodev->bus, mdiodev->addr, regnum) + mdiodev_read(mdiodev, regnum) @@ identifier mdiodev; expression regnum, val; @@ - mdiobus_write(mdiodev->bus, mdiodev->addr, regnum, val) + mdiodev_write(mdiodev, regnum, val) Signed-off-by: Sean Anderson <sean.anderson@seco.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-23regmap: spi: Set regmap max raw r/w from max_transfer_sizeLucas Tanure1-4/+32
Set regmap raw read/write from spi max_transfer_size so regmap_raw_read/write can split the access into chunks Signed-off-by: Lucas Tanure <tanureal@opensource.cirrus.com> Reviewed-by: Charles Keepax <ckeepax@opensource.cirrus.com> [André: fix build warning] Signed-off-by: André Almeida <andrealmeid@collabora.com> Link: https://lore.kernel.org/r/20211021132721.13669-1-andrealmeid@collabora.com Signed-off-by: Mark Brown <broonie@kernel.org>
2021-10-22PM: sleep: Do not let "syscore" devices runtime-suspend during system transitionsRafael J. Wysocki1-4/+5
There is no reason to allow "syscore" devices to runtime-suspend during system-wide PM transitions, because they are subject to the same possible failure modes as any other devices in that respect. Accordingly, change device_prepare() and device_complete() to call pm_runtime_get_noresume() and pm_runtime_put(), respectively, for "syscore" devices too. Fixes: 057d51a1268f ("Merge branch 'pm-sleep'") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: 3.10+ <stable@vger.kernel.org> # 3.10+ Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
2021-10-22firmware_loader: move struct builtin_fw to the only place usedLuis Chamberlain1-0/+6
Now that x86 doesn't abuse picking at internals to the firmware loader move out the built-in firmware struct to its only user. Reviewed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20211021155843.1969401-5-mcgrof@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-22firmware_loader: formalize built-in firmware APILuis Chamberlain4-79/+122
Formalize the built-in firmware with a proper API. This can later be used by other callers where all they need is built-in firmware. We export the firmware_request_builtin() call for now only under the TEST_FIRMWARE symbol namespace as there are no direct modular users for it. If they pop up they are free to export it generally. Built-in code always gets access to the callers and we'll demonstrate a hidden user which has been lurking in the kernel for a while and the reason why using a proper API was better long term. Reviewed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20211021155843.1969401-2-mcgrof@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-1/+2
Lots of simnple overlapping additions. With a build fix from Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-21component: do not leave master devres group open after bindKai Vehmanen1-2/+3
In current code, the devres group for aggregate master is left open after call to component_master_add_*(). This leads to problems when the master does further managed allocations on its own. When any participating driver calls component_del(), this leads to immediate release of resources. This came up when investigating a page fault occurring with i915 DRM driver unbind with 5.15-rc1 kernel. The following sequence occurs: i915_pci_remove() -> intel_display_driver_unregister() -> i915_audio_component_cleanup() -> component_del() -> component.c:take_down_master() -> hdac_component_master_unbind() [via master->ops->unbind()] -> devres_release_group(master->parent, NULL) With older kernels this has not caused issues, but with audio driver moving to use managed interfaces for more of its allocations, this no longer works. Devres log shows following to occur: component_master_add_with_match() [ 126.886032] snd_hda_intel 0000:00:1f.3: DEVRES ADD 00000000323ccdc5 devm_component_match_release (24 bytes) [ 126.886045] snd_hda_intel 0000:00:1f.3: DEVRES ADD 00000000865cdb29 grp< (0 bytes) [ 126.886049] snd_hda_intel 0000:00:1f.3: DEVRES ADD 000000001b480725 grp< (0 bytes) audio driver completes its PCI probe() [ 126.892238] snd_hda_intel 0000:00:1f.3: DEVRES ADD 000000001b480725 pcim_iomap_release (48 bytes) component_del() called() at DRM/i915 unbind() [ 137.579422] i915 0000:00:02.0: DEVRES REL 00000000ef44c293 grp< (0 bytes) [ 137.579445] snd_hda_intel 0000:00:1f.3: DEVRES REL 00000000865cdb29 grp< (0 bytes) [ 137.579458] snd_hda_intel 0000:00:1f.3: DEVRES REL 000000001b480725 pcim_iomap_release (48 bytes) So the "devres_release_group(master->parent, NULL)" ends up freeing the pcim_iomap allocation. Upon next runtime resume, the audio driver will cause a page fault as the iomap alloc was released without the driver knowing about it. Fix this issue by using the "struct master" pointer as identifier for the devres group, and by closing the devres group after the master->ops->bind() call is done. This allows devres allocations done by the driver acting as master to be isolated from the binding state of the aggregate driver. This modifies the logic originally introduced in commit 9e1ccb4a7700 ("drivers/base: fix devres handling for master device") Fixes: 9e1ccb4a7700 ("drivers/base: fix devres handling for master device") Cc: stable@vger.kernel.org Acked-by: Imre Deak <imre.deak@intel.com> Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> BugLink: https://gitlab.freedesktop.org/drm/intel/-/issues/4136 Link: https://lore.kernel.org/r/20211013161345.3755341-1-kai.vehmanen@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-20driver core: Provide device_match_acpi_handle() helperAndy Shevchenko1-0/+6
We have a couple of users of this helper, make it available for them. The prototype for the helper is specifically crafted in order to be easily used with bus_find_device() call. That's why its location is in the driver core rather than ACPI. Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20211014134756.39092-1-andriy.shevchenko@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-18Merge 5.15-rc6 into driver-core-nextGreg Kroah-Hartman2-2/+3
We need the driver-core fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-17Merge tag 'driver-core-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-coreLinus Torvalds1-1/+2
Pull driver core fixes from Greg KH: "Here are some small driver core fixes for 5.15-rc6, all of which have been in linux-next for a while with no reported issues. They include: - kernfs negative dentry bugfix - simple pm bus fixes to resolve reported issues" * tag 'driver-core-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: drivers: bus: Delete CONFIG_SIMPLE_PM_BUS drivers: bus: simple-pm-bus: Add support for probing simple bus only devices driver core: Reject pointless SYNC_STATE_ONLY device links kernfs: don't create a negative dentry if inactive node exists
2021-10-15topology: Represent clusters of CPUs within a dieJonathan Cameron2-0/+25
Both ACPI and DT provide the ability to describe additional layers of topology between that of individual cores and higher level constructs such as the level at which the last level cache is shared. In ACPI this can be represented in PPTT as a Processor Hierarchy Node Structure [1] that is the parent of the CPU cores and in turn has a parent Processor Hierarchy Nodes Structure representing a higher level of topology. For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data, but each cluster has local L3 tag. On the other hand, each clusters will share some internal system bus. +-----------------------------------+ +---------+ | +------+ +------+ +--------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | | +-----------+ | | | +------+ +------+ +--------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +---+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +--+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | +---------+ +-----------------------------------+ That means spreading tasks among clusters will bring more bandwidth while packing tasks within one cluster will lead to smaller cache synchronization latency. So both kernel and userspace will have a chance to leverage this topology to deploy tasks accordingly to achieve either smaller cache latency within one cluster or an even distribution of load among clusters for higher throughput. This patch exposes cluster topology to both kernel and userspace. Libraried like hwloc will know cluster by cluster_cpus and related sysfs attributes. PoC of HWLOC support at [2]. Note this patch only handle the ACPI case. Special consideration is needed for SMT processors, where it is necessary to move 2 levels up the hierarchy from the leaf nodes (thus skipping the processor core level). Note that arm64 / ACPI does not provide any means of identifying a die level in the topology but that may be unrelate to the cluster level. [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node structure (Type 0) [2] https://github.com/hisilicon/hwloc/tree/linux-cluster Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210924085104.44806-2-21cnbao@gmail.com
2021-10-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+1
tools/testing/selftests/net/ioam6.sh 7b1700e009cc ("selftests: net: modify IOAM tests for undef bits") bf77b1400a56 ("selftests: net: Test for the IOAM encapsulation with IPv6") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12regmap: Fix possible double-free in regcache_rbtree_exit()Yang Yingliang1-4/+3
In regcache_rbtree_insert_to_block(), when 'present' realloc failed, the 'blk' which is supposed to assign to 'rbnode->block' will be freed, so 'rbnode->block' points a freed memory, in the error handling path of regcache_rbtree_init(), 'rbnode->block' will be freed again in regcache_rbtree_exit(), KASAN will report double-free as follows: BUG: KASAN: double-free or invalid-free in kfree+0xce/0x390 Call Trace: slab_free_freelist_hook+0x10d/0x240 kfree+0xce/0x390 regcache_rbtree_exit+0x15d/0x1a0 regcache_rbtree_init+0x224/0x2c0 regcache_init+0x88d/0x1310 __regmap_init+0x3151/0x4a80 __devm_regmap_init+0x7d/0x100 madera_spi_probe+0x10f/0x333 [madera_spi] spi_probe+0x183/0x210 really_probe+0x285/0xc30 To fix this, moving up the assignment of rbnode->block to immediately after the reallocation has succeeded so that the data structure stays valid even if the second reallocation fails. Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: 3f4ff561bc88b ("regmap: rbtree: Make cache_present bitmap per node") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Link: https://lore.kernel.org/r/20211012023735.1632786-1-yangyingliang@huawei.com Signed-off-by: Mark Brown <broonie@kernel.org>
2021-10-11Merge tag 'linux-kselftest-kunit-fixes-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftestLinus Torvalds1-1/+1
Pull Kunit fixes from Shuah Khan: - Fixes to address the structleak plugin causing the stack frame size to grow immensely when used with KUnit. Fixes include adding a new makefile to disable structleak and using it from KUnit iio, device property, thunderbolt, and bitfield tests to disable it. - KUnit framework reference count leak in kfree_at_end - KUnit tool fix to resolve conflict between --json and --raw_output and generate correct test output in either case. - kernel-doc warnings due to mismatched arg names * tag 'linux-kselftest-kunit-fixes-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kunit: fix kernel-doc warnings due to mismatched arg names bitfield: build kunit tests without structleak plugin thunderbolt: build kunit tests without structleak plugin device property: build kunit tests without structleak plugin iio/test-format: build kunit tests without structleak plugin gcc-plugins/structleak: add makefile var for disabling structleak kunit: fix reference count leak in kfree_at_end kunit: tool: better handling of quasi-bool args (--json, --raw_output)
2021-10-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-27/+63
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-07device property: move mac addr helpers to eth.cJakub Kicinski1-63/+0
Move the mac address helpers out, eth.c already contains a bunch of similar helpers. Suggested-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-06device property: build kunit tests without structleak pluginBrendan Higgins1-1/+1
The structleak plugin causes the stack frame size to grow immensely when used with KUnit: ../drivers/base/test/property-entry-test.c:492:1: warning: the frame size of 2832 bytes is larger than 2048 bytes [-Wframe-larger-than=] ../drivers/base/test/property-entry-test.c:322:1: warning: the frame size of 2080 bytes is larger than 2048 bytes [-Wframe-larger-than=] ../drivers/base/test/property-entry-test.c:250:1: warning: the frame size of 4976 bytes is larger than 2048 bytes [-Wframe-larger-than=] ../drivers/base/test/property-entry-test.c:115:1: warning: the frame size of 3280 bytes is larger than 2048 bytes [-Wframe-larger-than=] Turn it off in this file. Signed-off-by: Brendan Higgins <brendanhiggins@google.com> Suggested-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2021-10-05driver core: Reject pointless SYNC_STATE_ONLY device linksSaravana Kannan1-1/+2
SYNC_STATE_ONLY device links intentionally allow cycles because cyclic sync_state() dependencies are valid and necessary. However a SYNC_STATE_ONLY device link where the consumer and the supplier are the same device is pointless because the device link would be deleted as soon as the device probes (because it's also the consumer) and won't affect when the sync_state() callback is called. It's a waste of CPU cycles and memory to create this device link. So reject any attempts to create such a device link. Fixes: 05ef983e0d65 ("driver core: Add device link support for SYNC_STATE_ONLY flag") Cc: stable <stable@vger.kernel.org> Reported-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20210929190549.860541-1-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-05firmware_loader: add a sanity check for firmware_request_builtin()Luis Chamberlain1-0/+3
Right now firmware_request_builtin() is used internally only and so we have control over the callers. But if we want to expose that API more broadly we should ensure the firmware pointer is valid. This doesn't fix any known issue, it just prepares us to later expose this API to other users. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20210917182226.3532898-4-mcgrof@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-05firmware_loader: split built-in firmware callLuis Chamberlain1-8/+21
There are two ways the firmware_loader can use the built-in firmware: with or without the pre-allocated buffer. We already have one explicit use case for each of these, and so split them up so that it is clear what the intention is on the caller side. This also paves the way so that eventually other callers outside of the firmware loader can uses these if and when needed. While at it, adopt the firmware prefix for the routine names. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20210917182226.3532898-3-mcgrof@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-05firmware_loader: fix pre-allocated buf built-in firmware useLuis Chamberlain1-6/+7
The firmware_loader can be used with a pre-allocated buffer through the use of the API calls: o request_firmware_into_buf() o request_partial_firmware_into_buf() If the firmware was built-in and present, our current check for if the built-in firmware fits into the pre-allocated buffer does not return any errors, and we proceed to tell the caller that everything worked fine. It's a lie and no firmware would end up being copied into the pre-allocated buffer. So if the caller trust the result it may end up writing a bunch of 0's to a device! Fix this by making the function that checks for the pre-allocated buffer return non-void. Since the typical use case is when no pre-allocated buffer is provided make this return successfully for that case. If the built-in firmware does *not* fit into the pre-allocated buffer size return a failure as we should have been doing before. I'm not aware of users of the built-in firmware using the API calls with a pre-allocated buffer, as such I doubt this fixes any real life issue. But you never know... perhaps some oddball private tree might use it. In so far as upstream is concerned this just fixes our code for correctness. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20210917182226.3532898-2-mcgrof@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-05drivers/base/component.c: remove superfluous header files from component.cMianhan Liu1-1/+0
component.c hasn't use any macro or function declared in linux/kref.h. Thus, these files can be removed from component.c safely without affecting the compilation of the drivers/base/ module Signed-off-by: Mianhan Liu <liumh1@shanghaitech.edu.cn> Link: https://lore.kernel.org/r/20210928193849.28717-1-liumh1@shanghaitech.edu.cn Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-05drivers/base/arch_topology.c: remove superfluous headerMianhan Liu1-3/+0
arch_topology.c hasn't use any macro or function declared in linux/percpu.h, linux/smp.h and linux/string.h. Thus, these files can be removed from arch_topology.c safely without affecting the compilation of the drivers/base/ module Signed-off-by: Mianhan Liu <liumh1@shanghaitech.edu.cn> Link: https://lore.kernel.org/r/20210928193138.24192-1-liumh1@shanghaitech.edu.cn Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-05driver core: use NUMA_NO_NODE during device_initializeMax Gurtovoy1-1/+1
Don't use (-1) constant for setting initial device node. Instead, use the generic NUMA_NO_NODE definition to indicate that "no node id specified". Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Link: https://lore.kernel.org/r/20211004133453.18881-1-mgurtovoy@nvidia.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-05driver core: Fix possible memory leak in device_link_add()Yang Yingliang1-3/+1
I got memory leak as follows: unreferenced object 0xffff88801f0b2200 (size 64): comm "i2c-lis2hh12-21", pid 5455, jiffies 4294944606 (age 15.224s) hex dump (first 32 bytes): 72 65 67 75 6c 61 74 6f 72 3a 72 65 67 75 6c 61 regulator:regula 74 6f 72 2e 30 2d 2d 69 32 63 3a 31 2d 30 30 31 tor.0--i2c:1-001 backtrace: [<00000000bf5b0c3b>] __kmalloc_track_caller+0x19f/0x3a0 [<0000000050da42d9>] kvasprintf+0xb5/0x150 [<000000004bbbed13>] kvasprintf_const+0x60/0x190 [<00000000cdac7480>] kobject_set_name_vargs+0x56/0x150 [<00000000bf83f8e8>] dev_set_name+0xc0/0x100 [<00000000cc1cf7e3>] device_link_add+0x6b4/0x17c0 [<000000009db9faed>] _regulator_get+0x297/0x680 [<00000000845e7f2b>] _devm_regulator_get+0x5b/0xe0 [<000000003958ee25>] st_sensors_power_enable+0x71/0x1b0 [st_sensors] [<000000005f450f52>] st_accel_i2c_probe+0xd9/0x150 [st_accel_i2c] [<00000000b5f2ab33>] i2c_device_probe+0x4d8/0xbe0 [<0000000070fb977b>] really_probe+0x299/0xc30 [<0000000088e226ce>] __driver_probe_device+0x357/0x500 [<00000000c21dda32>] driver_probe_device+0x4e/0x140 [<000000004e650441>] __device_attach_driver+0x257/0x340 [<00000000cf1891b8>] bus_for_each_drv+0x166/0x1e0 When device_register() returns an error, the name allocated in dev_set_name() will be leaked, the put_device() should be used instead of kfree() to give up the device reference, then the name will be freed in kobject_cleanup() and the references of consumer and supplier will be decreased in device_link_release_fn(). Fixes: 287905e68dd2 ("driver core: Expose device link details in sysfs") Reported-by: Hulk Robot <hulkci@huawei.com> Reviewed-by: Saravana Kannan <saravanak@google.com> Reviewed-by: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Link: https://lore.kernel.org/r/20210930085714.2057460-1-yangyingliang@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-04Merge 5.15-rc4 into driver-core-nextGreg Kroah-Hartman4-28/+77
We need the driver core fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-03Merge tag 'driver-core-5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-coreLinus Torvalds1-27/+63
Pull driver core fixes from Greg KH: "Here are some driver core and kernfs fixes for reported issues for 5.15-rc4. These fixes include: - kernfs positive dentry bugfix - debugfs_create_file_size error path fix - cpumask sysfs file bugfix to preserve the user/kernel abi (has been reported multiple times.) - devlink fixes for mdiobus devices as reported by the subsystem maintainers. Also included in here are some devlink debugging changes to make it easier for people to report problems when asked. They have already helped with the mdiobus and other subsystems reporting issues. All of these have been linux-next for a while with no reported issues" * tag 'driver-core-5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: kernfs: also call kernfs_set_rev() for positive dentry driver core: Add debug logs when fwnode links are added/deleted driver core: Create __fwnode_link_del() helper function driver core: Set deferred probe reason when deferred by driver core net: mdiobus: Set FWNODE_FLAG_NEEDS_CHILD_BOUND_ON_ADD for mdiobus parents driver core: fw_devlink: Add support for FWNODE_FLAG_NEEDS_CHILD_BOUND_ON_ADD driver core: fw_devlink: Improve handling of cyclic dependencies cpumask: Omit terminating null byte in cpumap_print_{list,bitmask}_to_buf debugfs: debugfs_create_file_size(): use IS_ERR to check for error
2021-09-28driver core: Add debug logs when fwnode links are added/deletedSaravana Kannan1-0/+4
This will help with debugging fw_devlink issues. Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20210915172808.620546-4-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-28driver core: Create __fwnode_link_del() helper functionSaravana Kannan1-16/+19
The same code is repeated in multiple locations. Create a helper function for it. Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20210915172808.620546-3-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-28driver core: Set deferred probe reason when deferred by driver coreSaravana Kannan1-6/+9
When the driver core defers the probe of a device, set the deferred probe reason so that it's easier to debug. The deferred probe reason is available in debugfs under devices_deferred. Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20210915172808.620546-2-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-24Merge tag 'devprop-5.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds1-0/+3
Pull device properties framework fix from Rafael Wysocki: "Fix software node refcount imbalance on device removal (Laurentiu Tudor)" * tag 'devprop-5.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: software node: balance refcount for managed software nodes
2021-09-23driver core: fw_devlink: Add support for FWNODE_FLAG_NEEDS_CHILD_BOUND_ON_ADDSaravana Kannan1-0/+19
If a parent device is also a supplier to a child device, fw_devlink=on by design delays the probe() of the child device until the probe() of the parent finishes successfully. However, some drivers of such parent devices (where parent is also a supplier) expect the child device to finish probing successfully as soon as they are added using device_add() and before the probe() of the parent device has completed successfully. One example of such a case is discussed in the link mentioned below. Add a flag to make fw_devlink=on not enforce these supplier-consumer relationships, so these drivers can continue working. Link: https://lore.kernel.org/netdev/CAGETcx_uj0V4DChME-gy5HGKTYnxLBX=TH2rag29f_p=UcG+Tg@mail.gmail.com/ Fixes: ea718c699055 ("Revert "Revert "driver core: Set fw_devlink=on by default""") Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20210915170940.617415-3-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-21driver core: Clarify that dev_err_probe() is OK even w/out -EPROBE_DEFERDouglas Anderson1-0/+5
There is some debate about whether it's deemed acceptable to call dev_err_probe() if you know that the error code can never be -EPROBE_DEFER. Clarify in the function comments that this is OK. Specifically this makes us able to transform code like this: ret = do_something_that_cant_defer(); if (ret < 0) { dev_err(dev, "The foo failed to bar (%pe)\n", ERR_PTR(ret)); return ret; } to code like this: ret = do_something_that_cant_defer(); if (ret < 0) return dev_err_probe(dev, ret, "The foo failed to bar\n"); It is also possible that in the future folks might want a CONFIG option to strip out all probe error strings to save space (keeping non-probe errors) with the argument that probe errors rarely happen after bringup. Having probe errors reported with a consistent function would allow that. Cc: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20210916161931.1.I32bea713bd6c6fb419a24da73686145742b6c117@changeid Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-21driver core: fw_devlink: Improve handling of cyclic dependenciesSaravana Kannan1-5/+12
When we have a dependency of the form: Device-A -> Device-C Device-B Device-C -> Device-B Where, * Indentation denotes "child of" parent in previous line. * X -> Y denotes X is consumer of Y based on firmware (Eg: DT). We have cyclic dependency: device-A -> device-C -> device-B -> device-A fw_devlink current treats device-C -> device-B dependency as an invalid dependency and doesn't enforce it but leaves the rest of the dependencies as is. While the current behavior is necessary, it is not sufficient if the false dependency in this example is actually device-A -> device-C. When this is the case, device-C will correctly probe defer waiting for device-B to be added, but device-A will be incorrectly probe deferred by fw_devlink waiting on device-C to probe successfully. Due to this, none of the devices in the cycle will end up probing. To fix this, we need to go relax all the dependencies in the cycle like we already do in the other instances where fw_devlink detects cycles. A real world example of this was reported[1] and analyzed[2]. [1] - https://lore.kernel.org/lkml/0a2c4106-7f48-2bb5-048e-8c001a7c3fda@samsung.com/ [2] - https://lore.kernel.org/lkml/CAGETcx8peaew90SWiux=TyvuGgvTQOmO4BFALz7aj0Za5QdNFQ@mail.gmail.com/ Fixes: f9aa460672c9 ("driver core: Refactor fw_devlink feature") Cc: stable <stable@vger.kernel.org> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20210915170940.617415-2-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-17Merge tag 'for-linus-5.15b-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tipLinus Torvalds1-0/+10
Pull xen fixes from Juergen Gross: - The first hunk of a Xen swiotlb fixup series fixing multiple minor issues and doing some small cleanups - Some further Xen related fixes avoiding WARN() splats when running as Xen guests or dom0 - A Kconfig fix allowing the pvcalls frontend to be built as a module * tag 'for-linus-5.15b-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: swiotlb-xen: drop DEFAULT_NSLABS swiotlb-xen: arrange to have buffer info logged swiotlb-xen: drop leftover __ref swiotlb-xen: limit init retries swiotlb-xen: suppress certain init retries swiotlb-xen: maintain slab count properly swiotlb-xen: fix late init retry swiotlb-xen: avoid double free xen/pvcalls: backend can be a module xen: fix usage of pmd_populate in mremap for pv guests xen: reset legacy rtc flag for PV domU PM: base: power: don't try to use non-existing RTC for storing data xen/balloon: use a kernel thread instead a workqueue
2021-09-16software node: balance refcount for managed software nodesLaurentiu Tudor1-0/+3
software_node_notify(), on KOBJ_REMOVE drops the refcount twice on managed software nodes, thus leading to underflow errors. Balance the refcount by bumping it in the device_create_managed_software_node() function. The error [1] was encountered after adding a .shutdown() op to our fsl-mc-bus driver. [1] pc : refcount_warn_saturate+0xf8/0x150 lr : refcount_warn_saturate+0xf8/0x150 sp : ffff80001009b920 x29: ffff80001009b920 x28: ffff1a2420318000 x27: 0000000000000000 x26: ffffccac15e7a038 x25: 0000000000000008 x24: ffffccac168e0030 x23: ffff1a2428a82000 x22: 0000000000080000 x21: ffff1a24287b5000 x20: 0000000000000001 x19: ffff1a24261f4400 x18: ffffffffffffffff x17: 6f72645f726f7272 x16: 0000000000000000 x15: ffff80009009b607 x14: 0000000000000000 x13: ffffccac16602670 x12: 0000000000000a17 x11: 000000000000035d x10: ffffccac16602670 x9 : ffffccac16602670 x8 : 00000000ffffefff x7 : ffffccac1665a670 x6 : ffffccac1665a670 x5 : 0000000000000000 x4 : 0000000000000000 x3 : 00000000ffffffff x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff1a2420318000 Call trace: refcount_warn_saturate+0xf8/0x150 kobject_put+0x10c/0x120 software_node_notify+0xd8/0x140 device_platform_notify+0x4c/0xb4 device_del+0x188/0x424 fsl_mc_device_remove+0x2c/0x4c rebofind sp.c__fsl_mc_device_remove+0x14/0x2c device_for_each_child+0x5c/0xac dprc_remove+0x9c/0xc0 fsl_mc_driver_remove+0x28/0x64 __device_release_driver+0x188/0x22c device_release_driver+0x30/0x50 bus_remove_device+0x128/0x134 device_del+0x16c/0x424 fsl_mc_bus_remove+0x8c/0x114 fsl_mc_bus_shutdown+0x14/0x20 platform_shutdown+0x28/0x40 device_shutdown+0x15c/0x330 __do_sys_reboot+0x218/0x2a0 __arm64_sys_reboot+0x28/0x34 invoke_syscall+0x48/0x114 el0_svc_common+0x40/0xdc do_el0_svc+0x2c/0x94 el0_svc+0x2c/0x54 el0t_64_sync_handler+0xa8/0x12c el0t_64_sync+0x198/0x19c ---[ end trace 32eb1c71c7d86821 ]--- Fixes: 151f6ff78cdf ("software node: Provide replacement for device_add_properties()") Reported-by: Jon Nettleton <jon@solid-run.com> Suggested-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Laurentiu Tudor <laurentiu.tudor@nxp.com> Cc: 5.12+ <stable@vger.kernel.org> # 5.12+ [ rjw: Fix up the software_node_notify() invocation ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-09-14memblock: introduce saner 'memblock_free_ptr()' interfaceLinus Torvalds1-1/+1
The boot-time allocation interface for memblock is a mess, with 'memblock_alloc()' returning a virtual pointer, but then you are supposed to free it with 'memblock_free()' that takes a _physical_ address. Not only is that all kinds of strange and illogical, but it actually causes bugs, when people then use it like a normal allocation function, and it fails spectacularly on a NULL pointer: https://lore.kernel.org/all/20210912140820.GD25450@xsang-OptiPlex-9020/ or just random memory corruption if the debug checks don't catch it: https://lore.kernel.org/all/61ab2d0c-3313-aaab-514c-e15b7aa054a0@suse.cz/ I really don't want to apply patches that treat the symptoms, when the fundamental cause is this horribly confusing interface. I started out looking at just automating a sane replacement sequence, but because of this mix or virtual and physical addresses, and because people have used the "__pa()" macro that can take either a regular kernel pointer, or just the raw "unsigned long" address, it's all quite messy. So this just introduces a new saner interface for freeing a virtual address that was allocated using 'memblock_alloc()', and that was kept as a regular kernel pointer. And then it converts a couple of users that are obvious and easy to test, including the 'xbc_nodes' case in lib/bootconfig.c that caused problems. Reported-by: kernel test robot <oliver.sang@intel.com> Fixes: 40caa127f3c7 ("init: bootconfig: Remove all bootconfig data when the init memory is removed") Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-14driver core: platform: Make use of the helper macro SET_RUNTIME_PM_OPS()Cai Huoqing1-2/+1
Use the helper macro SET_RUNTIME_PM_OPS() instead of the verbose operators ".runtime_suspend/.runtime_resume", because the SET_RUNTIME_PM_OPS() is a nice helper macro that could be brought in to make code a little clearer, a little more concise. Signed-off-by: Cai Huoqing <caihuoqing@baidu.com> Link: https://lore.kernel.org/r/20210828090219.1177-1-caihuoqing@baidu.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-14PM: base: power: don't try to use non-existing RTC for storing dataJuergen Gross1-0/+10
If there is no legacy RTC device, don't try to use it for storing trace data across suspend/resume. Cc: <stable@vger.kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20210903084937.19392-2-jgross@suse.com Signed-off-by: Juergen Gross <jgross@suse.com>
2021-09-10Merge branches 'pm-cpufreq', 'pm-sleep' and 'pm-em'Rafael J. Wysocki2-8/+5
* pm-cpufreq: cpufreq: intel_pstate: hybrid: Rework HWP calibration ACPI: CPPC: Introduce cppc_get_nominal_perf() * pm-sleep: PM: sleep: core: Avoid setting power.must_resume to false PM: sleep: wakeirq: drop useless parameter from dev_pm_attach_wake_irq() * pm-em: Documentation: power: include kernel-doc in Energy Model doc PM: EM: fix kernel-doc comments
2021-09-08Merge tag 'pm-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds1-0/+2
Pull more power management updates from Rafael Wysocki: "These are mostly ARM cpufreq driver updates, including one new MediaTek driver that has just passed all of the reviews, with the addition of a revert of a recent intel_pstate commit, some core cpufreq changes and a DT-related update of the operating performance points (OPP) support code. Specifics: - Add new cpufreq driver for the MediaTek MT6779 platform called mediatek-hw along with corresponding DT bindings (Hector.Yuan). - Add DCVS interrupt support to the qcom-cpufreq-hw driver (Thara Gopinath). - Make the qcom-cpufreq-hw driver set the dvfs_possible_from_any_cpu policy flag (Taniya Das). - Blocklist more Qualcomm platforms in cpufreq-dt-platdev (Bjorn Andersson). - Make the vexpress cpufreq driver set the CPUFREQ_IS_COOLING_DEV flag (Viresh Kumar). - Add new cpufreq driver callback to allow drivers to register with the Energy Model in a consistent way and make several drivers use it (Viresh Kumar). - Change the remaining users of the .ready() cpufreq driver callback to move the code from it elsewhere and drop it from the cpufreq core (Viresh Kumar). - Revert recent intel_pstate change adding HWP guaranteed performance change notification support to it that led to problems, because the notification in question is triggered prematurely on some systems (Rafael Wysocki). - Convert the OPP DT bindings to DT schema and clean them up while at it (Rob Herring)" * tag 'pm-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits) Revert "cpufreq: intel_pstate: Process HWP Guaranteed change notification" cpufreq: mediatek-hw: Add support for CPUFREQ HW cpufreq: Add of_perf_domain_get_sharing_cpumask dt-bindings: cpufreq: add bindings for MediaTek cpufreq HW cpufreq: Remove ready() callback cpufreq: sh: Remove sh_cpufreq_cpu_ready() cpufreq: acpi: Remove acpi_cpufreq_cpu_ready() cpufreq: qcom-hw: Set dvfs_possible_from_any_cpu cpufreq driver flag cpufreq: blocklist more Qualcomm platforms in cpufreq-dt-platdev cpufreq: qcom-cpufreq-hw: Add dcvs interrupt support cpufreq: scmi: Use .register_em() to register with energy model cpufreq: vexpress: Use .register_em() to register with energy model cpufreq: scpi: Use .register_em() to register with energy model dt-bindings: opp: Convert to DT schema dt-bindings: Clean-up OPP binding node names in examples ARM: dts: omap: Drop references to opp.txt cpufreq: qcom-cpufreq-hw: Use .register_em() to register with energy model cpufreq: omap: Use .register_em() to register with energy model cpufreq: mediatek: Use .register_em() to register with energy model cpufreq: imx6q: Use .register_em() to register with energy model ...
2021-09-08Merge branch 'akpm' (patches from Andrew)Linus Torvalds2-23/+204
Merge more updates from Andrew Morton: "147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f. Subsystems affected by this patch series: mm (memory-hotplug, rmap, ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan), alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib, checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig, selftests, ipc, and scripts" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits) scripts: check_extable: fix typo in user error message mm/workingset: correct kernel-doc notations ipc: replace costly bailout check in sysvipc_find_ipc() selftests/memfd: remove unused variable Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH configs: remove the obsolete CONFIG_INPUT_POLLDEV prctl: allow to setup brk for et_dyn executables pid: cleanup the stale comment mentioning pidmap_init(). kernel/fork.c: unexport get_{mm,task}_exe_file coredump: fix memleak in dump_vma_snapshot() fs/coredump.c: log if a core dump is aborted due to changed file permissions nilfs2: use refcount_dec_and_lock() to fix potential UAF nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group nilfs2: fix NULL pointer in nilfs_##name##_attr_release nilfs2: fix memory leak in nilfs_sysfs_create_device_group trap: cleanup trap_init() init: move usermodehelper_enable() to populate_rootfs() ...
2021-09-08mm/memory_hotplug: improved dynamic memory group aware "auto-movable" online policyDavid Hildenbrand1-0/+30
Currently, the "auto-movable" online policy does not allow for hotplugged KERNEL (ZONE_NORMAL) memory to increase the amount of MOVABLE memory we can have, primarily, because there is no coordiantion across memory devices and we don't want to create zone-imbalances accidentially when unplugging memory. However, within a single memory device it's different. Let's allow for KERNEL memory within a dynamic memory group to allow for more MOVABLE within the same memory group. The only thing we have to take care of is that the managing driver avoids zone imbalances by unplugging MOVABLE memory first, otherwise there can be corner cases where unplug of memory could result in (accidential) zone imbalances. virtio-mem is the only user of dynamic memory groups and recently added support for prioritizing unplug of ZONE_MOVABLE over ZONE_NORMAL, so we don't need a new toggle to enable it for dynamic memory groups. We limit this handling to dynamic memory groups, because: * We want to keep the runtime overhead for collecting stats when onlining a single memory block small. We tend to have only a handful of dynamic memory groups, but we can have quite some static memory groups (e.g., 256 DIMMs). * It doesn't make too much sense for static memory groups, as we try onlining all applicable memory blocks either completely to ZONE_MOVABLE or not. In ordinary operation, we won't have a mixture of zones within a static memory group. When adding memory to a dynamic memory group, we'll first online memory to ZONE_MOVABLE as long as early KERNEL memory allows for it. Then, we'll online the next unit(s) to ZONE_NORMAL, until we can online the next unit(s) to ZONE_MOVABLE. For a simple virtio-mem device with a MOVABLE:KERNEL ratio of 3:1, it will result in a layout like: [M][M][M][M][M][M][M][M][N][M][M][M][N][M][M][M]... ^ movable memory due to early kernel memory ^ allows for more movable memory ... ^-----^ ... here ^ allows for more movable memory ... ^-----^ ... here While the created layout is sub-optimal when it comes to contiguous zones, it gives us the maximum flexibility when dynamically growing/shrinking a device; we can grow small VMs really big in small steps, and still shrink reliably to e.g., 1/4 of the maximum VM size in this example, removing full memory blocks along with meta data more reliably. Mark dynamic memory groups in the xarray such that we can efficiently iterate over them when collecting stats. In usual setups, we have one virtio-mem device per NUMA node, and usually only a small number of NUMA nodes. Note: for now, there seems to be no compelling reason to make this behavior configurable. Link: https://lkml.kernel.org/r/20210806124715.17090-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: memory group aware "auto-movable" online policyDavid Hildenbrand1-8/+10
Use memory groups to improve our "auto-movable" onlining policy: 1. For static memory groups (e.g., a DIMM), online a memory block MOVABLE only if all other memory blocks in the group are either MOVABLE or could be onlined MOVABLE. A DIMM will either be MOVABLE or not, not a mixture. 2. For dynamic memory groups (e.g., a virtio-mem device), online a memory block MOVABLE only if all other memory blocks inside the current unit are either MOVABLE or could be onlined MOVABLE. For a virtio-mem device with a device block size with 512 MiB, all 128 MiB memory blocks wihin a 512 MiB unit will either be MOVABLE or not, not a mixture. We have to pass the memory group to zone_for_pfn_range() to take the memory group into account. Note: for now, there seems to be no compelling reason to make this behavior configurable. Link: https://lkml.kernel.org/r/20210806124715.17090-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: track present pages in memory groupsDavid Hildenbrand1-5/+5
Let's track all present pages in each memory group. Especially, track memory present in ZONE_MOVABLE and memory present in one of the kernel zones (which really only is ZONE_NORMAL right now as memory groups only apply to hotplugged memory) separately within a memory group, to prepare for making smart auto-online decision for individual memory blocks within a memory group based on group statistics. Link: https://lkml.kernel.org/r/20210806124715.17090-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08drivers/base/memory: introduce "memory groups" to logically group memory blocksDavid Hildenbrand1-4/+155
In our "auto-movable" memory onlining policy, we want to make decisions across memory blocks of a single memory device. Examples of memory devices include ACPI memory devices (in the simplest case a single DIMM) and virtio-mem. For now, we don't have a connection between a single memory block device and the real memory device. Each memory device consists of 1..X memory block devices. Let's logically group memory blocks belonging to the same memory device in "memory groups". Memory groups can span multiple physical ranges and a memory group itself does not contain any information regarding physical ranges, only properties (e.g., "max_pages") necessary for improved memory onlining. Introduce two memory group types: 1) Static memory group: E.g., a single ACPI memory device, consisting of 1..X memory resources. A memory group consists of 1..Y memory blocks. The whole group is added/removed in one go. If any part cannot get offlined, the whole group cannot be removed. 2) Dynamic memory group: E.g., a single virtio-mem device. Memory is dynamically added/removed in a fixed granularity, called a "unit", consisting of 1..X memory blocks. A unit is added/removed in one go. If any part of a unit cannot get offlined, the whole unit cannot be removed. In case of 1) we usually want either all memory managed by ZONE_MOVABLE or none. In case of 2) we usually want to have as many units as possible managed by ZONE_MOVABLE. We want a single unit to be of the same type. For now, memory groups are an internal concept that is not exposed to user space; we might want to change that in the future, though. add_memory() users can specify a mgid instead of a nid when passing the MHP_NID_IS_MGID flag. Link: https://lkml.kernel.org/r/20210806124715.17090-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm: track present early pages per zoneDavid Hildenbrand1-7/+7
Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3. I. Goal The goal of this series is improving in-kernel auto-online support. It tackles the fundamental problems that: 1) We can create zone imbalances when onlining all memory blindly to ZONE_MOVABLE, in the worst case crashing the system. We have to know upfront how much memory we are going to hotplug such that we can safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE via "online_movable". This is far from practical and only applicable in limited setups -- like inside VMs under the RHV/oVirt hypervisor which will never hotplug more than 3 times the boot memory (and the limitation is only in place due to the Linux limitation). 2) We see more setups that implement dynamic VM resizing, hot(un)plugging memory to resize VM memory. In these setups, we might hotplug a lot of memory, but it might happen in various small steps in both directions (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the primary driver of this upstream right now, performing such dynamic resizing NUMA-aware via multiple virtio-mem devices. Onlining all hotplugged memory to ZONE_NORMAL means we basically have no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can easily run into zone imbalances when growing a VM. We want a mixture, and we want as much memory as reasonable/configured in ZONE_MOVABLE. Details regarding zone imbalances can be found at [1]. 3) Memory devices consist of 1..X memory block devices, however, the kernel doesn't really track the relationship. Consequently, also user space has no idea. We want to make per-device decisions. As one example, for memory hotunplug it doesn't make sense to use a mixture of zones within a single DIMM: we want all MOVABLE if possible, otherwise all !MOVABLE, because any !MOVABLE part will easily block the whole DIMM from getting hotunplugged. As another example, virtio-mem operates on individual units that span 1..X memory blocks. Similar to a DIMM, we want a unit to either be all MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however, all units of a virtio-mem device logically belong together and are managed (added/removed) by a single driver. We want as much memory of a virtio-mem device to be MOVABLE as possible. 4) We want memory onlining to be done right from the kernel while adding memory, not triggered by user space via udev rules; for example, this is reqired for fast memory hotplug for drivers that add individual memory blocks, like virito-mem. We want a way to configure a policy in the kernel and avoid implementing advanced policies in user space. The auto-onlining support we have in the kernel is not sufficient. All we have is a) online everything MOVABLE (online_movable) b) online everything !MOVABLE (online_kernel) c) keep zones contiguous (online). This series allows configuring c) to mean instead "online movable if possible according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio" -- a new onlining policy. II. Approach This series does 3 things: 1) Introduces the "auto-movable" online policy that initially operates on individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio to make a decision whether a memory block will be onlined to ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL memory does not allow for more MOVABLE memory (details in the patches). CMA memory is treated like MOVABLE memory. 2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory groups and uses group information to make decisions in the "auto-movable" online policy across memory blocks of a single memory device (modeled as memory group). More details can be found in patch #3 or in the DIMM example below. 3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by allowing ZONE_NORMAL memory within a dynamic memory group to allow for more ZONE_MOVABLE memory within the same memory group. The target use case is dynamic VM resizing using virtio-mem. See the virtio-mem example below. I remember that the basic idea of using a ratio to implement a policy in the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I lost the pointer to that discussion). For me, the main use case is using it along with virtio-mem (and DIMMs / ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the amount of memory we can hotunplug reliably again if we might eventually hotplug a lot of memory to a VM. III. Target Usage The target usage will be: 1) Linux boots with "mhp_default_online_type=offline" 2) User space (e.g., systemd unit) configures memory onlining (according to a config file and system properties), for example: * Setting memory_hotplug.online_policy=auto-movable * Setting memory_hotplug.auto_movable_ratio=301 * Setting memory_hotplug.auto_movable_numa_aware=true 3) User space enabled auto onlining via "echo online > /sys/devices/system/memory/auto_online_blocks" 4) User space triggers manual onlining of all already-offline memory blocks (go over offline memory blocks and set them to "online") IV. Example For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of 301% results in the following layout: Memory block 0-15: DMA32 (early) Memory block 32-47: Normal (early) Memory block 48-79: Movable (DIMM 0) Memory block 80-111: Movable (DIMM 1) Memory block 112-143: Movable (DIMM 2) Memory block 144-275: Normal (DIMM 3) Memory block 176-207: Normal (DIMM 4) ... all Normal (-> hotplugged Normal memory does not allow for more Movable memory) For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM will result in the following layout: Memory block 0-15: DMA32 (early) Memory block 32-47: Normal (early) Memory block 48-143: Movable (virtio-mem, first 12 GiB) Memory block 144: Normal (virtio-mem, next 128 MiB) Memory block 145-147: Movable (virtio-mem, next 384 MiB) Memory block 148: Normal (virtio-mem, next 128 MiB) Memory block 149-151: Movable (virtio-mem, next 384 MiB) ... Normal/Movable mixture as above (-> hotplugged Normal memory allows for more Movable memory within the same device) Which gives us maximum flexibility when dynamically growing/shrinking a VM in smaller steps. V. Doc Update I'll update the memory-hotplug.rst documentation, once the overhaul [1] is usptream. Until then, details can be found in patch #2. VI. Future Work 1) Use memory groups for ppc64 dlpar 2) Being able to specify a portion of (early) kernel memory that will be excluded from the ratio. Like "128 MiB globally/per node" are excluded. This might be helpful when starting VMs with extremely small memory footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting the first hotplugged units getting onlined to ZONE_MOVABLE. One alternative would be a trigger to not consider ZONE_DMA memory in the ratio. We'll have to see if this is really rrequired. 3) Indicate to user space that MOVABLE might be a bad idea -- especially relevant when memory ballooning without support for balloon compaction is active. This patch (of 9): For implementing a new memory onlining policy, which determines when to online memory blocks to ZONE_MOVABLE semi-automatically, we need the number of present early (boot) pages -- present pages excluding hotplugged pages. Let's track these pages per zone. Pass a page instead of the zone to adjust_present_page_count(), similar as adjust_managed_page_count() and derive the zone from the page. It's worth noting that a memory block to be offlined/onlined is either completely "early" or "not early". add_memory() and friends can only add complete memory blocks and we only online/offline complete (individual) memory blocks. Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: Hui Zhu <teawater@gmail.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONEMike Rapoport1-2/+0
Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE". After recent updates to freeing unused parts of the memory map, no architecture can have holes in the memory map within a pageblock. This makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration option redundant. The first patch removes them both in a mechanical way and the second patch simplifies memory_hotplug::test_pages_in_a_zone() that had pfn_valid_within() surrounded by more logic than simple if. This patch (of 2): After recent changes in freeing of the unused parts of the memory map and rework of pfn_valid() in arm and arm64 there are no architectures that can have holes in the memory map within a pageblock and so nothing can enable CONFIG_HOLES_IN_ZONE which guards non trivial implementation of pfn_valid_within(). With that, pfn_valid_within() is always hardwired to 1 and can be completely removed. Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE. Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08Merge branch 'pm-cpufreq'Rafael J. Wysocki1-0/+2
* pm-cpufreq: Revert "cpufreq: intel_pstate: Process HWP Guaranteed change notification" cpufreq: mediatek-hw: Add support for CPUFREQ HW cpufreq: Add of_perf_domain_get_sharing_cpumask dt-bindings: cpufreq: add bindings for MediaTek cpufreq HW cpufreq: Remove ready() callback cpufreq: sh: Remove sh_cpufreq_cpu_ready() cpufreq: acpi: Remove acpi_cpufreq_cpu_ready() cpufreq: qcom-hw: Set dvfs_possible_from_any_cpu cpufreq driver flag cpufreq: blocklist more Qualcomm platforms in cpufreq-dt-platdev cpufreq: qcom-cpufreq-hw: Add dcvs interrupt support cpufreq: scmi: Use .register_em() to register with energy model cpufreq: vexpress: Use .register_em() to register with energy model cpufreq: scpi: Use .register_em() to register with energy model cpufreq: qcom-cpufreq-hw: Use .register_em() to register with energy model cpufreq: omap: Use .register_em() to register with energy model cpufreq: mediatek: Use .register_em() to register with energy model cpufreq: imx6q: Use .register_em() to register with energy model cpufreq: dt: Use .register_em() to register with energy model cpufreq: Add callback to register with energy model cpufreq: vexpress: Set CPUFREQ_IS_COOLING_DEV flag
2021-09-07PM: sleep: core: Avoid setting power.must_resume to falsePrasad Sodagudi1-1/+1
There are variables(power.may_skip_resume and dev->power.must_resume) and DPM_FLAG_MAY_SKIP_RESUME flags to control the resume of devices after a system wide suspend transition. Setting the DPM_FLAG_MAY_SKIP_RESUME flag means that the driver allows its "noirq" and "early" resume callbacks to be skipped if the device can be left in suspend after a system-wide transition into the working state. PM core determines that the driver's "noirq" and "early" resume callbacks should be skipped or not with dev_pm_skip_resume() function by checking power.may_skip_resume variable. power.must_resume variable is getting set to false in __device_suspend() function without checking device's DPM_FLAG_MAY_SKIP_RESUME settings. In problematic scenario, where all the devices in the suspend_late stage are successful and some device can fail to suspend in suspend_noirq phase. So some devices successfully suspended in suspend_late stage are not getting chance to execute __device_suspend_noirq() to set dev->power.must_resume variable to true and not getting resumed in early_resume phase. Add a check for device's DPM_FLAG_MAY_SKIP_RESUME flag before setting power.must_resume variable in __device_suspend function. Fixes: 6e176bf8d461 ("PM: sleep: core: Do not skip callbacks in the resume phase") Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>