wireguard-linux - WireGuard for the Linux kernel

Age	Commit message (Collapse)	Author	Files	Lines
2024-04-30	spi: sun4i: use 'time_left' variable with wait_for_completion_timeout()	Wolfram Sang	1	-4/+5
	There is a confusing pattern in the kernel to use a variable named 'timeout' to store the result of wait_for_completion_timeout() causing patterns like: timeout = wait_for_completion_timeout(...) if (!timeout) return -ETIMEDOUT; with all kinds of permutations. Use 'time_left' as a variable to make the code self explaining. Fix to the proper variable type 'unsigned long' while here. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Acked-by: Jernej Skrabec <jernej.skrabec@gmail.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/20240430114142.28551-7-wsa+renesas@sang-engineering.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-30	spi: pic32: use 'time_left' variable with wait_for_completion_timeout()	Wolfram Sang	1	-3/+3
	There is a confusing pattern in the kernel to use a variable named 'timeout' to store the result of wait_for_completion_timeout() causing patterns like: timeout = wait_for_completion_timeout(...) if (!timeout) return -ETIMEDOUT; with all kinds of permutations. Use 'time_left' as a variable to make the code self explaining. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Link: https://lore.kernel.org/r/20240430114142.28551-6-wsa+renesas@sang-engineering.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-30	spi: pic32-sqi: use 'time_left' variable with wait_for_completion_timeout()	Wolfram Sang	1	-3/+3
	There is a confusing pattern in the kernel to use a variable named 'timeout' to store the result of wait_for_completion_timeout() causing patterns like: timeout = wait_for_completion_timeout(...) if (!timeout) return -ETIMEDOUT; with all kinds of permutations. Use 'time_left' as a variable to make the code self explaining. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Link: https://lore.kernel.org/r/20240430114142.28551-5-wsa+renesas@sang-engineering.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-30	spi: imx: use 'time_left' variable with wait_for_completion_timeout()	Wolfram Sang	1	-10/+10
	There is a confusing pattern in the kernel to use a variable named 'timeout' to store the result of wait_for_completion_timeout() causing patterns like: timeout = wait_for_completion_timeout(...) if (!timeout) return -ETIMEDOUT; with all kinds of permutations. Use 'time_left' as a variable to make the code self explaining. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Peng Fan <peng.fan@nxp.com> Link: https://lore.kernel.org/r/20240430114142.28551-4-wsa+renesas@sang-engineering.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-30	spi: fsl-lpspi: use 'time_left' variable with wait_for_completion_timeout()	Wolfram Sang	1	-7/+7
	There is a confusing pattern in the kernel to use a variable named 'timeout' to store the result of wait_for_completion_timeout() causing patterns like: timeout = wait_for_completion_timeout(...) if (!timeout) return -ETIMEDOUT; with all kinds of permutations. Use 'time_left' as a variable to make the code self explaining. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Link: https://lore.kernel.org/r/20240430114142.28551-3-wsa+renesas@sang-engineering.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-30	spi: armada-3700: use 'time_left' variable with wait_for_completion_timeout()	Wolfram Sang	1	-4/+4
	There is a confusing pattern in the kernel to use a variable named 'timeout' to store the result of wait_for_completion_timeout() causing patterns like: timeout = wait_for_completion_timeout(...) if (!timeout) return -ETIMEDOUT; with all kinds of permutations. Use 'time_left' as a variable to make the code self explaining. Fix to the proper variable type 'unsigned long' while here. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Link: https://lore.kernel.org/r/20240430114142.28551-2-wsa+renesas@sang-engineering.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-26	spi: cadence-qspi: add mobileye,eyeq5-ospi compatible	Théo Lebrun	1	-0/+10
	Declare a new mobileye,eyeq5-ospi compatible. Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com> Link: https://lore.kernel.org/r/20240423-cdns-qspi-mbly-v4-4-3d2a7b535ad0@bootlin.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-26	spi: cadence-qspi: add early busywait to cqspi_wait_for_bit()	Théo Lebrun	1	-8/+23
	Call readl_relaxed_poll_timeout() with no sleep at the start of cqspi_wait_for_bit(). If its short timeout expires, a sleeping readl_relaxed_poll_timeout() call takes the relay. The reason is to avoid hrtimer interrupts on the system. All read operations are expected to take less than 100µs. Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com> Link: https://lore.kernel.org/r/20240423-cdns-qspi-mbly-v4-3-3d2a7b535ad0@bootlin.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-26	spi: cadence-qspi: add no-IRQ mode to indirect reads	Théo Lebrun	1	-4/+9
	Support reads through polling, without any IRQ. The main reason is performance; profiling shows that the first IRQ comes quickly on our specific hardware. Once this IRQ arrives, we poll until all data is retrieved. Avoid initial sleep to reduce IRQ count. Hide this behavior behind a quirk flag. This is confirmed through micro-benchmarks, but also end-to-end performance tests. Mobileye EyeQ5, octal flash, reading 235M on a UBIFS filesystem: - No optimizations, ~10.34s, ~22.7 MB/s, 199230 IRQs - CQSPI_SLOW_SRAM, ~10.34s, ~22.7 MB/s, 70284 IRQs - CQSPI_RD_NO_IRQ, ~9.37s, ~25.1 MB/s, 521 IRQs Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com> Link: https://lore.kernel.org/r/20240423-cdns-qspi-mbly-v4-2-3d2a7b535ad0@bootlin.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-26	spi: cadence-qspi: allow FIFO depth detection	Théo Lebrun	1	-7/+30
	If FIFO depth DT property is provided, check it matches what hardware reports and warn otherwise. Else, use hardware provided value. Hardware exposes FIFO depth indirectly because CQSPI_REG_SRAMPARTITION is partially read-only. Move probe cqspi->ddata assignment prior to cqspi_of_get_pdata() call. Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com> Link: https://lore.kernel.org/r/20240423-cdns-qspi-mbly-v4-1-3d2a7b535ad0@bootlin.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-24	spi: spi-s3c64xx.c: Remove of_node_put for auto cleanup	Shivani Gupta	1	-3/+3
	Use the scope based of_node_put() cleanup in s3c64xx_spi_csinfo to automatically release the device node with the __free() cleanup handler Initialize data_np at the point of declaration for clarity of scope. This change reduces the risk of memory leaks and simplifies the code by removing manual node put call. Suggested-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Shivani Gupta <shivani07g@gmail.com> Link: https://lore.kernel.org/r/20240418000505.731724-1-shivani07g@gmail.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-23	spi: mux: Fix master controller settings after mux select	Heikki Keranen	1	-0/+2
	In some cases SPI child devices behind spi-mux require different settings like: max_speed_hz, mode and bits_per_word. Typically the slave device driver puts the settings in place and calls spi_setup() once during probe and assumes they stay in place for all following spi transfers. However spi-mux forwarded spi_setup() -call to SPI master driver only when slave driver calls spi_setup(). If second slave device was accessed meanwhile and that driver called spi_setup(), the settings did not change back to the first spi device. In case of wrong max_speed_hz this caused spi trasfers to fail. This commit adds spi_setup() call after mux is changed. This way the right device specific parameters are set to the master driver. The fix has been tested by using custom hardware and debugging spi master driver speed settings. Co-authored-by: Petri Tauriainen <petri.tauriainen@bittium.com> Signed-off-by: Heikki Keranen <heikki.keranen@bittium.com> Link: https://lore.kernel.org/r/20240422114150.84426-1-heikki.keranen@bittium.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-21	spi: dt-bindings: armada-3700: convert to dtschema	Kousik Sanagavarapu	2	-25/+55
	Convert txt binding of marvell armada 3700 SoC spi controller to dtschema to allow for validation. Signed-off-by: Kousik Sanagavarapu <five231003@gmail.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20240417052729.6612-1-five231003@gmail.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-19	spi: cs42l43: Correct name of ACPI property	Maciej Strozek	1	-1/+1
	Fixes: 439fbc97502a ("spi: cs42l43: Add bridged cs35l56 amplifiers") Signed-off-by: Maciej Strozek <mstrozek@opensource.cirrus.com> Reviewed-by: Charles Keepax <ckeepax@opensource.cirrus.com> Link: https://lore.kernel.org/r/20240418103315.1487267-1-mstrozek@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-17	spi: oc-tiny: Remove unused of_gpio.h	Andy Shevchenko	1	-2/+0
	of_gpio.h is deprecated and subject to remove. The driver doesn't use it, simply remove the unused header. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20240417104730.2510856-1-andriy.shevchenko@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-17	spi: cs42l43: Use devm_add_action_or_reset()	Charles Keepax	1	-8/+4
	Use devm_add_action_or_reset() rather than manually cleaning up on the error path. Suggested-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Link: https://lore.kernel.org/r/20240417093026.79396-1-ckeepax@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-17	spi: renesas,sh-msiof: Add r8a779h0 support	Geert Uytterhoeven	1	-0/+1
	Document support for the Clock-Synchronized Serial Interface with FIFO (MSIOF) in the Renesas R-Car V4M (R8A779H0) SoC. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/68a4d8ad8638c1133e21d0eef87e8982ddea3dd8.1713279687.git.geert+renesas@glider.be Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-16	spi: cs42l43: Add bridged cs35l56 amplifiers	Maciej Strozek	2	-4/+120
	On some cs42l43 systems a couple of cs35l56 amplifiers are attached to the cs42l43's SPI and I2S. On Windows the cs42l43 is controlled by a SDCA class driver and these two amplifiers are controlled by firmware running on the cs42l43. However, under Linux the decision was made to interact with the cs42l43 directly, affording the user greater control over the audio system. However, this has resulted in an issue where these two bridged cs35l56 amplifiers are not populated in ACPI and must be added manually. Check for the presence of the "01fa-cirrus-sidecar-instances" property in the SDCA extension unit's ACPI properties to confirm the presence of these two amplifiers and if they exist add them manually onto the SPI bus. Reviewed-by: Andy Shevchenko <andy@kernel.org> Signed-off-by: Maciej Strozek <mstrozek@opensource.cirrus.com> Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Link: https://lore.kernel.org/r/20240416100904.3738093-5-ckeepax@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-16	spi: Update swnode based SPI devices to use the fwnode name	Charles Keepax	1	-0/+5
	Update the name for software node based SPI devices to use the fwnode name as the device name. This is helpful since swnode devices are usually added within the kernel, and the kernel often then requires a predictable name such that it can refer back to the device. Reviewed-by: Andy Shevchenko <andy@kernel.org> Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Link: https://lore.kernel.org/r/20240416100904.3738093-4-ckeepax@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-16	spi: Switch to using is_acpi_device_node() in spi_dev_set_name()	Charles Keepax	1	-3/+4
	Use is_acpi_device_node() rather than checking ACPI_COMPANION(), such that when checking for other types of firmware node, the code can consistently do checks against the fwnode. Suggested-by: Andy Shevchenko <andy.shevchenko@gmail.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Link: https://lore.kernel.org/r/20240416100904.3738093-3-ckeepax@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-16	gpio: swnode: Add ability to specify native chip selects for SPI	Charles Keepax	3	-0/+57
	SPI devices can specify a cs-gpios property to enumerate their chip selects. Under device tree, a zero entry in this property can be used to specify that a particular chip select is using the SPI controllers native chip select, for example: cs-gpios = <&gpio1 0 0>, <0>; Here, the second chip select is native. However, when using swnodes there is currently no way to specify a native chip select. The proposal here is to register a swnode_gpio_undefined software node, that can be specified to allow the indication of a native chip select. For example: static const struct software_node_ref_args device_cs_refs[] = { { .node = &device_gpiochip_swnode, .nargs = 2, .args = { 0, GPIO_ACTIVE_LOW }, }, { .node = &swnode_gpio_undefined, .nargs = 0, }, }; Register the swnode as the gpiolib is initialised and check in swnode_get_gpio_device() if the returned node matches swnode_gpio_undefined and return -ENOENT, which matches the behaviour of the device tree system when it encounters a 0 phandle. Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Andy Shevchenko <andy@kernel.org> Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Link: https://lore.kernel.org/r/20240416100904.3738093-2-ckeepax@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-16	spi: Consistently use BIT for cs_index_mask (part 2)	Andy Shevchenko	1	-7/+3
	For some reason the commit 1209c5566f9b ("spi: Consistently use BIT for cs_index_mask") missed one place to change, do it here to finish the job. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20240415184757.1198149-1-andriy.shevchenko@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-16	spi: Introduce spi_for_each_valid_cs() in order of deduplication	Andy Shevchenko	1	-7/+9
	In order of deduplication and better maintenance introduce a new spi_for_each_valid_cs() helper macro. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20240415193340.1279360-3-andriy.shevchenko@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-16	spi: Extract spi_toggle_csgpiod() helper for better maintanance	Andy Shevchenko	1	-24/+25
	The multi-CS support splits the comment and the code in the spi_set_cs(). To avoid this in the future extract spi_toggle_csgpiod() helper. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20240415193340.1279360-2-andriy.shevchenko@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-15	spi: pxa2xx: Move number of CS pins validation out of condition	Andy Shevchenko	1	-8/+7
	There is no need to allocate chip_data and then validate number of CS pins as it will have the same effect. Hence move number of CS pins validation out of condition in setup(). Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20240403171550.1074644-2-andriy.shevchenko@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-15	spi: altera: Drop unneeded MODULE_ALIAS	Krzysztof Kozlowski	1	-1/+0
	The ID table already has respective entry and MODULE_DEVICE_TABLE and creates proper alias for platform driver. Having another MODULE_ALIAS causes the alias to be duplicated. Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Link: https://lore.kernel.org/r/20240414154859.126931-1-krzk@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-11	MAINTAINERS: adjust file entry in TEXAS INSTRUMENTS AUDIO (ASoC/HDA) DRIVERS	Lukas Bulwahn	1	-1/+1
	Commit 8167bd1c8a45 ("ASoC: dt-bindings: ti,pcm1681: Convert to dtschema") converts ti,pcm1681.txt to ti,pcm1681.yaml, but misses to adjust the file entry in TEXAS INSTRUMENTS AUDIO (ASoC/HDA) DRIVERS. Hence, ./scripts/get_maintainer.pl --self-test=patterns complains about a broken reference. Adjust the file entry in TEXAS INSTRUMENTS AUDIO (ASoC/HDA) DRIVERS after this conversion. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Link: https://msgid.link/r/20240411075803.53657-1-lukas.bulwahn@redhat.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-10	spi: dt-bindings: cdns,qspi-nor: make cdns,fifo-depth optional	Théo Lebrun	1	-1/+0
	Make cdns,fifo-depth devicetree property optional. Value can be detected at runtime. Upper SRAMPARTITION register bits are read-only. Procedure to find FIFO depth is therefore to write 0xFFFFFFFF and read back to get amount of writeable bits. Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com> Link: https://msgid.link/r/20240410-cdns-qspi-mbly-v3-3-7b7053449cf7@bootlin.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-10	spi: dt-bindings: cdns,qspi-nor: add mobileye,eyeq5-ospi compatible	Théo Lebrun	1	-0/+1
	Add Mobileye EyeQ5 compatible. Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com> Link: https://msgid.link/r/20240410-cdns-qspi-mbly-v3-2-7b7053449cf7@bootlin.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-10	spi: dt-bindings: cdns,qspi-nor: sort compatibles alphabetically	Théo Lebrun	1	-3/+3
	Compatibles are ordered by date of addition. Switch to (deterministic) alphabetical ordering. Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com> Link: https://msgid.link/r/20240410-cdns-qspi-mbly-v3-1-7b7053449cf7@bootlin.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-10	spi: cadence-xspi: use for_each_available_child_of_node_scoped()	Kousik Sanagavarapu	1	-7/+1
	Refactor code for "is the node's child available?" check by using the corresponding macro instead, which reads more clearly. While at it, use scope-based cleanup instead of manual of_node_put() calls when getting platform data through cdns_xspi_of_get_plat_data(). This removes the unnecessary "node_child" declaration out of the loop's scope and auto cleans up "node_child" when it goes out of scope, even when we return early due to error. Signed-off-by: Kousik Sanagavarapu <five231003@gmail.com> Link: https://msgid.link/r/20240410130205.179069-1-five231003@gmail.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-08	spi: cadence-qspi: minimise register accesses on each op if !DTR	Théo Lebrun	1	-2/+5
	cqspi_enable_dtr() is called for each operation, commands or not, reads or writes. It writes CQSPI_REG_CONFIG then waits for idle (three successful reads). Skip that in the no-DTR case if DTR is already disabled. It cannot be skipped in the DTR case as cqspi_setup_opcode_ext() writes to a register and we must wait for idle state. According to ftrace, the average cqspi_exec_mem_op() call goes from 85.4µs to 83.6µs when reading 235M over UBIFS on an octal flash. Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com> Link: https://msgid.link/r/20240405-cdns-qspi-mbly-v2-6-956679866d6d@bootlin.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-08	spi: cadence-qspi: store device data pointer in private struct	Théo Lebrun	1	-9/+6
	Avoid of_device_get_match_data() call on each IRQ and each read operation. Store pointer in `struct cqspi_st` device instance. End-to-end performance measurements improve with this patch. On a given octal flash, reading 235M over UBIFS is ~3.4% faster. During that read, the average cqspi_exec_mem_op() call goes from 85.4µs to 80.7µs according to ftrace. The worst case goes from 622.4µs to 615.2µs. Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com> Link: https://msgid.link/r/20240405-cdns-qspi-mbly-v2-4-956679866d6d@bootlin.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-08	spi: cadence-qspi: allow building for MIPS	Théo Lebrun	1	-1/+1
	The Cadence QSPI Controller driver is used on Mobileye EyeQ5 platform. Allow building on MIPS. Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com> Link: https://msgid.link/r/20240405-cdns-qspi-mbly-v2-3-956679866d6d@bootlin.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-07	Linux 6.9-rc3	Linus Torvalds	1	-1/+1

2024-04-06	x86/retpoline: Add NOENDBR annotation to the SRSO dummy return thunk	Borislav Petkov (AMD)	1	-0/+1
	srso_alias_untrain_ret() is special code, even if it is a dummy which is called in the !SRSO case, so annotate it like its real counterpart, to address the following objtool splat: vmlinux.o: warning: objtool: .export_symbol+0x2b290: data relocation to !ENDBR: srso_alias_untrain_ret+0x0 Fixes: 4535e1a4174c ("x86/bugs: Fix the SRSO mitigation on Zen3/4") Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240405144637.17908-1-bp@kernel.org
2024-04-06	firewire: ohci: mask bus reset interrupts between ISR and bottom half	Adam Goldman	1	-1/+5
	In the FireWire OHCI interrupt handler, if a bus reset interrupt has occurred, mask bus reset interrupts until bus_reset_work has serviced and cleared the interrupt. Normally, we always leave bus reset interrupts masked. We infer the bus reset from the self-ID interrupt that happens shortly thereafter. A scenario where we unmask bus reset interrupts was introduced in 2008 in a007bb857e0b26f5d8b73c2ff90782d9c0972620: If OHCI_PARAM_DEBUG_BUSRESETS (8) is set in the debug parameter bitmask, we will unmask bus reset interrupts so we can log them. irq_handler logs the bus reset interrupt. However, we can't clear the bus reset event flag in irq_handler, because we won't service the event until later. irq_handler exits with the event flag still set. If the corresponding interrupt is still unmasked, the first bus reset will usually freeze the system due to irq_handler being called again each time it exits. This freeze can be reproduced by loading firewire_ohci with "modprobe firewire_ohci debug=-1" (to enable all debugging output). Apparently there are also some cases where bus_reset_work will get called soon enough to clear the event, and operation will continue normally. This freeze was first reported a few months after a007bb85 was committed, but until now it was never fixed. The debug level could safely be set to -1 through sysfs after the module was loaded, but this would be ineffectual in logging bus reset interrupts since they were only unmasked during initialization. irq_handler will now leave the event flag set but mask bus reset interrupts, so irq_handler won't be called again and there will be no freeze. If OHCI_PARAM_DEBUG_BUSRESETS is enabled, bus_reset_work will unmask the interrupt after servicing the event, so future interrupts will be caught as desired. As a side effect to this change, OHCI_PARAM_DEBUG_BUSRESETS can now be enabled through sysfs in addition to during initial module loading. However, when enabled through sysfs, logging of bus reset interrupts will be effective only starting with the second bus reset, after bus_reset_work has executed. Signed-off-by: Adam Goldman <adamg@pobox.com> Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
2024-04-05	stackdepot: rename pool_index to pool_index_plus_1	Peter Collingbourne	2	-6/+5
	Commit 3ee34eabac2a ("lib/stackdepot: fix first entry having a 0-handle") changed the meaning of the pool_index field to mean "the pool index plus 1". This made the code accessing this field less self-documenting, as well as causing debuggers such as drgn to not be able to easily remain compatible with both old and new kernels, because they typically do that by testing for presence of the new field. Because stackdepot is a debugging tool, we should make sure that it is debugger friendly. Therefore, give the field a different name to improve readability as well as enabling debugger backwards compatibility. This is needed in 6.9, which would otherwise become an odd release with the new semantics and old name so debuggers wouldn't recognize the new semantics there. Fixes: 3ee34eabac2a ("lib/stackdepot: fix first entry having a 0-handle") Link: https://lkml.kernel.org/r/20240402001500.53533-1-pcc@google.com Link: https://linux-review.googlesource.com/id/Ib3e70c36c1d230dd0a118dc22649b33e768b9f88 Signed-off-by: Peter Collingbourne <pcc@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Alexander Potapenko <glider@google.com> Acked-by: Marco Elver <elver@google.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-05	x86/mm/pat: fix VM_PAT handling in COW mappings	David Hildenbrand	2	-14/+39
	PAT handling won't do the right thing in COW mappings: the first PTE (or, in fact, all PTEs) can be replaced during write faults to point at anon folios. Reliably recovering the correct PFN and cachemode using follow_phys() from PTEs will not work in COW mappings. Using follow_phys(), we might just get the address+protection of the anon folio (which is very wrong), or fail on swap/nonswap entries, failing follow_phys() and triggering a WARN_ON_ONCE() in untrack_pfn() and track_pfn_copy(), not properly calling free_pfn_range(). In free_pfn_range(), we either wouldn't call memtype_free() or would call it with the wrong range, possibly leaking memory. To fix that, let's update follow_phys() to refuse returning anon folios, and fallback to using the stored PFN inside vma->vm_pgoff for COW mappings if we run into that. We will now properly handle untrack_pfn() with COW mappings, where we don't need the cachemode. We'll have to fail fork()->track_pfn_copy() if the first page was replaced by an anon folio, though: we'd have to store the cachemode in the VMA to make this work, likely growing the VMA size. For now, lets keep it simple and let track_pfn_copy() just fail in that case: it would have failed in the past with swap/nonswap entries already, and it would have done the wrong thing with anon folios. Simple reproducer to trigger the WARN_ON_ONCE() in untrack_pfn(): <--- C reproducer ---> #include <stdio.h> #include <sys/mman.h> #include <unistd.h> #include <liburing.h> int main(void) { struct io_uring_params p = {}; int ring_fd; size_t size; char map; ring_fd = io_uring_setup(1, &p); if (ring_fd < 0) { perror("io_uring_setup"); return 1; } size = p.sq_off.array + p.sq_entries sizeof(unsigned); /* Map the submission queue ring MAP_PRIVATE / map = mmap(0, size, PROT_READ \| PROT_WRITE, MAP_PRIVATE, ring_fd, IORING_OFF_SQ_RING); if (map == MAP_FAILED) { perror("mmap"); return 1; } / We have at least one page. Let's COW it. / map = 0; pause(); return 0; } <--- C reproducer ---> On a system with 16 GiB RAM and swap configured: # ./iouring & # memhog 16G # killall iouring [ 301.552930] ------------[ cut here ]------------ [ 301.553285] WARNING: CPU: 7 PID: 1402 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xf4/0x100 [ 301.553989] Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_g [ 301.558232] CPU: 7 PID: 1402 Comm: iouring Not tainted 6.7.5-100.fc38.x86_64 #1 [ 301.558772] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebu4 [ 301.559569] RIP: 0010:untrack_pfn+0xf4/0x100 [ 301.559893] Code: 75 c4 eb cf 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 b8 48 8b 7b 30 e8 ea 1a f7 000 [ 301.561189] RSP: 0018:ffffba2c0377fab8 EFLAGS: 00010282 [ 301.561590] RAX: 00000000ffffffea RBX: ffff9208c8ce9cc0 RCX: 000000010455e047 [ 301.562105] RDX: 07fffffff0eb1e0a RSI: 0000000000000000 RDI: ffff9208c391d200 [ 301.562628] RBP: 0000000000000000 R08: ffffba2c0377fab8 R09: 0000000000000000 [ 301.563145] R10: ffff9208d2292d50 R11: 0000000000000002 R12: 00007fea890e0000 [ 301.563669] R13: 0000000000000000 R14: ffffba2c0377fc08 R15: 0000000000000000 [ 301.564186] FS: 0000000000000000(0000) GS:ffff920c2fbc0000(0000) knlGS:0000000000000000 [ 301.564773] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 301.565197] CR2: 00007fea88ee8a20 CR3: 00000001033a8000 CR4: 0000000000750ef0 [ 301.565725] PKRU: 55555554 [ 301.565944] Call Trace: [ 301.566148] <TASK> [ 301.566325] ? untrack_pfn+0xf4/0x100 [ 301.566618] ? __warn+0x81/0x130 [ 301.566876] ? untrack_pfn+0xf4/0x100 [ 301.567163] ? report_bug+0x171/0x1a0 [ 301.567466] ? handle_bug+0x3c/0x80 [ 301.567743] ? exc_invalid_op+0x17/0x70 [ 301.568038] ? asm_exc_invalid_op+0x1a/0x20 [ 301.568363] ? untrack_pfn+0xf4/0x100 [ 301.568660] ? untrack_pfn+0x65/0x100 [ 301.568947] unmap_single_vma+0xa6/0xe0 [ 301.569247] unmap_vmas+0xb5/0x190 [ 301.569532] exit_mmap+0xec/0x340 [ 301.569801] __mmput+0x3e/0x130 [ 301.570051] do_exit+0x305/0xaf0 ... Link: https://lkml.kernel.org/r/20240403212131.929421-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Wupeng Ma <mawupeng1@huawei.com> Closes: https://lkml.kernel.org/r/20240227122814.3781907-1-mawupeng1@huawei.com Fixes: b1a86e15dc03 ("x86, pat: remove the dependency on 'vm_pgoff' in track/untrack pfn vma routines") Fixes: 5899329b1910 ("x86: PAT: implement track/untrack of pfnmap regions for x86 - v3") Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-05	MAINTAINERS: change vmware.com addresses to broadcom.com	Alexey Makhalov	2	-23/+28
	Update all remaining vmware.com email addresses to actual broadcom.com. Add corresponding .mailmap entries for maintainers who contributed in the past as the vmware.com address will start bouncing soon. Maintainership update. Jeff Sipek has left VMware, Nick Shi will be maintaining VMware PTP. Link: https://lkml.kernel.org/r/20240402232334.33167-1-alexey.makhalov@broadcom.com Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com> Acked-by: Florian Fainelli <florian.fainelli@broadcom.com> Acked-by: Ajay Kaher <ajay.kaher@broadcom.com> Acked-by: Ronak Doshi <ronak.doshi@broadcom.com> Acked-by: Nick Shi <nick.shi@broadcom.com> Acked-by: Bryan Tan <bryan-bt.tan@broadcom.com> Acked-by: Vishnu Dasa <vishnu.dasa@broadcom.com> Acked-by: Vishal Bhakta <vishal.bhakta@broadcom.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-05	selftests/mm: include strings.h for ffsl	Edward Liaw	1	-1/+1
	Got a compilation error on Android for ffsl after 91b80cc5b39f ("selftests: mm: fix map_hugetlb failure on 64K page size systems") included vm_util.h. Link: https://lkml.kernel.org/r/20240329185814.16304-1-edliaw@google.com Fixes: af605d26a8f2 ("selftests/mm: merge util.h into vm_util.h") Signed-off-by: Edward Liaw <edliaw@google.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-05	mm: vmalloc: fix lockdep warning	Uladzislau Rezki (Sony)	1	-30/+43
	A lockdep reports a possible deadlock in the find_vmap_area_exceed_addr_lock() function: ============================================ WARNING: possible recursive locking detected 6.9.0-rc1-00060-ged3ccc57b108-dirty #6140 Not tainted -------------------------------------------- drgn/455 is trying to acquire lock: ffff0000c00131d0 (&vn->busy.lock/1){+.+.}-{2:2}, at: find_vmap_area_exceed_addr_lock+0x64/0x124 but task is already holding lock: ffff0000c0011878 (&vn->busy.lock/1){+.+.}-{2:2}, at: find_vmap_area_exceed_addr_lock+0x64/0x124 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&vn->busy.lock/1); lock(&vn->busy.lock/1); * DEADLOCK * indeed it can happen if the find_vmap_area_exceed_addr_lock() gets called concurrently because it tries to acquire two nodes locks. It was done to prevent removing a lowest VA found on a previous step. To address this a lowest VA is found first without holding a node lock where it resides. As a last step we check if a VA still there because it can go away, if removed, proceed with next lowest. [akpm@linux-foundation.org: fix comment typos, per Baoquan] Link: https://lkml.kernel.org/r/20240328140330.4747-1-urezki@gmail.com Fixes: 53becf32aec1 ("mm: vmalloc: support multiple nodes in vread_iter") Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Jens Axboe <axboe@kernel.dk> Tested-by: Omar Sandoval <osandov@fb.com> Reported-by: Jens Axboe <axboe@kernel.dk> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-05	mm: vmalloc: bail out early in find_vmap_area() if vmap is not init	Uladzislau Rezki (Sony)	1	-0/+3
	During the boot the s390 system triggers "spinlock bad magic" messages if the spinlock debugging is enabled: [ 0.465445] BUG: spinlock bad magic on CPU#0, swapper/0 [ 0.465490] lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 [ 0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1 [ 0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux) [ 0.466270] Call Trace: [ 0.466470] [<00000000011f26c8>] dump_stack_lvl+0x98/0xd8 [ 0.466516] [<00000000001dcc6a>] do_raw_spin_lock+0x8a/0x108 [ 0.466545] [<000000000042146c>] find_vmap_area+0x6c/0x108 [ 0.466572] [<000000000042175a>] find_vm_area+0x22/0x40 [ 0.466597] [<000000000012f152>] __set_memory+0x132/0x150 [ 0.466624] [<0000000001cc0398>] vmem_map_init+0x40/0x118 [ 0.466651] [<0000000001cc0092>] paging_init+0x22/0x68 [ 0.466677] [<0000000001cbbed2>] setup_arch+0x52a/0x708 [ 0.466702] [<0000000001cb6140>] start_kernel+0x80/0x5c8 [ 0.466727] [<0000000000100036>] startup_continue+0x36/0x40 it happens because such system tries to access some vmap areas whereas the vmalloc initialization is not even yet done: [ 0.465490] lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 [ 0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1 [ 0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux) [ 0.466270] Call Trace: [ 0.466470] dump_stack_lvl (lib/dump_stack.c:117) [ 0.466516] do_raw_spin_lock (kernel/locking/spinlock_debug.c:87 kernel/locking/spinlock_debug.c:115) [ 0.466545] find_vmap_area (mm/vmalloc.c:1059 mm/vmalloc.c:2364) [ 0.466572] find_vm_area (mm/vmalloc.c:3150) [ 0.466597] __set_memory (arch/s390/mm/pageattr.c:360 arch/s390/mm/pageattr.c:393) [ 0.466624] vmem_map_init (./arch/s390/include/asm/set_memory.h:55 arch/s390/mm/vmem.c:660) [ 0.466651] paging_init (arch/s390/mm/init.c:97) [ 0.466677] setup_arch (arch/s390/kernel/setup.c:972) [ 0.466702] start_kernel (init/main.c:899) [ 0.466727] startup_continue (arch/s390/kernel/head64.S:35) [ 0.466811] INFO: lockdep is turned off. ... [ 0.718250] vmalloc init - busy lock init 0000000002871860 [ 0.718328] vmalloc init - busy lock init 00000000028731b8 Some background. It worked before because the lock that is in question was statically defined and initialized. As of now, the locks and data structures are initialized in the vmalloc_init() function. To address that issue add the check whether the "vmap_initialized" variable is set, if not find_vmap_area() bails out on entry returning NULL. Link: https://lkml.kernel.org/r/20240323141544.4150-1-urezki@gmail.com Fixes: 72210662c5a2 ("mm: vmalloc: offload free_vmap_area_lock lock") Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-05	init: open output files from cpio unpacking with O_LARGEFILE	John Sperbeck	1	-1/+1
	If a member of a cpio archive for an initrd or initrams is larger than 2Gb, we'll eventually fail to write to that file when we get to that limit, unless O_LARGEFILE is set. The problem can be seen with this recipe, assuming that BLK_DEV_RAM is not configured: cd /tmp dd if=/dev/zero of=BIGFILE bs=1048576 count=2200 echo BIGFILE \| cpio -o -H newc -R root:root > initrd.img kexec -l /boot/vmlinuz-$(uname -r) --initrd=initrd.img --reuse-cmdline kexec -e The console will show 'Initramfs unpacking failed: write error'. With the patch, the error is gone. Link: https://lkml.kernel.org/r/20240323152934.3307391-1-jsperbeck@google.com Signed-off-by: John Sperbeck <jsperbeck@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-05	mm/secretmem: fix GUP-fast succeeding on secretmem folios	David Hildenbrand	1	-2/+2
	folio_is_secretmem() currently relies on secretmem folios being LRU folios, to save some cycles. However, folios might reside in a folio batch without the LRU flag set, or temporarily have their LRU flag cleared. Consequently, the LRU flag is unreliable for this purpose. In particular, this is the case when secretmem_fault() allocates a fresh page and calls filemap_add_folio()->folio_add_lru(). The folio might be added to the per-cpu folio batch and won't get the LRU flag set until the batch was drained using e.g., lru_add_drain(). Consequently, folio_is_secretmem() might not detect secretmem folios and GUP-fast can succeed in grabbing a secretmem folio, crashing the kernel when we would later try reading/writing to the folio, because the folio has been unmapped from the directmap. Fix it by removing that unreliable check. Link: https://lkml.kernel.org/r/20240326143210.291116-2-david@redhat.com Fixes: 1507f51255c9 ("mm: introduce memfd_secret system call to create "secret" memory areas") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: xingwei lee <xrivendell7@gmail.com> Reported-by: yue sun <samsun1006219@gmail.com> Closes: https://lore.kernel.org/lkml/CABOYnLyevJeravW=QrH0JUPYEcDN160aZFb7kwndm-J2rmz0HQ@mail.gmail.com/ Debugged-by: Miklos Szeredi <miklos@szeredi.hu> Tested-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-05	nfsd: hold a lighter-weight client reference over CB_RECALL_ANY	Jeff Layton	1	-5/+2
	Currently the CB_RECALL_ANY job takes a cl_rpc_users reference to the client. While a callback job is technically an RPC that counter is really more for client-driven RPCs, and this has the effect of preventing the client from being unhashed until the callback completes. If nfsd decides to send a CB_RECALL_ANY just as the client reboots, we can end up in a situation where the callback can't complete on the (now dead) callback channel, but the new client can't connect because the old client can't be unhashed. This usually manifests as a NFS4ERR_DELAY return on the CREATE_SESSION operation. The job is only holding a reference to the client so it can clear a flag after the RPC completes. Fix this by having CB_RECALL_ANY instead hold a reference to the cl_nfsdfs.cl_ref. Typically we only take that sort of reference when dealing with the nfsdfs info files, but it should work appropriately here to ensure that the nfs4_client doesn't disappear. Fixes: 44df6f439a17 ("NFSD: add delegation reaper to react to low memory condition") Reported-by: Vladimir Benes <vbenes@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-05	aio: Fix null ptr deref in aio_complete() wakeup	Kent Overstreet	1	-1/+1
	list_del_init_careful() needs to be the last access to the wait queue entry - it effectively unlocks access. Previously, finish_wait() would see the empty list head and skip taking the lock, and then we'd return - but the completion path would still attempt to do the wakeup after the task_struct pointer had been overwritten. Fixes: 71eb6b6b0ba9 ("fs/aio: obey min_nr when doing wakeups") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-fsdevel/CAHTA-ubfwwB51A5Wg5M6H_rPEQK9pNf8FkAGH=vr=FEkyRrtqw@mail.gmail.com/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/stable/20240331215212.522544-1-kent.overstreet%40linux.dev Link: https://lore.kernel.org/r/20240331215212.522544-1-kent.overstreet@linux.dev Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05	timers/migration: Return early on deactivation	Anna-Maria Behnsen	1	-0/+27
	Commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on deactivation") removed the logic to return early in tmigr_update_events() on deactivation. With this the problem with a not properly updated first global event in a hierarchy containing only a single group was fixed. But when having a look at this code path with a hierarchy with more than a single level, now unnecessary work is done (example is partially copied from the message of the commit mentioned above): [GRP1:0] migrator = GRP0:0 active = GRP0:0 nextevt = T0:0i, T0:1 / \ [GRP0:0] [GRP0:1] migrator = 0 migrator = NONE active = 0 active = NONE nextevt = T0i, T1 nextevt = T2 / \ / \ 0 (T0i) 1 (T1) 2 (T2) 3 active idle idle idle 0) CPU 0 is active thus its event is ignored (the letter 'i') and so are upper levels' events. CPU 1 is idle and has the timer T1 enqueued. CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2 [GRP1:0] migrator = GRP0:0 active = GRP0:0 nextevt = T0:0i, T0:1 / \ [GRP0:0] [GRP0:1] migrator = NONE migrator = NONE active = NONE active = NONE nextevt = T1 nextevt = T2 / \ / \ 0 (T0i) 1 (T1) 2 (T2) 3 idle idle idle idle 1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is pushed as its next expiry and its own event kept as "ignore". Without this early return the following steps happen in tmigr_update_events() when child = null and group = GRP0:0 : lock(GRP0:0->lock); timerqueue_del(GRP0:0, T0i); unlock(GRP0:0->lock); [GRP1:0] migrator = NONE active = NONE nextevt = T0:0, T0:1 / \ [GRP0:0] [GRP0:1] migrator = NONE migrator = NONE active = NONE active = NONE nextevt = T1 nextevt = T2 / \ / \ 0 (T0i) 1 (T1) 2 (T2) 3 idle idle idle idle 2) The change now propagates up to the top. Then tmigr_update_events() updates the group event of GRP0:0 and executes the following steps (child = GRP0:0 and group = GRP0:0): lock(GRP0:0->lock); lock(GRP1:0->lock); evt = tmigr_next_groupevt(GRP0:0); -> this removes the ignored events in GRP0:0 ... update GRP1:0 group event and timerqueue ... unlock(GRP1:0->lock); unlock(GRP0:0->lock); So the dance in 1) with locking the GRP0:0->lock and removing the T0i from the timerqueue is redundand as this is done nevertheless in 2) when tmigr_next_groupevt(GRP0:0) is executed. Revert commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on deactivation") and add a condition into return path to skip the return only, when hierarchy contains a single group. Adapt comments accordingly. Fixes: 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on deactivation") Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/87cyr49on2.fsf@somnus
2024-04-05	timers/migration: Fix ignored event due to missing CPU update	Frederic Weisbecker	1	-1/+4
	When a group event is updated with its expiry unchanged but a different CPU, that target change may go unnoticed and the event may be propagated up with a stale CPU value. The following depicts a scenario that has been actually observed: [GRP2:0] migrator = GRP1:1 active = GRP1:1 nextevt = TGRP1:0 (T0) / \ [GRP1:0] [GRP1:1] migrator = NONE [...] active = NONE nextevt = TGRP0:0 (T0) / \ [GRP0:0] [...] migrator = NONE active = NONE nextevt = T0 / \ 0 (T0) 1 (T1) idle idle 0) The hierarchy has 3 levels. The left part (GRP1:0) is all idle, including CPU 0 and CPU 1 which have a timer each: T0 and T1. They have the same expiry value. [GRP2:0] migrator = GRP1:1 active = GRP1:1 nextevt = KTIME_MAX / \ [GRP1:0] [GRP1:1] migrator = NONE [...] active = NONE nextevt = TGRP0:0 (T0) / \ [GRP0:0] [...] migrator = NONE active = NONE nextevt = T0 / \ 0 (T0) 1 (T1) idle idle 1) The migrator in GRP1:1 handles remotely T0. The event is dequeued from the top and T0 executed. [GRP2:0] migrator = GRP1:1 active = GRP1:1 nextevt = KTIME_MAX / \ [GRP1:0] [GRP1:1] migrator = NONE [...] active = NONE nextevt = TGRP0:0 (T0) / \ [GRP0:0] [...] migrator = NONE active = NONE nextevt = T1 / \ 0 1 (T1) idle idle 2) The migrator in GRP1:1 fetches the next timer for CPU 0 and finds none. But it updates the events from its groups, starting with GRP0:0 which now has T1 as its next event. So far so good. [GRP2:0] migrator = GRP1:1 active = GRP1:1 nextevt = KTIME_MAX / \ [GRP1:0] [GRP1:1] migrator = NONE [...] active = NONE nextevt = TGRP0:0 (T0) / \ [GRP0:0] [...] migrator = NONE active = NONE nextevt = T1 / \ 0 1 (T1) idle idle 3) The migrator in GRP1:1 proceeds upward and updates the events in GRP1:0. The child event TGRP0:0 is found queued with the same expiry as before. And therefore it is left unchanged. However the target CPU is not the same but that fact is ignored so TGRP0:0 still points to CPU 0 when it should point to CPU 1. [GRP2:0] migrator = GRP1:1 active = GRP1:1 nextevt = TGRP1:0 (T0) / \ [GRP1:0] [GRP1:1] migrator = NONE [...] active = NONE nextevt = TGRP0:0 (T0) / \ [GRP0:0] [...] migrator = NONE active = NONE nextevt = T1 / \ 0 1 (T1) idle idle 4) The propagation has reached the top level and TGRP1:0, having TGRP0:0 as its first event, also wrongly points to CPU 0. TGRP1:0 is added to the top level group. [GRP2:0] migrator = GRP1:1 active = GRP1:1 nextevt = KTIME_MAX / \ [GRP1:0] [GRP1:1] migrator = NONE [...] active = NONE nextevt = TGRP0:0 (T0) / \ [GRP0:0] [...] migrator = NONE active = NONE nextevt = T1 / \ 0 1 (T1) idle idle 5) The migrator in GRP1:1 dequeues the next event in top level pointing to CPU 0. But since it actually doesn't see any real event in CPU 0, it early returns. 6) T1 is left unhandled until either CPU 0 or CPU 1 wake up. Some other bad scenario may involve trees with just two levels. Fix this with unconditionally updating the CPU of the child event before considering to early return while updating a queued event with an unchanged expiry value. Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Link: https://lore.kernel.org/r/Zg2Ct6M2RJAYHgCB@localhost.localdomain
2024-04-04	x86/cpufeatures: Add CPUID_LNX_5 to track recently added Linux-defined word	Sean Christopherson	2	-0/+4
	Add CPUID_LNX_5 to track cpufeatures' word 21, and add the appropriate compile-time assert in KVM to prevent direct lookups on the features in CPUID_LNX_5. KVM uses X86_FEATURE_* flags to manage guest CPUID, and so must translate features that are scattered by Linux from the Linux-defined bit to the hardware-defined bit, i.e. should never try to directly access scattered features in guest CPUID. Opportunistically add NR_CPUID_WORDS to enum cpuid_leafs, along with a compile-time assert in KVM's CPUID infrastructure to ensure that future additions update cpuid_leafs along with NCAPINTS. No functional change intended. Fixes: 7f274e609f3d ("x86/cpufeatures: Add new word for scattered features") Cc: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>