wireguard-linux - WireGuard for the Linux kernel

Age	Commit message (Collapse)	Author	Files	Lines
2025-01-29	lib/crc: simplify the kconfig options for CRC implementations	Eric Biggers	1	-102/+14
	Make the following simplifications to the kconfig options for choosing CRC implementations for CRC32 and CRC_T10DIF: 1. Make the option to disable the arch-optimized code be visible only when CONFIG_EXPERT=y. 2. Make a single option control the inclusion of the arch-optimized code for all enabled CRC variants. 3. Make CRC32_SARWATE (a.k.a. slice-by-1 or byte-by-byte) be the only generic CRC32 implementation. The result is there is now just one option, CRC_OPTIMIZATIONS, which is default y and can be disabled only when CONFIG_EXPERT=y. Rationale: 1. Enabling the arch-optimized code is nearly always the right choice. However, people trying to build the tiniest kernel possible would find some use in disabling it. Anything we add to CRC32 is de facto unconditional, given that CRC32 gets selected by something in nearly all kernels. And unfortunately enabling the arch CRC code does not eliminate the need to build the generic CRC code into the kernel too, due to CPU feature dependencies. The size of the arch CRC code will also increase slightly over time as more CRC variants get added and more implementations targeting different instruction set extensions get added. Thus, it seems worthwhile to still provide an option to disable it, but it should be considered an expert-level tweak. 2. Considering the use case described in (1), there doesn't seem to be sufficient value in making the arch-optimized CRC code be independently configurable for different CRC variants. Note also that multiple variants were already grouped together, e.g. CONFIG_CRC32 actually enables three different variants of CRC32. 3. The bit-by-bit implementation is uselessly slow, whereas slice-by-n for n=4 and n=8 use tables that are inconveniently large: 4096 bytes and 8192 bytes respectively, compared to 1024 bytes for n=1. Higher n gives higher instruction-level parallelism, so higher n easily wins on traditional microbenchmarks on most CPUs. However, the larger tables, which are accessed randomly, can be harmful in real-world situations where the dcache may be cold or useful data may need be evicted from the dcache. Meanwhile, today most architectures have much faster CRC32 implementations using dedicated CRC32 instructions or carryless multiplication instructions anyway, which make the generic code obsolete in most cases especially on long messages. Another reason for going with n=1 is that this is already what is used by all the other CRC variants in the kernel. CRC32 was unique in having support for larger tables. But as per the above this can be considered an outdated optimization. The standardization on slice-by-1 a.k.a. CRC32_SARWATE makes much of the code in lib/crc32.c unused. A later patch will clean that up. Link: https://lore.kernel.org/r/20250123212904.118683-2-ebiggers@kernel.org Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-01-29	fs: pack struct kstat better	Christoph Hellwig	1	-2/+2
	Move the change_cookie and subvol up to avoid two 4 byte holes. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-01-28	x86/sev: Disable jump tables in SEV startup code	Ard Biesheuvel	1	-0/+4
	When retpolines and IBT are both disabled, the compiler is free to use jump tables to optimize switch instructions. However, these are emitted by Clang as absolute references into .rodata: jmp *-0x7dfffe90(,%r9,8) R_X86_64_32S .rodata+0x170 Given that this code will execute before that address in .rodata has even been mapped, it is guaranteed to crash a SEV-SNP guest in a way that is difficult to diagnose. So disable jump tables when building this code. It would be better if we could attach this annotation to the __head macro but this appears to be impossible. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Tested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250127114334.1045857-6-ardb+git@google.com
2025-01-28	tools/bootconfig: Fix the wrong format specifier	Luo Yifan	1	-2/+2
	Use '%u' instead of '%d' for unsigned int. Link: https://lore.kernel.org/all/20241105011048.201629-1-luoyifan@cmss.chinamobile.com/ Fixes: 973780011106 ("tools/bootconfig: Suppress non-error messages") Signed-off-by: Luo Yifan <luoyifan@cmss.chinamobile.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-01-28	treewide: const qualify ctl_tables where applicable	Joel Granados	106	-114/+114
	Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysctl_table and the ones calling register_net_sysctl (./net, drivers/inifiniband dirs). These are special cases as they use a registration function with a non-const qualified ctl_table argument or modify the arrays before passing them on to the registration function. Constifying ctl_table structs will prevent the modification of proc_handler function pointers as the arrays would reside in .rodata. This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide: constify the ctl_table argument of proc_handlers") constified all the proc_handlers. Created this by running an spatch followed by a sed command: Spatch: virtual patch @ depends on !(file in "net") disable optional_qualifier @ identifier table_name != { watchdog_hardlockup_sysctl, iwcm_ctl_table, ucma_ctl_table, memory_allocation_profiling_sysctls, loadpin_sysctl_table }; @@ + const struct ctl_table table_name [] = { ... }; sed: sed --in-place \ -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \ kernel/utsname_sysctl.c Reviewed-by: Song Liu <song@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/ Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Corey Minyard <cminyard@mvista.com> Acked-by: Wei Liu <wei.liu@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-01-27	fuse: prevent disabling io-uring on active connections	Bernd Schubert	1	-5/+6
	The enable_uring module parameter allows administrators to enable/disable io-uring support for FUSE at runtime. However, disabling io-uring while connections already have it enabled can lead to an inconsistent state. Fix this by keeping io-uring enabled on connections that were already using it, even if the module parameter is later disabled. This ensures active FUSE mounts continue to function correctly. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27	fuse: enable fuse-over-io-uring	Bernd Schubert	2	-2/+4
	All required parts are handled now, fuse-io-uring can be enabled. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27	fuse: block request allocation until io-uring init is complete	Bernd Schubert	4	-1/+10
	Avoid races and block request allocation until io-uring queues are ready. This is a especially important for background requests, as bg request completion might cause lock order inversion of the typical queue->lock and then fc->bg_lock fuse_request_end spin_lock(&fc->bg_lock); flush_bg_queue fuse_send_one fuse_uring_queue_fuse_req spin_lock(&queue->lock); Signed-off-by: Bernd Schubert <bernd@bsbernd.com> Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27	fuse: {io-uring} Prevent mount point hang on fuse-server termination	Bernd Schubert	2	-3/+75
	When the fuse-server terminates while the fuse-client or kernel still has queued URING_CMDs, these commands retain references to the struct file used by the fuse connection. This prevents fuse_dev_release() from being invoked, resulting in a hung mount point. This patch addresses the issue by making queued URING_CMDs cancelable, allowing fuse_dev_release() to proceed as expected and preventing the mount point from hanging. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27	fuse: Allow to queue bg requests through io-uring	Bernd Schubert	3	-1/+136
	This prepares queueing and sending background requests through io-uring. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27	fuse: Allow to queue fg requests through io-uring	Bernd Schubert	2	-0/+188
	This prepares queueing and sending foreground requests through io-uring. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27	fuse: {io-uring} Make fuse_dev_queue_{interrupt,forget} non-static	Bernd Schubert	2	-2/+8
	These functions are also needed by fuse-over-io-uring. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27	fuse: {io-uring} Handle teardown of ring entries	Bernd Schubert	3	-0/+267
	On teardown struct file_operations::uring_cmd requests need to be completed by calling io_uring_cmd_done(). Not completing all ring entries would result in busy io-uring tasks giving warning messages in intervals and unreleased struct file. Additionally the fuse connection and with that the ring can only get released when all io-uring commands are completed. Completion is done with ring entries that are a) in waiting state for new fuse requests - io_uring_cmd_done is needed b) already in userspace - io_uring_cmd_done through teardown is not needed, the request can just get released. If fuse server is still active and commits such a ring entry, fuse_uring_cmd() already checks if the connection is active and then complete the io-uring itself with -ENOTCONN. I.e. special handling is not needed. This scheme is basically represented by the ring entry state FRRS_WAIT and FRRS_USERSPACE. Entries in state: - FRRS_INIT: No action needed, do not contribute to ring->queue_refs yet - All other states: Are currently processed by other tasks, async teardown is needed and it has to wait for the two states above. It could be also solved without an async teardown task, but would require additional if conditions in hot code paths. Also in my personal opinion the code looks cleaner with async teardown. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27	vfio/nvgrace-gpu: Add GB200 SKU to the devid table	Ankit Agrawal	1	-0/+2
	NVIDIA is productizing the new Grace Blackwell superchip SKU bearing device ID 0x2941. Add the SKU devid to nvgrace_gpu_vfio_pci_table. CC: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20250124183102.3976-5-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-27	vfio/nvgrace-gpu: Check the HBM training and C2C link status	Ankit Agrawal	1	-0/+72
	In contrast to Grace Hopper systems, the HBM training has been moved out of the UEFI on the Grace Blackwell systems. This reduces the system bootup time significantly. The onus of checking whether the HBM training has completed thus falls on the module. The HBM training status can be determined from a BAR0 register. Similarly, another BAR0 register exposes the status of the CPU-GPU chip-to-chip (C2C) cache coherent interconnect. Based on testing, 30s is determined to be sufficient to ensure initialization completion on all the Grace based systems. Thus poll these register and check for 30s. If the HBM training is not complete or if the C2C link is not ready, fail the probe. While the time is not required on Grace Hopper systems, it is beneficial to make the check to ensure the device is in an expected state. Hence keeping it generalized to both the generations. Ensure that the BAR0 is enabled before accessing the registers. CC: Alex Williamson <alex.williamson@redhat.com> CC: Kevin Tian <kevin.tian@intel.com> CC: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20250124183102.3976-4-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-27	vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM	Ankit Agrawal	1	-22/+45
	There is a HW defect on Grace Hopper (GH) to support the Multi-Instance GPU (MIG) feature [1] that necessiated the presence of a 1G region carved out from the device memory and mapped as uncached. The 1G region is shown as a fake BAR (comprising region 2 and 3) to workaround the issue. The Grace Blackwell systems (GB) differ from GH systems in the following aspects: 1. The aforementioned HW defect is fixed on GB systems. 2. There is a usable BAR1 (region 2 and 3) on GB systems for the GPUdirect RDMA feature [2]. This patch accommodate those GB changes by showing the 64b physical device BAR1 (region2 and 3) to the VM instead of the fake one. This takes care of both the differences. Moreover, the entire device memory is exposed on GB as cacheable to the VM as there is no carveout required. Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1] Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2] Cc: Kevin Tian <kevin.tian@intel.com> CC: Jason Gunthorpe <jgg@nvidia.com> Suggested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20250124183102.3976-3-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-27	vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem	Ankit Agrawal	1	-0/+28
	NVIDIA's recently introduced Grace Blackwell (GB) Superchip is a continuation with the Grace Hopper (GH) superchip that provides a cache coherent access to CPU and GPU to each other's memory with an internal proprietary chip-to-chip cache coherent interconnect. There is a HW defect on GH systems to support the Multi-Instance GPU (MIG) feature [1] that necessiated the presence of a 1G region with uncached mapping carved out from the device memory. The 1G region is shown as a fake BAR (comprising region 2 and 3) to workaround the issue. This is fixed on the GB systems. The presence of the fix for the HW defect is communicated by the device firmware through the DVSEC PCI config register with ID 3. The module reads this to take a different codepath on GB vs GH. Scan through the DVSEC registers to identify the correct one and use it to determine the presence of the fix. Save the value in the device's nvgrace_gpu_pci_core_device structure. Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1] CC: Jason Gunthorpe <jgg@nvidia.com> CC: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20250124183102.3976-2-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-27	fuse: Add io-uring sqe commit and fetch support	Bernd Schubert	3	-0/+457
	This adds support for fuse request completion through ring SQEs (FUSE_URING_CMD_COMMIT_AND_FETCH handling). After committing the ring entry it becomes available for new fuse requests. Handling of requests through the ring (SQE/CQE handling) is complete now. Fuse request data are copied through the mmaped ring buffer, there is no support for any zero copy yet. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27	virtio_blk: Add support for transport error recovery	Israel Rukshin	1	-3/+25
	Add support for proper cleanup and re-initialization of virtio-blk devices during transport reset error recovery flow. This enhancement includes: - Pre-reset handler (reset_prepare) to perform device-specific cleanup - Post-reset handler (reset_done) to re-initialize the device These changes allow the device to recover from various reset scenarios, ensuring proper functionality after a reset event occurs. Without this implementation, the device cannot properly recover from resets, potentially leading to undefined behavior or device malfunction. This feature has been tested using PCI transport with Function Level Reset (FLR) as an example reset mechanism. The reset can be triggered manually via sysfs (echo 1 > /sys/bus/pci/devices/$PCI_ADDR/reset). Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Message-Id: <1732690652-3065-3-git-send-email-israelr@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27	virtio_pci: Add support for PCIe Function Level Reset	Israel Rukshin	3	-25/+118
	Implement support for Function Level Reset (FLR) in virtio_pci devices. This change adds reset_prepare and reset_done callbacks, allowing drivers to properly handle FLR operations. Without this patch, performing and recovering from an FLR is not possible for virtio_pci devices. This implementation ensures proper FLR handling and recovery for both physical and virtual functions. The device reset can be triggered in case of error or manually via sysfs: echo 1 > /sys/bus/pci/devices/$PCI_ADDR/reset Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Message-Id: <1732690652-3065-2-git-send-email-israelr@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27	vhost/net: Set num_buffers for virtio 1.0	Akihiko Odaki	1	-1/+4
	The specification says the device MUST set num_buffers to 1 if VIRTIO_NET_F_MRG_RXBUF has not been negotiated. Fixes: 41e3e42108bc ("vhost/net: enable virtio 1.0") Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20240915-v1-v1-1-f10d2cb5e759@daynix.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27	vdpa/octeon_ep: read vendor-specific PCI capability	Shijith Thotton	3	-2/+58
	Added support to read the vendor-specific PCI capability to identify the type of device being emulated. Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Shijith Thotton <sthotton@marvell.com> Message-Id: <20250103153226.1933479-4-sthotton@marvell.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27	virtio-pci: define type and header for PCI vendor data	Shijith Thotton	1	-0/+14
	Added macro definition for VIRTIO_PCI_CAP_VENDOR_CFG to identify the PCI vendor data type in the virtio_pci_cap structure. Defined a new struct virtio_pci_vndr_data for the vendor data capability header as per the specification. Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Shijith Thotton <sthotton@marvell.com> Message-Id: <20250103153226.1933479-3-sthotton@marvell.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27	vdpa/octeon_ep: handle device config change events	Satha Rao	1	-0/+8
	The first interrupt of the device is used to notify the host about device configuration changes, such as link status updates. The ISR configuration area is updated to indicate a config change event when triggered. Signed-off-by: Satha Rao <skoteshwar@marvell.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Shijith Thotton <sthotton@marvell.com> Message-Id: <20250103153226.1933479-2-sthotton@marvell.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27	vdpa/octeon_ep: enable support for multiple interrupts per device	Shijith Thotton	3	-39/+62
	Updated the driver to utilize all the MSI-X interrupt vectors supported by each OCTEON endpoint VF, instead of relying on a single vector. Enabling more interrupts allows packets from multiple rings to be distributed across multiple cores, improving parallelism and performance. Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Shijith Thotton <sthotton@marvell.com> Message-Id: <20250103153226.1933479-1-sthotton@marvell.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27	vdpa: solidrun: Replace deprecated PCI functions	Philipp Stanner	1	-29/+28
	The PCI functions pcim_iomap_regions() pcim_iounmap_regions() pcim_iomap_table() have been deprecated by the PCI subsystem. Replace these functions with their successors pcim_iomap_region() and pcim_iounmap_region(). Signed-off-by: Philipp Stanner <pstanner@redhat.com> Message-Id: <20241219094428.21511-2-phasta@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com>
2025-01-27	s390/kdump: virtio-mem kdump support (CONFIG_PROC_VMCORE_DEVICE_RAM)	David Hildenbrand	2	-8/+32
	Let's add support for including virtio-mem device RAM in the crash dump, setting NEED_PROC_VMCORE_DEVICE_RAM, and implementing elfcorehdr_fill_device_ram_ptload_elf64(). To avoid code duplication, factor out the code to fill a PT_LOAD entry. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-13-david@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27	virtio-mem: support CONFIG_PROC_VMCORE_DEVICE_RAM	David Hildenbrand	2	-0/+89
	Let's implement the get_device_ram() vmcore callback, so architectures that select NEED_PROC_VMCORE_NEED_DEVICE_RAM, like s390 soon, can include that memory in a crash dump. Merge ranges, and process ranges that might contain a mixture of plugged and unplugged, to reduce the total number of ranges. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-12-david@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27	virtio-mem: remember usable region size	David Hildenbrand	1	-3/+7
	Let's remember the usable region size, which will be helpful in kdump mode next. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-11-david@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27	virtio-mem: mark device ready before registering callbacks in kdump mode	David Hildenbrand	1	-2/+3
	After the callbacks are registered we may immediately get a callback. So mark the device ready before registering the callbacks. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-10-david@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27	fs/proc/vmcore: introduce PROC_VMCORE_DEVICE_RAM to detect device RAM ranges in 2nd kernel	David Hildenbrand	3	-0/+183
	s390 allocates+prepares the elfcore hdr in the dump (2nd) kernel, not in the crashed kernel. RAM provided by memory devices such as virtio-mem can only be detected using the device driver; when vmcore_init() is called, these device drivers are usually not loaded yet, or the devices did not get probed yet. Consequently, on s390 these RAM ranges will not be included in the crash dump, which makes the dump partially corrupt and is unfortunate. Instead of deferring the vmcore_init() call, to an (unclear?) later point, let's reuse the vmcore_cb infrastructure to obtain device RAM ranges as the device drivers probe the device and get access to this information. Then, we'll add these ranges to the vmcore, adding more PT_LOAD entries and updating the offsets+vmcore size. Use a separate Kconfig option to be set by an architecture to include this code only if the arch really needs it. Further, we'll make the config depend on the relevant drivers (i.e., virtio_mem) once they implement support (next). The alternative of having a PROVIDE_PROC_VMCORE_DEVICE_RAM config option was dropped for now for simplicity. The current target use case is s390, which only creates an elf64 elfcore, so focusing on elf64 is sufficient. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-9-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27	fs/proc/vmcore: factor out freeing a list of vmcore ranges	David Hildenbrand	2	-8/+12
	Let's factor it out into include/linux/crash_dump.h, from where we can use it also outside of vmcore.c later. Acked-by: Baoquan He <bhe@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-8-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27	fs/proc/vmcore: factor out allocating a vmcore range and adding it to a list	David Hildenbrand	2	-19/+16
	Let's factor it out into include/linux/crash_dump.h, from where we can use it also outside of vmcore.c later. Acked-by: Baoquan He <bhe@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-7-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27	fs/proc/vmcore: move vmcore definitions out of kcore.h	David Hildenbrand	3	-23/+23
	These vmcore defines are not related to /proc/kcore, move them out. We'll move "struct vmcoredd_node" to vmcore.c, because it is only used internally. While "struct vmcore" is only used internally for now, we're planning on using it from inline functions in crash_dump.h next, so move it to crash_dump.h. While at it, rename "struct vmcore" to "struct vmcore_range", which is a more suitable name and will make the usage of it outside of vmcore.c clearer. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-6-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27	fs/proc/vmcore: prefix all pr_* with "vmcore:"	David Hildenbrand	1	-1/+3
	Let's use "vmcore: " as a prefix, converting the single "Kdump: vmcore not initialized" one to effectively be "vmcore: not initialized". Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-5-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27	fs/proc/vmcore: disallow vmcore modifications while the vmcore is open	David Hildenbrand	1	-23/+34
	The vmcoredd_update_size() call and its effects (size/offset changes) are currently completely unsynchronized, and will cause trouble when performed concurrently, or when done while someone is already reading the vmcore. Let's protect all vmcore modifications by the vmcore_mutex, disallow vmcore modifications while the vmcore is open, and warn on vmcore modifications after the vmcore was already opened once: modifications while the vmcore is open are unsafe, and modifications after the vmcore was opened indicates trouble. Properly synchronize against concurrent opening of the vmcore. No need to grab the mutex during mmap()/read(): after we opened the vmcore, modifications are impossible. It's worth noting that modifications after the vmcore was opened are completely unexpected, so failing if open, and warning if already opened (+closed again) is good enough. This change not only handles concurrent adding of device dumps + concurrent reading of the vmcore properly, it also prepares for other mechanisms that will modify the vmcore. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-4-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27	fs/proc/vmcore: replace vmcoredd_mutex by vmcore_mutex	David Hildenbrand	1	-9/+8
	Now that we have a mutex that synchronizes against opening of the vmcore, let's use that one to replace vmcoredd_mutex: there is no need to have two separate ones. This is a preparation for properly preventing vmcore modifications after the vmcore was opened. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-3-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27	fs/proc/vmcore: convert vmcore_cb_lock into vmcore_mutex	David Hildenbrand	1	-7/+8
	We want to protect vmcore modifications from concurrent opening of the vmcore, and also serialize vmcore modification. (a) We can currently modify the vmcore after it was opened. This can happen if a vmcoredd is added after the vmcore module was initialized and already opened by user space. We want to fix that and prepare for new code wanting to serialize against concurrent opening. (b) To handle it cleanly we need to protect the modifications against concurrent opening. As the modifications end up allocating memory and can sleep, we cannot rely on the spinlock. Let's convert the spinlock into a mutex to prepare for further changes. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-2-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-26	LoongArch: Extend the maximum number of watchpoints	Tiezhu Yang	2	-3/+13
	The maximum number of load/store watchpoints and fetch instruction watchpoints is 14 each according to LoongArch Reference Manual, so extend the maximum number of watchpoints from 8 to 14 for ptrace. By the way, just simply change 8 to 14 for the definition in struct user_watch_state at the beginning, but it may corrupt uapi, then add a new struct user_watch_state_v2 directly. As far as I can tell, the only users for this struct in the userspace are GDB and LLDB, there are no any problems of software compatibility between the application and kernel according to the analysis. The compatibility problem has been considered while developing and testing. When the applications in the userspace get watchpoint state, the length will be specified which is no bigger than the sizeof struct user_watch_state or user_watch_state_v2, the actual length is assigned as the minimal value of the application and kernel in the generic code of ptrace: kernel/ptrace.c: ptrace_regset(): kiov->iov_len = min(kiov->iov_len, (__kernel_size_t) (regset->n * regset->size)); if (req == PTRACE_GETREGSET) return copy_regset_to_user(task, view, regset_no, 0, kiov->iov_len, kiov->iov_base); else return copy_regset_from_user(task, view, regset_no, 0, kiov->iov_len, kiov->iov_base); For example, there are four kind of combinations, all of them work well. (1) "older kernel + older gdb", the actual length is 8+(8+8+4+4)8=200; (2) "newer kernel + newer gdb", the actual length is 8+(8+8+4+4)14=344; (3) "older kernel + newer gdb", the actual length is 8+(8+8+4+4)8=200; (4) "newer kernel + older gdb", the actual length is 8+(8+8+4+4)8=200. Link: https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#control-and-status-registers-related-to-watchpoints Cc: stable@vger.kernel.org Fixes: 1a69f7a161a7 ("LoongArch: ptrace: Expose hardware breakpoints to debuggers") Reviewed-by: WANG Xuerui <git@xen0n.name> Reviewed-by: Xi Ruoyao <xry111@xry111.site> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-01-26	LoongArch: Change 8 to 14 for LOONGARCH_MAX_{BRP,WRP}	Tiezhu Yang	3	-4/+76
	The maximum number of load/store watchpoints and fetch instruction watchpoints is 14 each according to LoongArch Reference Manual, so change 8 to 14 for the related code. Link: https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#control-and-status-registers-related-to-watchpoints Cc: stable@vger.kernel.org Fixes: edffa33c7bb5 ("LoongArch: Add hardware breakpoints/watchpoints support") Reviewed-by: WANG Xuerui <git@xen0n.name> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-01-26	LoongArch: Add debugfs entries to switch SFB/TSO state	Huacai Chen	4	-7/+186
	We need to switch SFB (Store Fill Buffer) and TSO (Total Store Order) state at runtime to debug memory management and KVM virtualization, so add two debugfs entries "sfb_state" and "tso_state" under the directory /sys/kernel/debug/loongarch. Query SFB: cat /sys/kernel/debug/loongarch/sfb_state Enable SFB: echo 1 > /sys/kernel/debug/loongarch/sfb_state Disable SFB: echo 0 > /sys/kernel/debug/loongarch/sfb_state Query TSO: cat /sys/kernel/debug/loongarch/tso_state Switch TSO: echo [TSO] > /sys/kernel/debug/loongarch/tso_state Available [TSO] states: 0 (No Load No Store) 1 (All Load No Store) 3 (Same Load No Store) 4 (No Load All Store) 5 (All Load All Store) 7 (Same Load All Store) Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-01-26	LoongArch: Fix warnings during S3 suspend	Huacai Chen	3	-3/+2
	The enable_gpe_wakeup() function calls acpi_enable_all_wakeup_gpes(), and the later one may call the preempt_schedule_common() function, resulting in a thread switch and causing the CPU to be in an interrupt enabled state after the enable_gpe_wakeup() function returns, leading to the warnings as follow. [ C0] WARNING: ... at kernel/time/timekeeping.c:845 ktime_get+0xbc/0xc8 [ C0] ... [ C0] Call Trace: [ C0] [<90000000002243b4>] show_stack+0x64/0x188 [ C0] [<900000000164673c>] dump_stack_lvl+0x60/0x88 [ C0] [<90000000002687e4>] __warn+0x8c/0x148 [ C0] [<90000000015e9978>] report_bug+0x1c0/0x2b0 [ C0] [<90000000016478e4>] do_bp+0x204/0x3b8 [ C0] [<90000000025b1924>] exception_handlers+0x1924/0x10000 [ C0] [<9000000000343bbc>] ktime_get+0xbc/0xc8 [ C0] [<9000000000354c08>] tick_sched_timer+0x30/0xb0 [ C0] [<90000000003408e0>] __hrtimer_run_queues+0x160/0x378 [ C0] [<9000000000341f14>] hrtimer_interrupt+0x144/0x388 [ C0] [<9000000000228348>] constant_timer_interrupt+0x38/0x48 [ C0] [<90000000002feba4>] __handle_irq_event_percpu+0x64/0x1e8 [ C0] [<90000000002fed48>] handle_irq_event_percpu+0x20/0x80 [ C0] [<9000000000306b9c>] handle_percpu_irq+0x5c/0x98 [ C0] [<90000000002fd4a0>] generic_handle_domain_irq+0x30/0x48 [ C0] [<9000000000d0c7b0>] handle_cpu_irq+0x70/0xa8 [ C0] [<9000000001646b30>] handle_loongarch_irq+0x30/0x48 [ C0] [<9000000001646bc8>] do_vint+0x80/0xe0 [ C0] [<90000000002aea1c>] finish_task_switch.isra.0+0x8c/0x2a8 [ C0] [<900000000164e34c>] __schedule+0x314/0xa48 [ C0] [<900000000164ead8>] schedule+0x58/0xf0 [ C0] [<9000000000294a2c>] worker_thread+0x224/0x498 [ C0] [<900000000029d2f0>] kthread+0xf8/0x108 [ C0] [<9000000000221f28>] ret_from_kernel_thread+0xc/0xa4 [ C0] [ C0] ---[ end trace 0000000000000000 ]--- The root cause is acpi_enable_all_wakeup_gpes() uses a mutex to protect acpi_hw_enable_all_wakeup_gpes(), and acpi_ut_acquire_mutex() may cause a thread switch. Since there is no longer concurrent execution during loongarch_acpi_suspend(), we can call acpi_hw_enable_all_wakeup_gpes() directly in enable_gpe_wakeup(). The solution is similar to commit 22db06337f590d01 ("ACPI: sleep: Avoid breaking S3 wakeup due to might_sleep()"). Fixes: 366bb35a8e48 ("LoongArch: Add suspend (ACPI S3) support") Signed-off-by: Qunqin Zhao <zhaoqunqin@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-01-26	module: sign with sha512 instead of sha1 by default	Thorsten Leemhuis	1	-0/+1
	Switch away from using sha1 for module signing by default and use the more modern sha512 instead, which is what among others Arch, Fedora, RHEL, and Ubuntu are currently using for their kernels. Sha1 has not been considered secure against well-funded opponents since 2005[1]; since 2011 the NIST and other organizations furthermore recommended its replacement[2]. This is why OpenSSL on RHEL9, Fedora Linux 41+[3], and likely some other current and future distributions reject the creation of sha1 signatures, which leads to a build error of allmodconfig configurations: 80A20474797F0000:error:03000098:digital envelope routines:do_sigver_init:invalid digest:crypto/evp/m_sigver.c:342: make[4]: * [.../certs/Makefile:53: certs/signing_key.pem] Error 1 make[4]: * Deleting file 'certs/signing_key.pem' make[4]: * Waiting for unfinished jobs.... make[3]: * [.../scripts/Makefile.build:478: certs] Error 2 make[2]: * [.../Makefile:1936: .] Error 2 make[1]: * [.../Makefile:224: __sub-make] Error 2 make[1]: Leaving directory '...' make: *** [Makefile:224: __sub-make] Error 2 This change makes allmodconfig work again and sets a default that is more appropriate for current and future users, too. Link: https://www.schneier.com/blog/archives/2005/02/cryptanalysis_o.html [1] Link: https://csrc.nist.gov/projects/hash-functions [2] Link: https://fedoraproject.org/wiki/Changes/OpenSSLDistrustsha1SigVer [3] Signed-off-by: Thorsten Leemhuis <linux@leemhuis.info> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Tested-by: kdevops <kdevops@lists.linux.dev> [0] Link: https://github.com/linux-kdevops/linux-modules-kpd/actions/runs/11420092929/job/31775404330 [0] Link: https://lore.kernel.org/r/52ee32c0c92afc4d3263cea1f8a1cdc809728aff.1729088288.git.linux@leemhuis.info Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26	module: Don't fail module loading when setting ro_after_init section RO failed	Christophe Leroy	1	-3/+4
	Once module init has succeded it is too late to cancel loading. If setting ro_after_init data section to read-only fails, all we can do is to inform the user through a warning. Reported-by: Thomas Gleixner <tglx@linutronix.de> Closes: https://lore.kernel.org/all/20230915082126.4187913-1-ruanjinjie@huawei.com/ Fixes: d1909c022173 ("module: Don't ignore errors from set_memory_XX()") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/d6c81f38da76092de8aacc8c93c4c65cb0fe48b8.1733427536.git.christophe.leroy@csgroup.eu Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26	module: Split module_enable_rodata_ro()	Christophe Leroy	3	-7/+13
	module_enable_rodata_ro() is called twice, once before module init to set rodata sections readonly and once after module init to set rodata_after_init section readonly. The second time, only the rodata_after_init section needs to be set to read-only, no need to re-apply it to already set rodata. Split module_enable_rodata_ro() in two. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/e3b6ff0df7eac281c58bb02cecaeb377215daff3.1733427536.git.christophe.leroy@csgroup.eu Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26	module: sysfs: Use const 'struct bin_attribute'	Thomas Weißschuh	1	-10/+10
	The sysfs core is switching to 'const struct bin_attribute's. Prepare for that. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20241227-sysfs-const-bin_attr-module-v2-6-e267275f0f37@weissschuh.net Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26	module: sysfs: Add notes attributes through attribute_group	Thomas Weißschuh	1	-26/+28
	A kobject is meant to manage the lifecycle of some resource. However the module sysfs code only creates a kobject to get a "notes" subdirectory in sysfs. This can be achieved easier and cheaper by using a sysfs group. Switch the notes attribute code to such a group, similar to how the section allocation in the same file already works. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20241227-sysfs-const-bin_attr-module-v2-5-e267275f0f37@weissschuh.net Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26	module: sysfs: Simplify section attribute allocation	Thomas Weißschuh	1	-8/+10
	The existing allocation logic manually stuffs two allocations into one. This is hard to understand and of limited value, given that all the section names are allocated on their own anyways. Une one allocation per datastructure. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20241227-sysfs-const-bin_attr-module-v2-4-e267275f0f37@weissschuh.net Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26	module: sysfs: Drop 'struct module_sect_attr'	Thomas Weißschuh	1	-15/+11
	This is now an otherwise empty wrapper around a 'struct bin_attribute', not providing any functionality. Remove it. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20241227-sysfs-const-bin_attr-module-v2-3-e267275f0f37@weissschuh.net Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-01-26	module: sysfs: Drop member 'module_sect_attr::address'	Thomas Weißschuh	1	-5/+2
	'struct bin_attribute' already contains the member 'private' to pass custom data to the attribute handlers. Use that instead of the custom 'address' member. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20241227-sysfs-const-bin_attr-module-v2-2-e267275f0f37@weissschuh.net Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>