aboutsummaryrefslogtreecommitdiffstatshomepage
AgeCommit message (Collapse)AuthorFilesLines
2026-04-28RDMA/mlx5: Fix error path fall-through in mlx5_ib_dev_res_srq_init()Junrui Luo1-0/+1
mlx5_ib_dev_res_srq_init() allocates two SRQs, s0 and s1. When ib_create_srq() fails for s1, the error branch destroys s0 but falls through and unconditionally assigns the freed s0 and the ERR_PTR s1 to devr->s0 and devr->s1. This leads to several problems: the lock-free fast path checks "if (devr->s1) return 0;" and treats the ERR_PTR as already initialised; users in mlx5_ib_create_qp() dereference the freed SRQ or ERR_PTR via to_msrq(devr->s0)->msrq.srqn; and mlx5_ib_dev_res_cleanup() dereferences the ERR_PTR and double-frees s0 on teardown. Fix by adding the same `goto unlock` in the s1 failure path. Cc: stable@vger.kernel.org Fixes: 5895e70f2e6e ("IB/mlx5: Allocate resources just before first QP/SRQ is created") Link: https://patch.msgid.link/r/SYBPR01MB7881E1E0970268BD69C0BA75AF2B2@SYBPR01MB7881.ausprd01.prod.outlook.com Reported-by: Yuhao Jiang <danisjiang@gmail.com> Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-04-28RDMA/rxe: Reject non-8-byte ATOMIC_WRITE payloadsMichael Bommarito1-1/+13
atomic_write_reply() at drivers/infiniband/sw/rxe/rxe_resp.c unconditionally dereferences 8 bytes at payload_addr(pkt): value = *(u64 *)payload_addr(pkt); check_rkey() previously accepted an ATOMIC_WRITE request with pktlen == resid == 0 because the length validation only compared pktlen against resid. A remote initiator that sets the RETH length to 0 therefore reaches atomic_write_reply() with a zero-byte logical payload, and the responder reads sizeof(u64) bytes from past the logical end of the packet into skb->head tailroom, then writes those 8 bytes into the attacker's MR via rxe_mr_do_atomic_write(). That is a remote disclosure of 4 bytes of kernel tailroom per probe (the other 4 bytes are the packet's own trailing ICRC). IBA oA19-28 defines ATOMIC_WRITE as exactly 8 bytes. Anything else is protocol-invalid. Hoist a strict length check into check_rkey() so the responder never reaches the unchecked dereference, and keep the existing WRITE-family length logic for the normal RDMA WRITE path. Reproduced on mainline with an unmodified rxe driver: a sustained zero-length ATOMIC_WRITE probe repeatedly leaks adjacent skb head-buffer bytes into the attacker's MR, including recognisable kernel strings and partial kernel-direct-map pointer words. With this patch applied the responder rejects the PDU and the MR stays all-zero. Cc: stable@vger.kernel.org Fixes: 034e285f8b99 ("RDMA/rxe: Make responder support atomic write on RC service") Link: https://patch.msgid.link/r/20260418162141.3610201-1-michael.bommarito@gmail.com Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-04-28RDMA/rxe: Reject unknown opcodes before ICRC processingMichael Bommarito1-0/+11
Even after applying commit 7244491dab34 ("RDMA/rxe: Validate pad and ICRC before payload_size() in rxe_rcv"), a single unauthenticated UDP packet can still trigger panic. That patch handled payload_size() underflow only for valid opcodes with short packets, not for packets carrying an unknown opcode. The unknown-opcode OOB read described below predates that commit and reaches back to the initial Soft RoCE driver. The check added there reads pkt->paylen < header_size(pkt) + bth_pad(pkt) + RXE_ICRC_SIZE where header_size(pkt) expands to rxe_opcode[pkt->opcode].length. The rxe_opcode[] array has 256 entries but is only populated for defined IB opcodes; any other entry (for example opcode 0xff) is zero-initialized, so length == 0 and the check degenerates to pkt->paylen < 0 + bth_pad(pkt) + RXE_ICRC_SIZE which does not constrain pkt->paylen enough. rxe_icrc_hdr() then computes rxe_opcode[pkt->opcode].length - RXE_BTH_BYTES which underflows when length == 0 and passes a huge value to rxe_crc32(), causing an out-of-bounds read of the skb payload. Reproduced on v7.0-rc7 with that fix applied, QEMU/KVM with CONFIG_RDMA_RXE=y and CONFIG_KASAN=y, after rdma link add rxe0 type rxe netdev eth0 A single 48-byte UDP packet to port 4791 with BTH opcode=0xff and QPN=IB_MULTICAST_QPN triggers: BUG: KASAN: slab-out-of-bounds in crc32_le+0x115/0x170 Read of size 1 at addr ... The buggy address is located 0 bytes to the right of allocated 704-byte region Call Trace: crc32_le+0x115/0x170 rxe_icrc_hdr.isra.0+0x226/0x300 rxe_icrc_check+0x13f/0x3a0 rxe_rcv+0x6e1/0x16e0 rxe_udp_encap_recv+0x20a/0x320 udp_queue_rcv_one_skb+0x7ed/0x12c0 Subsequent packets with the same shape fault on unmapped memory and panic the kernel. The trigger requires only module load and "rdma link add"; no QP, no connection, and no authentication. Fix this by rejecting packets whose opcode has no rxe_opcode[] entry, detected via the zero mask or zero length, before any length arithmetic runs. Cc: stable@vger.kernel.org Fixes: 8700e3e7c485 ("Soft RoCE driver") Link: https://patch.msgid.link/r/20260414111555.3386793-1-michael.bommarito@gmail.com Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-04-28Merge tag 'md-7.1-20260428' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into block-7.1Jens Axboe10-109/+251
Pull MD fixes from Yu Kuai: "Bug Fixes: - Fix a raid5 UAF on IO across the reshape position. - Avoid failing RAID1/RAID10 devices for invalid IO errors. - Fix RAID10 divide-by-zero when far_copies is zero. - Restore bitmap grow through sysfs. Cleanups: - Use mddev_is_dm() instead of open-coding gendisk checks. - Use ATTRIBUTE_GROUPS() for md default sysfs attributes. - Replace open-coded wait loops with wait_event helpers." * tag 'md-7.1-20260428' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux: md: use ATTRIBUTE_GROUPS() for md default sysfs attributes md: use mddev_is_dm() instead of open-coding gendisk checks md/raid1: replace wait loop with wait_event_idle() in raid1_write_request() md/md-bitmap: add a none backend for bitmap grow md/md-bitmap: split bitmap sysfs groups md: factor bitmap creation away from sysfs handling md: use mddev_lock_nointr() in mddev_suspend_and_lock_nointr() md: replace wait loop with wait_event() in md_handle_request() md/raid10: fix divide-by-zero in setup_geo() with zero far_copies md/raid1,raid10: don't fail devices for invalid IO errors MAINTAINERS: Add Xiao Ni as md/raid reviewer md/raid5: Fix UAF on IO across the reshape position
2026-04-28IB/hfi1: Fix potential use-after-free in PIO and SDMA map teardownLi RongQing2-2/+7
The current teardown logic for dd->pio_map and dd->sdma_map frees the structures while they might still be accessed by RCU readers. Although the pointer is nulled under a spinlock, the memory is reclaimed before waiting for the grace period to end. This patch fixes the sequence by: 1. Extracting the pointer under the lock. 2. Clearing the RCU-protected pointer. 3. Waiting for readers to finish with synchronize_rcu(). 4. Finally freeing the memory. Fixes: 7724105686e7 ("IB/hfi1: add driver files") Link: https://patch.msgid.link/r/20260206050836.5890-1-lirongqing@baidu.com Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-04-28net: phy: dp83869: fix setting CLK_O_SEL field.Heiko Schocher1-1/+12
Table 7-121 in datasheet says we have to set register 0xc6 to value 0x10 before CLK_O_SEL can be modified. No more infos about this field found in datasheet. With this fix, setting of CLK_O_SEL field in IO_MUX_CFG register worked through dts property "ti,clk-output-sel" on a DP83869HMRGZR. Signed-off-by: Heiko Schocher <hs@nabladev.com> Reviewed-by: Simon Horman <horms@kernel.org> Fixes: 01db923e8377 ("net: phy: dp83869: Add TI dp83869 phy") Link: https://patch.msgid.link/20260425031339.3318-1-hs@nabladev.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-28mm/memfd_luo: report error when restoring a folio fails mid-loopDavid Carlier1-0/+1
memfd_luo_retrieve_folios() initialises err to -EIO, but the per-iteration calls to mem_cgroup_charge(), shmem_add_to_page_cache() and shmem_inode_acct_blocks() reuse and overwrite err. Once any iteration completes successfully, err becomes zero. If a later iteration's kho_restore_folio() returns NULL, the failure path jumps to put_folios without resetting err, so the function returns 0. The caller memfd_luo_retrieve() then takes the success path, sets args->file and reports the restore as successful, leaving userspace with a partially populated memfd and no indication that anything went wrong. Set err to -EIO in the kho_restore_folio() failure branch so the error is propagated to the caller. Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Fixes: b3749f174d68 ("mm: memfd_luo: allow preserving memfd") Link: https://patch.msgid.link/20260415052300.362539-1-devnexen@gmail.com Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-04-28kho: skip KHO for crash kernelEvangelos Petrongonas1-1/+1
kho_fill_kimage() unconditionally populates the kimage with KHO metadata for every kexec image type. When the image is a crash kernel, this can be problematic as the crash kernel can run in a small reserved region and the KHO scratch areas can sit outside it. The crash kernel then faults during kho_memory_init() when it tries phys_to_virt() on the KHO FDT address: Unable to handle kernel paging request at virtual address xxxxxxxx ... fdt_offset_ptr+... fdt_check_node_offset_+... fdt_first_property_offset+... fdt_get_property_namelen_+... fdt_getprop+... kho_memory_init+... mm_core_init+... start_kernel+... kho_locate_mem_hole() already skips KHO logic for KEXEC_TYPE_CRASH images, but kho_fill_kimage() was missing the same guard. As kho_fill_kimage() is the single point that populates image->kho.fdt and image->kho.scratch, fixing it here is sufficient for both arm64 and x86 as the FDT and boot_params path are bailing out when these fields are unset. Fixes: d7255959b69a ("kho: allow kexec load before KHO finalization") Signed-off-by: Evangelos Petrongonas <epetron@amazon.de> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Link: https://patch.msgid.link/20260410011609.1103-1-epetron@amazon.de Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-04-28ntfs: fix invalid PTR_ERR() usage in __ntfs_bitmap_set_bits_in_run()Namjae Jeon1-8/+11
The Smatch reported a warning in __ntfs_bitmap_set_bits_in_run(): "warn: passing a valid pointer to 'PTR_ERR'" This occurs because the 'folio' variable might contain a valid pointer when jumping to the 'rollback' label, specifically when 'cnt <= 0' is detected during the subsequent page mapping loop. In such cases, calling PTR_ERR(folio) is incorrect as it does not contain an error code. Fix this by introducing an explicit 'err' variable to track the error status. This ensures that the rollback logic and the return value consistently use a proper error code regardless of the state of the folio pointer. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2026-04-28s390/mm: Fix phys_to_folio() usage in do_secure_storage_access()Heiko Carstens1-1/+1
In case of a Secure-Storage-Access exception the effective aka virtual address which caused the exception is contained within the TEID. do_secure_storage_access() incorrectly uses phys_to_folio() instead of virt_to_folio() to translate the virtual address to the corresponding folio. Fix this by using virt_to_folio() instead of phys_to_folio(). Fixes: 084ea4d611a3 ("s390/mm: add (non)secure page access exceptions handlers") Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2026-04-28s390/sclp: Remove SCLP_OFB Kconfig optionHeiko Carstens2-14/+0
Remove the SCLP_OFB Kconfig option and enable the guarded code unconditionally. This guards only a few lines of code, so the impact is very low while at the same time this reduces the large number of Kconfig options. Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2026-04-28MAINTAINERS: Replace one of the maintainers for s390/pciGerd Bayer1-1/+1
Add myself as co-maintainer for s390/pci, replacing Gerald Schaefer who has moved his focus to s390/mm. Thank you Gerald! Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com> Acked-by: Niklas Schnelle <schnelle@linux.ibm.com> Acked-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2026-04-28s390/debug: Reject zero-length input in debug_input_flush_fn()Vasily Gorbik1-0/+5
debug_input_flush_fn() always copies one byte from the userspace buffer with copy_from_user() regardless of the supplied write length. A zero-length write therefore reads one byte beyond the caller's buffer. If the stale byte happens to be '-' or a digit the debug log is silently flushed. With an unmapped buffer the call returns -EFAULT. Reject zero-length writes before copying from userspace. Cc: stable@vger.kernel.org # v5.10+ Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2026-04-28s390/debug: Reject zero-length input before trimming a newlinePengpeng Hou1-0/+3
debug_get_user_string() duplicates the userspace buffer with memdup_user_nul() and then unconditionally looks at buffer[user_len - 1] to strip a trailing newline. A zero-length write reaches this helper unchanged, so the newline trim reads before the start of the allocated buffer. Reject empty writes before accessing the last input byte. Fixes: 66a464dbc8e0 ("[PATCH] s390: debug feature changes") Cc: stable@vger.kernel.org Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn> Reviewed-by: Benjamin Block <bblock@linux.ibm.com> Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Vasily Gorbik <gor@linux.ibm.com> Link: https://lore.kernel.org/r/20260417073530.96002-1-pengpeng@iscas.ac.cn Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2026-04-28md: use ATTRIBUTE_GROUPS() for md default sysfs attributesAbd-Alrhman Masalkhi1-10/+2
Replace the md_default_group and md_attr_groups with ATTRIBUTE_GROUPS(). Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://lore.kernel.org/linux-raid/20260423101303.48196-4-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md: use mddev_is_dm() instead of open-coding gendisk checksAbd-Alrhman Masalkhi1-2/+2
Replace direct checks on mddev->gendisk with mddev_is_dm() in md_handle_request() and md_run(). Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://lore.kernel.org/linux-raid/20260423101303.48196-3-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md/raid1: replace wait loop with wait_event_idle() in raid1_write_request()Abd-Alrhman Masalkhi1-11/+4
The wait loop is equivalent to wait_event_idle(); use it to improve readability. Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://lore.kernel.org/linux-raid/20260423101303.48196-2-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md/md-bitmap: add a none backend for bitmap growYu Kuai3-16/+137
Add a real none bitmap backend that exposes the common bitmap sysfs group and use it to keep bitmap/location available when an array has no bitmap. Then switch the bitmap location sysfs path to move only between none and the classic bitmap backend, using the no-sysfs bitmap helpers while merging or unmerging the internal bitmap sysfs group. This restores mdadm --grow bitmap addition through bitmap/location. Fixes: fb8cc3b0d9db ("md/md-bitmap: delay registration of bitmap_ops until creating bitmap") Reviewed-by: Su Yue <glass.su@suse.com> Link: https://lore.kernel.org/r/20260425024615.1696892-4-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md/md-bitmap: split bitmap sysfs groupsYu Kuai4-13/+40
Split the classic bitmap sysfs files into a common bitmap group with the location attribute and a separate internal bitmap group for the remaining files. At the same time, convert bitmap operations from a single sysfs group to a sysfs group array so backends can share part of their sysfs layout while adding backend-specific attributes separately. Switch the bitmap sysfs helpers to use sysfs_update_groups() for the add and update path, and remove groups in reverse order so shared named groups are unmerged before the last group removes the directory. Also make bitmap operation lookup depend only on the currently selected bitmap id matching the installed backend. This prepares the lookup path for a later registered none backend. Reviewed-by: Su Yue <glass.su@suse.com> Link: https://lore.kernel.org/r/20260425024615.1696892-3-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md: factor bitmap creation away from sysfs handlingYu Kuai1-29/+49
Factor bitmap creation and destruction into helpers that do not touch bitmap sysfs registration. This prepares the bitmap sysfs rework so callers such as the sysfs bitmap location path can create or destroy a bitmap backend without coupling that to sysfs group lifetime management. Reviewed-by: Su Yue <glass.su@suse.com> Link: https://lore.kernel.org/r/20260425024615.1696892-2-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md: use mddev_lock_nointr() in mddev_suspend_and_lock_nointr()Abd-Alrhman Masalkhi1-1/+1
This keeps mddev locking consistent and ensures that any future changes to locking behavior are done through the wrapper. Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://lore.kernel.org/r/20260415140319.376578-3-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md: replace wait loop with wait_event() in md_handle_request()Abd-Alrhman Masalkhi1-9/+1
The wait loop is equivalent to wait_event() and can be simplified by usaing it for improving readability. Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://lore.kernel.org/r/20260415140319.376578-2-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md/raid10: fix divide-by-zero in setup_geo() with zero far_copiesJunrui Luo1-0/+2
setup_geo() extracts near_copies (nc) and far_copies (fc) from the user-provided layout parameter without checking for zero. When fc=0 with the "improved" far set layout selected, 'geo->far_set_size = disks / fc' triggers a divide-by-zero. Validate nc and fc immediately after extraction, returning -1 if either is zero. Fixes: 475901aff158 ("MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)") Cc: stable@vger.kernel.org Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Link: https://lore.kernel.org/linux-raid/SYBPR01MB7881A5E2556806CC1D318582AF232@SYBPR01MB7881.ausprd01.prod.outlook.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md/raid1,raid10: don't fail devices for invalid IO errorsKeith Busch1-1/+6
BLK_STS_INVAL indicates the IO request itself was invalid, not that the device has failed. When raid1 treats this as a device error, it retries on alternate mirrors which fail the same way, eventually exceeding the read error threshold and removing the device from the array. This happens when stacking configurations bypass bio_split_to_limits() in the IO path: dm-raid calls md_handle_request() directly without going through md_submit_bio(), skipping the alignment validation that would otherwise reject invalid bios early. The invalid bio reaches the lower block layers, which fail the bio with BLK_STS_INVAL, and raid1 wrongly interprets this as a device failure. Add BLK_STS_INVAL to raid1_should_handle_error() so that invalid IO errors are propagated back to the caller rather than triggering device removal. This is consistent with the previous kernel behavior when alignment checks were done earlier in the direct-io path. Fixes: 5ff3f74e145adc7 ("block: simplify direct io validity check") Reported-by: Tomáš Trnka <trnka@scm.com> Closes: https://lore.kernel.org/linux-block/2982107.4sosBPzcNG@electra/ Signed-off-by: Keith Busch <kbusch@kernel.org> Tested-by: Tomáš Trnka <trnka@scm.com> Link: https://lore.kernel.org/r/20260416140345.3872265-1-kbusch@meta.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28MAINTAINERS: Add Xiao Ni as md/raid reviewerXiao Ni1-0/+1
I've been actively involved in the md subsystem, contributing bug fixes, performance improvements, and participating in code reviews. I will help improve patch review coverage and response time. Signed-off-by: Xiao Ni <xiao@kernel.org> Link: https://lore.kernel.org/r/20260414022956.48271-1-xiaoraid25@gmail.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md/raid5: Fix UAF on IO across the reshape positionBenjamin Marzinski3-25/+14
If make_stripe_request() returns STRIPE_WAIT_RESHAPE, raid5_make_request() will free the cloned bio. But raid5_make_request() can call make_stripe_request() multiple times, writing to the various stripes. If that bio got added to the toread or towrite lists of a stripe disk in an earlier call to make_stripe_request(), then it's not safe to just free the bio if a later part of it is found to cross the reshape position. Doing so can lead to a UAF error, when bio_endio() is called on the bio for the earlier stripes. Instead, raid5_make_request() needs to wait until all parts of the bio have called bio_endio(). To do this, bios that cross the reshape position while the reshape can't make progress are flagged as needing to wait for all parts to complete. When raid5_make_request() has a bio that failed make_stripe_request() with STRIPE_WAIT_RESHAPE, it sets bi->bi_private to a completion struct and waits for completion after ending the bio. When the bio_endio() is called for the last time on a clone bio with bi->bi_private set, it wakes up the waiter. This guarantees that raid5_make_request() doesn't return until the cloned bio needing a retry for io across the reshape boundary is safely cleaned up. There is a simple reproducer available at [1]. Compile the kernel with KASAN for more useful reporting when the error is triggered (this is not necessary to see the bug). [1] https://gist.github.com/bmarzins/e48598824305cf2171289e47d7241fa5 Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/r/20260408043548.1695157-1-bmarzins@redhat.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28lib/fonts: Fix bit position when rotating by 180 degreesThomas Zimmermann1-1/+1
Fix the horizontal bit position when rotating a glyph by 180°. The original code in rotate_ud() rounded the value in width up to a multiple of 8, aka the bit pitch, and calculated the rotated pixel from that value. The new code stores the glyph's pitch in bit_pitch, but fails to update the rotated pixel's output accordingly. Simply replacing the variable does this. The bug can be reproduced by setting a font with an unaligned width, such as sun12x22, like this: setfont sun12x22 echo 2 > /sys/class/graphics/fbcon/rotate Without the fix, the font looks distorted. Fixes: a30e9e6b018f ("lib/fonts: Refactor glyph-rotation helpers") Closes: https://lore.gitlab.freedesktop.org/drm-ai-reviews/review-patch7-20260407092555.58816-8-tzimmermann@suse.de/ Cc: dri-devel@lists.freedesktop.org Cc: linux-fbdev@vger.kernel.org Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Helge Deller <deller@gmx.de>
2026-04-28fbdev: defio: Remove duplicate include of linux/module.hChen Ni1-1/+0
Remove duplicate inclusion of linux/module.h in fb_defio.c to clean up redundant code. Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Signed-off-by: Helge Deller <deller@gmx.de>
2026-04-28fbdev: ipu-v3: clean up kernel-doc warningsRandy Dunlap1-5/+11
Correct all kernel-doc warnings: - fix a typedef kernel-doc comment - mark a list_head as private - use Returns: for function return values Warning: include/video/imx-ipu-image-convert.h:31 struct member 'list' not described in 'ipu_image_convert_run' Warning: include/video/imx-ipu-image-convert.h:40 function parameter 'ipu_image_convert_cb_t' not described in 'void' Warning: include/video/imx-ipu-image-convert.h:40 expecting prototype for ipu_image_convert_cb_t(). Prototype was for void() instead Warning: include/video/imx-ipu-image-convert.h:66 No description found for return value of 'ipu_image_convert_verify' Warning: include/video/imx-ipu-image-convert.h:90 No description found for return value of 'ipu_image_convert_prepare' Warning: include/video/imx-ipu-image-convert.h:125 No description found for return value of 'ipu_image_convert_queue' Warning: include/video/imx-ipu-image-convert.h:163 No description found for return value of 'ipu_image_convert' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de> Signed-off-by: Helge Deller <deller@gmx.de>
2026-04-28net: mctp i2c: check length before marking flow activeWilliam A. Kennington III2-3/+5
Currently, mctp_i2c_get_tx_flow_state() is called before the packet length sanity check. This function marks a new flow as active in the MCTP core. If the sanity check fails, mctp_i2c_xmit() returns early without calling mctp_i2c_lock_nest(). This results in a mismatched locking state: the flow is active, but the I2C bus lock was never acquired for it. When the flow is later released, mctp_i2c_release_flow() will see the active state and queue an unlock marker. The TX thread will then decrement midev->i2c_lock_count from 0, causing it to underflow to -1. This underflow permanently breaks the driver's locking logic, allowing future transmissions to occur without holding the I2C bus lock, leading to bus collisions and potential hardware hangs. Move the mctp_i2c_get_tx_flow_state() call to after the length sanity check to ensure we only transition the flow state if we are actually going to proceed with the transmission and locking. Fixes: f5b8abf9fc3d ("mctp i2c: MCTP I2C binding driver") Signed-off-by: William A. Kennington III <william@wkennington.com> Acked-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20260423074741.201460-1-william@wkennington.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-28efi: pstore: Drop efivar lock when efi_pstore_open() returns with an errorThomas Huth1-1/+3
If kzalloc fails, the function returns -ENOMEM without calling efivar_unlock(). Since open() returned an error, the calling site in pstore_get_backend_records() won't call the close() function, so the lock is never released. Thus drop the lock in case of errors here. Fixes: 859748255b434 ("efi: pstore: Omit efivars caching EFI varstore access layer") Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2026-04-28efivarfs: use QSTR() in efivarfs_alloc_dentryThorsten Blum1-4/+1
Use QSTR() and drop strlen() in efivarfs_alloc_dentry(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2026-04-28net: stmmac: Prevent NULL deref when RX memory exhaustedSam Edwards1-7/+12
The CPU receives frames from the MAC through conventional DMA: the CPU allocates buffers for the MAC, then the MAC fills them and returns ownership to the CPU. For each hardware RX queue, the CPU and MAC coordinate through a shared ring array of DMA descriptors: one descriptor per DMA buffer. Each descriptor includes the buffer's physical address and a status flag ("OWN") indicating which side owns the buffer: OWN=0 for CPU, OWN=1 for MAC. The CPU is only allowed to set the flag and the MAC is only allowed to clear it, and both must move through the ring in sequence: thus the ring is used for both "submissions" and "completions." In the stmmac driver, stmmac_rx() bookmarks its position in the ring with the `cur_rx` index. The main receive loop in that function checks for rx_descs[cur_rx].own=0, gives the corresponding buffer to the network stack (NULLing the pointer), and increments `cur_rx` modulo the ring size. After the loop exits, stmmac_rx_refill(), which bookmarks its position with `dirty_rx`, allocates fresh buffers and rearms the descriptors (setting OWN=1). If it fails any allocation, it simply stops early (leaving OWN=0) and will retry where it left off when next called. This means descriptors have a three-stage lifecycle (terms my own): - `empty` (OWN=1, buffer valid) - `full` (OWN=0, buffer valid and populated) - `dirty` (OWN=0, buffer NULL) But because stmmac_rx() only checks OWN, it confuses `full`/`dirty`. In the past (see 'Fixes:'), there was a bug where the loop could cycle `cur_rx` all the way back to the first descriptor it dirtied, resulting in a NULL dereference when mistaken for `full`. The aforementioned commit resolved that *specific* failure by capping the loop's iteration limit at `dma_rx_size - 1`, but this is only a partial fix: if the previous stmmac_rx_refill() didn't complete, then there are leftover `dirty` descriptors that the loop might encounter without needing to cycle fully around. The current code therefore panics (see 'Closes:') when stmmac_rx_refill() is memory-starved long enough for `cur_rx` to catch up to `dirty_rx`. Fix this by explicitly checking, before advancing `cur_rx`, if the next entry is dirty; exit the loop if so. This prevents processing of the final, used descriptor until stmmac_rx_refill() succeeds, but fully prevents the `cur_rx == dirty_rx` ambiguity as the previous bugfix intended: so remove the clamp as well. Since stmmac_rx_zc() is a copy-paste-and-tweak of stmmac_rx() and the code structure is identical, any fix to stmmac_rx() will also need a corresponding fix for stmmac_rx_zc(). Therefore, apply the same check there. In stmmac_rx() (not stmmac_rx_zc()), a related bug remains: after the MAC sets OWN=0 on the final descriptor, it will be unable to send any further DMA-complete IRQs until it's given more `empty` descriptors. Currently, the driver simply *hopes* that the next stmmac_rx_refill() succeeds, risking an indefinite stall of the receive process if not. But this is not a regression, so it can be addressed in a future change. Fixes: b6cb4541853c7 ("net: stmmac: avoid rx queue overrun") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221010 Cc: stable@vger.kernel.org Suggested-by: Russell King <linux@armlinux.org.uk> Signed-off-by: Sam Edwards <CFSworks@gmail.com> Link: https://patch.msgid.link/20260422044503.5349-1-CFSworks@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-28pinctrl: qcom: Fix GPIO to PDC wake irq map for qcs615Maulik Shah1-3/+3
PDC interrupts 122-125 were meant for ibi_i3c wakeup but qcs615 do not support i3c. GPIOs 39,51,88 and 89 are also connected to different PDC pin to support non-ibi wakeup. Update the wakeirq map to reflect same. Fixes: b698f36a9d40 ("pinctrl: qcom: add the tlmm driver for QCS615 platform") Signed-off-by: Maulik Shah <maulik.shah@oss.qualcomm.com> Signed-off-by: Navya Malempati <navya.malempati@oss.qualcomm.com> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Signed-off-by: Linus Walleij <linusw@kernel.org>
2026-04-28pinctrl: meson: amlogic-a4: fix deadlock issueXianwei Zhao1-3/+3
Accessing the pinconf-pins sysfs node may deadlock. pinconf_pins_show() holds pctldev->mutex, and the platform driver calls pinctrl_find_gpio_range_from_pin(), which tries to acquire the same mutex again, leading to a deadlock. Use pinctrl_find_gpio_range_from_pin_nolock() to fix this issue. Fixes: 6e9be3abb78c ("pinctrl: Add driver support for Amlogic SoCs") Signed-off-by: Xianwei Zhao <xianwei.zhao@amlogic.com> Reviewed-by: Neil Armstrong <neil.armstrong@linaro.org> Signed-off-by: Linus Walleij <linusw@kernel.org>
2026-04-28pinctrl: qcom: eliza: Fix QDSS trace clock/control pingroup namesAlexander Koskovich1-4/+4
Fix a few typos for these in their respective pingroups, the groups already exist they just weren't referenced. Signed-off-by: Alexander Koskovich <akoskovich@pm.me> Fixes: 6f26989e15fb ("pinctrl: qcom: Add Eliza pinctrl driver") Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Signed-off-by: Linus Walleij <linusw@kernel.org>
2026-04-28net: ipv6: fix NOREF dst use in seg6 and rpl lwtunnelsAndrea Mayer2-0/+18
seg6_input_core() and rpl_input() call ip6_route_input() which sets a NOREF dst on the skb, then pass it to dst_cache_set_ip6() invoking dst_hold() unconditionally. On PREEMPT_RT, ksoftirqd is preemptible and a higher-priority task can release the underlying pcpu_rt between the lookup and the caching through a concurrent FIB lookup on a shared nexthop. Simplified race sequence: ksoftirqd/X higher-prio task (same CPU X) ----------- -------------------------------- seg6_input_core(,skb)/rpl_input(skb) dst_cache_get() -> miss ip6_route_input(skb) -> ip6_pol_route(,skb,flags) [RT6_LOOKUP_F_DST_NOREF in flags] -> FIB lookup resolves fib6_nh [nhid=N route] -> rt6_make_pcpu_route() [creates pcpu_rt, refcount=1] pcpu_rt->sernum = fib6_sernum [fib6_sernum=W] -> cmpxchg(fib6_nh.rt6i_pcpu, NULL, pcpu_rt) [slot was empty, store succeeds] -> skb_dst_set_noref(skb, dst) [dst is pcpu_rt, refcount still 1] rt_genid_bump_ipv6() -> bumps fib6_sernum [fib6_sernum from W to Z] ip6_route_output() -> ip6_pol_route() -> FIB lookup resolves fib6_nh [nhid=N] -> rt6_get_pcpu_route() pcpu_rt->sernum != fib6_sernum [W <> Z, stale] -> prev = xchg(rt6i_pcpu, NULL) -> dst_release(prev) [prev is pcpu_rt, refcount 1->0, dead] dst = skb_dst(skb) [dst is the dead pcpu_rt] dst_cache_set_ip6(dst) -> dst_hold() on dead dst -> WARN / use-after-free For the race to occur, ksoftirqd must be preemptible (PREEMPT_RT without PREEMPT_RT_NEEDS_BH_LOCK) and a concurrent task must be able to release the pcpu_rt. Shared nexthop objects provide such a path, as two routes pointing to the same nhid share the same fib6_nh and its rt6i_pcpu entry. Fix seg6_input_core() and rpl_input() by calling skb_dst_force() after ip6_route_input() to force the NOREF dst into a refcounted one before caching. The output path is not affected as ip6_route_output() already returns a refcounted dst. Fixes: af4a2209b134 ("ipv6: sr: use dst_cache in seg6_input") Fixes: a7a29f9c361f ("net: ipv6: add rpl sr tunnel") Cc: stable@vger.kernel.org Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260421094735.20997-1-andrea.mayer@uniroma2.it Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-28drm/udl: Increase GET_URB_TIMEOUTShixiong Ou2-3/+5
[WHY] A situation has occurred where udl_handle_damage() executed successfully and the kernel log appears normal, but the display fails to show any output. This is because the call to udl_get_urb() in udl_crtc_helper_atomic_enable() failed without generating any error message. [HOW] 1. Increase timeout of getting urb. 2. Add error messages when calling udl_get_urb() failed in udl_crtc_helper_atomic_enable(). Signed-off-by: Shixiong Ou <oushixiong@kylinos.cn> Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Fixes: 5320918b9a87 ("drm/udl: initial UDL driver (v4)") Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Cc: <stable@vger.kernel.org> # v3.4+ Link: https://patch.msgid.link/20260424124427.657-1-oushixiong1025@163.com
2026-04-28ASoC: Intel: bytcr_wm5102: Fix MCLK leak on platform_clock_control errorCássio Gabriel1-0/+1
If byt_wm5102_prepare_and_enable_pll1() fails in the SND_SOC_DAPM_EVENT_ON() path, platform_clock_control() returns after clk_prepare_enable(priv->mclk) without disabling the clock again. This leaks an MCLK enable reference on failed power-up attempts. Add the missing clk_disable_unprepare() on the error path, matching the unwind used by the other Intel platform_clock_control() implementations. Fixes: 9a87fc1e0619 ("ASoC: Intel: bytcr_wm5102: Add machine driver for BYT/WM5102") Cc: stable@vger.kernel.org Signed-off-by: Cássio Gabriel <cassiogabrielcontato@gmail.com> Reviewed-by: Cezary Rojewski <cezary.rojewski@intel.com> Reviewed-by: Hans de Goede <johannes.goede@oss.qualcomm.com> Link: https://patch.msgid.link/20260427-bytcr-wm5102-mclk-leak-v1-1-02b96d08e99c@gmail.com Signed-off-by: Mark Brown <broonie@kernel.org>
2026-04-28Merge tag 'ath-current-20260427' of git://git.kernel.org/pub/scm/linux/kernel/git/ath/athJohannes Berg5-31/+54
Jeff Johnson says: ================== ath.git update for v7.1-rc2 Fix an ath10k build dependency issue along with a few ath12k bugs. ================== Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-04-28wifi: rsi: fix kthread lifetime race between self-exit and external-stopJeongjun Park1-3/+2
RSI driver use both self-exit(kthread_complete_and_exit) and external-stop (kthread_stop) when killing a kthread. Generally, kthread_stop() is called first, and in this case, no particular issues occur. However, in rare instances where kthread_complete_and_exit() is called first and then kthread_stop() is called, a UAF occurs because the kthread object, which has already exited and been freed, is accessed again. Therefore, to prevent this with minimal modification, you must remove kthread_stop() and change the code to wait until the self-exit operation is completed. Cc: <stable@vger.kernel.org> Reported-by: syzbot+5de83f57cd8531f55596@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69e5d03b.a00a0220.1bd0ca.0064.GAE@google.com/ Fixes: 4c62764d0fc2 ("rsi: improve kernel thread handling to fix kernel panic") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Link: https://patch.msgid.link/20260422173846.37640-1-aha310510@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-04-28sched/fair: Clear rel_deadline when initializing forked entitiesZicheng Qu1-0/+1
A yield-triggered crash can happen when a newly forked sched_entity enters the fair class with se->rel_deadline unexpectedly set. The failing sequence is: 1. A task is forked while se->rel_deadline is still set. 2. __sched_fork() initializes vruntime, vlag and other sched_entity state, but does not clear rel_deadline. 3. On the first enqueue, enqueue_entity() calls place_entity(). 4. Because se->rel_deadline is set, place_entity() treats se->deadline as a relative deadline and converts it to an absolute deadline by adding the current vruntime. 5. However, the forked entity's deadline is not a valid inherited relative deadline for this new scheduling instance, so the conversion produces an abnormally large deadline. 6. If the task later calls sched_yield(), yield_task_fair() advances se->vruntime to se->deadline. 7. The inflated vruntime is then used by the following enqueue path, where the vruntime-derived key can overflow when multiplied by the entity weight. 8. This corrupts cfs_rq->sum_w_vruntime, breaks EEVDF eligibility calculation, and can eventually make all entities appear ineligible. pick_next_entity() may then return NULL unexpectedly, leading to a later NULL dereference. A captured trace shows the effect clearly. Before yield, the entity's vruntime was around: 9834017729983308 After yield_task_fair() executed: se->vruntime = se->deadline the vruntime jumped to: 19668035460670230 and the deadline was later advanced further to: 19668035463470230 This shows that the deadline had already become abnormally large before yield_task_fair() copied it into vruntime. rel_deadline is only meaningful when se->deadline really carries a relative deadline that still needs to be placed against vruntime. A freshly forked sched_entity should not inherit or retain this state. Clear se->rel_deadline in __sched_fork(), together with the other sched_entity runtime state, so that the first enqueue does not interpret the new entity's deadline as a stale relative deadline. Fixes: 82e9d0456e06 ("sched/fair: Avoid re-setting virtual deadline on 'migrations'") Analyzed-by: Hui Tang <tanghui20@huawei.com> Analyzed-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260424071113.1199600-1-quzicheng@huawei.com
2026-04-28sched/fair: Fix wakeup_preempt_fair() vs delayed dequeueVincent Guittot1-13/+14
Similar to how pick_next_entity() must dequeue delayed entities, so too must wakeup_preempt_fair(). Any delayed task being found means it is eligible and hence past the 0-lag point, ready for removal. Worse, by not removing delayed entities from consideration, it can skew the preemption decision, with the end result that a short slice wakeup will not result in a preemption. tip/sched/core tip/sched/core +this patch cyclictest slice (ms) (default)2.8 8 8 hackbench slice (ms) (default)2.8 20 20 Total Samples | 22559 22595 22683 Average (us) | 157 64( 59%) 59( 8%) Median (P50) (us) | 57 57( 0%) 58(- 2%) 90th Percentile (us) | 64 60( 6%) 60( 0%) 99th Percentile (us) | 2407 67( 97%) 67( 0%) 99.9th Percentile (us) | 3400 2288( 33%) 727( 68%) Maximum (us) | 5037 9252(-84%) 7461( 19%) Fixes: f12e148892ed ("sched/fair: Prepare pick_next_task() for delayed dequeue") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260422093400.319251-1-vincent.guittot@linaro.org
2026-04-28sched/fair: Fix the negative lag increase fixPeter Zijlstra1-5/+10
Vincent reported that my rework of his original patch lost a little something. Specifically it got the return value wrong; it should not compare against the old se->vlag, but rather against the current value. Since the thing that matters is if the effective vruntime of an entity is affected and the thing needs repositioning or not. Fixes: 059258b0d424 ("sched/fair: Prevent negative lag increase during delayed dequeue") Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260423094107.GT3102624%40noisy.programming.kicks-ass.net
2026-04-28ALSA: usb-audio: Avoid potential endless loop in convert_chmap_v3()Takashi Iwai1-0/+2
The convert_chmap_v3() has a loop with its increment size of cs_desc->wLength, but we forgot to validate cs_desc->wLength itself, which may lead to potential endless loop by a malformed descriptor. Add a proper size check to abort the loop for plugging the hole. Fixes: ecfd41166b72 ("ALSA: usb-audio: Validate UAC3 cluster segment descriptors") Cc: <stable@vger.kernel.org> Link: https://patch.msgid.link/20260427152224.15276-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>
2026-04-28ALSA: usb-audio: Fix potential leak of pd at parsing UAC3 streamsTakashi Iwai3-38/+25
At parsing UAC3 streams, we allocate a PD object at each time, and either assign or free it. But there is a case where the PD object may be leaked; namely, in __snd_usb_parse_audio_interface() loop, when an audioformat shares the same endpoint with others, it's put to a link and returns from snd_usb_add_audio_stream(), but the PD is forgotten afterwards. Overall, the treatment of PD object in the parser code is a bit flaky, and we should be more careful about the object ownership. This patch tries to fix the above case and improve the code a bit. The pd object is now managed with the auto-cleanup in the loop, and the ownership is updated when the pd object gets assigned to the stream, which guarantees the release of the leftover object. Fixes: 7edf3b5e6a45 ("ALSA: usb-audio: AudioStreaming Power Domain parsing") Link: https://patch.msgid.link/20260427151508.12544-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>
2026-04-28ALSA: caiaq: Don't abort when no input device is availableTakashi Iwai2-2/+2
The previous fix to handle the error from setup_card() caused a regression for the models that have no dedicated input device; snd_usb_caiaq_input_init() just returns -EINVAL, and we treat it as a fatal error although it should be ignored. As a regression fix, change the error code to -ENODEV, and ignore this error in the callee, to continue probing. Fixes: 28abd224db4a ("ALSA: caiaq: Handle probe errors properly") Cc: <stable@vger.kernel.org> Link: https://bugzilla.kernel.org/show_bug.cgi?id=221423 Link: https://patch.msgid.link/20260427145642.6637-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>
2026-04-28ALSA: caiaq: Fix potentially leftover ep1_in_urb at error pathTakashi Iwai1-1/+1
The previous fix for handling the error from setup_card() missed that an internal URB cdev->ep1_in_urb might have been already submitted beforehand. In the normal case, this URB gets killed at the disconnection, but in the error path, we didn't do it, hence there can be a potential leak. Fix it in the error path for setup_card(), too. Fixes: 28abd224db4a ("ALSA: caiaq: Handle probe errors properly") Cc: <stable@vger.kernel.org> Link: https://patch.msgid.link/20260427123819.890185-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>
2026-04-28xfrm: Don't clobber inner headers when already setCosmin Ratiu1-6/+14
On VXLAN over IPsec egress, xfrm{4,6}_transport_output() blindly overwrite inner_transport_header (== the inner TCP header saved in VXLAN iptunnel_handle_offloads() -> skb_reset_inner_headers()) with the current transport_header (== the VXLAN outer UDP header set by udp_tunnel_xmit_skb()). This was a latent bug, harmless until commit [1] added a doff validation check in qdisc_pkt_len_segs_init() for encapsulated GSO packets. With the wrong inner_transport_header set by xfrm, qdisc_pkt_len_segs_init() interprets inner_transport_header as a TCP header, reads doff=0 from the upper byte of the VNI and drops the packet with DROP_REASON_SKB_BAD_GSO. Besides the use in GSO to determine the header size of segmented packets, inner_transport_header might be used by drivers to set up inner checksum offloading by pointing the HW to the inner transport header. A quick browse through available drivers shows that mlx5 uses skb->csum_start specifically for this scenario, while others either don't support VXLAN over IPsec crypto offload (ixgbe) or the HW is capable of parsing the packets itself (nfp, Chelsio). But in all cases, it is more correct to let the inner_transport_header point to the innermost header instead of overwriting it in xfrm. So fix this by guarding all four inner header save sites in xfrm_output.c (xfrm{4,6}_transport_output, xfrm{4,6}_tunnel_encap_add) with a check for skb->inner_protocol. When inner_protocol is set, a tunnel layer (VXLAN, Geneve, GRE, etc.) has already saved the correct inner header offsets and they must not be overwritten. When inner_protocol is zero, no prior tunnel encapsulation exists and xfrm must save the inner headers itself. The tunnel mode checks are only added for completion, since they aren't strictly required, as xfrm_output() forces software GSO in tunnel mode before encap. This makes the previously added test pass: # ./tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py TAP version 13 1..4 ok 1 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v4_inner_v4 ok 2 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v4_inner_v6 ok 3 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v6_inner_v4 ok 4 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v6_inner_v6 # Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0 [1] commit 7fb4c1967011 ("net: pull headers in qdisc_pkt_len_segs_init()") Fixes: f1bd7d659ef0 ("xfrm: Add encapsulation header offsets while SKB is not encrypted") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2026-04-28tools/selftests: Add a VXLAN+IPsec traffic testCosmin Ratiu3-0/+210
There are VXLAN tests and IPsec tests, but there is no test that combines the two protocols and exercises the tunnel-over-ipsec code paths. Fix that by adding a traffic test with VXLAN and IPsec using crypto offload. This is runnable on HW which supports ESP offload (so no nsim unfortunately). Traffic is done with iperf3 and the test validates that there are no packet drops and iperf3 can get to at least 100 Mbps (a very conservative value on today's crypto offload HW, as it can typically reach multi-Gbps rates). Ran right now, the test fails due to a recently exposed bug in xfrm, which will be fixed in the next patch: # ./tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py TAP version 13 1..4 # Check| At ./tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py, # line 161, in test_vxlan_ipsec_crypto_offload: # Check| ksft_eq(drops_after - drops_before, 0, # Check failed 189 != 0 TX drops during VXLAN+IPsec # Check| At ./tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py, # line 163, in test_vxlan_ipsec_crypto_offload: # Check| ksft_ge(bw_gbps, 0.1, # Check failed 0.0015058278404812596 < 0.1 Minimum 100Mbps over # VXLAN+IPsec not ok 1 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v4_inner_v4 ... Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>