wireguard-linux - WireGuard for the Linux kernel

Age	Commit message (Collapse)	Author	Files	Lines
2024-12-19	iommu/tegra241-cmdqv: Read SMMU IDR1.CMDQS instead of hardcoding	Nicolin Chen	1	-3/+5
	The hardware limitation "max=19" actually comes from SMMU Command Queue. So, it'd be more natural for tegra241-cmdqv driver to read it out rather than hardcoding it itself. This is not an issue yet for a kernel on a baremetal system, but a guest kernel setting the queue base/size in form of IPA/gPA might result in a noncontiguous queue in the physical address space, if underlying physical pages backing up the guest RAM aren't contiguous entirely: e.g. 2MB-page backed guest RAM cannot guarantee a contiguous queue if it is 8MB (capped to VCMDQ_LOG2SIZE_MAX=19). This might lead to command errors when HW does linear-read from a noncontiguous queue memory. Adding this extra IDR1.CMDQS cap (in the guest kernel) allows VMM to set SMMU's IDR1.CMDQS=17 for the case mentioned above, so a guest-level queue will be capped to maximum 2MB, ensuring a contiguous queue memory. Fixes: a3799717b881 ("iommu/tegra241-cmdqv: Fix alignment failure at max_n_shift") Reported-by: Ian Kalinowski <ikalinowski@nvidia.com> Cc: stable@vger.kernel.org Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Link: https://lore.kernel.org/r/20241219051421.1850267-1-nicolinc@nvidia.com Signed-off-by: Will Deacon <will@kernel.org>
2024-12-19	iommu/io-pgtable-arm: Fix cfg reading in arm_lpae_concat_mandatory()	Mostafa Saleh	1	-5/+6
	The newly introduced arm_lpae_concat_mandatory() function reads the ias/oas fields from the 'io_pgtable_cfg' copy embedded inside the 'arm_lpae_io_pgtable' structure. However, this copy is not set until later in alloc_io_pgtable_ops() after the alloc() function has been called. Use the address sizes passed in the 'io_pgtable_cfg' structure when deciding whether or not to concatenate the PGD. Fixes: 4dcac8407fe1 ("iommu/io-pgtable-arm: Fix stage-2 concatenation with 16K") Signed-off-by: Mostafa Saleh <smostafa@google.com> Link: https://lore.kernel.org/r/20241215200412.561400-1-smostafa@google.com Signed-off-by: Will Deacon <will@kernel.org>
2024-12-09	iommu/io-pgtable-arm: Add coverage for different OAS in selftest	Mostafa Saleh	1	-12/+15
	Run selftests with different OAS values intead of hardcoding it to 48 bits. We always keep OAS >= IAS to make the config valid for stage-2. This can be further improved, if we split IAS/OAS configuration for stage-1 and stage-2 (to use input sizes compatible with VA_BITS as SMMUv3 does, or IAS > OAS which is valid for stage-1). However, that adds more complexity, and the current change improves coverage and makes it possible to test all concatenation cases. Signed-off-by: Mostafa Saleh <smostafa@google.com> Link: https://lore.kernel.org/r/20241202140604.422235-3-smostafa@google.com Signed-off-by: Will Deacon <will@kernel.org>
2024-12-09	iommu/io-pgtable-arm: Fix stage-2 concatenation with 16K	Mostafa Saleh	1	-12/+33
	At the moment, io-pgtable-arm uses concatenation only if it is possible at level 0, which misses a case where concatenation is mandatory at level 1 according to R_SRKBC in Arm spec DDI0487 K.a. Also, that means concatenation can be used when not mandated, contradicting the comment on the code. However, these cases can only happen if the SMMUv3 driver is changed to use ias != oas for stage-2. This patch re-writes the code to use concatenation only if mandatory, fixing the missing case for level-1 and granule 16K with PA = 40 bits. Signed-off-by: Mostafa Saleh <smostafa@google.com> Link: https://lore.kernel.org/r/20241202140604.422235-2-smostafa@google.com Signed-off-by: Will Deacon <will@kernel.org>
2024-12-09	iommu/arm-smmu-v3: Remove domain_alloc_paging()	Jason Gunthorpe	1	-31/+0
	arm_smmu_domain_alloc_paging_flags() with a flags = 0 now does the same thing as arm_smmu_domain_alloc_paging(), remove arm_smmu_domain_alloc_paging(). Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/3-v1-0bb8d5313a27+27b-smmuv3_paging_flags_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org>
2024-12-09	iommu/arm-smmu-v3: Make domain_alloc_paging_flags() directly determine the S1/S2	Jason Gunthorpe	1	-12/+30
	The selection of S1/S2 is a bit indirect today, make domain_alloc_paging_flags() directly decode the flags and select the correct S1/S2 type. Directly reject flag combinations the HW doesn't support when processing the flags. Fix missing rejection of some flag combinations that are not supported today (ie NEST_PARENT \| DIRTY_TRACKING) by using a switch statement to list out exactly the combinations that are currently supported. Move the determination of the stage out of arm_smmu_domain_finalise() and into both callers. As today the default stage is S1 if supported in HW. This makes arm_smmu_domain_alloc_paging_flags() self contained and no longer calling arm_smmu_domain_alloc_paging(). Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/2-v1-0bb8d5313a27+27b-smmuv3_paging_flags_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org>
2024-12-09	iommu/arm-smmu-v3: Remove arm_smmu_domain_finalise() during attach	Jason Gunthorpe	2	-29/+9
	Domains are now always finalized during allocation because the core code no longer permits a NULL dev argument to domain_alloc_paging/_flags(). Remove the late finalize during attach that supported domains that were not fully initialized. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1-v1-0bb8d5313a27+27b-smmuv3_paging_flags_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org>
2024-12-09	iommu/arm-smmu-v3: Document SVA interaction with new pagetable features	Robin Murphy	1	-1/+14
	Process pagetables may now be using new permission-indirection-based features which an SMMU may not understand when given such a table for SVA. Although SMMUv3.4 does add its own S1PIE feature, realistically we're still going to have to cope with feature mismatches between CPUs and SMMUs, so let's start simple and essentially just document the expectations for what falls out as-is. Although it seems unlikely for SVA applications to also depend on memory-hardening features, or vice-versa, the relative lifecycles make it tricky to enforce mutual exclusivity. Thankfully our PIE index allocation makes it relatively benign for an SMMU to keep interpreting them as direct permissions, the only real implication is that an SVA application cannot harden itself against its own devices with these features. Thus, inform the user about that just in case they have other expectations. Also we don't (yet) support LPA2, so deny SVA entirely if we're going to misunderstand the pagetable format altogether. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/68a37b00a720f0827cac0e4f40e4d3a688924054.1733406275.git.robin.murphy@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2024-12-09	iommu: Manage driver probe deferral better	Robin Murphy	2	-3/+1
	Since iommu_fwspec_init() absorbed the basic driver probe deferral check to wait for an IOMMU to register, we may as well handle the probe deferral timeout there as well. The current inconsistency of callers results in client devices deferring forever on an arm64 ACPI system where an SMMU has failed its own driver probe. Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/41fa59f156ef8d196d08fa75c4901e6d4b12e6c4.1733406914.git.robin.murphy@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2024-12-09	iommu/arm-smmu-v3: Clean up more on probe failure	Robin Murphy	1	-5/+12
	kmemleak noticed that the iopf queue allocated deep down within arm_smmu_init_structures() can be leaked by a subsequent error return from arm_smmu_device_probe(). Furthermore, after arm_smmu_device_reset() we will also leave the SMMU enabled with an empty Stream Table, silently blocking all DMA. This proves rather annoying for debugging said probe failure, so let's handle it a bit better by putting the SMMU back into (more or less) the same state as if it hadn't probed at all. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/5137901958471cf67f2fad5c2229f8a8f1ae901a.1733406914.git.robin.murphy@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2024-12-09	iommu/arm-smmu: Retire probe deferral workaround	Robin Murphy	1	-11/+0
	This reverts commit 229e6ee43d2a160a1592b83aad620d6027084aad. Now that the fundamental ordering issue between arm_smmu_get_by_fwnode() and iommu_device_register() is resolved, the race condition for client probe no longer exists either, so retire the specific workaround. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/4167c5dfa052d4c8bb780f0a30af63dcfc4ce6c1.1733406914.git.robin.murphy@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2024-12-09	iommu/arm-smmu: Make instance lookup robust	Robin Murphy	2	-18/+15
	Relying on the driver list was a cute idea for minimising the scope of our SMMU device lookups, however it turns out to have a subtle flaw. The SMMU device only gets added to that list after arm_smmu_device_probe() returns success, so there's actually no way the iommu_device_register() call from there could ever work as intended, even if it wasn't already hampered by the fwspec setup not happening early enough. Switch both arm_smmu_get_by_fwnode() implementations to use a platform bus lookup instead, which will reliably work. Also make sure that we don't register SMMUv2 instances until we've fully initialised them, to avoid similar consequences of the lookup now finding a device with no drvdata. Moving the error returns is also a perfect excuse to streamline them with dev_err_probe() in the process. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/6d7ce1dc31873abdb75c895fb8bd2097cce098b4.1733406914.git.robin.murphy@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2024-12-09	iommu/arm-smmuv3: Update comments about ATS and bypass	Jason Gunthorpe	1	-5/+12
	The SMMUv3 spec has a note that BYPASS and ATS don't work together under the STE EATS field definition. However there is another section "13.6.4 Full ATS skipping stage 1" that explains under certain conditions BYPASS and ATS do work together if the STE is using S1DSS to select BYPASS and the CD table has the possibility for a substream. When these comments were written the understanding was that all forms of BYPASS just didn't work and this was to be a future problem to solve. It turns out that ATS and IDENTITY will always work just fine: - If STE.Config = BYPASS then the PCI ATS is disabled - If a PASID domain is attached then S1DSS = BYPASS and ATS will be enabled. This meets the requirements of 13.6.4 to automatically generate 1:1 ATS replies on the RID. Update the comments to reflect this. Fixes: 7497f4211f4f ("iommu/arm-smmu-v3: Make changing domains be hitless for ATS") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/0-v1-f27174f44f39+27a33-smmuv3_ats_note_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org>
2024-12-09	iommu/arm-smmu-v3: Log better event records	Pranjal Shrivastava	2	-13/+115
	Currently, the driver dumps the raw hex for a received event record. Improve this by leveraging `struct arm_smmu_event` for event fields and log human-readable event records with meaningful information. Signed-off-by: Pranjal Shrivastava <praan@google.com> Link: https://lore.kernel.org/r/20241203184906.2264528-3-praan@google.com Signed-off-by: Will Deacon <will@kernel.org>
2024-12-09	iommu/arm-smmu-v3: Introduce struct arm_smmu_event	Pranjal Shrivastava	2	-19/+54
	Introduce `struct arm_smmu_event` to represent event records. Parse out relevant fields from raw event records for ease and use the new `struct arm_smmu_event` instead. Signed-off-by: Pranjal Shrivastava <praan@google.com> Link: https://lore.kernel.org/r/20241203184906.2264528-2-praan@google.com Signed-off-by: Will Deacon <will@kernel.org>
2024-12-09	iommu/arm-smmu-qcom: add sdm670 adreno iommu compatible	Richard Acayan	1	-0/+1
	Add the compatible for the separate IOMMU on SDM670 for the Adreno GPU. This IOMMU has the compatible strings: "qcom,sdm670-smmu-v2", "qcom,adreno-smmu", "qcom,smmu-v2" While the SMMU 500 doesn't need an entry for this specific SoC, the SMMU v2 compatible should have its own entry, as the fallback entry in arm-smmu.c handles "qcom,smmu-v2" without per-process page table support unless there is an entry here. This entry can't be the "qcom,adreno-smmu" compatible because dedicated GPU IOMMUs can also be SMMU 500 with different handling. Signed-off-by: Richard Acayan <mailingradian@gmail.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Link: https://lore.kernel.org/r/20241114004713.42404-6-mailingradian@gmail.com Signed-off-by: Will Deacon <will@kernel.org>
2024-12-08	Linux 6.13-rc2	Linus Torvalds	1	-1/+1

2024-12-08	kbuild: deb-pkg: fix build error with O=	Masahiro Yamada	1	-2/+2
	Since commit 13b25489b6f8 ("kbuild: change working directory to external module directory with M="), the Debian package build fails if a relative path is specified with the O= option. $ make O=build bindeb-pkg [ snip ] dpkg-deb: building package 'linux-image-6.13.0-rc1' in '../linux-image-6.13.0-rc1_6.13.0-rc1-6_amd64.deb'. Rebuilding host programs with x86_64-linux-gnu-gcc... make[6]: Entering directory '/home/masahiro/linux/build' /home/masahiro/linux/Makefile:190: *** specified kernel directory "build" does not exist. Stop. This occurs because the sub_make_done flag is cleared, even though the working directory is already in the output directory. Passing KBUILD_OUTPUT=. resolves the issue. Fixes: 13b25489b6f8 ("kbuild: change working directory to external module directory with M=") Reported-by: Charlie Jenkins <charlie@rivosinc.com> Closes: https://lore.kernel.org/all/Z1DnP-GJcfseyrM3@ghost/ Tested-by: Charlie Jenkins <charlie@rivosinc.com> Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-12-08	modpost: Add .irqentry.text to OTHER_SECTIONS	Thomas Gleixner	1	-1/+1
	The compiler can fully inline the actual handler function of an interrupt entry into the .irqentry.text entry point. If such a function contains an access which has an exception table entry, modpost complains about a section mismatch: WARNING: vmlinux.o(__ex_table+0x447c): Section mismatch in reference ... The relocation at __ex_table+0x447c references section ".irqentry.text" which is not in the list of authorized sections. Add .irqentry.text to OTHER_SECTIONS to cure the issue. Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org # needed for linux-5.4-y Link: https://lore.kernel.org/all/20241128111844.GE10431@google.com/ Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-12-06	x86/CPU/AMD: WARN when setting EFER.AUTOIBRS if and only if the WRMSR fails	Sean Christopherson	1	-1/+1
	When ensuring EFER.AUTOIBRS is set, WARN only on a negative return code from msr_set_bit(), as '1' is used to indicate the WRMSR was successful ('0' indicates the MSR bit was already set). Fixes: 8cc68c9c9e92 ("x86/CPU/AMD: Make sure EFER[AIBRSE] is set") Reported-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/Z1MkNofJjt7Oq0G6@google.com Closes: https://lore.kernel.org/all/20241205220604.GA2054199@thelio-3990X
2024-12-06	selftests/bpf: Add more test cases for LPM trie	Hou Tao	1	-0/+395
	Add more test cases for LPM trie in test_maps: 1) test_lpm_trie_update_flags It constructs various use cases for BPF_EXIST and BPF_NOEXIST and check whether the return value of update operation is expected. 2) test_lpm_trie_update_full_maps It tests the update operations on a full LPM trie map. Adding new node will fail and overwriting the value of existed node will succeed. 3) test_lpm_trie_iterate_strs and test_lpm_trie_iterate_ints There two test cases test whether the iteration through get_next_key is sorted and expected. These two test cases delete the minimal key after each iteration and check whether next iteration returns the second minimal key. The only difference between these two test cases is the former one saves strings in the LPM trie and the latter saves integers. Without the fix of get_next_key, these two cases will fail as shown below: test_lpm_trie_iterate_strs(1091):FAIL:iterate #2 got abc exp abS test_lpm_trie_iterate_ints(1142):FAIL:iterate #1 got 0x2 exp 0x1 Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-10-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06	selftests/bpf: Move test_lpm_map.c to map_tests	Hou Tao	3	-9/+4
	Move test_lpm_map.c to map_tests/ to include LPM trie test cases in regular test_maps run. Most code remains unchanged, including the use of assert(). Only reduce n_lookups from 64K to 512, which decreases test_lpm_map runtime from 37s to 0.7s. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-9-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06	bpf: Use raw_spinlock_t for LPM trie	Hou Tao	1	-6/+6
	After switching from kmalloc() to the bpf memory allocator, there will be no blocking operation during the update of LPM trie. Therefore, change trie->lock from spinlock_t to raw_spinlock_t to make LPM trie usable in atomic context, even on RT kernels. The max value of prefixlen is 2048. Therefore, update or deletion operations will find the target after at most 2048 comparisons. Constructing a test case which updates an element after 2048 comparisons under a 8 CPU VM, and the average time and the maximal time for such update operation is about 210us and 900us. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-8-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06	bpf: Switch to bpf mem allocator for LPM trie	Hou Tao	1	-23/+48
	Multiple syzbot warnings have been reported. These warnings are mainly about the lock order between trie->lock and kmalloc()'s internal lock. See report [1] as an example: ====================================================== WARNING: possible circular locking dependency detected 6.10.0-rc7-syzkaller-00003-g4376e966ecb7 #0 Not tainted ------------------------------------------------------ syz.3.2069/15008 is trying to acquire lock: ffff88801544e6d8 (&n->list_lock){-.-.}-{2:2}, at: get_partial_node ... but task is already holding lock: ffff88802dcc89f8 (&trie->lock){-.-.}-{2:2}, at: trie_update_elem ... which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&trie->lock){-.-.}-{2:2}: __raw_spin_lock_irqsave _raw_spin_lock_irqsave+0x3a/0x60 trie_delete_elem+0xb0/0x820 ___bpf_prog_run+0x3e51/0xabd0 __bpf_prog_run32+0xc1/0x100 bpf_dispatcher_nop_func ...... bpf_trace_run2+0x231/0x590 __bpf_trace_contention_end+0xca/0x110 trace_contention_end.constprop.0+0xea/0x170 __pv_queued_spin_lock_slowpath+0x28e/0xcc0 pv_queued_spin_lock_slowpath queued_spin_lock_slowpath queued_spin_lock do_raw_spin_lock+0x210/0x2c0 __raw_spin_lock_irqsave _raw_spin_lock_irqsave+0x42/0x60 __put_partials+0xc3/0x170 qlink_free qlist_free_all+0x4e/0x140 kasan_quarantine_reduce+0x192/0x1e0 __kasan_slab_alloc+0x69/0x90 kasan_slab_alloc slab_post_alloc_hook slab_alloc_node kmem_cache_alloc_node_noprof+0x153/0x310 __alloc_skb+0x2b1/0x380 ...... -> #0 (&n->list_lock){-.-.}-{2:2}: check_prev_add check_prevs_add validate_chain __lock_acquire+0x2478/0x3b30 lock_acquire lock_acquire+0x1b1/0x560 __raw_spin_lock_irqsave _raw_spin_lock_irqsave+0x3a/0x60 get_partial_node.part.0+0x20/0x350 get_partial_node get_partial ___slab_alloc+0x65b/0x1870 __slab_alloc.constprop.0+0x56/0xb0 __slab_alloc_node slab_alloc_node __do_kmalloc_node __kmalloc_node_noprof+0x35c/0x440 kmalloc_node_noprof bpf_map_kmalloc_node+0x98/0x4a0 lpm_trie_node_alloc trie_update_elem+0x1ef/0xe00 bpf_map_update_value+0x2c1/0x6c0 map_update_elem+0x623/0x910 __sys_bpf+0x90c/0x49a0 ... other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&trie->lock); lock(&n->list_lock); lock(&trie->lock); lock(&n->list_lock); * DEADLOCK * [1]: https://syzkaller.appspot.com/bug?extid=9045c0a3d5a7f1b119f7 A bpf program attached to trace_contention_end() triggers after acquiring &n->list_lock. The program invokes trie_delete_elem(), which then acquires trie->lock. However, it is possible that another process is invoking trie_update_elem(). trie_update_elem() will acquire trie->lock first, then invoke kmalloc_node(). kmalloc_node() may invoke get_partial_node() and try to acquire &n->list_lock (not necessarily the same lock object). Therefore, lockdep warns about the circular locking dependency. Invoking kmalloc() before acquiring trie->lock could fix the warning. However, since BPF programs call be invoked from any context (e.g., through kprobe/tracepoint/fentry), there may still be lock ordering problems for internal locks in kmalloc() or trie->lock itself. To eliminate these potential lock ordering problems with kmalloc()'s internal locks, replacing kmalloc()/kfree()/kfree_rcu() with equivalent BPF memory allocator APIs that can be invoked in any context. The lock ordering problems with trie->lock (e.g., reentrance) will be handled separately. Three aspects of this change require explanation: 1. Intermediate and leaf nodes are allocated from the same allocator. Since the value size of LPM trie is usually small, using a single alocator reduces the memory overhead of the BPF memory allocator. 2. Leaf nodes are allocated before disabling IRQs. This handles cases where leaf_size is large (e.g., > 4KB - 8) and updates require intermediate node allocation. If leaf nodes were allocated in IRQ-disabled region, the free objects in BPF memory allocator would not be refilled timely and the intermediate node allocation may fail. 3. Paired migrate_{disable\|enable}() calls for node alloc and free. The BPF memory allocator uses per-CPU struct internally, these paired calls are necessary to guarantee correctness. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-7-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06	bpf: Fix exact match conditions in trie_get_next_key()	Hou Tao	1	-2/+2
	trie_get_next_key() uses node->prefixlen == key->prefixlen to identify an exact match, However, it is incorrect because when the target key doesn't fully match the found node (e.g., node->prefixlen != matchlen), these two nodes may also have the same prefixlen. It will return expected result when the passed key exist in the trie. However when a recently-deleted key or nonexistent key is passed to trie_get_next_key(), it may skip keys and return incorrect result. Fix it by using node->prefixlen == matchlen to identify exact matches. When the condition is true after the search, it also implies node->prefixlen equals key->prefixlen, otherwise, the search would return NULL instead. Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06	bpf: Handle in-place update for full LPM trie correctly	Hou Tao	1	-23/+21
	When a LPM trie is full, in-place updates of existing elements incorrectly return -ENOSPC. Fix this by deferring the check of trie->n_entries. For new insertions, n_entries must not exceed max_entries. However, in-place updates are allowed even when the trie is full. Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06	bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie	Hou Tao	1	-3/+20
	Add the currently missing handling for the BPF_EXIST and BPF_NOEXIST flags. These flags can be specified by users and are relevant since LPM trie supports exact matches during update. Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06	bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem	Hou Tao	1	-3/+1
	There is no need to call kfree(im_node) when updating element fails, because im_node must be NULL. Remove the unnecessary kfree() for im_node. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06	bpf: Remove unnecessary check when updating LPM trie	Hou Tao	1	-2/+1
	When "node->prefixlen == matchlen" is true, it means that the node is fully matched. If "node->prefixlen == key->prefixlen" is false, it means the prefix length of key is greater than the prefix length of node, otherwise, matchlen will not be equal with node->prefixlen. However, it also implies that the prefix length of node must be less than max_prefixlen. Therefore, "node->prefixlen == trie->max_prefixlen" will always be false when the check of "node->prefixlen == key->prefixlen" returns false. Remove this unnecessary comparison. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06	blk-mq: move cpuhp callback registering out of q->sysfs_lock	Ming Lei	1	-11/+92
	Registering and unregistering cpuhp callback requires global cpu hotplug lock, which is used everywhere. Meantime q->sysfs_lock is used in block layer almost everywhere. It is easy to trigger lockdep warning[1] by connecting the two locks. Fix the warning by moving blk-mq's cpuhp callback registering out of q->sysfs_lock. Add one dedicated global lock for covering registering & unregistering hctx's cpuhp, and it is safe to do so because hctx is guaranteed to be live if our request_queue is live. [1] https://lore.kernel.org/lkml/Z04pz3AlvI4o0Mr8@agluck-desk3/ Cc: Reinette Chatre <reinette.chatre@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Newman <peternewman@google.com> Cc: Babu Moger <babu.moger@amd.com> Reported-by: Luck Tony <tony.luck@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20241206111611.978870-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-06	blk-mq: register cpuhp callback after hctx is added to xarray table	Ming Lei	1	-8/+7
	We need to retrieve 'hctx' from xarray table in the cpuhp callback, so the callback should be registered after this 'hctx' is added to xarray table. Cc: Reinette Chatre <reinette.chatre@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Newman <peternewman@google.com> Cc: Babu Moger <babu.moger@amd.com> Cc: Luck Tony <tony.luck@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20241206111611.978870-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-06	smb: client: fix potential race in cifs_put_tcon()	Paulo Alcantara	1	-3/+1
	dfs_cache_refresh() delayed worker could race with cifs_put_tcon(), so make sure to call list_replace_init() on @tcon->dfs_ses_list after kworker is cancelled or finished. Fixes: 4f42a8b54b5c ("smb: client: fix DFS interlink failover") Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-12-06	smb3.1.1: fix posix mounts to older servers	Steve French	3	-5/+11
	Some servers which implement the SMB3.1.1 POSIX extensions did not set the file type in the mode in the infolevel 100 response. With the recent changes for checking the file type via the mode field, this can cause the root directory to be reported incorrectly and mounts (e.g. to ksmbd) to fail. Fixes: 6a832bc8bbb2 ("fs/smb/client: Implement new SMB3 POSIX type") Cc: stable@vger.kernel.org Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Cc: Ralph Boehme <slow@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-12-06	x86/cacheinfo: Delete global num_cache_leaves	Ricardo Neri	1	-22/+21
	Linux remembers cpu_cachinfo::num_leaves per CPU, but x86 initializes all CPUs from the same global "num_cache_leaves". This is erroneous on systems such as Meteor Lake, where each CPU has a distinct num_leaves value. Delete the global "num_cache_leaves" and initialize num_leaves on each CPU. init_cache_level() no longer needs to set num_leaves. Also, it never had to set num_levels as it is unnecessary in x86. Keep checking for zero cache leaves. Such condition indicates a bug. [ bp: Cleanup. ] Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org # 6.3+ Link: https://lore.kernel.org/r/20241128002247.26726-3-ricardo.neri-calderon@linux.intel.com
2024-12-06	cacheinfo: Allocate memory during CPU hotplug if not done from the primary CPU	Ricardo Neri	1	-6/+8
	Commit 5944ce092b97 ("arch_topology: Build cacheinfo from primary CPU") adds functionality that architectures can use to optionally allocate and build cacheinfo early during boot. Commit 6539cffa9495 ("cacheinfo: Add arch specific early level initializer") lets secondary CPUs correct (and reallocate memory) cacheinfo data if needed. If the early build functionality is not used and cacheinfo does not need correction, memory for cacheinfo is never allocated. x86 does not use the early build functionality. Consequently, during the cacheinfo CPU hotplug callback, last_level_cache_is_valid() attempts to dereference a NULL pointer: BUG: kernel NULL pointer dereference, address: 0000000000000100 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not present page PGD 0 P4D 0 Oops: 0000 [#1] PREEPMT SMP NOPTI CPU: 0 PID 19 Comm: cpuhp/0 Not tainted 6.4.0-rc2 #1 RIP: 0010: last_level_cache_is_valid+0x95/0xe0a Allocate memory for cacheinfo during the cacheinfo CPU hotplug callback if not done earlier. Moreover, before determining the validity of the last-level cache info, ensure that it has been allocated. Simply checking for non-zero cache_leaves() is not sufficient, as some architectures (e.g., Intel processors) have non-zero cache_leaves() before allocation. Dereferencing NULL cacheinfo can occur in update_per_cpu_data_slice_size(). This function iterates over all online CPUs. However, a CPU may have come online recently, but its cacheinfo may not have been allocated yet. While here, remove an unnecessary indentation in allocate_cache_info(). [ bp: Massage. ] Fixes: 6539cffa9495 ("cacheinfo: Add arch specific early level initializer") Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Radu Rendec <rrendec@redhat.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Cc: stable@vger.kernel.org # 6.3+ Link: https://lore.kernel.org/r/20241128002247.26726-2-ricardo.neri-calderon@linux.intel.com
2024-12-06	x86/kexec: Restore GDT on return from ::preserve_context kexec	David Woodhouse	1	-0/+7
	The restore_processor_state() function explicitly states that "the asm code that gets us here will have restored a usable GDT". That wasn't true in the case of returning from a ::preserve_context kexec. Make it so. Without this, the kernel was depending on the called function to reload a GDT which is appropriate for the kernel before returning. Test program: #include <unistd.h> #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <linux/kexec.h> #include <linux/reboot.h> #include <sys/reboot.h> #include <sys/syscall.h> int main (void) { struct kexec_segment segment = {}; unsigned char purgatory[] = { 0x66, 0xba, 0xf8, 0x03, // mov $0x3f8, %dx 0xb0, 0x42, // mov $0x42, %al 0xee, // outb %al, (%dx) 0xc3, // ret }; int ret; segment.buf = &purgatory; segment.bufsz = sizeof(purgatory); segment.mem = (void *)0x400000; segment.memsz = 0x1000; ret = syscall(__NR_kexec_load, 0x400000, 1, &segment, KEXEC_PRESERVE_CONTEXT); if (ret) { perror("kexec_load"); exit(1); } ret = syscall(__NR_reboot, LINUX_REBOOT_MAGIC1, LINUX_REBOOT_MAGIC2, LINUX_REBOOT_CMD_KEXEC); if (ret) { perror("kexec reboot"); exit(1); } printf("Success\n"); return 0; } Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20241205153343.3275139-2-dwmw2@infradead.org
2024-12-05	iio: magnetometer: yas530: use signed integer type for clamp limits	Jakob Hauser	1	-6/+7
	In the function yas537_measure() there is a clamp_val() with limits of -BIT(13) and BIT(13) - 1. The input clamp value h[] is of type s32. The BIT() is of type unsigned long integer due to its define in include/vdso/bits.h. The lower limit -BIT(13) is recognized as -8192 but expressed as an unsigned long integer. The size of an unsigned long integer differs between 32-bit and 64-bit architectures. Converting this to type s32 may lead to undesired behavior. Additionally, in the calculation lines h[0], h[1] and h[2] the unsigned long integer divisor BIT(13) causes an unsigned division, shifting the left-hand side of the equation back and forth, possibly ending up in large positive values instead of negative values on 32-bit architectures. To solve those two issues, declare a signed integer with a value of BIT(13). There is another omission in the clamp line: clamp_val() returns a value and it's going nowhere here. Self-assign it to h[i] to make use of the clamp macro. Finally, replace clamp_val() macro by clamp() because after changing the limits from type unsigned long integer to signed integer it's fine that way. Link: https://lkml.kernel.org/r/11609b2243c295d65ab4d47e78c239d61ad6be75.1732914810.git.jahau@rocketmail.com Fixes: 65f79b501030 ("iio: magnetometer: yas530: Add YAS537 variant") Signed-off-by: Jakob Hauser <jahau@rocketmail.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202411230458.dhZwh3TT-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202411282222.oF0B4110-lkp@intel.com/ Reviewed-by: David Laight <david.laight@aculab.com> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05	sched/numa: fix memory leak due to the overwritten vma->numab_state	Adrian Huang	1	-3/+9
	[Problem Description] When running the hackbench program of LTP, the following memory leak is reported by kmemleak. # /opt/ltp/testcases/bin/hackbench 20 thread 1000 Running with 2040 (== 800) tasks. # dmesg \| grep kmemleak ... kmemleak: 480 new suspected memory leaks (see /sys/kernel/debug/kmemleak) kmemleak: 665 new suspected memory leaks (see /sys/kernel/debug/kmemleak) # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff888cd8ca2c40 (size 64): comm "hackbench", pid 17142, jiffies 4299780315 hex dump (first 32 bytes): ac 74 49 00 01 00 00 00 4c 84 49 00 01 00 00 00 .tI.....L.I..... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc bff18fd4): [<ffffffff81419a89>] __kmalloc_cache_noprof+0x2f9/0x3f0 [<ffffffff8113f715>] task_numa_work+0x725/0xa00 [<ffffffff8110f878>] task_work_run+0x58/0x90 [<ffffffff81ddd9f8>] syscall_exit_to_user_mode+0x1c8/0x1e0 [<ffffffff81dd78d5>] do_syscall_64+0x85/0x150 [<ffffffff81e0012b>] entry_SYSCALL_64_after_hwframe+0x76/0x7e ... This issue can be consistently reproduced on three different servers: a 448-core server * a 256-core server * a 192-core server [Root Cause] Since multiple threads are created by the hackbench program (along with the command argument 'thread'), a shared vma might be accessed by two or more cores simultaneously. When two or more cores observe that vma->numab_state is NULL at the same time, vma->numab_state will be overwritten. Although current code ensures that only one thread scans the VMAs in a single 'numa_scan_period', there might be a chance for another thread to enter in the next 'numa_scan_period' while we have not gotten till numab_state allocation [1]. Note that the command `/opt/ltp/testcases/bin/hackbench 50 process 1000` cannot the reproduce the issue. It is verified with 200+ test runs. [Solution] Use the cmpxchg atomic operation to ensure that only one thread executes the vma->numab_state assignment. [1] https://lore.kernel.org/lkml/1794be3c-358c-4cdc-a43d-a1f841d91ef7@amd.com/ Link: https://lkml.kernel.org/r/20241113102146.2384-1-ahuang12@lenovo.com Fixes: ef6a22b70f6d ("sched/numa: apply the scan delay to every new vma") Signed-off-by: Adrian Huang <ahuang12@lenovo.com> Reported-by: Jiwei Sun <sunjw10@lenovo.com> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ben Segall <bsegall@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05	mm/damon: fix order of arguments in damos_before_apply tracepoint	Akinobu Mita	1	-1/+1
	Since the order of the scheme_idx and target_idx arguments in TP_ARGS is reversed, they are stored in the trace record in reverse. Link: https://lkml.kernel.org/r/20241115182023.43118-1-sj@kernel.org Link: https://patch.msgid.link/20241112154828.40307-1-akinobu.mita@gmail.com Fixes: c603c630b509 ("mm/damon/core: add a tracepoint for damos apply target regions") Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05	lib: stackinit: hide never-taken branch from compiler	Kees Cook	1	-0/+1
	The never-taken branch leads to an invalid bounds condition, which is by design. To avoid the unwanted warning from the compiler, hide the variable from the optimizer. ../lib/stackinit_kunit.c: In function 'do_nothing_u16_zero': ../lib/stackinit_kunit.c:51:49: error: array subscript 1 is outside array bounds of 'u16[0]' {aka 'short unsigned int[]'} [-Werror=array-bounds=] 51 \| #define DO_NOTHING_RETURN_SCALAR(ptr) *(ptr) \| ^~~~~~ ../lib/stackinit_kunit.c:219:24: note: in expansion of macro 'DO_NOTHING_RETURN_SCALAR' 219 \| return DO_NOTHING_RETURN_ ## which(ptr + 1); \ \| ^~~~~~~~~~~~~~~~~~ Link: https://lkml.kernel.org/r/20241117113813.work.735-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05	mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio()	David Hildenbrand	1	-2/+2
	The folio can get freed + buddy-merged + reallocated in the meantime, resulting in us calling folio_test_locked() possibly on a tail page. This makes const_folio_flags VM_BUG_ON_PGFLAGS() when stumbling over the tail page. Could this result in other issues? Doesn't look like it. False positives and false negatives don't really matter, because this folio would get skipped either way when detecting that they have been reallocated in the meantime. Fix it by performing the folio_test_locked() checked after grabbing a reference. If this ever becomes a real problem, we could add a special helper that racily checks if the bit is set even on tail pages ... but let's hope that's not required so we can just handle it cleaner: work on the folio after we hold a reference. Do we really need the folio_test_locked() check if we are going to trylock briefly after? Well, we can at least avoid a xas_reload(). It's a bit unclear which exact change introduced that issue. Likely, ever since we made PG_locked obey to the PF_NO_TAIL policy it could have been triggered in some way. Link: https://lkml.kernel.org/r/20241129125303.4033164-1-david@redhat.com Fixes: 48c935ad88f5 ("page-flags: define PG_locked behavior on compound pages") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: syzbot+9f9a7f73fb079b2387a6@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/674184c9.050a0220.1cc393.0001.GAE@google.com/ Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Hillf Danton <hdanton@sina.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05	scatterlist: fix incorrect func name in kernel-doc	Randy Dunlap	1	-1/+1
	Fix a kernel-doc warning by making the kernel-doc function description match the function name: include/linux/scatterlist.h:323: warning: expecting prototype for sg_unmark_bus_address(). Prototype was for sg_dma_unmark_bus_address() instead Link: https://lkml.kernel.org/r/20241130022406.537973-1-rdunlap@infradead.org Fixes: 42399301203e ("lib/scatterlist: add flag for indicating P2PDMA segments in an SGL") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05	mm: correct typo in MMAP_STATE() macro	Lorenzo Stoakes	1	-1/+1
	We mistakenly refer to len rather than len_ here. The only existing caller passes len to the len_ parameter so this has no impact on the code, but it is obviously incorrect to do this, so fix it. Link: https://lkml.kernel.org/r/20241118175414.390827-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05	mm: respect mmap hint address when aligning for THP	Kalesh Singh	1	-0/+1
	Commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") updated __get_unmapped_area() to align the start address for the VMA to a PMD boundary if CONFIG_TRANSPARENT_HUGEPAGE=y. It does this by effectively looking up a region that is of size, request_size + PMD_SIZE, and aligning up the start to a PMD boundary. Commit 4ef9ad19e176 ("mm: huge_memory: don't force huge page alignment on 32 bit") opted out of this for 32bit due to regressions in mmap base randomization. Commit d4148aeab412 ("mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes") restricted this to only mmap sizes that are multiples of the PMD_SIZE due to reported regressions in some performance benchmarks -- which seemed mostly due to the reduced spatial locality of related mappings due to the forced PMD-alignment. Another unintended side effect has emerged: When a user specifies an mmap hint address, the THP alignment logic modifies the behavior, potentially ignoring the hint even if a sufficiently large gap exists at the requested hint location. Example Scenario: Consider the following simplified virtual address (VA) space: ... 0x200000-0x400000 --- VMA A 0x400000-0x600000 --- Hole 0x600000-0x800000 --- VMA B ... A call to mmap() with hint=0x400000 and len=0x200000 behaves differently: - Before THP alignment: The requested region (size 0x200000) fits into the gap at 0x400000, so the hint is respected. - After alignment: The logic searches for a region of size 0x400000 (len + PMD_SIZE) starting at 0x400000. This search fails due to the mapping at 0x600000 (VMA B), and the hint is ignored, falling back to arch_get_unmapped_area[_topdown](). In general the hint is effectively ignored, if there is any existing mapping in the below range: [mmap_hint + mmap_size, mmap_hint + mmap_size + PMD_SIZE) This changes the semantics of mmap hint; from ""Respect the hint if a sufficiently large gap exists at the requested location" to "Respect the hint only if an additional PMD-sized gap exists beyond the requested size". This has performance implications for allocators that allocate their heap using mmap but try to keep it "as contiguous as possible" by using the end of the exisiting heap as the address hint. With the new behavior it's more likely to get a much less contiguous heap, adding extra fragmentation and performance overhead. To restore the expected behavior; don't use thp_get_unmapped_area_vmflags() when the user provided a hint address, for anonymous mappings. Note: As Yang Shi pointed out: the issue still remains for filesystems which are using thp_get_unmapped_area() for their get_unmapped_area() op. It is unclear what worklaods will regress for if we ignore THP alignment when the hint address is provided for such file backed mappings -- so this fix will be handled separately. Link: https://lkml.kernel.org/r/20241118214650.3667577-1-kaleshsingh@google.com Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Hans Boehm <hboehm@google.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05	mm: memcg: declare do_memsw_account inline	John Sperbeck	1	-1/+1
	In commit 66d60c428b23 ("mm: memcg: move legacy memcg event code into memcontrol-v1.c"), the static do_memsw_account() function was moved from a .c file to a .h file. Unfortunately, the traditional inline keyword wasn't added. If a file (e.g., a unit test) includes the .h file, but doesn't refer to do_memsw_account(), it will get a warning like: mm/memcontrol-v1.h:41:13: warning: unused function 'do_memsw_account' [-Wunused-function] 41 \| static bool do_memsw_account(void) \| ^~~~~~~~~~~~~~~~ Link: https://lkml.kernel.org/r/20241128203959.726527-1-jsperbeck@google.com Fixes: 66d60c428b23 ("mm: memcg: move legacy memcg event code into memcontrol-v1.c") Signed-off-by: John Sperbeck <jsperbeck@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05	mm/codetag: swap tags when migrate pages	David Wang	3	-17/+25
	Current solution to adjust codetag references during page migration is done in 3 steps: 1. sets the codetag reference of the old page as empty (not pointing to any codetag); 2. subtracts counters of the new page to compensate for its own allocation; 3. sets codetag reference of the new page to point to the codetag of the old page. This does not work if CONFIG_MEM_ALLOC_PROFILING_DEBUG=n because set_codetag_empty() becomes NOOP. Instead, let's simply swap codetag references so that the new page is referencing the old codetag and the old page is referencing the new codetag. This way accounting stays valid and the logic makes more sense. Link: https://lkml.kernel.org/r/20241129025213.34836-1-00107082@163.com Fixes: e0a955bf7f61 ("mm/codetag: add pgalloc_tag_copy()") Signed-off-by: David Wang <00107082@163.com> Closes: https://lore.kernel.org/lkml/20241124074318.399027-1-00107082@163.com/ Acked-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05	ocfs2: update seq_file index in ocfs2_dlm_seq_next	Wengang Wang	1	-0/+1
	The following INFO level message was seen: seq_file: buggy .next function ocfs2_dlm_seq_next [ocfs2] did not update position index Fix: Update *pos (so m->index) to make seq_read_iter happy though the index its self makes no sense to ocfs2_dlm_seq_next. Link: https://lkml.kernel.org/r/20241119174500.9198-1-wen.gang.wang@oracle.com Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05	stackdepot: fix stack_depot_save_flags() in NMI context	Marco Elver	2	-4/+12
	Per documentation, stack_depot_save_flags() was meant to be usable from NMI context if STACK_DEPOT_FLAG_CAN_ALLOC is unset. However, it still would try to take the pool_lock in an attempt to save a stack trace in the current pool (if space is available). This could result in deadlock if an NMI is handled while pool_lock is already held. To avoid deadlock, only try to take the lock in NMI context and give up if unsuccessful. The documentation is fixed to clearly convey this. Link: https://lkml.kernel.org/r/Z0CcyfbPqmxJ9uJH@elver.google.com Link: https://lkml.kernel.org/r/20241122154051.3914732-1-elver@google.com Fixes: 4434a56ec209 ("stackdepot: make fast paths lock-less again") Signed-off-by: Marco Elver <elver@google.com> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05	mm: open-code page_folio() in dump_page()	Matthew Wilcox (Oracle)	1	-2/+5
	page_folio() calls page_fixed_fake_head() which will misidentify this page as being a fake head and load off the end of 'precise'. We may have a pointer to a fake head, but that's OK because it contains the right information for dump_page(). gcc-15 is smart enough to catch this with -Warray-bounds: In function 'page_fixed_fake_head', inlined from '_compound_head' at ../include/linux/page-flags.h:251:24, inlined from '__dump_page' at ../mm/debug.c:123:11: ../include/asm-generic/rwonce.h:44:26: warning: array subscript 9 is outside +array bounds of 'struct page[1]' [-Warray-bounds=] Link: https://lkml.kernel.org/r/20241125201721.2963278-2-willy@infradead.org Fixes: fae7d834c43c ("mm: add __dump_folio()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Kees Cook <kees@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05	mm: open-code PageTail in folio_flags() and const_folio_flags()	Matthew Wilcox (Oracle)	1	-2/+2
	It is unsafe to call PageTail() in dump_page() as page_is_fake_head() will almost certainly return true when called on a head page that is copied to the stack. That will cause the VM_BUG_ON_PGFLAGS() in const_folio_flags() to trigger when it shouldn't. Fortunately, we don't need to call PageTail() here; it's fine to have a pointer to a virtual alias of the page's flag word rather than the real page's flag word. Link: https://lkml.kernel.org/r/20241125201721.2963278-1-willy@infradead.org Fixes: fae7d834c43c ("mm: add __dump_folio()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Kees Cook <kees@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>