Age | Commit message (Collapse) | Author | Files | Lines |
|
The sysfs core now allows instances of 'struct bin_attribute' to be
moved into read-only memory. Make use of that to protect them against
accidental or malicious modifications.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Tested-by: Finn Callies <fcallies@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Link: https://lore.kernel.org/r/20241211-sysfs-const-bin_attr-s390-v1-1-be01f66bfcf7@weissschuh.net
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Instead of distributing the memory allocation and pointer arithmetic to
place slib and sl on the page that is allocated for them over multiple
functions and comments, move both into the same context directly next to
each other, so that the knowledge of how this is done is immediately
visible.
The actual layout in memory doesn't change with this, just the structure
of the code to achieve it.
Signed-off-by: Benjamin Block <bblock@linux.ibm.com>
Reviewed-by: Steffen Maier <maier@linux.ibm.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
ccw_device_get_ciw() already uses array indices to iterate over the vector
of CIWs, but then switches to pointer arithmetic when returning the one it
found. Change this to make it more consistent.
Signed-off-by: Benjamin Block <bblock@linux.ibm.com>
Reviewed-by: Vineeth Vijayan <vneethv@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
This feature is not only utilized by OSA, but by QDIO in general. Clear
up possible confusions.
Signed-off-by: Benjamin Block <bblock@linux.ibm.com>
Reviewed-by: Steffen Maier <maier@linux.ibm.com>
Acked-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Move implementation of s390 diagnose code to diag specific folder.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Retrieve electrical power readings for resources in a computing
environment via diag 0x324. diag 0x324 stores the power readings in the
power information block (pib).
Provide power readings from pib via diag324 ioctl interface. diag324
ioctl provides new pib to the user only if the threshold time has passed
since the last call. Otherwise, cache data is returned. When there are
no active readers, cleanup of pib buffer is performed.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Create a misc device /dev/diag to fetch diagnose specific information
from the kernel and provide it to userspace.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
exrl is present in all machines currently supported in the linux
kernel, therefore prefer it over ex. This saves one instruction
and doesn't need an additional register to hold the address of the
target instruction.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
By default page protection definitions like PAGE_RX have the _PAGE_NOEXEC
bit set. For older machines without the instruction execution protection
facility this bit is not allowed to be used in page table entries, and
therefore must be removed.
This is done at a couple of page table walkers, but also at some but not
all page table modification functions like ptep_modify_prot_commit(). Avoid
all of this and change the page, segment and region3 protection definitions
so that the noexec bit is masked out automatically if the instruction
execution-protection facility is not available. This is similar to what
also various other architectures do which had to solve the same problem.
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Remove unused PAGE_KERNEL_EXEC, SEGMENT_KERNEL_EXEC,
and REGION3_KERNEL_EXEC.
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Remove an outdated comment that is also located at a random place. The
generic statement that read permissions imply execute permissions is
wrong since the instruction execution-protection facility is available.
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Using the newly introduced debug_dump() mechanism add formatted content
of pci_debug_msg_id to the PCI report. The formatting is based on the
existing sprintf format but removes caller pointer and area index and
adds an column header. This will allow the platform to collect this log
data together with hardware errors. This sets the reverse flag such that
the newest log entries get added to the PCI report even if not all debug
log entries fit.
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Co-developed-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
In this mode debug_dump() writes the debug log starting at the newest
entry followed by earlier entries. To this end add a debug_prev_entry()
helper analogous to debug_next_entry() a helper to get the latest entry
which is one before the active entry and a helper to iterate either
forward or backward.
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Co-developed-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
The debug_dump() function allows to get the content of a debug log and
view pair in a string buffer. One future application of this is to
provide debug logs to the platform to be collected with hardware error
logs during recovery.
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Co-developed-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Split the allocation respectively freeing of file_private_info_t out
of open() respectively close(). This will be used in a follow on change
to access to debug views without going through the s390dbf filesystem.
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Contrary to convention debug_next_entry() returns a falsy 0 value if
there are more entries and a truthy 1 value when there are no more
entries. As there is only one caller just reverse this logic to be less
surprising and document the behavior in a kdoc comment. Also replace the
goto with an early return. In the future this allows using it in
a do {} while (debug_next_entry(...)) loop.
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Add a mechanism with which the status of PCI error recovery runs
is reported to the platform. Together with the status supply additional
information that may aid in problem determination.
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Slightly cleanup arch/s390/include/asm/hugetlb.h:
- Remove huge_pte_none() / huge_pte_none_mostly() which are identical
to the generic variants
- Coding style adjustments
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Although Kconfig specifies:
config KERNEL_IMAGE_BASE
hex "Kernel image base address"
range 0x100000 0x1FFFFFE0000000 if !KASAN
range 0x100000 0x1BFFFFE0000000 if KASAN
default 0x3FFE0000000 if !KASAN
default 0x7FFFE0000000 if KASAN
Running make defconfig or make debug_defconfig
followed by make kasan.config results in a suboptimal
CONFIG_KERNEL_IMAGE_BASE=0x3FFE0000000. Add
CONFIG_KERNEL_IMAGE_BASE=0x7FFFE0000000 to kasan.config to address that.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Add missing include of <linux/smp.h> in abs_lowcore.h to provide
declarations for get_cpu() and put_cpu() used in the code.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
For consistency, remove the `__bootdata` and `__bootdata_preserved`
section annotations from variable declarations in header files. Section
annotations should be applied to definitions, not declarations. This
change moves the annotations to the variable definitions in the
corresponding source files.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Use __atomic_add_const_and_test() within __preempt_count_dec_and_test().
With this it is possible to decrease preempt_count by one and test if
need_resched is set with one instruction, if the compiler has support for
flag output operand constraints.
Reviewed-by: Juergen Christ <jchrist@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Provide arch_atomic_*_and_test() implementations which make use of flag
output constraints, and allow the compiler to generate slightly better
code.
Reviewed-by: Juergen Christ <jchrist@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
GCC uses the number of lines of an inline assembly to calculate its length
(number of instructions). This has an impact on GCCs inlining decisions.
Therefore remove superfluous new lines from a couple of inline
assemblies, so that their real size is reflected.
Also use an "asm inline" statement for the fpu_lfpc_safe() inline assembly
to enforce that GCC assumes the minimum size for this inline assembly,
since it contains various statements which make it appear much larger than
the resulting code is.
Suggested-by: Juergen Christ <jchrist@linux.ibm.com>
Reviewed-by: Juergen Christ <jchrist@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Just remove a line break which reduces readability.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Remove the preempt count implementation for pre MARCH_HAS_Z196_FEATURES
builds. If the kernel is compiled with PREEMPT=n, which is the default for
all distributions, this has close to zero impact in the generated code.
Therefore remove the alternative implementation to keep things simple.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
The s390 preempt_count implementation is more or less a copy of the x86
implementation using different instructions. For clarification how this
works also add all comments from x86 with some minor modifications.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
With commit c8a91c285d8c ("s390/atomic: move remaining inline assemblies to
atomic_ops.h") all remaining atomic inline assemblies have been moved to
atomic_ops.h.
However the result is inconsistent: the functions in atomic_ops.h are
supposed to be used with integral types like int and long pointers, while
the functions in atomic.h work with atomic types.
This layering got violated with the named commit. Therefore adjust this
now, and also use consistent variable names in atomic_ops.h.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Implement arch_atomic_inc() / arch_atomic_dec() functions which result
in a single instruction if compiled for z196 or newer architectures.
Reduces the kernel image size by ~6K (defconfig):
bloat-o-meter:
add/remove: 0/0 grow/shrink: 12/1005 up/down: 106/-6404 (-6298)
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Some small cleanups to stack_alloc() and stack_free():
- Rename ret to stack to reflect what the variable is used for
- Whitespace removal
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
There is no point in supporting !VMAP_STACK kernel builds. VMAP_STACK has
proven to work since many years. Also, since KASAN_VMALLOC is supported,
kernels built with !VMAP_STACK are completely untested.
Therefore select VMAP_STACK unconditionally and remove all config options
and code required for !VMAP_STACK builds.
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Reduce the number of to be considered config options and select
KASAN_VMALLOC if KASAN is enabled.
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
|
|
Since commit 13b25489b6f8 ("kbuild: change working directory to external
module directory with M="), the Debian package build fails if a relative
path is specified with the O= option.
$ make O=build bindeb-pkg
[ snip ]
dpkg-deb: building package 'linux-image-6.13.0-rc1' in '../linux-image-6.13.0-rc1_6.13.0-rc1-6_amd64.deb'.
Rebuilding host programs with x86_64-linux-gnu-gcc...
make[6]: Entering directory '/home/masahiro/linux/build'
/home/masahiro/linux/Makefile:190: *** specified kernel directory "build" does not exist. Stop.
This occurs because the sub_make_done flag is cleared, even though the
working directory is already in the output directory.
Passing KBUILD_OUTPUT=. resolves the issue.
Fixes: 13b25489b6f8 ("kbuild: change working directory to external module directory with M=")
Reported-by: Charlie Jenkins <charlie@rivosinc.com>
Closes: https://lore.kernel.org/all/Z1DnP-GJcfseyrM3@ghost/
Tested-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
The compiler can fully inline the actual handler function of an interrupt
entry into the .irqentry.text entry point. If such a function contains an
access which has an exception table entry, modpost complains about a
section mismatch:
WARNING: vmlinux.o(__ex_table+0x447c): Section mismatch in reference ...
The relocation at __ex_table+0x447c references section ".irqentry.text"
which is not in the list of authorized sections.
Add .irqentry.text to OTHER_SECTIONS to cure the issue.
Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org # needed for linux-5.4-y
Link: https://lore.kernel.org/all/20241128111844.GE10431@google.com/
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
When ensuring EFER.AUTOIBRS is set, WARN only on a negative return code
from msr_set_bit(), as '1' is used to indicate the WRMSR was successful
('0' indicates the MSR bit was already set).
Fixes: 8cc68c9c9e92 ("x86/CPU/AMD: Make sure EFER[AIBRSE] is set")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/Z1MkNofJjt7Oq0G6@google.com
Closes: https://lore.kernel.org/all/20241205220604.GA2054199@thelio-3990X
|
|
Add more test cases for LPM trie in test_maps:
1) test_lpm_trie_update_flags
It constructs various use cases for BPF_EXIST and BPF_NOEXIST and check
whether the return value of update operation is expected.
2) test_lpm_trie_update_full_maps
It tests the update operations on a full LPM trie map. Adding new node
will fail and overwriting the value of existed node will succeed.
3) test_lpm_trie_iterate_strs and test_lpm_trie_iterate_ints
There two test cases test whether the iteration through get_next_key is
sorted and expected. These two test cases delete the minimal key after
each iteration and check whether next iteration returns the second
minimal key. The only difference between these two test cases is the
former one saves strings in the LPM trie and the latter saves integers.
Without the fix of get_next_key, these two cases will fail as shown
below:
test_lpm_trie_iterate_strs(1091):FAIL:iterate #2 got abc exp abS
test_lpm_trie_iterate_ints(1142):FAIL:iterate #1 got 0x2 exp 0x1
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-10-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Move test_lpm_map.c to map_tests/ to include LPM trie test cases in
regular test_maps run. Most code remains unchanged, including the use of
assert(). Only reduce n_lookups from 64K to 512, which decreases
test_lpm_map runtime from 37s to 0.7s.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-9-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
After switching from kmalloc() to the bpf memory allocator, there will be
no blocking operation during the update of LPM trie. Therefore, change
trie->lock from spinlock_t to raw_spinlock_t to make LPM trie usable in
atomic context, even on RT kernels.
The max value of prefixlen is 2048. Therefore, update or deletion
operations will find the target after at most 2048 comparisons.
Constructing a test case which updates an element after 2048 comparisons
under a 8 CPU VM, and the average time and the maximal time for such
update operation is about 210us and 900us.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-8-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Multiple syzbot warnings have been reported. These warnings are mainly
about the lock order between trie->lock and kmalloc()'s internal lock.
See report [1] as an example:
======================================================
WARNING: possible circular locking dependency detected
6.10.0-rc7-syzkaller-00003-g4376e966ecb7 #0 Not tainted
------------------------------------------------------
syz.3.2069/15008 is trying to acquire lock:
ffff88801544e6d8 (&n->list_lock){-.-.}-{2:2}, at: get_partial_node ...
but task is already holding lock:
ffff88802dcc89f8 (&trie->lock){-.-.}-{2:2}, at: trie_update_elem ...
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&trie->lock){-.-.}-{2:2}:
__raw_spin_lock_irqsave
_raw_spin_lock_irqsave+0x3a/0x60
trie_delete_elem+0xb0/0x820
___bpf_prog_run+0x3e51/0xabd0
__bpf_prog_run32+0xc1/0x100
bpf_dispatcher_nop_func
......
bpf_trace_run2+0x231/0x590
__bpf_trace_contention_end+0xca/0x110
trace_contention_end.constprop.0+0xea/0x170
__pv_queued_spin_lock_slowpath+0x28e/0xcc0
pv_queued_spin_lock_slowpath
queued_spin_lock_slowpath
queued_spin_lock
do_raw_spin_lock+0x210/0x2c0
__raw_spin_lock_irqsave
_raw_spin_lock_irqsave+0x42/0x60
__put_partials+0xc3/0x170
qlink_free
qlist_free_all+0x4e/0x140
kasan_quarantine_reduce+0x192/0x1e0
__kasan_slab_alloc+0x69/0x90
kasan_slab_alloc
slab_post_alloc_hook
slab_alloc_node
kmem_cache_alloc_node_noprof+0x153/0x310
__alloc_skb+0x2b1/0x380
......
-> #0 (&n->list_lock){-.-.}-{2:2}:
check_prev_add
check_prevs_add
validate_chain
__lock_acquire+0x2478/0x3b30
lock_acquire
lock_acquire+0x1b1/0x560
__raw_spin_lock_irqsave
_raw_spin_lock_irqsave+0x3a/0x60
get_partial_node.part.0+0x20/0x350
get_partial_node
get_partial
___slab_alloc+0x65b/0x1870
__slab_alloc.constprop.0+0x56/0xb0
__slab_alloc_node
slab_alloc_node
__do_kmalloc_node
__kmalloc_node_noprof+0x35c/0x440
kmalloc_node_noprof
bpf_map_kmalloc_node+0x98/0x4a0
lpm_trie_node_alloc
trie_update_elem+0x1ef/0xe00
bpf_map_update_value+0x2c1/0x6c0
map_update_elem+0x623/0x910
__sys_bpf+0x90c/0x49a0
...
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&trie->lock);
lock(&n->list_lock);
lock(&trie->lock);
lock(&n->list_lock);
*** DEADLOCK ***
[1]: https://syzkaller.appspot.com/bug?extid=9045c0a3d5a7f1b119f7
A bpf program attached to trace_contention_end() triggers after
acquiring &n->list_lock. The program invokes trie_delete_elem(), which
then acquires trie->lock. However, it is possible that another
process is invoking trie_update_elem(). trie_update_elem() will acquire
trie->lock first, then invoke kmalloc_node(). kmalloc_node() may invoke
get_partial_node() and try to acquire &n->list_lock (not necessarily the
same lock object). Therefore, lockdep warns about the circular locking
dependency.
Invoking kmalloc() before acquiring trie->lock could fix the warning.
However, since BPF programs call be invoked from any context (e.g.,
through kprobe/tracepoint/fentry), there may still be lock ordering
problems for internal locks in kmalloc() or trie->lock itself.
To eliminate these potential lock ordering problems with kmalloc()'s
internal locks, replacing kmalloc()/kfree()/kfree_rcu() with equivalent
BPF memory allocator APIs that can be invoked in any context. The lock
ordering problems with trie->lock (e.g., reentrance) will be handled
separately.
Three aspects of this change require explanation:
1. Intermediate and leaf nodes are allocated from the same allocator.
Since the value size of LPM trie is usually small, using a single
alocator reduces the memory overhead of the BPF memory allocator.
2. Leaf nodes are allocated before disabling IRQs. This handles cases
where leaf_size is large (e.g., > 4KB - 8) and updates require
intermediate node allocation. If leaf nodes were allocated in
IRQ-disabled region, the free objects in BPF memory allocator would not
be refilled timely and the intermediate node allocation may fail.
3. Paired migrate_{disable|enable}() calls for node alloc and free. The
BPF memory allocator uses per-CPU struct internally, these paired calls
are necessary to guarantee correctness.
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-7-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
trie_get_next_key() uses node->prefixlen == key->prefixlen to identify
an exact match, However, it is incorrect because when the target key
doesn't fully match the found node (e.g., node->prefixlen != matchlen),
these two nodes may also have the same prefixlen. It will return
expected result when the passed key exist in the trie. However when a
recently-deleted key or nonexistent key is passed to
trie_get_next_key(), it may skip keys and return incorrect result.
Fix it by using node->prefixlen == matchlen to identify exact matches.
When the condition is true after the search, it also implies
node->prefixlen equals key->prefixlen, otherwise, the search would
return NULL instead.
Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-6-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When a LPM trie is full, in-place updates of existing elements
incorrectly return -ENOSPC.
Fix this by deferring the check of trie->n_entries. For new insertions,
n_entries must not exceed max_entries. However, in-place updates are
allowed even when the trie is full.
Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add the currently missing handling for the BPF_EXIST and BPF_NOEXIST
flags. These flags can be specified by users and are relevant since LPM
trie supports exact matches during update.
Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
There is no need to call kfree(im_node) when updating element fails,
because im_node must be NULL. Remove the unnecessary kfree() for
im_node.
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When "node->prefixlen == matchlen" is true, it means that the node is
fully matched. If "node->prefixlen == key->prefixlen" is false, it means
the prefix length of key is greater than the prefix length of node,
otherwise, matchlen will not be equal with node->prefixlen. However, it
also implies that the prefix length of node must be less than
max_prefixlen.
Therefore, "node->prefixlen == trie->max_prefixlen" will always be false
when the check of "node->prefixlen == key->prefixlen" returns false.
Remove this unnecessary comparison.
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Registering and unregistering cpuhp callback requires global cpu hotplug lock,
which is used everywhere. Meantime q->sysfs_lock is used in block layer
almost everywhere.
It is easy to trigger lockdep warning[1] by connecting the two locks.
Fix the warning by moving blk-mq's cpuhp callback registering out of
q->sysfs_lock. Add one dedicated global lock for covering registering &
unregistering hctx's cpuhp, and it is safe to do so because hctx is
guaranteed to be live if our request_queue is live.
[1] https://lore.kernel.org/lkml/Z04pz3AlvI4o0Mr8@agluck-desk3/
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Newman <peternewman@google.com>
Cc: Babu Moger <babu.moger@amd.com>
Reported-by: Luck Tony <tony.luck@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20241206111611.978870-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We need to retrieve 'hctx' from xarray table in the cpuhp callback, so the
callback should be registered after this 'hctx' is added to xarray table.
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Newman <peternewman@google.com>
Cc: Babu Moger <babu.moger@amd.com>
Cc: Luck Tony <tony.luck@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20241206111611.978870-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
dfs_cache_refresh() delayed worker could race with cifs_put_tcon(), so
make sure to call list_replace_init() on @tcon->dfs_ses_list after
kworker is cancelled or finished.
Fixes: 4f42a8b54b5c ("smb: client: fix DFS interlink failover")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Some servers which implement the SMB3.1.1 POSIX extensions did not
set the file type in the mode in the infolevel 100 response.
With the recent changes for checking the file type via the mode field,
this can cause the root directory to be reported incorrectly and
mounts (e.g. to ksmbd) to fail.
Fixes: 6a832bc8bbb2 ("fs/smb/client: Implement new SMB3 POSIX type")
Cc: stable@vger.kernel.org
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Cc: Ralph Boehme <slow@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Linux remembers cpu_cachinfo::num_leaves per CPU, but x86 initializes all
CPUs from the same global "num_cache_leaves".
This is erroneous on systems such as Meteor Lake, where each CPU has a
distinct num_leaves value. Delete the global "num_cache_leaves" and
initialize num_leaves on each CPU.
init_cache_level() no longer needs to set num_leaves. Also, it never had to
set num_levels as it is unnecessary in x86. Keep checking for zero cache
leaves. Such condition indicates a bug.
[ bp: Cleanup. ]
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org # 6.3+
Link: https://lore.kernel.org/r/20241128002247.26726-3-ricardo.neri-calderon@linux.intel.com
|