Age | Commit message (Collapse) | Author | Files | Lines |
|
This reverts commit d3155742db89df3b3c96da383c400e6ff4d23c25.
The header_length field is byte unit, thus it can not express the number of
elements in header field. It seems that the argument for counted_by
attribute can have no arithmetic expression, therefore this commit just
reverts the issued commit.
Suggested-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20240725161648.130404-1-o-takashi@sakamocchi.jp
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
|
|
The minmax infrastructure is overkill for simple constants, and can
cause huge expansions because those simple constants are then used by
other things.
For example, 'pageblock_order' is a core VM constant, but because it was
implemented using 'min_t()' and all the type-checking that involves, it
actually expanded to something like 2.5kB of preprocessor noise.
And when that simple constant was then used inside other expansions:
#define pageblock_nr_pages (1UL << pageblock_order)
#define pageblock_start_pfn(pfn) ALIGN_DOWN((pfn), pageblock_nr_pages)
and we then use that inside a 'max()' macro:
case ISOLATE_SUCCESS:
update_cached = false;
last_migrated_pfn = max(cc->zone->zone_start_pfn,
pageblock_start_pfn(cc->migrate_pfn - 1));
the end result was that one statement expanding to 253kB in size.
There are probably other cases of this, but this one case certainly
stood out.
I've added 'MIN_T()' and 'MAX_T()' macros for this kind of "core simple
constant with specific type" use. These macros skip the type checking,
and as such need to be very sparingly used only for obvious cases that
have active issues like this.
Reported-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/all/36aa2cad-1db1-4abf-8dd2-fb20484aabc3@lucifer.local/
Cc: David Laight <David.Laight@aculab.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We have some very fancy min/max macros that have tons of sanity checking
to warn about mixed signedness etc.
This is all things that a sane compiler should warn about, but there are
no sane compiler interfaces for this, and '-Wsign-compare' is broken [1]
and not useful.
So then we compensate (some would say over-compensate) by doing the
checks manually with some truly horrid macro games.
And no, we can't just use __builtin_types_compatible_p(), because the
whole question of "does it make sense to compare these two values" is a
lot more complicated than that.
For example, it makes a ton of sense to compare unsigned values with
simple constants like "5", even if that is indeed a signed type. So we
have these very strange macros to try to make sensible type checking
decisions on the arguments to 'min()' and 'max()'.
But that can cause enormous code expansion if the min()/max() macros are
used with complicated expressions, and particularly if you nest these
things so that you get the first big expansion then expanded again.
The xen setup.c file ended up ballooning to over 50MB of preprocessed
noise that takes 15s to compile (obviously depending on the build host),
largely due to one single line.
So let's split that one single line to just be simpler. I think it ends
up being more legible to humans too at the same time. Now that single
file compiles in under a second.
Reported-and-reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/all/c83c17bb-be75-4c67-979d-54eee38774c6@lucifer.local/
Link: https://staticthinking.wordpress.com/2023/07/25/wsign-compare-is-garbage/ [1]
Cc: David Laight <David.Laight@aculab.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Syzbot reported that a buffer state inconsistency was detected in
nilfs_btnode_create_block(), triggering a kernel bug.
It is not appropriate to treat this inconsistency as a bug; it can occur
if the argument block address (the buffer index of the newly created
block) is a virtual block number and has been reallocated due to
corruption of the bitmap used to manage its allocation state.
So, modify nilfs_btnode_create_block() and its callers to treat it as a
possible filesystem error, rather than triggering a kernel bug.
Link: https://lkml.kernel.org/r/20240725052007.4562-1-konishi.ryusuke@gmail.com
Fixes: a60be987d45d ("nilfs2: B-tree node cache")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+89cc4f2324ed37988b60@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=89cc4f2324ed37988b60
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Post my improvement of the test in e4a4ba415419 ("selftests/mm:
va_high_addr_switch: dynamically initialize testcases to enable LPA2
testing"):
The test begins to fail on 4k and 16k pages, on non-LPA2 systems. To
reduce noise in the CI systems, let us skip the test when higher address
space is not implemented.
Link: https://lkml.kernel.org/r/20240718052504.356517-1-dev.jain@arm.com
Fixes: e4a4ba415419 ("selftests/mm: va_high_addr_switch: dynamically initialize testcases to enable LPA2 testing")
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
It's expected that no page should be left in pcp_list after calling
zone_pcp_disable() in offline_pages(). Previously, it's observed that
offline_pages() gets stuck [1] due to some pages remaining in pcp_list.
Cause:
There is a race condition between drain_pages_zone() and __rmqueue_pcplist()
involving the pcp->count variable. See below scenario:
CPU0 CPU1
---------------- ---------------
spin_lock(&pcp->lock);
__rmqueue_pcplist() {
zone_pcp_disable() {
/* list is empty */
if (list_empty(list)) {
/* add pages to pcp_list */
alloced = rmqueue_bulk()
mutex_lock(&pcp_batch_high_lock)
...
__drain_all_pages() {
drain_pages_zone() {
/* read pcp->count, it's 0 here */
count = READ_ONCE(pcp->count)
/* 0 means nothing to drain */
/* update pcp->count */
pcp->count += alloced << order;
...
...
spin_unlock(&pcp->lock);
In this case, after calling zone_pcp_disable() though, there are still some
pages in pcp_list. And these pages in pcp_list are neither movable nor
isolated, offline_pages() gets stuck as a result.
Solution:
Expand the scope of the pcp->lock to also protect pcp->count in
drain_pages_zone(), to ensure no pages are left in the pcp list after
zone_pcp_disable()
[1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/
Link: https://lkml.kernel.org/r/20240723064428.1179519-1-lizhijian@fujitsu.com
Fixes: 4b23a68f9536 ("mm/page_alloc: protect PCP lists with a spinlock")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Reported-by: Yao Xingtao <yaoxt.fnst@fujitsu.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Oliver Sand reported a performance regression caused by commit
98c9daf5ae6b ("mm: memcg: guard memcg1-specific members of struct
mem_cgroup_per_node"), which puts some fields of the mem_cgroup_per_node
structure under the CONFIG_MEMCG_V1 config option. Apparently it causes a
false cache sharing between lruvec and lru_zone_size members of the
structure. Fix it by adding an explicit padding after the lruvec member.
Even though the padding is not required with CONFIG_MEMCG_V1 set, it seems
like the introduced memory overhead is not significant enough to warrant
another divergence in the mem_cgroup_per_node layout, so the padding is
added unconditionally.
Link: https://lkml.kernel.org/r/20240723171244.747521-1-roman.gushchin@linux.dev
Fixes: 98c9daf5ae6b ("mm: memcg: guard memcg1-specific members of struct mem_cgroup_per_node")
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202407121335.31a10cb6-oliver.sang@intel.com
Tested-by: Oliver Sang <oliver.sang@intel.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Outline and export free_reserved_page() because modules use it and it in
turn uses page_ext_{get|put} which should not be exported. The same
result could be obtained by outlining {get|put}_page_tag_ref() but that
would have higher performance impact as these functions are used in more
performance critical paths.
Link: https://lkml.kernel.org/r/20240717212844.2749975-1-surenb@google.com
Fixes: dcfe378c81f7 ("lib: introduce support for page allocation tagging")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202407080044.DWMC9N9I-lkp@intel.com/
Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Sourav Panda <souravpanda@google.com>
Cc: <stable@vger.kernel.org> [6.10]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The decompression code parses a huffman tree and counts the number of
symbols for a given bit length. In rare cases, there may be >= 256
symbols with a given bit length, causing the unsigned char to overflow.
This causes a decompression failure later when the code tries and fails to
find the bit length for a given symbol.
Since the maximum number of symbols is 258, use unsigned short instead.
Link: https://lkml.kernel.org/r/20240717162016.1514077-1-ross.lagerwall@citrix.com
Fixes: bc22c17e12c1 ("bzip2/lzma: library support for gzip, bzip2 and lzma decompression")
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Cc: Alain Knaff <alain@knaff.lu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
xarray can't support arbitrary page cache size. the largest and supported
page cache size is defined as MAX_PAGECACHE_ORDER by commit 099d90642a71
("mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray"). However,
it's possible to have 512MB page cache in the huge memory's collapsing
path on ARM64 system whose base page size is 64KB. 512MB page cache is
breaking the limitation and a warning is raised when the xarray entry is
split as shown in the following example.
[root@dhcp-10-26-1-207 ~]# cat /proc/1/smaps | grep KernelPageSize
KernelPageSize: 64 kB
[root@dhcp-10-26-1-207 ~]# cat /tmp/test.c
:
int main(int argc, char **argv)
{
const char *filename = TEST_XFS_FILENAME;
int fd = 0;
void *buf = (void *)-1, *p;
int pgsize = getpagesize();
int ret = 0;
if (pgsize != 0x10000) {
fprintf(stdout, "System with 64KB base page size is required!\n");
return -EPERM;
}
system("echo 0 > /sys/devices/virtual/bdi/253:0/read_ahead_kb");
system("echo 1 > /proc/sys/vm/drop_caches");
/* Open the xfs file */
fd = open(filename, O_RDONLY);
assert(fd > 0);
/* Create VMA */
buf = mmap(NULL, TEST_MEM_SIZE, PROT_READ, MAP_SHARED, fd, 0);
assert(buf != (void *)-1);
fprintf(stdout, "mapped buffer at 0x%p\n", buf);
/* Populate VMA */
ret = madvise(buf, TEST_MEM_SIZE, MADV_NOHUGEPAGE);
assert(ret == 0);
ret = madvise(buf, TEST_MEM_SIZE, MADV_POPULATE_READ);
assert(ret == 0);
/* Collapse VMA */
ret = madvise(buf, TEST_MEM_SIZE, MADV_HUGEPAGE);
assert(ret == 0);
ret = madvise(buf, TEST_MEM_SIZE, MADV_COLLAPSE);
if (ret) {
fprintf(stdout, "Error %d to madvise(MADV_COLLAPSE)\n", errno);
goto out;
}
/* Split xarray entry. Write permission is needed */
munmap(buf, TEST_MEM_SIZE);
buf = (void *)-1;
close(fd);
fd = open(filename, O_RDWR);
assert(fd > 0);
fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
TEST_MEM_SIZE - pgsize, pgsize);
out:
if (buf != (void *)-1)
munmap(buf, TEST_MEM_SIZE);
if (fd > 0)
close(fd);
return ret;
}
[root@dhcp-10-26-1-207 ~]# gcc /tmp/test.c -o /tmp/test
[root@dhcp-10-26-1-207 ~]# /tmp/test
------------[ cut here ]------------
WARNING: CPU: 25 PID: 7560 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128
Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib \
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct \
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 \
ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm fuse \
xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 virtio_net \
sha1_ce net_failover virtio_blk virtio_console failover dimlib virtio_mmio
CPU: 25 PID: 7560 Comm: test Kdump: loaded Not tainted 6.10.0-rc7-gavin+ #9
Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024
pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
pc : xas_split_alloc+0xf8/0x128
lr : split_huge_page_to_list_to_order+0x1c4/0x780
sp : ffff8000ac32f660
x29: ffff8000ac32f660 x28: ffff0000e0969eb0 x27: ffff8000ac32f6c0
x26: 0000000000000c40 x25: ffff0000e0969eb0 x24: 000000000000000d
x23: ffff8000ac32f6c0 x22: ffffffdfc0700000 x21: 0000000000000000
x20: 0000000000000000 x19: ffffffdfc0700000 x18: 0000000000000000
x17: 0000000000000000 x16: ffffd5f3708ffc70 x15: 0000000000000000
x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
x11: ffffffffffffffc0 x10: 0000000000000040 x9 : ffffd5f3708e692c
x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff0000e0969eb8
x5 : ffffd5f37289e378 x4 : 0000000000000000 x3 : 0000000000000c40
x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
Call trace:
xas_split_alloc+0xf8/0x128
split_huge_page_to_list_to_order+0x1c4/0x780
truncate_inode_partial_folio+0xdc/0x160
truncate_inode_pages_range+0x1b4/0x4a8
truncate_pagecache_range+0x84/0xa0
xfs_flush_unmap_range+0x70/0x90 [xfs]
xfs_file_fallocate+0xfc/0x4d8 [xfs]
vfs_fallocate+0x124/0x2f0
ksys_fallocate+0x4c/0xa0
__arm64_sys_fallocate+0x24/0x38
invoke_syscall.constprop.0+0x7c/0xd8
do_el0_svc+0xb4/0xd0
el0_svc+0x44/0x1d8
el0t_64_sync_handler+0x134/0x150
el0t_64_sync+0x17c/0x180
Fix it by correcting the supported page cache orders, different sets for
DAX and other files. With it corrected, 512MB page cache becomes
disallowed on all non-DAX files on ARM64 system where the base page size
is 64KB. After this patch is applied, the test program fails with error
-EINVAL returned from __thp_vma_allowable_orders() and the madvise()
system call to collapse the page caches.
Link: https://lkml.kernel.org/r/20240715000423.316491-1-gshan@redhat.com
Fixes: 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache")
Signed-off-by: Gavin Shan <gshan@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: <stable@vger.kernel.org> [5.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Yves-Alexis Perez reported commit 4ef9ad19e176 ("mm: huge_memory: don't
force huge page alignment on 32 bit") didn't work for x86_32 [1]. It is
because x86_32 uses CONFIG_X86_32 instead of CONFIG_32BIT.
!CONFIG_64BIT should cover all 32 bit machines.
[1] https://lore.kernel.org/linux-mm/CAHbLzkr1LwH3pcTgM+aGQ31ip2bKqiqEQ8=FQB+t2c3dhNKNHA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20240712155855.1130330-1-yang@os.amperecomputing.com
Fixes: 4ef9ad19e176 ("mm: huge_memory: don't force huge page alignment on 32 bit")
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Reported-by: Yves-Alexis Perez <corsac@debian.org>
Tested-by: Yves-Alexis Perez <corsac@debian.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Salvatore Bonaccorso <carnil@debian.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org> [6.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 3bd786f76de2 ("mm: convert do_set_pte() to set_pte_range()")
replaced do_set_pte() with set_pte_range() and that introduced a
regression in the following faulting path of non-anonymous vmas which
caused the PTE for the faulting address to be marked as old instead of
young.
handle_pte_fault()
do_pte_missing()
do_fault()
do_read_fault() || do_cow_fault() || do_shared_fault()
finish_fault()
set_pte_range()
The polarity of prefault calculation is incorrect. This leads to prefault
being incorrectly set for the faulting address. The following check will
incorrectly mark the PTE old rather than young. On some architectures
this will cause a double fault to mark it young when the access is
retried.
if (prefault && arch_wants_old_prefaulted_pte())
entry = pte_mkold(entry);
On a subsequent fault on the same address, the faulting path will see a
non NULL vmf->pte and instead of reaching the do_pte_missing() path, PTE
will then be correctly marked young in handle_pte_fault() itself.
Due to this bug, performance degradation in the fault handling path will
be observed due to unnecessary double faulting.
Link: https://lkml.kernel.org/r/20240710014539.746200-1-rtummala@nvidia.com
Fixes: 3bd786f76de2 ("mm: convert do_set_pte() to set_pte_range()")
Signed-off-by: Ram Tummala <rtummala@nvidia.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
My new address is james.clark@linaro.org
Link: https://lkml.kernel.org/r/20240709102512.31212-3-james.clark@linaro.org
Signed-off-by: James Clark <james.clark@linaro.org>
Cc: Bjorn Andersson <quic_bjorande@quicinc.com>
Cc: Conor Dooley <conor+dt@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geliang Tang <geliang@kernel.org>
Cc: Hao Zhang <quic_hazha@quicinc.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Krzysztof Kozlowski <krzk+dt@kernel.org>
Cc: Mao Jinlong <quic_jinlmao@quicinc.com>
Cc: Matthieu Baerts <matttbe@kernel.org>
Cc: Matt Ranostay <matt@ranostay.sg>
Cc: Mike Leach <mike.leach@linaro.org>
Cc: Oleksij Rempel <o.rempel@pengutronix.de>
Cc: Rob Herring (Arm) <robh@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
My new address is james.clark@linaro.org
Link: https://lkml.kernel.org/r/20240709102512.31212-2-james.clark@linaro.org
Signed-off-by: James Clark <james.clark@linaro.org>
Cc: Bjorn Andersson <quic_bjorande@quicinc.com>
Cc: Conor Dooley <conor+dt@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geliang Tang <geliang@kernel.org>
Cc: Hao Zhang <quic_hazha@quicinc.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Krzysztof Kozlowski <krzk+dt@kernel.org>
Cc: Mao Jinlong <quic_jinlmao@quicinc.com>
Cc: Matthieu Baerts <matttbe@kernel.org>
Cc: Matt Ranostay <matt@ranostay.sg>
Cc: Mike Leach <mike.leach@linaro.org>
Cc: Oleksij Rempel <o.rempel@pengutronix.de>
Cc: Rob Herring (Arm) <robh@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The 'single-channel' property is an uint32, not an array, so 'items' is
an incorrect constraint. This didn't matter until dtschema recently
changed how properties are decoded. This results in this warning:
Documentation/devicetree/bindings/iio/adc/adi,ad7192.example.dtb: adc@0: \
channel@1:single-channel: 1 is not of type 'array'
Fixes: caf7b7632b8d ("dt-bindings: iio: adc: ad7192: Add AD7194 support")
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20240723230904.1299744-1-robh@kernel.org
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
|
|
On arm64 we build compressed images, but "make install" by default will
install the old non-compressed one. To actually get the compressed
image install, you need to use "make zinstall", which is not the usual
way to install a kernel.
Which may not sound like much of an issue, but when you deal with
multiple architectures (and years of your fingers knowing the regular
"make install" incantation), this inconsistency is pretty annoying.
But as Will Deacon says:
"Sadly, bootloaders being as top quality as you might expect, I don't
think we're in a position to rely on decompressor support across the
board. Our Image.gz is literally just that -- we don't have a built-in
decompressor (nor do I think we want to rush into that again after the
fun we had on arm32) and the recent EFI zboot support solves that
problem for platforms using EFI.
Changing the default 'install' target terrifies me. There are bound to
be folks with embedded boards who've scripted this and we could really
ruin their day if we quietly give them a compressed kernel that their
bootloader doesn't know how to handle :/"
So make this conditional on a new "COMPRESSED_INSTALL" option.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since the regulator can't be used without the USB controller also
tighten the dependency to match, as well as the default.
Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://patch.msgid.link/20240726-regulator-restrict-rzg2l-v1-1-640e508896e2@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
The RISC-V architecture makes a real time counter CSR (via RDTIME
instruction) available for applications in U-mode but there is no
architected mechanism for an application to discover the frequency
the counter is running at. Some applications (e.g., DPDK) use the
time counter for basic performance analysis as well as fine grained
time-keeping.
Add support to the hwprobe system call to export the time CSR
frequency to code running in U-mode.
Signed-off-by: Yunhui Cui <cuiyunhui@bytedance.com>
Reviewed-by: Evan Green <evan@rivosinc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Acked-by: Punit Agrawal <punit.agrawal@bytedance.com>
Link: https://lore.kernel.org/r/20240702033731.71955-2-cuiyunhui@bytedance.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
|
This harmonizes all virtual addressing modes which can now all map
(PGDIR_SIZE * PTRS_PER_PGD) / 4 of physical memory.
The RISCV implementation of KASAN requires that the boundary between
shallow mappings are aligned on an 8G boundary. In this case we need
VMALLOC_START to be 8G aligned. So although we only need to move the
start of the linear mapping down by 4GiB to allow 128GiB to be mapped,
we actually move it down by 8GiB (creating a 4GiB hole between the
linear mapping and KASAN shadow space) to maintain the alignment
requirement.
Signed-off-by: Stuart Menefy <stuart.menefy@codasip.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240630110550.1731929-1-stuart.menefy@codasip.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
|
Add support for the stackleak feature. Whenever the kernel returns to user
space the kernel stack is filled with a poison value.
At the same time, disables the plugin in EFI stub code because EFI stub
is out of scope for the protection.
Tested on qemu and milkv duo:
/ # echo STACKLEAK_ERASING > /sys/kernel/debug/provoke-crash/DIRECT
[ 38.675575] lkdtm: Performing direct entry STACKLEAK_ERASING
[ 38.678448] lkdtm: stackleak stack usage:
[ 38.678448] high offset: 288 bytes
[ 38.678448] current: 496 bytes
[ 38.678448] lowest: 1328 bytes
[ 38.678448] tracked: 1328 bytes
[ 38.678448] untracked: 448 bytes
[ 38.678448] poisoned: 14312 bytes
[ 38.678448] low offset: 8 bytes
[ 38.689887] lkdtm: OK: the rest of the thread stack is properly erased
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Link: https://lore.kernel.org/r/20240623235316.2010-1-jszhang@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
|
"WARN_ON(unlikely(x))" is excessive. WARN_ON() already uses unlikely()
internally.
Signed-off-by: Zhongqiu Han <quic_zhonhan@quicinc.com>
Reviewed-by: Bjorn Andersson <quic_bjorande@quicinc.com>
Reviewed-by: Andy Chiu <andy.chiu@sifive.com>
Link: https://lore.kernel.org/r/20240620033434.3778156-1-quic_zhonhan@quicinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
|
Many CPUs implement return address branch prediction as a stack. The
RISCV architecture refers to this as a return address stack (RAS). If
this gets corrupted then the CPU will mispredict at least one but
potentally many function returns.
There are two issues with the current RISCV exception code:
- We are using the alternate link stack (x5/t0) for the indirect branch
which makes the hardware think this is a function return. This will
corrupt the RAS.
- We modify the return address of handle_exception to point to
ret_from_exception. This will also corrupt the RAS.
Testing the null system call latency before and after the patch:
Visionfive2 (StarFive JH7110 / U74)
baseline: 189.87 ns
patched: 176.76 ns
Lichee pi 4a (T-Head TH1520 / C910)
baseline: 666.58 ns
patched: 636.90 ns
Just over 7% on the U74 and just over 4% on the C910.
Signed-off-by: Anton Blanchard <antonb@tenstorrent.com>
Signed-off-by: Cyril Bur <cyrilbur@tenstorrent.com>
Tested-by: Jisheng Zhang <jszhang@kernel.org>
Reviewed-by: Jisheng Zhang <jszhang@kernel.org>
Link: https://lore.kernel.org/r/20240607061335.2197383-1-cyrilbur@tenstorrent.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
|
Add header file for generic I3C flags to avoid hard code in dts file.
Signed-off-by: Carlos Song <carlos.song@nxp.com>
Reviewed-by: Frank Li <frank.li@nxp.com>
Acked-by: Jason Liu <jason.hui.liu@nxp.com>
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20240715225351.3237284-1-Frank.Li@nxp.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
This code has a typo so it returns positive EIO instead of negative -EIO. Fix
it!
Fixes: a7809cb368b9 ("i3c: master: svc: Improve DAA STOP handle code logic")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://lore.kernel.org/r/e017edfc-da64-496b-8516-958bec27cd9a@stanley.mountain
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Improve the visibility of i3c_bus_type to facilitate searching for
i3c devices attached to the i3c bus. Enable other drivers to use
bus_register_notifier to monitor i3c bus device events.
Signed-off-by: Bhoomik Gupta <bhoomik.gupta@nxp.com>
Link: https://lore.kernel.org/r/20240708053835.3003986-1-bhoomik.gupta@nxp.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Add support for runtime and system power management.
Handle clocks, resets, pads as part of suspend and resume.
Restore controller registers that could be lost due to suspend.
Finally add get and put calls appropriately in functions which
access controller : bus_init, do_daa, send_ccc_cmd, priv_xfers,
i2c_xfers, ibi and hot-join.
Signed-off-by: Aniket <aniketmaurya@google.com>
Link: https://lore.kernel.org/r/20240708062103.3296587-4-aniketmaurya@google.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Separate logic for setting intr/thld registers in a func.
Also modify enable function to take care of setting all fields in DEVICE_CTRL.
These functions can be reused later for power management.
Signed-off-by: Aniket <aniketmaurya@google.com>
Link: https://lore.kernel.org/r/20240708062103.3296587-3-aniketmaurya@google.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Add variables to store timing registers and other values.
These variables would be later used to restore registers
during resume without recomputation.
Signed-off-by: Aniket <aniketmaurya@google.com>
Link: https://lore.kernel.org/r/20240708062103.3296587-2-aniketmaurya@google.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
The REQUEST_PROC_DAA command behaves differently from other commands.
Sometimes the hardware can auto emit STOP, but in other conditions, it
cannot.
Improves the code logic to better handle these situations.
Hardware can auto emit STOP only when the following conditions are met:
- The previous I3C device correctly returns a PID and ACKs an I3C address.
- A NACK is received when emitting 7E to try to get the next I3C device's
PID.
In all other cases, a manual STOP emission is needed.
The code is changed to emit STOP when break the while loop and 'return 0'
only when the hardware can auto emit STOP.
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://lore.kernel.org/r/20240702223107.403057-1-Frank.Li@nxp.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Besides the core clock, IP also has an apb interface clock.
Add an optional hook for the same.
Signed-off-by: Aniket <aniketmaurya@google.com>
Link: https://lore.kernel.org/r/20240628154618.327151-1-aniketmaurya@google.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Move to "enabled" variant of clk_get API. It takes care
of enable and disable calls during the probe and remove.
Signed-off-by: Aniket <aniketmaurya@google.com>
Link: https://lore.kernel.org/r/20240628154603.326075-1-aniketmaurya@google.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Add dt binding for optional apb clock. Core clock is mandatory.
Signed-off-by: Aniket <aniketmaurya@google.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20240628154526.324068-1-aniketmaurya@google.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Replace a comma between expression statements by a semicolon.
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Link: https://lore.kernel.org/r/20240702024758.1411569-1-nichen@iscas.ac.cn
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
The dma.c: hci_dma_init() sets the CHUNK_SIZE field in the IBI_SETUP
register incorrectly if the calculated ibi_chunk_sz is not exactly
2^(n+2) bytes, where n is 0..6.
Fix this by rounding the chunk size up to nearest 2^(n+2) bytes.
Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Link: https://lore.kernel.org/r/20240628131559.502822-4-jarkko.nikula@linux.intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Definitely condition dma_get_cache_alignment * defined value > 256
during driver initialization is not reason to BUG_ON(). Turn that to
graceful error out with -EINVAL.
Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Link: https://lore.kernel.org/r/20240628131559.502822-3-jarkko.nikula@linux.intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
IBI Status and Data Ring base address registers are not set so HW
obviously cannot update those rings after In-Band Interrupt.
Set them to already allocated and mapped ring addresses.
Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Link: https://lore.kernel.org/r/20240628131559.502822-2-jarkko.nikula@linux.intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Rather than having own lo32()/hi32() helpers for dealing with 32-bit and
64-bit build targets switch to generic lower_32_bits()/upper_32_bits()
helpers.
Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Link: https://lore.kernel.org/r/20240628131559.502822-1-jarkko.nikula@linux.intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Since DW I3C IP master role always supports IBI, we don't need
to keep two variants of master ops and select one using this
property. Hence remove the code.
Signed-off-by: Aniket <aniketmaurya@google.com>
Link: https://lore.kernel.org/r/20240627034119.3938050-1-aniketmaurya@google.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
IBI_SIR_REQ_REJECT register is not present if the IP has
IC_HAS_IBI_DATA = 1 set. So don't rely on doing read-
modify-write op on this register.
Instead maintain a variable to store the sir reject mask
and use it to set IBI_SIR_REQ_REJECT.
Signed-off-by: Aniket <aniketmaurya@google.com>
Reviewed-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
QUEUE_THLD_CTRL_IBI_STAT_MASK is repeated twice.
Replace with QUEUE_THLD_CTRL_IBI_DATA_MASK.
Signed-off-by: Aniket <aniketmaurya@google.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
I was wrong about the TABLE_SIZE field description in the
commit 0676bfebf576 ("i3c: mipi-i3c-hci: Fix DAT/DCT entry sizes").
For the MIPI I3C HCI versions 1.0 and earlier the TABLE_SIZE field in
the registers DAT_SECTION_OFFSET and DCT_SECTION_OFFSET is indeed defined
in DWORDs and not number of entries like it is defined in later versions.
Where above fix allowed driver initialization to continue the wrongly
interpreted TABLE_SIZE field leads variables DAT_entries being twice and
DCT_entries four times as big as they really are.
That in turn leads clearing the DAT table over the boundary in the
dat_v1.c: hci_dat_v1_init().
So interprete the TABLE_SIZE field in DWORDs for HCI versions < 1.1 and
fix number of DAT/DCT entries accordingly.
Fixes: 0676bfebf576 ("i3c: mipi-i3c-hci: Fix DAT/DCT entry sizes")
Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
According to I3C Spec 1.1.1, 11-Jun-2021, section: 5.1.2.2.3:
If the Controller chooses to start an I3C Message with an I3C Dynamic
Address, then special provisions shall be made because that same I3C Target
may be initiating an IBI or a Controller Role Request. So, one of three
things may happen: (skip 1, 2)
3. The Addresses match and the RnW bits also match, and so neither
Controller nor Target will ACK since both are expecting the other side to
provide ACK. As a result, each side might think it had "won" arbitration,
but neither side would continue, as each would subsequently see that the
other did not provide ACK.
...
For either value of RnW: Due to the NACK, the Controller shall defer the
Private Write or Private Read, and should typically transmit the Target
^^^^^^^^^^^^^^^^^^^
Address again after a Repeated START (i.e., the next one or any one prior
^^^^^^^^^^^^^
to a STOP in the Frame). Since the Address Header following a Repeated
START is not arbitrated, the Controller will always win (see Section
5.1.2.2.4).
Resend target address again if address is not 7E and controller get NACK.
Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Replace a comma between expression statements by a semicolon.
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Link: https://lore.kernel.org/r/20240724020721.2389738-1-nichen@iscas.ac.cn
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
If the requested page is part of the previous multi-page folio, there
is no need to call read_mapping_folio() again.
Also, get rid of the remaining one of page->index [1] in our codebase.
[1] https://lore.kernel.org/r/Zp8fgUSIBGQ1TN0D@casper.infradead.org
Cc: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240723073024.875290-1-hsiangkao@linux.alibaba.com
|
|
FS_IOC_GETFSSYSFSPATH ioctl exposes /sys/fs path of a given filesystem,
potentially standarizing sysfs reporting. This patch add support for
FS_IOC_GETFSSYSFSPATH for erofs, "erofs/<dev>" will be outputted for bdev
cases, "erofs/[domain_id,]<fs_id>" will be outputted for fscache cases.
Signed-off-by: Huang Xiaojia <huangxiaojia2@huawei.com>
Link: https://lore.kernel.org/r/20240720082335.441563-1-huangxiaojia2@huawei.com
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
In z_erofs_get_gbuf(), the current task may be migrated to another
CPU between `z_erofs_gbuf_id()` and `spin_lock(&gbuf->lock)`.
Therefore, z_erofs_put_gbuf() will trigger the following issue
which was found by stress test:
<2>[772156.434168] kernel BUG at fs/erofs/zutil.c:58!
..
<4>[772156.435007]
<4>[772156.439237] CPU: 0 PID: 3078 Comm: stress Kdump: loaded Tainted: G E 6.10.0-rc7+ #2
<4>[772156.439239] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 1.0.0 01/01/2017
<4>[772156.439241] pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
<4>[772156.439243] pc : z_erofs_put_gbuf+0x64/0x70 [erofs]
<4>[772156.439252] lr : z_erofs_lz4_decompress+0x600/0x6a0 [erofs]
..
<6>[772156.445958] stress (3127): drop_caches: 1
<4>[772156.446120] Call trace:
<4>[772156.446121] z_erofs_put_gbuf+0x64/0x70 [erofs]
<4>[772156.446761] z_erofs_lz4_decompress+0x600/0x6a0 [erofs]
<4>[772156.446897] z_erofs_decompress_queue+0x740/0xa10 [erofs]
<4>[772156.447036] z_erofs_runqueue+0x428/0x8c0 [erofs]
<4>[772156.447160] z_erofs_readahead+0x224/0x390 [erofs]
..
Fixes: f36f3010f676 ("erofs: rename per-CPU buffers to global buffer pool and make it configurable")
Cc: <stable@vger.kernel.org> # 6.10+
Reviewed-by: Chunhai Guo <guochunhai@vivo.com>
Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240722035110.3456740-1-hsiangkao@linux.alibaba.com
|
|
Add support for STATX_DIOALIGN to EROFS, so that direct I/O
alignment restrictions are exposed to userspace in a generic
way.
[Before]
```
./statx_test /mnt/erofs/testfile
statx(/mnt/erofs/testfile) = 0
dio mem align:0
dio offset align:0
```
[After]
```
./statx_test /mnt/erofs/testfile
statx(/mnt/erofs/testfile) = 0
dio mem align:512
dio offset align:512
```
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240718083243.2485437-1-hsiangkao@linux.alibaba.com
|
|
The cited commit missed to check against the validity of the frame length
in the tun_xdp_one() path, which could cause a corrupted skb to be sent
downstack. Even before the skb is transmitted, the
tun_xdp_one-->eth_type_trans() may access the Ethernet header although it
can be less than ETH_HLEN. Once transmitted, this could either cause
out-of-bound access beyond the actual length, or confuse the underlayer
with incorrect or inconsistent header length in the skb metadata.
In the alternative path, tun_get_user() already prohibits short frame which
has the length less than Ethernet header size from being transmitted for
IFF_TAP.
This is to drop any frame shorter than the Ethernet header size just like
how tun_get_user() does.
CVE: CVE-2024-41091
Inspired-by: https://lore.kernel.org/netdev/1717026141-25716-1-git-send-email-si-wei.liu@oracle.com/
Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
Cc: stable@vger.kernel.org
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20240724170452.16837-3-dongli.zhang@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The cited commit missed to check against the validity of the frame length
in the tap_get_user_xdp() path, which could cause a corrupted skb to be
sent downstack. Even before the skb is transmitted, the
tap_get_user_xdp()-->skb_set_network_header() may assume the size is more
than ETH_HLEN. Once transmitted, this could either cause out-of-bound
access beyond the actual length, or confuse the underlayer with incorrect
or inconsistent header length in the skb metadata.
In the alternative path, tap_get_user() already prohibits short frame which
has the length less than Ethernet header size from being transmitted.
This is to drop any frame shorter than the Ethernet header size just like
how tap_get_user() does.
CVE: CVE-2024-41090
Link: https://lore.kernel.org/netdev/1717026141-25716-1-git-send-email-si-wei.liu@oracle.com/
Fixes: 0efac27791ee ("tap: accept an array of XDP buffs through sendmsg()")
Cc: stable@vger.kernel.org
Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20240724170452.16837-2-dongli.zhang@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Don't dereference *sp after calling dev_kfree_skb(*sp).
Fixes: af69fb3a8ffa ("Add mISDN HFC multiport driver")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/8be65f5a-c2dd-4ba0-8a10-bfe5980b8cfb@stanley.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|