Age | Commit message (Collapse) | Author | Files | Lines |
|
schbench shows latency increase for 95 percentile above since:
commit 0b0695f2b34a ("sched/fair: Rework load_balance()")
Align the behavior of the load balancer with the wake up path, which tries
to select an idle CPU which belongs to the LLC for a waking task.
calculate_imbalance() will use nr_running instead of the spare
capacity when CPUs share resources (ie cache) at the domain level. This
will ensure a better spread of tasks on idle CPUs.
Running schbench on a hikey (8cores arm64) shows the problem:
tip/sched/core :
schbench -m 2 -t 4 -s 10000 -c 1000000 -r 10
Latency percentiles (usec)
50.0th: 33
75.0th: 45
90.0th: 51
95.0th: 4152
*99.0th: 14288
99.5th: 14288
99.9th: 14288
min=0, max=14276
tip/sched/core + patch :
schbench -m 2 -t 4 -s 10000 -c 1000000 -r 10
Latency percentiles (usec)
50.0th: 34
75.0th: 47
90.0th: 52
95.0th: 78
*99.0th: 94
99.5th: 94
99.9th: 94
min=0, max=94
Fixes: 0b0695f2b34a ("sched/fair: Rework load_balance()")
Reported-by: Chris Mason <clm@fb.com>
Suggested-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Tested-by: Rik van Riel <riel@surriel.com>
Link: https://lkml.kernel.org/r/20201102102457.28808-1-vincent.guittot@linaro.org
|
|
Commit:
765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds")
made sched features static for !CONFIG_SCHED_DEBUG configurations, but
overlooked the CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL cases.
For the latter echoing changes to /sys/kernel/debug/sched_features has
the nasty effect of effectively changing what sched_features reports,
but without actually changing the scheduler behaviour (since different
translation units get different sysctl_sched_features).
Fix CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL configurations by properly
restructuring ifdefs.
Fixes: 765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds")
Co-developed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Patrick Bellasi <patrick.bellasi@matbug.net>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lore.kernel.org/r/20201013053114.160628-1-juri.lelli@redhat.com
|
|
In the following commit:
04f5c362ec6d: ("sched/fair: Replace zero-length array with flexible-array")
a zero-length array cpumask[0] has been replaced with cpumask[].
But there is still a cpumask[0] in 'struct sched_group_capacity'
which was missed.
The point of using [] instead of [0] is that with [] the compiler will
generate a build warning if it isn't the last member of a struct.
[ mingo: Rewrote the changelog. ]
Signed-off-by: zhuguangqing <zhuguangqing@xiaomi.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20201014140220.11384-1-zhuguangqing83@gmail.com
|
|
There's a possible race in perf_mmap_close() when checking ring buffer's
mmap_count refcount value. The problem is that the mmap_count check is
not atomic because we call atomic_dec() and atomic_read() separately.
perf_mmap_close:
...
atomic_dec(&rb->mmap_count);
...
if (atomic_read(&rb->mmap_count))
goto out_put;
<ring buffer detach>
free_uid
out_put:
ring_buffer_put(rb); /* could be last */
The race can happen when we have two (or more) events sharing same ring
buffer and they go through atomic_dec() and then they both see 0 as refcount
value later in atomic_read(). Then both will go on and execute code which
is meant to be run just once.
The code that detaches ring buffer is probably fine to be executed more
than once, but the problem is in calling free_uid(), which will later on
demonstrate in related crashes and refcount warnings, like:
refcount_t: addition on 0; use-after-free.
...
RIP: 0010:refcount_warn_saturate+0x6d/0xf
...
Call Trace:
prepare_creds+0x190/0x1e0
copy_creds+0x35/0x172
copy_process+0x471/0x1a80
_do_fork+0x83/0x3a0
__do_sys_wait4+0x83/0x90
__do_sys_clone+0x85/0xa0
do_syscall_64+0x5b/0x1e0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Using atomic decrease and check instead of separated calls.
Tested-by: Michael Petlan <mpetlan@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Wade Mealing <wmealing@redhat.com>
Fixes: 9bb5d40cd93c ("perf: Fix mmap() accounting hole");
Link: https://lore.kernel.org/r/20200916115311.GE2301783@krava
|
|
|
|
When memory is hotplug added or removed the min_free_kbytes should be
recalculated based on what is expected by khugepaged. Currently after
hotplug, min_free_kbytes will be set to a lower default and higher
default set when THP enabled is lost.
This change restores min_free_kbytes as expected for THP consumers.
[vijayb@linux.microsoft.com: v5]
Link: https://lkml.kernel.org/r/1601398153-5517-1-git-send-email-vijayb@linux.microsoft.com
Fixes: f000565adb77 ("thp: set recommended min free kbytes")
Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Allen Pais <apais@microsoft.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/1600305709-2319-2-git-send-email-vijayb@linux.microsoft.com
Link: https://lkml.kernel.org/r/1600204258-13683-1-git-send-email-vijayb@linux.microsoft.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The swap address_space doesn't have host. Thus, it makes kernel crash once
swap write meets error. Fix it.
Fixes: 735e4ae5ba28 ("vfs: track per-sb writeback errors and report them to syncfs")
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201010000650.750063-1-minchan@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The syzbot reported the below general protection fault:
general protection fault, probably for non-canonical address
0xe00eeaee0000003b: 0000 [#1] PREEMPT SMP KASAN
KASAN: maybe wild-memory-access in range [0x00777770000001d8-0x00777770000001df]
CPU: 1 PID: 10488 Comm: syz-executor721 Not tainted 5.9.0-rc3-syzkaller #0
RIP: 0010:unlink_file_vma+0x57/0xb0 mm/mmap.c:164
Call Trace:
free_pgtables+0x1b3/0x2f0 mm/memory.c:415
exit_mmap+0x2c0/0x530 mm/mmap.c:3184
__mmput+0x122/0x470 kernel/fork.c:1076
mmput+0x53/0x60 kernel/fork.c:1097
exit_mm kernel/exit.c:483 [inline]
do_exit+0xa8b/0x29f0 kernel/exit.c:793
do_group_exit+0x125/0x310 kernel/exit.c:903
get_signal+0x428/0x1f00 kernel/signal.c:2757
arch_do_signal+0x82/0x2520 arch/x86/kernel/signal.c:811
exit_to_user_mode_loop kernel/entry/common.c:136 [inline]
exit_to_user_mode_prepare+0x1ae/0x200 kernel/entry/common.c:167
syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:242
entry_SYSCALL_64_after_hwframe+0x44/0xa9
It's because the ->mmap() callback can change vma->vm_file and fput the
original file. But the commit d70cec898324 ("mm: mmap: merge vma after
call_mmap() if possible") failed to catch this case and always fput()
the original file, hence add an extra fput().
[ Thanks Hillf for pointing this extra fput() out. ]
Fixes: d70cec898324 ("mm: mmap: merge vma after call_mmap() if possible")
Reported-by: syzbot+c5d5a51dcbb558ca0cb5@syzkaller.appspotmail.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christian König <ckoenig.leichtzumerken@gmail.com>
Cc: Hongxiang Lou <louhongxiang@huawei.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Link: https://lkml.kernel.org/r/20200916090733.31427-1-linmiaohe@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Use my kernel.org address instead of my bootlin.com one.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20201005164533.16811-1-atenart@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
As more email from git history gets aimed at the OpenWall
kernel-hardening@ list, there has been a desire to separate "new topics"
from "on-going" work.
To handle this, the superset of hardening email topics are now to be
directed to linux-hardening@vger.kernel.org.
Update the MAINTAINERS file and the .mailmap to accomplish this, so that
linux-hardening@ can be treated like any other regular upstream kernel
development list.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: "Tobin C. Harding" <me@tobin.cc>
Cc: Tycho Andersen <tycho@tycho.pizza>
Cc: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/linux-hardening/202010051443.279CC265D@keescook/
Link: https://lkml.kernel.org/r/20201006000012.2768958-1-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
On setxattr() syscall path due to an apprent typo the size of a dynamically
allocated memory chunk for storing struct smb2_file_full_ea_info object is
computed incorrectly, to be more precise the first addend is the size of
a pointer instead of the wanted object size. Coincidentally it makes no
difference on 64-bit platforms, however on 32-bit targets the following
memcpy() writes 4 bytes of data outside of the dynamically allocated memory.
=============================================================================
BUG kmalloc-16 (Not tainted): Redzone overwritten
-----------------------------------------------------------------------------
Disabling lock debugging due to kernel taint
INFO: 0x79e69a6f-0x9e5cdecf @offset=368. First byte 0x73 instead of 0xcc
INFO: Slab 0xd36d2454 objects=85 used=51 fp=0xf7d0fc7a flags=0x35000201
INFO: Object 0x6f171df3 @offset=352 fp=0x00000000
Redzone 5d4ff02d: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ................
Object 6f171df3: 00 00 00 00 00 05 06 00 73 6e 72 75 62 00 66 69 ........snrub.fi
Redzone 79e69a6f: 73 68 32 0a sh2.
Padding 56254d82: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
CPU: 0 PID: 8196 Comm: attr Tainted: G B 5.9.0-rc8+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
Call Trace:
dump_stack+0x54/0x6e
print_trailer+0x12c/0x134
check_bytes_and_report.cold+0x3e/0x69
check_object+0x18c/0x250
free_debug_processing+0xfe/0x230
__slab_free+0x1c0/0x300
kfree+0x1d3/0x220
smb2_set_ea+0x27d/0x540
cifs_xattr_set+0x57f/0x620
__vfs_setxattr+0x4e/0x60
__vfs_setxattr_noperm+0x4e/0x100
__vfs_setxattr_locked+0xae/0xd0
vfs_setxattr+0x4e/0xe0
setxattr+0x12c/0x1a0
path_setxattr+0xa4/0xc0
__ia32_sys_lsetxattr+0x1d/0x20
__do_fast_syscall_32+0x40/0x70
do_fast_syscall_32+0x29/0x60
do_SYSENTER_32+0x15/0x20
entry_SYSENTER_32+0x9f/0xf2
Fixes: 5517554e4313 ("cifs: Add support for writing attributes on SMB2+")
Signed-off-by: Vladimir Zapolskiy <vladimir@tuxera.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There have been elusive reports of filemap_fault() hitting its
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page) on kernels built
with CONFIG_READ_ONLY_THP_FOR_FS=y.
Suren has hit it on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y and
CONFIG_NUMA is not set: and he has analyzed it down to how khugepaged
without NUMA reuses the same huge page after collapse_file() failed
(whereas NUMA targets its allocation to the respective node each time).
And most of us were usually testing with CONFIG_NUMA=y kernels.
collapse_file(old start)
new_page = khugepaged_alloc_page(hpage)
__SetPageLocked(new_page)
new_page->index = start // hpage->index=old offset
new_page->mapping = mapping
xas_store(&xas, new_page)
filemap_fault
page = find_get_page(mapping, offset)
// if offset falls inside hpage then
// compound_head(page) == hpage
lock_page_maybe_drop_mmap()
__lock_page(page)
// collapse fails
xas_store(&xas, old page)
new_page->mapping = NULL
unlock_page(new_page)
collapse_file(new start)
new_page = khugepaged_alloc_page(hpage)
__SetPageLocked(new_page)
new_page->index = start // hpage->index=new offset
new_page->mapping = mapping // mapping becomes valid again
// since compound_head(page) == hpage
// page_to_pgoff(page) got changed
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset)
An initial patch replaced __SetPageLocked() by lock_page(), which did
fix the race which Suren illustrates above. But testing showed that it's
not good enough: if the racing task's __lock_page() gets delayed long
after its find_get_page(), then it may follow collapse_file(new start)'s
successful final unlock_page(), and crash on the same VM_BUG_ON_PAGE.
It could be fixed by relaxing filemap_fault()'s VM_BUG_ON_PAGE to a
check and retry (as is done for mapping), with similar relaxations in
find_lock_entry() and pagecache_get_page(): but it's not obvious what
else might get caught out; and khugepaged non-NUMA appears to be unique
in exposing a page to page cache, then revoking, without going through
a full cycle of freeing before reuse.
Instead, non-NUMA khugepaged_prealloc_page() release the old page
if anyone else has a reference to it (1% of cases when I tested).
Although never reported on huge tmpfs, I believe its find_lock_entry()
has been at similar risk; but huge tmpfs does not rely on khugepaged
for its normal working nearly so much as READ_ONLY_THP_FOR_FS does.
Reported-by: Denis Lisov <dennis.lissov@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206569
Link: https://lore.kernel.org/linux-mm/?q=20200219144635.3b7417145de19b65f258c943%40linux-foundation.org
Reported-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/linux-xfs/?q=20200616013309.GB815%40lca.pw
Reported-and-analyzed-by: Suren Baghdasaryan <surenb@google.com>
Fixes: 87c460a0bded ("mm/khugepaged: collapse_shmem() without freezing new_page")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # v4.9+
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|