Age | Commit message (Collapse) | Author | Files | Lines |
|
During fstrim, if one of multiple write_checkpoint failed, break off and
return error number to caller.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
If we preallocate blocks with f2fs_reserve_blocks in f2fs_map_blocks, we
should call f2fs_balance_fs for checking and reclaiming space, fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When building each sit entry in cache, firstly, we will load it from
sit page, and then check all entries in sit journal, if there is one
updated entry in journal, cover cached entry with the journaled one.
Actually, most of check operation is unneeded since we only need
to update cached entries with journaled entries in batch, so
changing the flow as below for more efficient:
1. load all sit entries into cache from sit pages;
2. update sit entries with journal.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch changes to check valid block number of one GCed section
directly instead of checking the number in all segments of section
one by one in order to clean up codes of foreground GC.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
We don't guarantee integrity of user data after checkpoint, since we only
guarantee meta data integrity for data consistency of filesystem.
Due to above reason, we only need to set fs as dirty when meta data is
updated, so that we can skip writing checkpoint in some case of non-meta
data is updated.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch enables to do fstrim without checkpoint, if there is no fs
change.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch add discard block count to sys entry of f2fs status
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This is to reduce the batch size of fstrim to avoid long latency.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Trace info related to bio cache operation is out of format, clean up it.
Before:
<...>-28308 [002] .... 4781.052703: f2fs_submit_write_bio: dev = (251,1), WRITEWRITE_SYNC ^H, DATA, sector = 271424, size = 126976
<...>-28308 [002] .... 4781.052820: f2fs_submit_page_mbio: dev = (251,1), ino = 103, page_index = 0x1f, oldaddr = 0xffffffff, newaddr = 0x84a7 rw = WRITEWRITE_SYNCi ^H, type = DATA
kworker/u8:2-29988 [001] .... 5549.293877: f2fs_submit_page_mbio: dev = (251,1), ino = 91, page_index = 0xd, oldaddr = 0xffffffff, newaddr = 0x782f rw = WRITE0x0i ^H type = DATA
After:
kworker/u8:2-8678 [000] .... 7945.124459: f2fs_submit_write_bio: dev = (251,1), rw = WRITE_SYNC, DATA, sector = 74080, size = 53248
kworker/u8:2-8678 [000] .... 7945.124551: f2fs_submit_page_mbio: dev = (251,1), ino = 11, page_index = 0xec, oldaddr = 0xffffffff, newaddr = 0x243a, rw = WRITE, type = DATA
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
We don't need to keep discard_map, if disk does not support discard command.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
we came across an error as below:
[build_nat_area_bitmap:1710] nid[0x 1718] addr[0x 1c18ddc] ino[0x 1718]
[build_nat_area_bitmap:1710] nid[0x 1719] addr[0x 1c193d5] ino[0x 1719]
[build_nat_area_bitmap:1710] nid[0x 171a] addr[0x 1c1736e] ino[0x 171a]
[build_nat_area_bitmap:1710] nid[0x 171b] addr[0x 58b3ee8f] ino[0x815f92ed]
[build_nat_area_bitmap:1710] nid[0x 171c] addr[0x fcdc94b] ino[0x49366377]
[build_nat_area_bitmap:1710] nid[0x 171d] addr[0x 7cd2facf] ino[0xb3c55300]
[build_nat_area_bitmap:1710] nid[0x 171e] addr[0x bd4e25d0] ino[0x77c34c09]
... ...
[build_nat_area_bitmap:1710] nid[0x 1718] addr[0x 1c18ddc] ino[0x 1718]
[build_nat_area_bitmap:1710] nid[0x 1719] addr[0x 1c193d5] ino[0x 1719]
[build_nat_area_bitmap:1710] nid[0x 171a] addr[0x 1c1736e] ino[0x 171a]
[build_nat_area_bitmap:1710] nid[0x 171b] addr[0x 58b3ee8f] ino[0x815f92ed]
[build_nat_area_bitmap:1710] nid[0x 171c] addr[0x fcdc94b] ino[0x49366377]
[build_nat_area_bitmap:1710] nid[0x 171d] addr[0x 7cd2facf] ino[0xb3c55300]
[build_nat_area_bitmap:1710] nid[0x 171e] addr[0x bd4e25d0] ino[0x77c34c09]
One nat block may be stepped by a data block, so this patch forbid to
write if the blkaddr is illegal
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
The address of the iovec &vq->iov[out] is not guaranteed to contain the scsi
command's response iovec throughout the lifetime of the command. Rather, it
is more likely to contain an iovec from an immediately following command
after looping back around to vhost_get_vq_desc(). Pass along the iovec
entirely instead.
Fixes: 79c14141a487 ("vhost/scsi: Convert completion path to use copy_to_iter")
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
When running with a local patch which moves the '_stext' symbol to the
very beginning of the kernel text area, I got the following panic with
CONFIG_HARDENED_USERCOPY:
usercopy: kernel memory exposure attempt detected from ffff88103dfff000 (<linear kernel text>) (4096 bytes)
------------[ cut here ]------------
kernel BUG at mm/usercopy.c:79!
invalid opcode: 0000 [#1] SMP
...
CPU: 0 PID: 4800 Comm: cp Not tainted 4.8.0-rc3.after+ #1
Hardware name: Dell Inc. PowerEdge R720/0X3D66, BIOS 2.5.4 01/22/2016
task: ffff880817444140 task.stack: ffff880816274000
RIP: 0010:[<ffffffff8121c796>] __check_object_size+0x76/0x413
RSP: 0018:ffff880816277c40 EFLAGS: 00010246
RAX: 000000000000006b RBX: ffff88103dfff000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88081f80dfa8 RDI: ffff88081f80dfa8
RBP: ffff880816277c90 R08: 000000000000054c R09: 0000000000000000
R10: 0000000000000005 R11: 0000000000000006 R12: 0000000000001000
R13: ffff88103e000000 R14: ffff88103dffffff R15: 0000000000000001
FS: 00007fb9d1750800(0000) GS:ffff88081f800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000021d2000 CR3: 000000081a08f000 CR4: 00000000001406f0
Stack:
ffff880816277cc8 0000000000010000 000000043de07000 0000000000000000
0000000000001000 ffff880816277e60 0000000000001000 ffff880816277e28
000000000000c000 0000000000001000 ffff880816277ce8 ffffffff8136c3a6
Call Trace:
[<ffffffff8136c3a6>] copy_page_to_iter_iovec+0xa6/0x1c0
[<ffffffff8136e766>] copy_page_to_iter+0x16/0x90
[<ffffffff811970e3>] generic_file_read_iter+0x3e3/0x7c0
[<ffffffffa06a738d>] ? xfs_file_buffered_aio_write+0xad/0x260 [xfs]
[<ffffffff816e6262>] ? down_read+0x12/0x40
[<ffffffffa06a61b1>] xfs_file_buffered_aio_read+0x51/0xc0 [xfs]
[<ffffffffa06a6692>] xfs_file_read_iter+0x62/0xb0 [xfs]
[<ffffffff812224cf>] __vfs_read+0xdf/0x130
[<ffffffff81222c9e>] vfs_read+0x8e/0x140
[<ffffffff81224195>] SyS_read+0x55/0xc0
[<ffffffff81003a47>] do_syscall_64+0x67/0x160
[<ffffffff816e8421>] entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:[<00007fb9d0c33c00>] 0x7fb9d0c33c00
RSP: 002b:00007ffc9c262f28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: fffffffffff8ffff RCX: 00007fb9d0c33c00
RDX: 0000000000010000 RSI: 00000000021c3000 RDI: 0000000000000004
RBP: 00000000021c3000 R08: 0000000000000000 R09: 00007ffc9c264d6c
R10: 00007ffc9c262c50 R11: 0000000000000246 R12: 0000000000010000
R13: 00007ffc9c2630b0 R14: 0000000000000004 R15: 0000000000010000
Code: 81 48 0f 44 d0 48 c7 c6 90 4d a3 81 48 c7 c0 bb b3 a2 81 48 0f 44 f0 4d 89 e1 48 89 d9 48 c7 c7 68 16 a3 81 31 c0 e8 f4 57 f7 ff <0f> 0b 48 8d 90 00 40 00 00 48 39 d3 0f 83 22 01 00 00 48 39 c3
RIP [<ffffffff8121c796>] __check_object_size+0x76/0x413
RSP <ffff880816277c40>
The checked object's range [ffff88103dfff000, ffff88103e000000) is
valid, so there shouldn't have been a BUG. The hardened usercopy code
got confused because the range's ending address is the same as the
kernel's text starting address at 0xffff88103e000000. The overlap check
is slightly off.
Fixes: f5509cc18daa ("mm: Hardened usercopy")
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
check_bogus_address() checked for pointer overflow using this expression,
where 'ptr' has type 'const void *':
ptr + n < ptr
Since pointer wraparound is undefined behavior, gcc at -O2 by default
treats it like the following, which would not behave as intended:
(long)n < 0
Fortunately, this doesn't currently happen for kernel code because kernel
code is compiled with -fno-strict-overflow. But the expression should be
fixed anyway to use well-defined integer arithmetic, since it could be
treated differently by different compilers in the future or could be
reported by tools checking for undefined behavior.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
|
|
This is an entirely new driver instead of yet another set of patches
to sb_edac.c because:
1) Mapping from PCI devices to socket/memory controller is significantly
different. Skylake scatters devices on a socket across a number of
PCI buses.
2) There is an extra level of interleaving via the "mcroute" register
that would be a little messy to squeeze into the old driver.
3) Validation is getting too expensive. Changes to sb_edac need to
be checked against Sandy Bridge, Ivy Bridge, Haswell, Broadwell and
Knights Landing.
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When building gccgo in userspace, errno.h gets parsed and the go include file
sysinfo.go is generated.
Since EREFUSED is defined to the same value as ECONNREFUSED, and ECONNREFUSED
is defined later on in errno.h, this leads to go complaining that EREFUSED
isn't defined yet.
Fix this trivial problem by moving the define of EREFUSED down after
ECONNREFUSED in errno.h (and clean up the indenting while touching this line).
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: stable@vger.kernel.org
|
|
Commit 54b66800907 (parisc: Add native high-resolution sched_clock()
implementation) added support to use the CPU-internal cr16 counters as reliable
clocksource with the help of HAVE_UNSTABLE_SCHED_CLOCK.
Sadly the commit missed to remove the hack which prevented cr16 to become the
default clocksource even on SMP systems.
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: stable@vger.kernel.org # 4.7+
|
|
Some module using div_u64() was failing to link because the libgcc 64-bit
divide assist routine was not being exported for modules
Reported-by: avinashp@quantenna.com
Cc: stable@vger.kernel.org
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
|
|
The kernel test robot reported a usercopy failure in the new hardened
sanity checks, due to a page-crossing copy of the FPU state into the
task structure.
This happened because the kernel test robot was testing with SLOB, which
doesn't actually do the required book-keeping for slab allocations, and
as a result the hardening code didn't realize that the task struct
allocation was one single allocation - and the sanity checks fail.
Since SLOB doesn't even claim to support hardening (and you really
shouldn't use it), the straightforward solution is to just make the
usercopy hardening code depend on the allocator supporting it.
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
| CC mm/memory.o
| In file included from ../mm/memory.c:53:0:
| ../include/linux/pfn_t.h: In function ‘pfn_t_pte’:
| ../include/linux/pfn_t.h:78:2: error: conversion to non-scalar type requested
| return pfn_pte(pfn_t_to_pfn(pfn), pgprot);
With STRICT_MM_TYPECHECKS pte_t is a struct and the offending code
forces a cast which ends up shifting a struct and hence the gcc warning.
Note that in recent past some of the arches (aarch64, s390) made
STRICT_MM_TYPECHECKS default, but we don't for ARC as this leads to slightly
worse generated code, given ARC ABI definition of returning structs
(which pte_t would become)
Quoting from ARC ABI...
"Results of type struct are returned in a caller-supplied temporary
variable whose address is passed in r0.
For such functions, the arguments are shifted so that they are
passed in r1 and up."
So
- struct to be returned would be allocated on stack requiring extra
code at call sites
- callee updates stack memory to facilitate the return (vs. simple
MOV into return reg r0)
Hence STRICT_MM_TYPECHECKS is not enabled by default for ARC
Cc: <stable@vger.kernel.org> #4.4+
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
|
|
| MODPOST 7 modules
| ERROR: "kmap" [fs/ext2/ext2.ko] undefined!
| ../scripts/Makefile.modpost:91: recipe for target '__modpost' failed
Cc: <stable@vger.kernel.org>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
|
|
The syscall ABI includes the gcc functional calling ABI since a syscall
implies userland caller and kernel callee.
The current gcc ABI (v3) for ARCv2 ISA required 64-bit data be passed in
even-odd register pairs, (potentially punching reg holes when passing such
values as args). This was partly driven by the fact that the double-word
LDD/STD instructions in ARCv2 expect the register alignment and thus gcc
forcing this avoids extra MOV at the cost of a few unused register (which we
have plenty anyways).
This however was rejected as part of upstreaming gcc port to HS. So the new
ABI v4 doesn't enforce the even-odd reg restriction.
Do note that for ARCompact ISA builds v3 and v4 are practically the same in
terms of gcc code generation.
In terms of change management, we infer the new ABI if gcc 6.x onwards
is used for building the kernel.
This also needs a stable backport to enable older kernels to work with
new tools/user-space
Cc: <stable@vger.kernel.org>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
|
|
User mode callee regs are explicitly collected before signal delivery or
breakpoint trap. r25 is special for kernel as it serves as task pointer,
so user mode value is clobbered very early. It is saved in pt_regs where
generally only scratch (aka caller saved) regs are saved.
The code to access the corresponding pt_regs location had a subtle bug as
it was using load/store with scaling of offset, whereas the offset was already
byte wise correct. So fix this by replacing LD.AS with a standard LD
Cc: <stable@vger.kernel.org>
Signed-off-by: Liav Rehana <liavr@mellanox.com>
Reviewed-by: Alexey Brodkin <abrodkin@synopsys.com>
[vgupta: rewrote title and commit log]
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
|
|
The drivers that depend on OF but not OF_GPIO are wreaking havoc
with the autobuilders for archs that have all requirements for
OF but not for OF_GPIO, particularly the UM (Usermode) arch does
not have iomem (NO_IOMEM) which result in configuring GPIOLIB but
without OF_GPIO which is wrong if the driver is using the .of_node
of the gpiochip, which only appears with OF_GPIO.
After a brief look at the drivers just depending on OF it seems
most if not all of them actually require stuff from gpiolib-of so
the dependency is wrong in the first place.
This simply patches the Kconfig so that all GPIO drivers using OF
depend on OF_GPIO rather than just OF.
Cc: Rabin Vincent <rabin@rab.in>
Cc: Pramod Gurav <pramod.gurav@smartplayin.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Laxman Dewangan <ldewangan@nvidia.com>
Cc: Alexandre Courbot <acourbot@nvidia.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Phil Reid <preid@electromag.com.au>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|
|
The UserMode (UM) Linux build was failing in gpiolib-of as it requires
ioremap()/iounmap() to exist, which is absent from UM. The non-existence
of IO memory is negatively defined as CONFIG_NO_IOMEM which means we
need to depend on HAS_IOMEM.
Cc: stable@vger.kernel.org
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|
|
Thread A Thread B
- inode_lock fileA
- inode_lock fileB
- inode_lock fileA
- inode_lock fileB
We may encounter above potential deadlock during moving file range in
concurrent scenario. This patch fixes the issue by using inode_trylock
instead.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Only if two input files are regular files, we allow copying data in
range of them, otherwise, deny it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This reverts commit a2ee0a300344a6da76186129b078113354fe13d2.
When testing with generic/032 of xfstest suit, failure message will be
reported as below:
generic/032 8s ... [failed, exit status 1] - output mismatch (see results/generic/032.out.bad)
--- tests/generic/032.out 2015-01-11 16:52:27.643681072 +0800
+++ results/generic/032.out.bad 2016-08-06 13:44:43.861330500 +0800
@@ -1,5 +1,5 @@
QA output created by 032
-100 iterations
-0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
-*
-0100000
+1: [768..775]: unwritten
+Unwritten extents found!
...
(Run 'diff -u tests/generic/032.out results/generic/032.out.bad' to see the entire diff)
Ran: generic/032
Failures: generic/032
Failed 1 of 1 tests
In write_end(), we should update i_size of inode before unlock page,
otherwise, we will lose newly updated data in following race condition.
Thread A Thread B
- write_end
- unlock page
- writepages
- lock_page
- writepage
if page is out-of-range of file size,
we will skip writting the page.
- update i_size
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
LKP reported -36.3% regression of fsmark.files_per_sec due to this patch.
I've confirmed that fxmark [1] has also slight regression for DWAL.
[1] https://github.com/sslab-gatech/fxmark
This reverts commit ec795418c41850056feb956534edf059dc1155d4.
|
|
After Peter's commit:
331b6d8c7afc ("locking/barriers: Validate lockless_dereference() is used on a pointer type")
... we get a lot of sparse warnings (one for every rcu_dereference, and more)
since the expression here is assigning to the wrong address space.
Instead of validating that 'p' is a pointer this way, instead make
it fail compilation when it's not by using sizeof(*(p)). This will
not cause any sparse warnings (tested, likely since the address
space is irrelevant for sizeof), and will fail compilation when
'p' isn't a pointer type.
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 331b6d8c7afc ("locking/barriers: Validate lockless_dereference() is used on a pointer type")
Link: http://lkml.kernel.org/r/1470909022-687-2-git-send-email-johannes@sipsolutions.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This reverts commit:
fa7d81bb3c269 ("drm/fb-helper: Reduce READ_ONCE(master) to lockless_dereference")
As Peter explained:
[...] lockless_dereference() is _stronger_ than READ_ONCE(), not weaker.
[...]
Also, clue is in the name: 'dereference', you don't actually dereference
the pointer here, only load it.
My next patch breaks the compile without this revert, because it assumes
you want to deference and thus also need the struct type visible (which
it isn't here), so revert it.
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1470909022-687-1-git-send-email-johannes@sipsolutions.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
When building with 48-bit VAs and 16K page configuration, it's possible
to get the following warning when building the arm64 page table dumping
code:
arch/arm64/mm/dump.c: In function ‘walk_pud’:
arch/arm64/mm/dump.c:274:102: warning: right shift count >= width of type [-Wshift-count-overflow]
This is because pud_offset(pgd, 0) performs a shift to the right by 36
while the value 0 has the type 'int' by default, therefore 32-bit.
This patch modifies all the p*_offset() uses in arch/arm64/mm/dump.c to
use 0UL for the address argument.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Commit:
57430218317e ("sched/cputime: Count actually elapsed irq & softirq time")
... fixed a bug but also triggered a regression:
On an i5 laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
CPU hog processes(for loop) running in the guest, I hot-unplug the pCPUs
on host one by one until there is only one left, then observe CPU utilization
via 'top' in the guest, it shows:
100% st for cpu0(housekeeping)
75% st for other CPUs (nohz full mode)
However, w/o this commit it shows the correct 75% for all four CPUs.
When a guest is interrupted for a longer amount of time, missed clock ticks
are not redelivered later. Because of that, we should not limit the amount
of steal time accounted to the amount of time that the calling functions
think have passed.
However, the interval returned by account_other_time() is NOT rounded down
to the nearest jiffy, while the base interval in get_vtime_delta() it is
subtracted from is, so the max cputime limit is required to avoid underflow.
This patch fixes the regression by limiting the account_other_time() from
get_vtime_delta() to avoid underflow, and lets the other three call sites
(in account_other_time() and steal_account_process_time()) account however
much steal time the host told us elapsed.
Suggested-by: Rik van Riel <riel@redhat.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Link: http://lkml.kernel.org/r/1471399546-4069-1-git-send-email-wanpeng.li@hotmail.com
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Mike reports:
Roughly 10% of the time, ltp testcase getrusage04 fails:
getrusage04 0 TINFO : Expected timers granularity is 4000 us
getrusage04 0 TINFO : Using 1 as multiply factor for max [us]time increment (1000+4000us)!
getrusage04 0 TINFO : utime: 0us; stime: 179us
getrusage04 0 TINFO : utime: 3751us; stime: 0us
getrusage04 1 TFAIL : getrusage04.c:133: stime increased > 5000us:
And tracked it down to the case where the task simply doesn't get
_any_ [us]time ticks.
Update the code to assume all rtime is utime when we lack information,
thus ensuring a task that elides the tick gets time accounted.
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Fredrik Markstrom <fredrik.markstrom@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: stable@vger.kernel.org # 4.3+
Fixes: 9d7fb0427648 ("sched/cputime: Guarantee stime + utime == rtime")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The call to smp_call_function_single in perf_event_read() may fail if
an invalid or not online CPU index is passed. Warn user if such bug is
present and return error.
Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1471467307-61171-2-git-send-email-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
At this time the perf_addr_filter_needs_mmap() function will _not_
return true on a user space 'stop' filter. But stop filters need
exactly the same kind of mapping that range and start filters get.
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1468860187-318-4-git-send-email-mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Function perf_event_mmap() is called by the MM subsystem each time
part of a binary is loaded in memory. There can be several mapping
for a binary, many times unrelated to the code section.
Each time a section of a binary is mapped address filters are
updated, event when the map doesn't pertain to the code section.
The end result is that filters are configured based on the last map
event that was received rather than the last mapping of the code
segment.
For example if we have an executable 'main' that calls library
'libcstest.so.1.0', and that we want to collect traces on code
that is in that library. The perf cmd line for this scenario
would be:
perf record -e cs_etm// --filter 'filter 0x72c/0x40@/opt/lib/libcstest.so.1.0' --per-thread ./main
Resulting in binaries being mapped this way:
root@linaro-nano:~# cat /proc/1950/maps
00400000-00401000 r-xp 00000000 08:02 33169 /home/linaro/main
00410000-00411000 r--p 00000000 08:02 33169 /home/linaro/main
00411000-00412000 rw-p 00001000 08:02 33169 /home/linaro/main
7fa2464000-7fa2474000 rw-p 00000000 00:00 0
7fa2474000-7fa25a4000 r-xp 00000000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
7fa25a4000-7fa25b3000 ---p 00130000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
7fa25b3000-7fa25b7000 r--p 0012f000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
7fa25b7000-7fa25b9000 rw-p 00133000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
7fa25b9000-7fa25bd000 rw-p 00000000 00:00 0
7fa25bd000-7fa25be000 r-xp 00000000 08:02 38308 /opt/lib/libcstest.so.1.0
7fa25be000-7fa25cd000 ---p 00001000 08:02 38308 /opt/lib/libcstest.so.1.0
7fa25cd000-7fa25ce000 r--p 00000000 08:02 38308 /opt/lib/libcstest.so.1.0
7fa25ce000-7fa25cf000 rw-p 00001000 08:02 38308 /opt/lib/libcstest.so.1.0
7fa25cf000-7fa25eb000 r-xp 00000000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so
7fa25ef000-7fa25f2000 rw-p 00000000 00:00 0
7fa25f7000-7fa25f9000 rw-p 00000000 00:00 0
7fa25f9000-7fa25fa000 r--p 00000000 00:00 0 [vvar]
7fa25fa000-7fa25fb000 r-xp 00000000 00:00 0 [vdso]
7fa25fb000-7fa25fc000 r--p 0001c000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so
7fa25fc000-7fa25fe000 rw-p 0001d000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so
7ff2ea8000-7ff2ec9000 rw-p 00000000 00:00 0 [stack]
root@linaro-nano:~#
Before 'main()' can execute 'libcstest.so.1.0' has to be loaded in
memory. Once that has been done perf_event_mmap() has been called
4 times, with the last map starting at address 0x7fa25ce000 and
the address filter configured to start filtering when the
IP has passed over address 0x0x7fa25ce72c (0x7fa25ce000 + 0x72c).
But that is wrong since the code segment for library 'libcstest.so.1.0'
as been mapped at 0x7fa25bd000, resulting in traces not being
collected.
This patch corrects the situation by requesting that address
filters be updated only if the mapped event is for a code
segment.
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1468860187-318-3-git-send-email-mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Binary file names have to be supplied for both range and start/stop
filters but the current code only processes the filename if an
address range filter is specified. This code adds processing of
the filename for start/stop filters.
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1468860187-318-2-git-send-email-mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Vincent reported triggering the WARN_ON_ONCE() in event_function_local().
While thinking through cases I noticed that by using event_function()
directly, we miss the inactive case usually handled by
event_function_call().
Therefore construct a blend of event_function_call() and
event_function() that handles the cases relevant to
event_function_local().
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org # 4.5+
Fixes: fae3fde65138 ("perf: Collapse and fix event_function_call() users")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Frank reported kernel panic when he disabled several cores in BIOS
via following option:
Core Disable Bitmap(Hex) [0]
with number 0xFFE, which leaves 16 CPUs in system (out of 48).
The kernel panic below goes along with following messages:
smpboot: Max logical packages: 2^M
smpboot: APIC(0) Converting physical 0 to logical package 0^M
smpboot: APIC(20) Converting physical 1 to logical package 1^M
smpboot: APIC(40) Package 2 exceeds logical package map^M
smpboot: CPU 8 APICId 40 disabled^M
smpboot: APIC(60) Package 3 exceeds logical package map^M
smpboot: CPU 12 APICId 60 disabled^M
...
general protection fault: 0000 [#1] SMP^M
Modules linked in:^M
CPU: 15 PID: 1 Comm: swapper/0 Not tainted 4.7.0-rc5+ #1^M
Hardware name: SGI UV300/UV300, BIOS SGI UV 300 series BIOS 05/25/2016^M
task: ffff8801673e0000 ti: ffff8801673ac000 task.ti: ffff8801673ac000^M
RIP: 0010:[<ffffffff81014d54>] [<ffffffff81014d54>] uncore_change_context+0xd4/0x180^M
...
[<ffffffff810158ac>] uncore_event_init_cpu+0x6c/0x70^M
[<ffffffff81d8c91c>] intel_uncore_init+0x1c2/0x2dd^M
[<ffffffff81d8c75a>] ? uncore_cpu_setup+0x17/0x17^M
[<ffffffff81002190>] do_one_initcall+0x50/0x190^M
[<ffffffff810ab193>] ? parse_args+0x293/0x480^M
[<ffffffff81d87365>] kernel_init_freeable+0x1a5/0x249^M
[<ffffffff81d86a35>] ? set_debug_rodata+0x12/0x12^M
[<ffffffff816dc19e>] kernel_init+0xe/0x110^M
[<ffffffff816e93bf>] ret_from_fork+0x1f/0x40^M
[<ffffffff816dc190>] ? rest_init+0x80/0x80^M
The reason for the panic is wrong value of __max_logical_packages,
which lets logical_package_map uninitialized and the uncore code
relying on this map being properly initialized (maybe we should
add some safety checks there as well).
The __max_logical_packages is computed as:
DIV_ROUND_UP(total_cpus, ncpus);
- ncpus being number of cores
With above BIOS setup we get total_cpus == 16 which set
__max_logical_packages to 2 (ncpus is 12).
Once topology_update_package_map processes CPU with logical
pkg over 2 we display above messages and fail to initialize
the physical_to_logical_pkg map, which makes the uncore code
crash.
The fix is to remove logical_package_map bitmap completely
and keep and update the logical_packages number instead.
After we enumerate all the present CPUs, we check if the
enumerated logical packages count is within its computed
maximum from BIOS data.
If it's not the case, we set this maximum to the new enumerated
value and freeze any new addition of logical packages.
The freeze is because lot of init code like uncore/rapl/cqm
depends on having maximum logical package value set to allocate
their data, so we can't change it later on.
Prarit Bhargava tested the patch and confirms that it solves
the problem:
From dmidecode:
Core Count: 24
Core Enabled: 24
Thread Count: 48
Orig kernel boot log:
[ 0.464981] smpboot: Max logical packages: 19
[ 0.469861] smpboot: APIC(0) Converting physical 0 to logical package 0
[ 0.477261] smpboot: APIC(40) Converting physical 1 to logical package 1
[ 0.484760] smpboot: APIC(80) Converting physical 2 to logical package 2
[ 0.492258] smpboot: APIC(c0) Converting physical 3 to logical package 3
1. nr_cpus=8, should stop enumerating in package 0:
[ 0.533664] smpboot: APIC(0) Converting physical 0 to logical package 0
[ 0.539596] smpboot: Max logical packages: 19
2. max_cpus=8, should still enumerate all packages:
[ 0.526494] smpboot: APIC(0) Converting physical 0 to logical package 0
[ 0.532428] smpboot: APIC(40) Converting physical 1 to logical package 1
[ 0.538456] smpboot: APIC(80) Converting physical 2 to logical package 2
[ 0.544486] smpboot: APIC(c0) Converting physical 3 to logical package 3
[ 0.550524] smpboot: Max logical packages: 19
3. nr_cpus=49 ( 2 socket + 1 core on 3rd socket), should stop enumerating in
package 2:
[ 0.521378] smpboot: APIC(0) Converting physical 0 to logical package 0
[ 0.527314] smpboot: APIC(40) Converting physical 1 to logical package 1
[ 0.533345] smpboot: APIC(80) Converting physical 2 to logical package 2
[ 0.539368] smpboot: Max logical packages: 19
4. maxcpus=49, should still enumerate all packages:
[ 0.525591] smpboot: APIC(0) Converting physical 0 to logical package 0
[ 0.531525] smpboot: APIC(40) Converting physical 1 to logical package 1
[ 0.537547] smpboot: APIC(80) Converting physical 2 to logical package 2
[ 0.543579] smpboot: APIC(c0) Converting physical 3 to logical package 3
[ 0.549624] smpboot: Max logical packages: 19
5. kdump (nr_cpus=1) works as well.
Reported-by: Frank Ramsay <framsay@redhat.com>
Tested-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Reviewed-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160815101700.GA30090@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Similar to:
efaad554b4ff ("x86/microcode/intel: Fix initrd loading with CONFIG_RANDOMIZE_MEMORY=y")
... fix microcode loading from the initrd on AMD by adding the
randomization offset to the microcode patch container within the initrd.
Reported-and-tested-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-tip-commits@vger.kernel.org
Link: http://lkml.kernel.org/r/20160817113314.GA19221@nazgul.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
__replace_page() wronlgy calls mem_cgroup_cancel_charge() in "success" path,
it should only do this if page_check_address() fails.
This means that every enable/disable leads to unbalanced mem_cgroup_uncharge()
from put_page(old_page), it is trivial to underflow the page_counter->count
and trigger OOM.
Reported-and-tested-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: stable@vger.kernel.org # 3.17+
Fixes: 00501b531c47 ("mm: memcontrol: rewrite charge API")
Link: http://lkml.kernel.org/r/20160817153629.GB29724@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The act_police uses its own code to walk the
action hashtable, which leads to that we could
not flush standalone tc police actions, so just
switch to tcf_generic_walker() like other actions.
(Joint work from Roman and Cong.)
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Jamal reported a crash when we create a police action
with a specific index, this is because the init logic
is not correct, we should always create one for this
case. Just unify the logic with other tc actions.
Fixes: a03e6fe56971 ("act_police: fix a crash during removal")
Reported-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As pointed out by Jamal, an action could be shared by
multiple filters, so we can't use list to chain them
any more after we get rid of the original tc_action.
Instead, we could just save pointers to these actions
in tcf_exts, since they are refcount'ed, so convert
the list to an array of pointers.
The "ugly" part is the action API still accepts list
as a parameter, I just introduce a helper function to
convert the array of pointers to a list, instead of
relying on the C99 feature to iterate the array.
Fixes: a85a970af265 ("net_sched: move tc_action into tcf_common")
Reported-by: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
struct tcf_exts belongs to filters, should not be visible
to plain tc actions.
Cc: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It is harmless because all users pass 'a' to this macro.
Fixes: 00175aec941e ("net/sched: Macro instead of CONFIG_NET_CLS_ACT ifdef")
Cc: Amir Vadai <amir@vadai.me>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This list_del() for tc action is not needed actually,
because we only use this list to chain bulk operations,
therefore should not be carried for latter operations.
Fixes: ec0595cc4495 ("net_sched: get rid of struct tcf_common")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After refactoring tc_action into tcf_common, we no
longer need to cleanup temporary "actions" in list,
they are permanently stored in the hashtable.
Fixes: a85a970af265 ("net_sched: move tc_action into tcf_common")
Reported-by: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|