aboutsummaryrefslogtreecommitdiffstats
path: root/tools/perf/scripts/python/export-to-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2016-11-25perf record: Fix segfault when running with suid and kptr_restrict is 1Wang Nan1-1/+1
Before this patch perf panics if kptr_restrict is set to 1 and perf is owned by root with suid set: $ whoami wangnan $ ls -l ./perf -rwsr-xr-x 1 root root 19781908 Sep 21 19:29 /home/wangnan/perf $ cat /proc/sys/kernel/kptr_restrict 1 $ cat /proc/sys/kernel/perf_event_paranoid -1 $ ./perf record -a Segmentation fault (core dumped) $ The reason is that perf assumes it is allowed to read kptr from /proc/kallsyms when euid is root, but in fact the kernel doesn't allow reading kptr when euid and uid do not match with each other: $ cp /bin/cat . $ sudo chown root:root ./cat $ sudo chmod u+s ./cat $ cat /proc/kallsyms | grep do_fork 0000000000000000 T _do_fork <--- kptr is hidden even euid is root $ sudo cat /proc/kallsyms | grep do_fork ffffffff81080230 T _do_fork See lib/vsprintf.c for kernel side code. This patch fixes this problem by checking both uid and euid. Signed-off-by: Wang Nan <wangnan0@huawei.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: He Kuang <hekuang@huawei.com> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/20161115040617.69788-3-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-25perf tools: Fix kernel version error in ubuntuWang Nan1-2/+53
On ubuntu the internal kernel version code is different from what can be retrived from uname: $ uname -r 4.4.0-47-generic $ cat /lib/modules/`uname -r`/build/include/generated/uapi/linux/version.h #define LINUX_VERSION_CODE 263192 #define KERNEL_VERSION(a,b,c) (((a) << 16) + ((b) << 8) + (c)) $ cat /lib/modules/`uname -r`/build/include/generated/utsrelease.h #define UTS_RELEASE "4.4.0-47-generic" #define UTS_UBUNTU_RELEASE_ABI 47 $ cat /proc/version_signature Ubuntu 4.4.0-47.68-generic 4.4.24 The macro LINUX_VERSION_CODE is set to 4.4.24 (263192 == 0x40418), but `uname -r` reports 4.4.0. This mismatch causes LINUX_VERSION_CODE macro passed to BPF script become an incorrect value, results in magic failure in BPF loading: $ sudo ./buildperf/perf record -e ./tools/perf/tests/bpf-script-example.c ls event syntax error: './tools/perf/tests/bpf-script-example.c' \___ Failed to load program for unknown reason According to Ubuntu document (https://wiki.ubuntu.com/Kernel/FAQ), the correct kernel version can be retrived through /proc/version_signature, which is ubuntu specific. This patch checks the existance of /proc/version_signature, and returns version number through parsing this file instead of uname. Version string is untouched (value returns from uname) because `uname -r` is required to be consistence with path of kbuild directory in /lib/module. Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: He Kuang <hekuang@huawei.com> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/20161115040617.69788-2-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-25perf sched timehist: Enlarge max stack depth by 2Namhyung Kim1-1/+1
When it records callchains, they will always have 2 scheduler functions (__schedule + schedule or __schedule + preempt_schedule) and get ignored. So it should collect 2 more functions to show the expected number of callchains to user. Committer Notes: Example of final result, using the same perf.data file as in the previous cset comment, but this time redirecting the output of 'perf sched timehist' to a file instead of copy'n'pasting from xterm: [root@jouet experimental]# perf sched timehist > /tmp/bla [root@jouet experimental]# cat /tmp/bla time cpu task name wait time sch delay run time [tid/pid] (msec) (msec) (msec) -------- ---- -------------------- ------ ------ ----- 6.494998 [01] <idle> 0.000 0.000 0.000 6.495027 [02] perf[519] 0.000 0.000 0.000 schedule_hrtimeout_range_clock <- schedule_hrtimeout_range <- poll_schedule_timeout <- do_sys_poll <- sys_poll 6.495096 [03] <idle> 0.000 0.000 0.000 6.495100 [03] rcuos/0[9] 0.000 0.005 0.003 rcu_nocb_kthread <- kthread <- ret_from_fork 6.495113 [01] perf[520] 0.000 0.008 0.114 preempt_schedule_common <- _cond_resched <- wait_for_completion <- stop_one_cpu <- sched_exec <- do_execveat_common.isra.35 6.495121 [00] <idle> 0.000 0.000 0.000 6.495129 [01] migration/1[17] 0.000 0.003 0.016 smpboot_thread_fn <- kthread <- ret_from_fork 6.496085 [02] <idle> 0.000 0.000 1.057 6.496096 [02] kworker/u16:1[31169] 0.000 0.004 0.011 worker_thread <- kthread <- ret_from_fork 6.496096 [03] <idle> 0.003 0.000 0.996 6.496169 [02] <idle> 0.011 0.000 0.072 6.496171 [00] ls[520] 0.008 0.000 1.049 do_exit <- do_group_exit <- [unknown] <- entry_SYSCALL_64_fastpath 6.496172 [03] gnome-terminal-[4391] 0.000 0.003 0.076 schedule_hrtimeout_range_clock <- schedule_hrtimeout_range <- poll_schedule_timeout <- do_sys_poll <- sys_poll Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20161124011114.7102-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-25perf sched timehist: Mark schedule function in callchainsNamhyung Kim2-0/+22
The sched_switch event always captured from the scheduler function. So it'd be great omit them from the callchain. This patch marks the functions to be omitted by later patch. Committer notes: Testing it: Before: [root@jouet experimental]# perf sched record -g ls Dockerfile perf.data x-mips64 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 1.355 MB perf.data (29 samples) ] [root@jouet experimental]# perf sched timehist time cpu task name wait time sch delay run time [tid/pid] (msec) (msec) (msec) ----------- ----- ----------------- ------ ------ ------ 6.494998 [001] <idle> 0.000 0.000 0.000 6.495027 [002] perf[519] 0.000 0.000 0.000 __schedule <- schedule <- schedule_hrtimeout_range_clock <- schedule_hrtimeou 6.495096 [003] <idle> 0.000 0.000 0.000 6.495100 [003] rcuos/0[9] 0.000 0.005 0.003 __schedule <- schedule <- rcu_nocb_kthread <- kthread <- ret_from_fork 6.495113 [001] perf[520] 0.000 0.008 0.114 __schedule <- preempt_schedule_common <- _cond_resched <- wait_for_completion 6.495121 [000] <idle> 0.000 0.000 0.000 6.495129 [001] migration/1[17] 0.000 0.003 0.016 __schedule <- schedule <- smpboot_thread_fn <- kthread <- ret_from_fork 6.496085 [002] <idle> 0.000 0.000 1.057 6.496096 [002] kworker/u16:1[31169] 0.000 0.004 0.011 __schedule <- schedule <- worker_thread <- kthread <- ret_from_fork 6.496096 [003] <idle> 0.003 0.000 0.996 6.496169 [002] <idle> 0.011 0.000 0.072 6.496171 [000] ls[520] 0.008 0.000 1.049 __schedule <- schedule <- do_exit <- do_group_exit <- [unknown] 6.496172 [003] gnome-terminal-[4391] 0.000 0.003 0.076 __schedule <- schedule <- schedule_hrtimeout_range_clock <- schedule_hrtimeo After: [root@jouet experimental]# perf sched timehist time cpu task name wait time sch delay run time [tid/pid] (msec) (msec) (msec) ----------- ----- ----------------- ----- ----- ------ 6.494998 [001] <idle> 0.000 0.000 0.000 6.495027 [002] perf[519] 0.000 0.000 0.000 schedule_hrtimeout_range_clock <- schedule_hrtimeout_range <- poll_schedule_t 6.495096 [003] <idle> 0.000 0.000 0.000 6.495100 [003] rcuos/0[9] 0.000 0.005 0.003 rcu_nocb_kthread <- kthread <- ret_from_fork 6.495113 [001] perf[520] 0.000 0.008 0.114 preempt_schedule_common <- _cond_resched <- wait_for_completion <- stop_one_c 6.495121 [000] <idle> 0.000 0.000 0.000 6.495129 [001] migration/1[17] 0.000 0.003 0.016 smpboot_thread_fn <- kthread <- ret_from_fork 6.496085 [002] <idle> 0.000 0.000 1.057 6.496096 [002] kworker/u16:1[31169] 0.000 0.004 0.011 worker_thread <- kthread <- ret_from_fork 6.496096 [003] <idle> 0.003 0.000 0.996 6.496169 [002] <idle> 0.011 0.000 0.072 6.496171 [000] ls[520] 0.008 0.000 1.049 do_exit <- do_group_exit <- [unknown] 6.496172 [003] gnome-terminal-[4391] 0.000 0.003 0.076 schedule_hrtimeout_range_clock <- schedule_hrtimeout_range <- poll_schedule_ [root@jouet experimental]# Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20161124011114.7102-1-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-25perf callchain: Add option to skip ignore symbol when printing callchainsNamhyung Kim3-2/+9
For tracepoint events, callchains always contain certain functions. Sometimes it'd be better to skip those functions as they have no value. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20161124011114.7102-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-25perf annotate: Initial PowerPC supportRavi Bangoria2-0/+63
Support the PowerPC architecture using the ins_ops association method. Committer notes: Testing it with a perf.data file collected on a PowerPC machine and cross-annotated on a x86_64 workstation, using the associated vmlinux file: $ perf report -i perf.data.f22vm.powerdev --vmlinux vmlinux.powerpc .ktime_get vmlinux.powerpc │ clrldi r9,r28,63 8.57 │ ┌──bne e0 <- TUI cursor positioned here │54:│ lwsync 2.86 │ │ std r2,40(r1) │ │ ld r9,144(r31) │ │ ld r3,136(r31) │ │ ld r30,184(r31) │ │ ld r10,0(r9) │ │ mtctr r10 │ │ ld r2,8(r9) 8.57 │ │→ bctrl │ │ ld r2,40(r1) │ │ ld r10,160(r31) │ │ ld r5,152(r31) │ │ lwz r7,168(r31) │ │ ld r9,176(r31) 8.57 │ │ lwz r6,172(r31) │ │ lwsync 2.86 │ │ lwz r8,128(r31) │ │ cmpw cr7,r8,r28 2.86 │ │↑ bne 48 │ │ subf r10,r10,r3 │ │ mr r3,r29 │ │ and r10,r10,r5 2.86 │ │ mulld r10,r10,r7 │ │ add r9,r10,r9 │ │ srd r9,r9,r6 │ │ add r9,r9,r30 │ │ std r9,0(r29) │ │ addi r1,r1,144 │ │ ld r0,16(r1) │ │ ld r28,-32(r1) │ │ ld r29,-24(r1) │ │ ld r30,-16(r1) │ │ mtlr r0 │ │ ld r31,-8(r1) │ │← blr 5.71 │e0:└─→mr r1,r1 11.43 │ mr r2,r2 11.43 │ lwz r28,128(r31) Press 'h' for help on key bindings $ perf report -i perf.data.f22vm.powerdev --header-only # ======== # captured on: Thu Nov 24 12:40:38 2016 # hostname : pdev-f22-qemu # os release : 4.4.10-200.fc22.ppc64 # perf version : 4.9.rc1.g6298ce # arch : ppc64 # nrcpus online : 48 # nrcpus avail : 48 # cpudesc : POWER7 (architected), altivec supported # cpuid : 74,513 # total memory : 4158976 kB # cmdline : /home/ravi/Workspace/linux/tools/perf/perf record -a # event : name = cycles:ppp, , size = 112, { sample_period, sample_freq } = 4000, sample_type = IP|TID|TIME|CPU|PERIOD, disabled = 1, inherit = 1, mmap = 1, comm = 1, freq = 1, task = 1, precise_ip = 3, sample_id_all = 1, exclude_guest = 1, mmap2 = 1, comm_exec = 1 # HEADER_CPU_TOPOLOGY info available, use -I to display # HEADER_NUMA_TOPOLOGY info available, use -I to display # pmu mappings: cpu = 4, software = 1, tracepoint = 2, breakpoint = 5 # missing features: HEADER_TRACING_DATA HEADER_BRANCH_STACK HEADER_GROUP_DESC HEADER_AUXTRACE HEADER_STAT HEADER_CACHE # ======== # $ Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Kim Phillips <kim.phillips@arm.com> Link: http://lkml.kernel.org/n/tip-tbjnp40ddoxxl474uvhwi6g4@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-25perf annotate: Improve support for ARMArnaldo Carvalho de Melo2-96/+60
By using arch->init() to set up some regular expressions to associate ins_ops to ARM instructions, ditching that old table that has instructions not present on ARM. Take advantage of having an arch->init() to hide more arm specific stuff from the common code, like the objdump details. The regular expressions comes from a patch written by Kim Phillips. Reviewed-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Chris Riyder <chris.ryder@arm.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kim Phillips <kim.phillips@arm.com> Cc: Markus Trippelsdorf <markus@trippelsdorf.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Pawel Moll <pawel.moll@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Taeung Song <treeze.taeung@gmail.com> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-77m7lufz9ajjimkrebtg5ead@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-25perf annotate: Allow arches to have a init routine and a priv areaArnaldo Carvalho de Melo1-0/+11
Arches like ARM will want to use regular expressions when deciding what instructions to associate with what ins_ops, provide infrastructure for that. Reviewed-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Chris Riyder <chris.ryder@arm.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kim Phillips <kim.phillips@arm.com> Cc: Markus Trippelsdorf <markus@trippelsdorf.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Pawel Moll <pawel.moll@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Taeung Song <treeze.taeung@gmail.com> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-7dmnk9el2ipu3nxog092k9z5@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-25perf annotate: Introduce alternative method of keeping instructions tableArnaldo Carvalho de Melo1-1/+62
Some arches may want to dynamically populate the table using regular expressions on the instruction names to associate them with a set of parsing/formatting/etc functions (struct ins_ops), so provide a fallback for when the ins__find() method fails. That fall back will be able to resize the arch->instructions, setting arch->nr_instructions appropriately, helper functions to associate an ins_ops to an instruction name, growing the arch->instructions if needed and resorting it are provided, all the arch specific callback needs to do is to decide if the missing instruction should be added to arch->instructions with a ins_ops association. Reviewed-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Chris Riyder <chris.ryder@arm.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kim Phillips <kim.phillips@arm.com> Cc: Markus Trippelsdorf <markus@trippelsdorf.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Pawel Moll <pawel.moll@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Taeung Song <treeze.taeung@gmail.com> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-auu13yradxf7g5dgtpnzt97a@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-25perf annotate: Remove duplicate 'name' field from disasm_lineArnaldo Carvalho de Melo3-57/+51
The disasm_line::name field is always equal to ins::name, being used just to locate the instruction's ins_ops from the per-arch instructions table. Eliminate this duplication, nuking that field and instead make ins__find() return an ins_ops, store it in disasm_line::ins.ops, and keep just in disasm_line::ins.name what was in disasm_line::name, this way we end up not keeping a reference to entries in the per-arch instructions table. This in turn will help supporting multiple ways to manage the per-arch instructions table, allowing resorting that array, for instance, when the entries will move after references to its addresses were made. The same problem is avoided when one grows the array with realloc. So architectures simply keeping a constant array will work as well as architectures building the table using regular expressions or other logic that involves resorting the table. Reviewed-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Chris Riyder <chris.ryder@arm.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kim Phillips <kim.phillips@arm.com> Cc: Markus Trippelsdorf <markus@trippelsdorf.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Pawel Moll <pawel.moll@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Taeung Song <treeze.taeung@gmail.com> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-vr899azvabnw9gtuepuqfd9t@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-23tile: avoid using clocksource_cyc2ns with absolute cycle countChris Metcalf1-2/+2
For large values of "mult" and long uptimes, the intermediate result of "cycles * mult" can overflow 64 bits. For example, the tile platform calls clocksource_cyc2ns with a 1.2 GHz clock; we have mult = 853, and after 208.5 days, we overflow 64 bits. Since clocksource_cyc2ns() is intended to be used for relative cycle counts, not absolute cycle counts, performance is more importance than accepting a wider range of cycle values. So, just use mult_frac() directly in tile's sched_clock(). Commit 4cecf6d401a0 ("sched, x86: Avoid unnecessary overflow in sched_clock") by Salman Qazi results in essentially the same generated code for x86 as this change does for tile. In fact, a follow-on change by Salman introduced mult_frac() and switched to using it, so the C code was largely identical at that point too. Peter Zijlstra then added mul_u64_u32_shr() and switched x86 to use it. This is, in principle, better; by optimizing the 64x64->64 multiplies to be 32x32->64 multiplies we can potentially save some time. However, the compiler piplines the 64x64->64 multiplies pretty well, and the conditional branch in the generic mul_u64_u32_shr() causes some bubbles in execution, with the result that it's pretty much a wash. If tilegx provided its own implementation of mul_u64_u32_shr() without the conditional branch, we could potentially save 3 cycles, but that seems like small gain for a fair amount of additional build scaffolding; no other platform currently provides a mul_u64_u32_shr() override, and tile doesn't currently have an <asm/div64.h> header to put the override in. Additionally, gcc currently has an optimization bug that prevents it from recognizing the opportunity to use a 32x32->64 multiply, and so the result would be no better than the existing mult_frac() until such time as the compiler is fixed. For now, just using mult_frac() seems like the right answer. Cc: stable@kernel.org [v3.4+] Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
2016-11-23perf sched timehist: Add -V/--cpu-visual optionDavid Ahern2-2/+47
The -V option provides a visual aid for sched switches by cpu: $ perf sched timehist -V time cpu 0123456789abc task name b/n time sch delay run time [tid/pid] (msec) (msec) (msec) --------------- ------ ------------- -------------------- --------- --------- --------- ... 2412598.429696 [0009] i <idle> 0.000 0.000 0.000 2412598.429767 [0002] s perf[7219] 0.000 0.000 0.000 2412598.429783 [0009] s perf[7220] 0.000 0.006 0.087 2412598.429794 [0010] i <idle> 0.000 0.000 0.000 2412598.429795 [0009] s migration/9[53] 0.000 0.003 0.011 2412598.430370 [0010] s sleep[7220] 0.011 0.000 0.576 2412598.432584 [0003] i <idle> 0.000 0.000 0.000 ... Committer notes: 'i' marks idle time, 's' are scheduler events. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20161116060634.28477-8-namhyung@kernel.org [ Add documentation based on above commit message ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-23perf sched timehist: Add call graph optionsDavid Ahern2-6/+89
If callchains were recorded they are appended to the line with a default stack depth of 5: 1.874569 [0011] gcc[31949] 0.014 0.000 1.148 wait_for_completion_killable <- do_fork <- sys_vfork <- stub_vfork <- __vfork 1.874591 [0010] gcc[31951] 0.000 0.000 0.024 __cond_resched <- _cond_resched <- wait_for_completion <- stop_one_cpu <- sched_exec 1.874603 [0010] migration/10[59] 3.350 0.004 0.011 smpboot_thread_fn <- kthread <- ret_from_fork 1.874604 [0011] <idle> 1.148 0.000 0.035 cpu_startup_entry <- start_secondary 1.874723 [0005] <idle> 0.016 0.000 1.383 cpu_startup_entry <- start_secondary 1.874746 [0005] gcc[31949] 0.153 0.078 0.022 do_wait sys_wait4 <- system_call_fastpath <- __GI___waitpid --no-call-graph can be used to not show the callchains. --max-stack is used to control the number of frames shown (default of 5). -x/--excl options can be used to collapse redundant callchains to get more relevant data on screen. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20161116060634.28477-7-namhyung@kernel.org [ Add documentation based on above commit message ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-23perf sched timehist: Add -w/--wakeups optionDavid Ahern2-4/+58
The -w option is to show wakeup events with timehist. $ perf sched timehist -w time cpu task name b/n time sch delay run time [tid/pid] (msec) (msec) (msec) --------------- ------ -------------------- --------- --------- --------- 2412598.429689 [0002] perf[7219] awakened: perf[7220] 2412598.429696 [0009] <idle> 0.000 0.000 0.000 2412598.429767 [0002] perf[7219] 0.000 0.000 0.000 2412598.429780 [0009] perf[7220] awakened: migration/9[53] ... Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20161116060634.28477-6-namhyung@kernel.org [ Add documentation based on above commit message ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-23perf sched timehist: Add summary optionsDavid Ahern1-6/+160
The -s/--summary option is to show process runtime statistics. And the -S/--with-summary option is to show the stats with the normal output. $ perf sched timehist -s Runtime summary comm parent sched-in run-time min-run avg-run max-run stddev (count) (msec) (msec) (msec) (msec) % --------------------------------------------------------------------------------------------------------- ksoftirqd/0[3] 2 2 0.011 0.004 0.005 0.006 14.87 rcu_preempt[7] 2 11 0.071 0.002 0.006 0.017 20.23 watchdog/0[11] 2 1 0.002 0.002 0.002 0.002 0.00 watchdog/1[12] 2 1 0.004 0.004 0.004 0.004 0.00 ... Terminated tasks: sleep[7220] 7219 3 0.770 0.087 0.256 0.576 62.28 Idle stats: CPU 0 idle for 2352.006 msec CPU 1 idle for 2764.497 msec CPU 2 idle for 2998.229 msec CPU 3 idle for 2967.800 msec Total number of unique tasks: 52 Total number of context switches: 2532 Total run time (msec): 218.036 Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20161116060634.28477-5-namhyung@kernel.org [ Add documentation from last commit, so that docs comes with the cset that introduces the feature ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-23perf sched timehist: Introduce timehist commandDavid Ahern2-7/+637
'perf sched timehist' provides an analysis of scheduling events. Example usage: perf sched record -- sleep 1 perf sched timehist By default it shows the individual schedule events, including the wait time (time between sched-out and next sched-in events for the task), the task scheduling delay (time between wakeup and actually running) and run time for the task: time cpu task name wait time sch delay run time [tid/pid] (msec) (msec) (msec) -------------- ------ -------------------- --------- --------- --------- 79371.874569 [0011] gcc[31949] 0.014 0.000 1.148 79371.874591 [0010] gcc[31951] 0.000 0.000 0.024 79371.874603 [0010] migration/10[59] 3.350 0.004 0.011 79371.874604 [0011] <idle> 1.148 0.000 0.035 79371.874723 [0005] <idle> 0.016 0.000 1.383 79371.874746 [0005] gcc[31949] 0.153 0.078 0.022 ... Times are in msec.usec. Committer note: Add above explanation as the 'perf sched timehist' entry for 'man perf-sched'. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20161116060634.28477-4-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-23perf evsel: Support printing callchains with arrowsNamhyung Kim2-0/+7
The EVSEL__PRINT_CALLCHAIN_ARROW options can be used to print callchains with arrows for readability. It will be used 'sched timehist' command like below: __schedule <- schedule <- schedule_timeout <- rcu_gp_kthread <- kthread <- ret_from_fork __schedule <- schedule <- schedule_timeout <- rcu_gp_kthread <- kthread <- ret_from_fork __schedule <- schedule <- worker_thread <- kthread <- ret_from_fork Suggested-and-Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Ahern <dsahern@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20161116060634.28477-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-23perf symbols: Print symbol offsets conditionallyNamhyung Kim3-8/+12
The __symbol__fprintf_symname_offs() always shows symbol offsets. So there's no difference between 'perf script -F ip,sym' and 'perf script -F ip,sym,symoff'. I don't think it's a desired behavior.. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Ahern <dsahern@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20161116060634.28477-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-23perf c2c: Support cascading optionsJiri Olsa1-12/+10
Adding support for cascading options added by Namhyung in: commit 369a2478973a ("tools lib subcmd: Support cascading options") This way the report and record command share options with with c2c command and can save some option duplicates. For now it's the 'v' option. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Ahern <dsahern@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Joe Mario <jmario@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1479764011-10732-7-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-23perf c2c report: Display total HITMs on defaultJiri Olsa2-7/+36
Currently we display the cacheline list sorted on remote HITMs by default. The problem is that they might not be always counted and 'perf c2c report' displays an empty output. Thus it's more convenient to display and sort the cacheline list based on the total of HITMs and have the best change to see data in the default report run. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Ahern <dsahern@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Joe Mario <jmario@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1479764011-10732-6-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-23perf c2c report: Add struct c2c_stats::tot_hitm fieldJiri Olsa2-2/+11
Count total number of HITMs in a special field. This will ease up addition of total HITM sorting into c2c report in the following patch. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Ahern <dsahern@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Joe Mario <jmario@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1479764011-10732-5-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-23perf c2c report: Add -f/--force optionJiri Olsa2-1/+7
Adding -f/--force option to go through ownership validation: $ sudo perf c2c report File perf.data not owned by current user or root (use -f to override) $ $ sudo perf c2c report -f < c2c report output > $ Signed-off-by: Jiri Olsa <jolsa@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Ahern <dsahern@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Joe Mario <jmario@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1479764011-10732-4-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-23perf c2c report: Setup browser after opening perf.dataJiri Olsa1-7/+8
Because of the early browser switch we won't get possible error messages, as it will clear the screen right after showing the message, e.g.: Before: $ sudo perf c2c report -d lcl $ After: $ sudo perf c2c report -d lcl File perf.data not owned by current user or root (use -f to override) $ $ ls -la perf.data -rw-------. 1 acme acme 26648 Nov 22 15:11 perf.data $ Signed-off-by: Jiri Olsa <jolsa@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Ahern <dsahern@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Joe Mario <jmario@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1479764011-10732-3-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-23perf tools: Show event fd in debug outputJiri Olsa1-2/+4
It is useful for debug to see file descriptors for each event. Before: $ perf stat -vvv -e cycles,cache-misses ls ... sys_perf_event_open: pid 12146 cpu -1 group_fd -1 flags 0x8 ... sys_perf_event_open: pid 12146 cpu -1 group_fd 3 flags 0x8 sys_perf_event_open failed, error -13 Now: $ perf stat -vvv -e cycles,cache-misses ls ... sys_perf_event_open: pid 12858 cpu -1 group_fd -1 flags 0x8 = 3 ... sys_perf_event_open: pid 12858 cpu -1 group_fd 3 flags 0x8 sys_perf_event_open failed, error -13 Signed-off-by: Jiri Olsa <jolsa@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Ahern <dsahern@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Joe Mario <jmario@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1479764011-10732-2-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-23tools lib traceevent: Add retrieval of preempt count and latency flagsSteven Rostedt2-2/+30
Add a way to retrieve the preempt count as well as the latency flags from a pevent_record. int pevent_data_preempt_count(pevent, record); returns the preempt count of a record. int pevent_data_flags(pevent, record); returns the latency flags for a record. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/20161122113158.03a010a8@gandalf.local.home Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-23tools lib traceevent: Use USECS_PER_SEC instead of hardcoded numberSteven Rostedt2-8/+6
Instead of using 1000000, use the define in time64.h instead. Also remove the the duplicate defines for NSECS_PER_SEC. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Olsa <jolsa@kernel.org> Link: http://lkml.kernel.org/r/20161121114149.67111981@gandalf.local.home Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-11-22NFSv4.x: hide array-bounds warningArnd Bergmann1-1/+1
A correct bugfix introduced a harmless warning that shows up with gcc-7: fs/nfs/callback.c: In function 'nfs_callback_up': fs/nfs/callback.c:214:14: error: array subscript is outside array bounds [-Werror=array-bounds] What happens here is that the 'minorversion == 0' check tells the compiler that we assume minorversion can be something other than 0, but when CONFIG_NFS_V4_1 is disabled that would be invalid and result in an out-of-bounds access. The added check for IS_ENABLED(CONFIG_NFS_V4_1) tells gcc that this really can't happen, which makes the code slightly smaller and also avoids the warning. The bugfix that introduced the warning is marked for stable backports, we want this one backported to the same releases. Fixes: 98b0f80c2396 ("NFSv4.x: Fix a refcount leak in nfs_callback_up_net") Cc: stable@vger.kernel.org # v3.7+ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-22perf/x86/intel/uncore: Allow only a single PMU/box within an events groupPeter Zijlstra1-4/+4
Group validation expects all events to be of the same PMU; however is_uncore_pmu() is too wide, it matches _all_ uncore events, even across PMUs. This triggers failure when we group different events from different uncore PMUs, like: perf stat -vv -e '{uncore_cbox_0/config=0x0334/,uncore_qpi_0/event=1/}' -a sleep 1 Fix is_uncore_pmu() by only matching events to the box at hand. Note that generic code; ran after this step; will disallow this mixture of PMU events. Reported-by: Jiri Olsa <jolsa@redhat.com> Tested-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vince@deater.net> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/20161118125354.GQ3117@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22perf/x86/intel: Cure bogus unwind from PEBS entriesPeter Zijlstra2-13/+24
Vince Weaver reported that perf_fuzzer + KASAN detects that PEBS event unwinds sometimes do 'weird' things. In particular, we seemed to be ending up unwinding from random places on the NMI stack. While it was somewhat expected that the event record BP,SP would not match the interrupt BP,SP in that the interrupt is strictly later than the record event, it was overlooked that it could be on an already overwritten stack. Therefore, don't copy the recorded BP,SP over the interrupted BP,SP when we need stack unwinds. Note that its still possible the unwind doesn't full match the actual event, as its entirely possible to have done an (I)RET between record and interrupt, but on average it should still point in the general direction of where the event came from. Also, it's the best we can do, considering. The particular scenario that triggered the bogus NMI stack unwind was a PEBS event with very short period, upon enabling the event at the tail of the PMI handler (FREEZE_ON_PMI is not used), it instantly triggers a record (while still on the NMI stack) which in turn triggers the next PMI. This then causes back-to-back NMIs and we'll try and unwind the stack-frame from the last NMI, which obviously is now overwritten by our own. Analyzed-by: Josh Poimboeuf <jpoimboe@redhat.com> Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@gmail.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davej@codemonkey.org.uk <davej@codemonkey.org.uk> Cc: dvyukov@google.com <dvyukov@google.com> Cc: stable@vger.kernel.org Fixes: ca037701a025 ("perf, x86: Add PEBS infrastructure") Link: http://lkml.kernel.org/r/20161117171731.GV3157@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22perf/x86: Restore TASK_SIZE check on frame pointerJohannes Weiner1-8/+2
The following commit: 75925e1ad7f5 ("perf/x86: Optimize stack walk user accesses") ... switched from copy_from_user_nmi() to __copy_from_user_nmi() with a manual access_ok() check. Unfortunately, copy_from_user_nmi() does an explicit check against TASK_SIZE, whereas the access_ok() uses whatever the current address limit of the task is. We are getting NMIs when __probe_kernel_read() has switched to KERNEL_DS, and then see vmalloc faults when we access what looks like pointers into vmalloc space: [] WARNING: CPU: 3 PID: 3685731 at arch/x86/mm/fault.c:435 vmalloc_fault+0x289/0x290 [] CPU: 3 PID: 3685731 Comm: sh Tainted: G W 4.6.0-5_fbk1_223_gdbf0f40 #1 [] Call Trace: [] <NMI> [<ffffffff814717d1>] dump_stack+0x4d/0x6c [] [<ffffffff81076e43>] __warn+0xd3/0xf0 [] [<ffffffff81076f2d>] warn_slowpath_null+0x1d/0x20 [] [<ffffffff8104a899>] vmalloc_fault+0x289/0x290 [] [<ffffffff8104b5a0>] __do_page_fault+0x330/0x490 [] [<ffffffff8104b70c>] do_page_fault+0xc/0x10 [] [<ffffffff81794e82>] page_fault+0x22/0x30 [] [<ffffffff81006280>] ? perf_callchain_user+0x100/0x2a0 [] [<ffffffff8115124f>] get_perf_callchain+0x17f/0x190 [] [<ffffffff811512c7>] perf_callchain+0x67/0x80 [] [<ffffffff8114e750>] perf_prepare_sample+0x2a0/0x370 [] [<ffffffff8114e840>] perf_event_output+0x20/0x60 [] [<ffffffff8114aee7>] ? perf_event_update_userpage+0xc7/0x130 [] [<ffffffff8114ea01>] __perf_event_overflow+0x181/0x1d0 [] [<ffffffff8114f484>] perf_event_overflow+0x14/0x20 [] [<ffffffff8100a6e3>] intel_pmu_handle_irq+0x1d3/0x490 [] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10 [] [<ffffffff81197191>] ? vunmap_page_range+0x1a1/0x2f0 [] [<ffffffff811972f1>] ? unmap_kernel_range_noflush+0x11/0x20 [] [<ffffffff814f2056>] ? ghes_copy_tofrom_phys+0x116/0x1f0 [] [<ffffffff81040d1d>] ? x2apic_send_IPI_self+0x1d/0x20 [] [<ffffffff8100411d>] perf_event_nmi_handler+0x2d/0x50 [] [<ffffffff8101ea31>] nmi_handle+0x61/0x110 [] [<ffffffff8101ef94>] default_do_nmi+0x44/0x110 [] [<ffffffff8101f13b>] do_nmi+0xdb/0x150 [] [<ffffffff81795187>] end_repeat_nmi+0x1a/0x1e [] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10 [] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10 [] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10 [] <<EOE>> <IRQ> [<ffffffff8115d05e>] ? __probe_kernel_read+0x3e/0xa0 Fix this by moving the valid_user_frame() check to before the uaccess that loads the return address and the pointer to the next frame. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Fixes: 75925e1ad7f5 ("perf/x86: Optimize stack walk user accesses") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22sched/autogroup: Do not use autogroup->tg in zombie threadsOleg Nesterov3-0/+22
Exactly because for_each_thread() in autogroup_move_group() can't see it and update its ->sched_task_group before _put() and possibly free(). So the exiting task needs another sched_move_task() before exit_notify() and we need to re-introduce the PF_EXITING (or similar) check removed by the previous change for another reason. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: hartsjc@redhat.com Cc: vbendel@redhat.com Cc: vlovejoy@redhat.com Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22sched/autogroup: Fix autogroup_move_group() to never skip sched_move_task()Oleg Nesterov1-11/+12
The PF_EXITING check in task_wants_autogroup() is no longer needed. Remove it, but see the next patch. However the comment is correct in that autogroup_move_group() must always change task_group() for every thread so the sysctl_ check is very wrong; we can race with cgroups and even sys_setsid() is not safe because a task running with task_group() == ag->tg must participate in refcounting: int main(void) { int sctl = open("/proc/sys/kernel/sched_autogroup_enabled", O_WRONLY); assert(sctl > 0); if (fork()) { wait(NULL); // destroy the child's ag/tg pause(); } assert(pwrite(sctl, "1\n", 2, 0) == 2); assert(setsid() > 0); if (fork()) pause(); kill(getppid(), SIGKILL); sleep(1); // The child has gone, the grandchild runs with kref == 1 assert(pwrite(sctl, "0\n", 2, 0) == 2); assert(setsid() > 0); // runs with the freed ag/tg for (;;) sleep(1); return 0; } crashes the kernel. It doesn't really need sleep(1), it doesn't matter if autogroup_move_group() actually frees the task_group or this happens later. Reported-by: Vern Lovejoy <vlovejoy@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: hartsjc@redhat.com Cc: vbendel@redhat.com Link: http://lkml.kernel.org/r/20161114184609.GA15965@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22crypto: scatterwalk - Remove unnecessary aliasing check in map_and_copyHerbert Xu1-4/+0
The aliasing check in map_and_copy is no longer necessary because the IPsec ESP code no longer provides an IV that points into the actual request data. As this check is now triggering BUG checks due to the vmalloced stack code, I'm removing it. Reported-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-11-22crypto: algif_hash - Fix result clobbering in recvmsgHerbert Xu1-1/+1
Recently an init call was added to hash_recvmsg so as to reset the hash state in case a sendmsg call was never made. Unfortunately this ended up clobbering the result if the previous sendmsg was done with a MSG_MORE flag. This patch fixes it by excluding that case when we make the init call. Fixes: a8348bca2944 ("algif_hash - Fix NULL hash crash with shash") Reported-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-11-21tcp: zero ca_priv area when switching cc algorithmsFlorian Westphal1-1/+3
We need to zero out the private data area when application switches connection to different algorithm (TCP_CONGESTION setsockopt). When congestion ops get assigned at connect time everything is already zeroed because sk_alloc uses GFP_ZERO flag. But in the setsockopt case this contains whatever previous cc placed there. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmitGao Feng1-1/+1
The tc could return NET_XMIT_CN as one congestion notification, but it does not mean the packe is lost. Other modules like ipvlan, macvlan, and others treat NET_XMIT_CN as success too. So l2tp_eth_dev_xmit should add the NET_XMIT_CN check. Signed-off-by: Gao Feng <gfree.wind@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21NFSv4.1: Keep a reference on lock states while checkingBenjamin Coddington1-3/+15
While walking the list of lock_states, keep a reference on each nfs4_lock_state to be checked, otherwise the lock state could be removed while the check performs TEST_STATEID and possible FREE_STATEID. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-21ethernet: stmmac: make DWMAC_STM32 depend on it's associated SoCPeter Robinson1-1/+1
There's not much point, except compile test, enabling the stmmac platform drivers unless the STM32 SoC is enabled. It's not useful without it. Signed-off-by: Peter Robinson <pbrobinson@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21thermal/powerclamp: add back module device tableJacob Pan1-1/+8
Commit 3105f234e0aba43e44e277c20f9b32ee8add43d4 replaced module cpu id table with a cpu feature check, which is logically correct. But we need the module device table to allow module auto loading. Cc: stable@vger.kernel.org # 4.8 Fixes:3105f234 thermal/powerclamp: correct cpu support check Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com>
2016-11-21perf/core: Fix address filter parserAlexander Shishkin1-0/+2
The token table passed into match_token() must be null-terminated, which it currently is not in the perf's address filter string parser, as caught by Vince's perf_fuzzer and KASAN. It doesn't blow up otherwise because of the alignment padding of the table to the next element in the .rodata, which is luck. Fixing by adding a null-terminator to the token table. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dvyukov@google.com Cc: stable@vger.kernel.org # v4.7+ Fixes: 375637bc524 ("perf/core: Introduce address range filtering") Link: http://lkml.kernel.org/r/877f81f264.fsf@ashishki-desk.ger.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-21x86/platform/intel-mid: Rename platform_wdt to platform_mrfld_wdtAndy Shevchenko2-2/+2
Rename the watchdog platform library file to explicitly show that is used only on Intel Merrifield platforms. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20161118172723.179761-1-andriy.shevchenko@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-21x86/build: Build compressed x86 kernels as PIE when !CONFIG_RELOCATABLE as wellH.J. Lu1-3/+2
Since the bootloader may load the compressed x86 kernel at any address, it should always be built as PIE, not just when CONFIG_RELOCATABLE=y. Otherwise, linker in binutils 2.27 will optimize GOT load into the absolute address when building the compressed x86 kernel as a non-PIE executable. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org [ Small wording changes. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-21x86/platform/intel-mid: Register watchdog device after SCUAndy Shevchenko1-5/+27
Watchdog device in Intel Tangier relies on SCU to be present. It uses the SCU IPC channel to send commands and receive responses. If watchdog driver is initialized quite before SCU and a command has been sent the result is always an error like the following: intel_mid_wdt: Error stopping watchdog: 0xffffffed Register watchdog device whne SCU is ready to avoid described issue. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20161118165224.175514-1-andriy.shevchenko@linux.intel.com [ Small cleanups. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-21x86/fpu: Fix invalid FPU ptrace state after execve()Yu-cheng Yu1-8/+8
Robert O'Callahan reported that after an execve PTRACE_GETREGSET NT_X86_XSTATE continues to return the pre-exec register values until the exec'ed task modifies FPU state. The test code is at: https://bugzilla.redhat.com/attachment.cgi?id=1164286. What is happening is fpu__clear() does not properly clear fpstate. Fix it by doing just that. Reported-by: Robert O'Callahan <robert@ocallahan.org> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Cc: <stable@vger.kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: David Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi V. Shankar <ravi.v.shankar@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1479402695-6553-1-git-send-email-yu-cheng.yu@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-21x86/boot: Fail the boot if !M486 and CPUID is missingAndy Lutomirski1-0/+6
Linux will have all kinds of sporadic problems on systems that don't have the CPUID instruction unless CONFIG_M486=y. In particular, sync_core() will explode. I believe that these kernels had a better chance of working before commit 05fb3c199bb0 ("x86/boot: Initialize FPU and X86_FEATURE_ALWAYS even if we don't have CPUID"). That commit inadvertently fixed a serious bug: we used to fail to detect the FPU if CPUID wasn't present. Because we also used to forget to set X86_FEATURE_ALWAYS, we end up with no cpu feature bits set at all. This meant that alternative patching didn't do anything and, if paravirt was disabled, we could plausibly finish the entire boot process without calling sync_core(). Rather than trying to work around these issues, just have the kernel fail loudly if it's running on a CPUID-less 486, doesn't have CPUID, and doesn't have CONFIG_M486 set. Reported-by: Matthew Whitehead <tedheadster@gmail.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/70eac6639f23df8be5fe03fa1984aedd5d40077a.1479598603.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-21x86/traps: Ignore high word of regs->cs in early_fixup_exception()Andy Lutomirski1-1/+6
On the 80486 DX, it seems that some exceptions may leave garbage in the high bits of CS. This causes sporadic failures in which early_fixup_exception() refuses to fix up an exception. As far as I can tell, this has been buggy for a long time, but the problem seems to have been exacerbated by commits: 1e02ce4cccdc ("x86: Store a per-cpu shadow copy of CR4") e1bfc11c5a6f ("x86/init: Fix cr4_init_shadow() on CR4-less machines") This appears to have broken for as long as we've had early exception handling. [ Note to stable maintainers: This patch is needed all the way back to 3.4, but it will only apply to 4.6 and up, as it depends on commit: 0e861fbb5bda ("x86/head: Move early exception panic code into early_fixup_exception()") If you want to backport to kernels before 4.6, please don't backport the prerequisites (there was a big chain of them that rewrote a lot of the early exception machinery); instead, ask me and I can send you a one-liner that will apply. ] Reported-by: Matthew Whitehead <tedheadster@gmail.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: 4c5023a3fa2e ("x86-32: Handle exception table entries during early boot") Link: http://lkml.kernel.org/r/cb32c69920e58a1a58e7b5cad975038a69c0ce7d.1479609510.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-21apparmor: fix change_hat not finding hat after policy replacementJohn Johansen1-2/+4
After a policy replacement, the task cred may be out of date and need to be updated. However change_hat is using the stale profiles from the out of date cred resulting in either: a stale profile being applied or, incorrect failure when searching for a hat profile as it has been migrated to the new parent profile. Fixes: 01e2b670aa898a39259bc85c78e3d74820f4d3b6 (failure to find hat) Fixes: 898127c34ec03291c86f4ff3856d79e9e18952bc (stale policy being applied) Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1000287 Cc: stable@vger.kernel.org Signed-off-by: John Johansen <john.johansen@canonical.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
2016-11-20Linux 4.9-rc6Linus Torvalds1-1/+1
2016-11-19tipc: eliminate obsolete socket locking policy descriptionJon Paul Maloy1-47/+1
The comment block in socket.c describing the locking policy is obsolete, and does not reflect current reality. We remove it in this commit. Since the current locking policy is much simpler and follows a mainstream approach, we see no need to add a new description. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19rtnl: fix the loop index update error in rtnl_dump_ifinfo()Zhang Shengju1-1/+1
If the link is filtered out, loop index should also be updated. If not, loop index will not be correct. Fixes: dc599f76c22b0 ("net: Add support for filtering link dump by master device and kind") Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>