aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2013-07-10Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-2/+2
Pull perf fixes from Ingo Molnar: "Two small fixlets" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix interrupt handler timing harness perf/x86/amd: Do not print an error when the device is not present
2013-07-10Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linuxLinus Torvalds2-35/+44
Pull module updates from Rusty Russell: "Nothing interesting. Except the most embarrassing bugfix ever. But let's ignore that" * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: module: cleanup call chain. module: do percpu allocation after uniqueness check. No, really! modules: don't fail to load on unknown parameters. ABI: Clarify when /sys/module/MODULENAME is created There is no /sys/parameters module: don't modify argument of module_kallsyms_lookup_name()
2013-07-10Merge branch 'timers/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/urgentIngo Molnar3-61/+71
Pull nohz updates/fixes from Frederic Weisbecker: ' Note that "watchdog: Boot-disable by default on full dynticks" is a temporary solution to solve the issue with the watchdog that prevents the tick from stopping. This is to make sure that 3.11 doesn't have that problem as several people complained about it. A proper and longer term solution has been proposed by Peterz: http://lkml.kernel.org/r/20130618103632.GO3204@twins.programming.kicks-ass.net ' Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-07-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds1-1/+0
Pull networking updates from David Miller: "This is a re-do of the net-next pull request for the current merge window. The only difference from the one I made the other day is that this has Eliezer's interface renames and the timeout handling changes made based upon your feedback, as well as a few bug fixes that have trickeled in. Highlights: 1) Low latency device polling, eliminating the cost of interrupt handling and context switches. Allows direct polling of a network device from socket operations, such as recvmsg() and poll(). Currently ixgbe, mlx4, and bnx2x support this feature. Full high level description, performance numbers, and design in commit 0a4db187a999 ("Merge branch 'll_poll'") From Eliezer Tamir. 2) With the routing cache removed, ip_check_mc_rcu() gets exercised more than ever before in the case where we have lots of multicast addresses. Use a hash table instead of a simple linked list, from Eric Dumazet. 3) Add driver for Atheros CQA98xx 802.11ac wireless devices, from Bartosz Markowski, Janusz Dziedzic, Kalle Valo, Marek Kwaczynski, Marek Puzyniak, Michal Kazior, and Sujith Manoharan. 4) Support reporting the TUN device persist flag to userspace, from Pavel Emelyanov. 5) Allow controlling network device VF link state using netlink, from Rony Efraim. 6) Support GRE tunneling in openvswitch, from Pravin B Shelar. 7) Adjust SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF for modern times, from Daniel Borkmann and Eric Dumazet. 8) Allow controlling of TCP quickack behavior on a per-route basis, from Cong Wang. 9) Several bug fixes and improvements to vxlan from Stephen Hemminger, Pravin B Shelar, and Mike Rapoport. In particular, support receiving on multiple UDP ports. 10) Major cleanups, particular in the area of debugging and cookie lifetime handline, to the SCTP protocol code. From Daniel Borkmann. 11) Allow packets to cross network namespaces when traversing tunnel devices. From Nicolas Dichtel. 12) Allow monitoring netlink traffic via AF_PACKET sockets, in a manner akin to how we monitor real network traffic via ptype_all. From Daniel Borkmann. 13) Several bug fixes and improvements for the new alx device driver, from Johannes Berg. 14) Fix scalability issues in the netem packet scheduler's time queue, by using an rbtree. From Eric Dumazet. 15) Several bug fixes in TCP loss recovery handling, from Yuchung Cheng. 16) Add support for GSO segmentation of MPLS packets, from Simon Horman. 17) Make network notifiers have a real data type for the opaque pointer that's passed into them. Use this to properly handle network device flag changes in arp_netdev_event(). From Jiri Pirko and Timo Teräs. 18) Convert several drivers over to module_pci_driver(), from Peter Huewe. 19) tcp_fixup_rcvbuf() can loop 500 times over loopback, just use a O(1) calculation instead. From Eric Dumazet. 20) Support setting of explicit tunnel peer addresses in ipv6, just like ipv4. From Nicolas Dichtel. 21) Protect x86 BPF JIT against spraying attacks, from Eric Dumazet. 22) Prevent a single high rate flow from overruning an individual cpu during RX packet processing via selective flow shedding. From Willem de Bruijn. 23) Don't use spinlocks in TCP md5 signing fast paths, from Eric Dumazet. 24) Don't just drop GSO packets which are above the TBF scheduler's burst limit, chop them up so they are in-bounds instead. Also from Eric Dumazet. 25) VLAN offloads are missed when configured on top of a bridge, fix from Vlad Yasevich. 26) Support IPV6 in ping sockets. From Lorenzo Colitti. 27) Receive flow steering targets should be updated at poll() time too, from David Majnemer. 28) Fix several corner case regressions in PMTU/redirect handling due to the routing cache removal, from Timo Teräs. 29) We have to be mindful of ipv4 mapped ipv6 sockets in upd_v6_push_pending_frames(). From Hannes Frederic Sowa. 30) Fix L2TP sequence number handling bugs, from James Chapman." * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1214 commits) drivers/net: caif: fix wrong rtnl_is_locked() usage drivers/net: enic: release rtnl_lock on error-path vhost-net: fix use-after-free in vhost_net_flush net: mv643xx_eth: do not use port number as platform device id net: sctp: confirm route during forward progress virtio_net: fix race in RX VQ processing virtio: support unlocked queue poll net/cadence/macb: fix bug/typo in extracting gem_irq_read_clear bit Documentation: Fix references to defunct linux-net@vger.kernel.org net/fs: change busy poll time accounting net: rename low latency sockets functions to busy poll bridge: fix some kernel warning in multicast timer sfc: Fix memory leak when discarding scattered packets sit: fix tunnel update via netlink dt:net:stmmac: Add dt specific phy reset callback support. dt:net:stmmac: Add support to dwmac version 3.610 and 3.710 dt:net:stmmac: Allocate platform data only if its NULL. net:stmmac: fix memleak in the open method ipv6: rt6_check_neigh should successfully verify neigh if no NUD information are available net: ipv6: fix wrong ping_v6_sendmsg return value ...
2013-07-09reboot: move arch/x86 reboot= handling to generic kernelRobin Holt1-1/+75
Merge together the unicore32, arm, and x86 reboot= command line parameter handling. Signed-off-by: Robin Holt <holt@sgi.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Russ Anderson <rja@sgi.com> Cc: Robin Holt <holt@sgi.com> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Guan Xuetao <gxt@mprc.pku.edu.cn> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09reboot: checkpatch.pl the new kernel/reboot.c fileRobin Holt1-14/+13
Get the new file to pass scripts/checkpatch.pl Signed-off-by: Robin Holt <holt@sgi.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Russ Anderson <rja@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09reboot: move shutdown/reboot related functions to kernel/reboot.cRobin Holt3-332/+347
This patch is preparatory. It moves reboot related syscall, etc functions from kernel/sys.c to kernel/reboot.c. Signed-off-by: Robin Holt <holt@sgi.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Russ Anderson <rja@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09reboot: remove -stable friendly PF_THREAD_BOUND defineRobin Holt1-5/+0
Remove the prior patch's #define for easier backporting to the stable releases. Signed-off-by: Robin Holt <holt@sgi.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Russ Anderson <rja@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ptrace: PTRACE_DETACH should do flush_ptrace_hw_breakpoint(child)Oleg Nesterov1-0/+1
Change ptrace_detach() to call flush_ptrace_hw_breakpoint(child). This frees the slots for non-ptrace PERF_TYPE_BREAKPOINT users, and this ensures that the tracee won't be killed by SIGTRAP triggered by the active breakpoints. Test-case: unsigned long encode_dr7(int drnum, int enable, unsigned int type, unsigned int len) { unsigned long dr7; dr7 = ((len | type) & 0xf) << (DR_CONTROL_SHIFT + drnum * DR_CONTROL_SIZE); if (enable) dr7 |= (DR_GLOBAL_ENABLE << (drnum * DR_ENABLE_SIZE)); return dr7; } int write_dr(int pid, int dr, unsigned long val) { return ptrace(PTRACE_POKEUSER, pid, offsetof (struct user, u_debugreg[dr]), val); } void func(void) { } int main(void) { int pid, stat; unsigned long dr7; pid = fork(); if (!pid) { assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0); kill(getpid(), SIGHUP); func(); return 0x13; } assert(pid == waitpid(-1, &stat, 0)); assert(WSTOPSIG(stat) == SIGHUP); assert(write_dr(pid, 0, (long)func) == 0); dr7 = encode_dr7(0, 1, DR_RW_EXECUTE, DR_LEN_1); assert(write_dr(pid, 7, dr7) == 0); assert(ptrace(PTRACE_DETACH, pid, 0,0) == 0); assert(pid == waitpid(-1, &stat, 0)); assert(stat == 0x1300); return 0; } Before this patch the child is killed after PTRACE_DETACH. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: Michael Neuling <mikey@neuling.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Russell King <linux@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ptrace: revert "Prepare to fix racy accesses on task breakpoints"Oleg Nesterov2-17/+1
This reverts commit bf26c018490c ("Prepare to fix racy accesses on task breakpoints"). The patch was fine but we can no longer race with SIGKILL after commit 9899d11f6544 ("ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL"), the __TASK_TRACED tracee can't be woken up and ->ptrace_bps[] can't go away. Now that ptrace_get_breakpoints/ptrace_put_breakpoints have no callers, we can kill them and remove task->ptrace_bp_refcnt. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Michael Neuling <mikey@neuling.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Russell King <linux@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09panic: add cpu/pid to warn_slowpath_common in WARNING printk()sAlex Thorlton1-2/+3
Add the cpu/pid that called WARN() so that the stack traces can be matched up with the WARNING messages. [akpm@linux-foundation.org: remove stray quote] Signed-off-by: Alex Thorlton <athorlton@sgi.com> Reviewed-by: Robin Holt <holt@sgi.com> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Vikram Mulukutla <markivx@codeaurora.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09audit: Fix decimal constant descriptionMichal Simek1-1/+1
Use proper decimal type for comparison with u32. Compilation warning was introduced by 780a7654 ("audit: Make testing for a valid loginuid explicit.") kernel/auditfilter.c: In function 'audit_data_to_entry': kernel/auditfilter.c:426:3: warning: this decimal constant is unsigned only in ISO C90 [enabled by default] if ((f->type == AUDIT_LOGINUID) && (f->val == 4294967295)) { Signed-off-by: Michal Simek <michal.simek@xilinx.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09kernel/auditfilter.c: fix leak in audit_add_rule() error pathChen Gang1-0/+6
If both 'tree' and 'watch' are valid we must call audit_put_tree(), just like the preceding code within audit_add_rule(). Signed-off-by: Chen Gang <gang.chen@asianux.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09kernel/auditfilter.c: fixing build warningRaphael S. Carvalho1-1/+1
kernel/auditfilter.c:426: warning: this decimal constant is unsigned only in ISO C90 Signed-off-by: Raphael S. Carvalho <raphael.scarv@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09audit: fix mq_open and mq_unlink to add the MQ root as a hidden parent audit_names recordJeff Layton2-3/+10
The old audit PATH records for mq_open looked like this: type=PATH msg=audit(1366282323.982:869): item=1 name=(null) inode=6777 dev=00:0c mode=041777 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tmpfs_t:s15:c0.c1023 type=PATH msg=audit(1366282323.982:869): item=0 name="test_mq" inode=26732 dev=00:0c mode=0100700 ouid=0 ogid=0 rdev=00:00 obj=staff_u:object_r:user_tmpfs_t:s15:c0.c1023 ...with the audit related changes that went into 3.7, they now look like this: type=PATH msg=audit(1366282236.776:3606): item=2 name=(null) inode=66655 dev=00:0c mode=0100700 ouid=0 ogid=0 rdev=00:00 obj=staff_u:object_r:user_tmpfs_t:s15:c0.c1023 type=PATH msg=audit(1366282236.776:3606): item=1 name=(null) inode=6926 dev=00:0c mode=041777 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tmpfs_t:s15:c0.c1023 type=PATH msg=audit(1366282236.776:3606): item=0 name="test_mq" Both of these look wrong to me. As Steve Grubb pointed out: "What we need is 1 PATH record that identifies the MQ. The other PATH records probably should not be there." Fix it to record the mq root as a parent, and flag it such that it should be hidden from view when the names are logged, since the root of the mq filesystem isn't terribly interesting. With this change, we get a single PATH record that looks more like this: type=PATH msg=audit(1368021604.836:484): item=0 name="test_mq" inode=16914 dev=00:0c mode=0100644 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmpfs_t:s0 In order to do this, a new audit_inode_parent_hidden() function is added. If we do it this way, then we avoid having the existing callers of audit_inode needing to do any sort of flag conversion if auditing is inactive. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reported-by: Jiri Jaburek <jjaburek@redhat.com> Cc: Steve Grubb <sgrubb@redhat.com> Cc: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-06Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds15-540/+1223
Pull timer core updates from Thomas Gleixner: "The timer changes contain: - posix timer code consolidation and fixes for odd corner cases - sched_clock implementation moved from ARM to core code to avoid duplication by other architectures - alarm timer updates - clocksource and clockevents unregistration facilities - clocksource/events support for new hardware - precise nanoseconds RTC readout (Xen feature) - generic support for Xen suspend/resume oddities - the usual lot of fixes and cleanups all over the place The parts which touch other areas (ARM/XEN) have been coordinated with the relevant maintainers. Though this results in an handful of trivial to solve merge conflicts, which we preferred over nasty cross tree merge dependencies. The patches which have been committed in the last few days are bug fixes plus the posix timer lot. The latter was in akpms queue and next for quite some time; they just got forgotten and Frederic collected them last minute." * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (59 commits) hrtimer: Remove unused variable hrtimers: Move SMP function call to thread context clocksource: Reselect clocksource when watchdog validated high-res capability posix-cpu-timers: don't account cpu timer after stopped thread runtime accounting posix_timers: fix racy timer delta caching on task exit posix-timers: correctly get dying task time sample in posix_cpu_timer_schedule() selftests: add basic posix timers selftests posix_cpu_timers: consolidate expired timers check posix_cpu_timers: consolidate timer list cleanups posix_cpu_timer: consolidate expiry time type tick: Sanitize broadcast control logic tick: Prevent uncontrolled switch to oneshot mode tick: Make oneshot broadcast robust vs. CPU offlining x86: xen: Sync the CMOS RTC as well as the Xen wallclock x86: xen: Sync the wallclock when the system time is set timekeeping: Indicate that clock was set in the pvclock gtod notifier timekeeping: Pass flags instead of multiple bools to timekeeping_update() xen: Remove clock_was_set() call in the resume path hrtimers: Support resuming with two or more CPUs online (but stopped) timer: Fix jiffies wrap behavior of round_jiffies_common() ...
2013-07-06Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linuxLinus Torvalds3-401/+186
Pull irqdomain refactoring from Grant Likely: "This is the long awaited simplification of irqdomain. It gets rid of the different types of irq domains and instead both linear and tree mappings can be supported in a single domain. Doing this removes a lot of special case code and makes irq domains simpler to understand overall" * tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux: irq: fix checkpatch error irqdomain: Include hwirq number in /proc/interrupts irqdomain: make irq_linear_revmap() a fast path again irqdomain: remove irq_domain_generate_simple() irqdomain: Refactor irq_domain_associate_many() irqdomain: Beef up debugfs output irqdomain: Clean up aftermath of irq_domain refactoring irqdomain: Eliminate revmap type irqdomain: merge linear and tree reverse mappings. irqdomain: Add a name field irqdomain: Replace LEGACY mapping with LINEAR irqdomain: Relax failure path on setting up mappings
2013-07-06hrtimer: Remove unused variableThomas Gleixner1-2/+0
Sigh, should have noticed myself. Reported-by: fengguang.wu@intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-07-05hrtimers: Move SMP function call to thread contextThomas Gleixner1-22/+15
smp_call_function_* must not be called from softirq context. But clock_was_set() which calls on_each_cpu() is called from softirq context to implement a delayed clock_was_set() for the timer interrupt handler. Though that almost never gets invoked. A recent change in the resume code uses the softirq based delayed clock_was_set to support Xens resume mechanism. linux-next contains a new warning which warns if smp_call_function_* is called from softirq context which gets triggered by that Xen change. Fix this by moving the delayed clock_was_set() call to a work context. Reported-and-tested-by: Artem Savkov <artem.savkov@gmail.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com>, Cc: Konrad Wilk <konrad.wilk@oracle.com> Cc: John Stultz <john.stultz@linaro.org> Cc: xen-devel@lists.xen.org Cc: stable@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-07-05genirq: generic chip: Use DIV_ROUND_UP to calculate numchipsAxel Lin1-1/+1
The number of interrupts in a domain may be not divisible by the number of interrupts each chip handles. Integer division may truncate the result, thus use DIV_ROUND_UP to count numchips. Seems all users of irq_alloc_domain_generic_chips() in current code do not have this issue. I just found the issue while reading the code. Signed-off-by: Axel Lin <axel.lin@ingics.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Tony Lindgren <tony@atomide.com> Cc: Arnd Bergmann <arnd@arndb.de> Link: http://lkml.kernel.org/r/1373015592.18252.2.camel@phoenix Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-07-05clocksource: Reselect clocksource when watchdog validated high-res capabilityThomas Gleixner1-15/+42
Up to commit 5d33b883a (clocksource: Always verify highres capability) we had no sanity check when selecting a clocksource, which prevented that a non highres capable clocksource is used when the system already switched to highres/nohz mode. The new sanity check works as Alex and Tim found out. It prevents the TSC from being used. This happens because on x86 the boot process looks like this: tsc_start_freqency_validation(TSC); clocksource_register(HPET); clocksource_done_booting(); clocksource_select() Selects HPET which is valid for high-res switch_to_highres(); clocksource_register(TSC); TSC is not selected, because it is not yet flagged as VALID_HIGH_RES clocksource_watchdog() Validates TSC for highres, but that does not make TSC the current clocksource. Before the sanity check was added, we installed TSC unvalidated which worked most of the time. If the TSC was really detected as unstable, then the unstable logic removed it and installed HPET again. The sanity check is correct and needed. So the watchdog needs to kick a reselection of the clocksource, when it qualifies TSC as a valid high res clocksource. To solve this, we mark the clocksource which got the flag CLOCK_SOURCE_VALID_FOR_HRES set by the watchdog with an new flag CLOCK_SOURCE_RESELECT and trigger the watchdog thread. The watchdog thread evaluates the flag and invokes clocksource_select() when set. To avoid that the clocksource_done_booting() code, which is about to install the first real clocksource anyway, needs to go through clocksource_select and tick_oneshot_notify() pointlessly, split out the clocksource_watchdog_kthread() list walk code and invoke the select/notify only when called from clocksource_watchdog_kthread(). So clocksource_done_booting() can utilize the same splitout code without the select/notify invocation and the clocksource_mutex unlock/relock dance. Reported-and-tested-by: Alex Shi <alex.shi@intel.com> Cc: Hans Peter Anvin <hpa@linux.intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Tested-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@kernel.org> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: John Stultz <john.stultz@linaro.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1307042239150.11637@ionos.tec.linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-07-05perf: Fix interrupt handler timing harnessStephane Eranian1-2/+2
This patch fixes a serious bug in: 14c63f17b1fd perf: Drop sample rate when sampling is too slow There was an misunderstanding on the API of the do_div() macro. It returns the remainder of the division and this was not what the function expected leading to disabling the interrupt latency watchdog. This patch also remove a duplicate assignment in perf_sample_event_took(). Signed-off-by: Stephane Eranian <eranian@google.com> Cc: peterz@infradead.org Cc: dave.hansen@linux.intel.com Cc: ak@linux.intel.com Cc: jolsa@redhat.com Link: http://lkml.kernel.org/r/20130704223010.GA30625@quad Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-07-04Merge branch 'timers/posix-cpu-timers-for-tglx' ofThomas Gleixner38-527/+691
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/core Frederic sayed: "Most of these patches have been hanging around for several month now, in -mmotm for a significant chunk. They already missed a few releases." Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-07-04Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivialLinus Torvalds1-1/+1
Pull trivial tree updates from Jiri Kosina: "The usual stuff from trivial tree" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits) treewide: relase -> release Documentation/cgroups/memory.txt: fix stat file documentation sysctl/net.txt: delete reference to obsolete 2.4.x kernel spinlock_api_smp.h: fix preprocessor comments treewide: Fix typo in printk doc: device tree: clarify stuff in usage-model.txt. open firmware: "/aliasas" -> "/aliases" md: bcache: Fixed a typo with the word 'arithmetic' irq/generic-chip: fix a few kernel-doc entries frv: Convert use of typedef ctl_table to struct ctl_table sgi: xpc: Convert use of typedef ctl_table to struct ctl_table doc: clk: Fix incorrect wording Documentation/arm/IXP4xx fix a typo Documentation/networking/ieee802154 fix a typo Documentation/DocBook/media/v4l fix a typo Documentation/video4linux/si476x.txt fix a typo Documentation/virtual/kvm/api.txt fix a typo Documentation/early-userspace/README fix a typo Documentation/video4linux/soc-camera.txt fix a typo lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment ...
2013-07-04posix-cpu-timers: don't account cpu timer after stopped thread runtime accountingKOSAKI Motohiro1-3/+36
When tsk->signal->cputimer->running is 1, signal->cputimer (i.e. per process timer account) and tsk->sum_sched_runtime (i.e. per thread timer account) increase at the same pace because update_curr() increases both accounting. However, there is one exception. When thread exiting, __exit_signal() turns over task's sum_shced_runtime to sig->sum_sched_runtime, but it doesn't stop signal->cputimer accounting. This inconsistency makes POSIX timer wake up too early. This patch fixes it. Original-patch-by: Olivier Langlois <olivier@trillion01.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Olivier Langlois <olivier@trillion01.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2013-07-03Merge branch 'akpm' (updates from Andrew Morton)Linus Torvalds10-67/+107
Merge first patch-bomb from Andrew Morton: - various misc bits - I'm been patchmonkeying ocfs2 for a while, as Joel and Mark have been distracted. There has been quite a bit of activity. - About half the MM queue - Some backlight bits - Various lib/ updates - checkpatch updates - zillions more little rtc patches - ptrace - signals - exec - procfs - rapidio - nbd - aoe - pps - memstick - tools/testing/selftests updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (445 commits) tools/testing/selftests: don't assume the x bit is set on scripts selftests: add .gitignore for kcmp selftests: fix clean target in kcmp Makefile selftests: add .gitignore for vm selftests: add hugetlbfstest self-test: fix make clean selftests: exit 1 on failure kernel/resource.c: remove the unneeded assignment in function __find_resource aio: fix wrong comment in aio_complete() drivers/w1/slaves/w1_ds2408.c: add magic sequence to disable P0 test mode drivers/memstick/host/r592.c: convert to module_pci_driver drivers/memstick/host/jmb38x_ms: convert to module_pci_driver pps-gpio: add device-tree binding and support drivers/pps/clients/pps-gpio.c: convert to module_platform_driver drivers/pps/clients/pps-gpio.c: convert to devm_* helpers drivers/parport/share.c: use kzalloc Documentation/accounting/getdelays.c: avoid strncpy in accounting tool aoe: update internal version number to v83 aoe: update copyright date aoe: perform I/O completions in parallel ...
2013-07-03kernel/resource.c: remove the unneeded assignment in function __find_resourceKevin Hao1-1/+0
This line was introduced by fcb11918 ("resources: add arch hook for preventing allocation in reserved areas"). But the struct tmp was already assigned to *new in the above line, so this seems superfluous. Just remove it. Signed-off-by: Kevin Hao <haokexin@gmail.com> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03kernel/pid.c: move statementRaphael S. Carvalho1-1/+1
Move statement to static initilization of init_pid_ns. Signed-off-by: Raphael S. Carvalho <raphael.scarv@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03kernel/fork.c:copy_process(): consolidate the lockless CLONE_THREAD checksOleg Nesterov1-17/+16
copy_process() does a lot of "chaotic" initializations and checks CLONE_THREAD twice before it takes tasklist. In particular it sets "p->group_leader = p" and then changes it again under tasklist if !thread_group_leader(p). This looks a bit confusing, lets create a single "if (CLONE_THREAD)" block which initializes ->exit_signal, ->group_leader, and ->tgid. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Sergey Dyasly <dserrg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03kernel/fork.c:copy_process(): don't add the uninitialized child to thread/task/pid listsOleg Nesterov2-11/+17
copy_process() adds the new child to thread_group/init_task.tasks list and then does attach_pid(child, PIDTYPE_PID). This means that the lockless next_thread() or next_task() can see this thread with the wrong pid. Say, "ls /proc/pid/task" can list the same inode twice. We could move attach_pid(child, PIDTYPE_PID) up, but in this case find_task_by_vpid() can find the new thread before it was fully initialized. And this is already true for PIDTYPE_PGID/PIDTYPE_SID, With this patch copy_process() initializes child->pids[*].pid first, then calls attach_pid() to insert the task into the pid->tasks list. attach_pid() no longer need the "struct pid*" argument, it is always called after pid_link->pid was already set. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Sergey Dyasly <dserrg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03kernel/fork.c:copy_process(): unify CLONE_THREAD-or-thread_group_leader codeOleg Nesterov1-8/+7
Cleanup and preparation for the next changes. Move the "if (clone_flags & CLONE_THREAD)" code down under "if (likely(p->pid))" and turn it into into the "else" branch. This makes the process/thread initialization more symmetrical and removes one check. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Sergey Dyasly <dserrg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03fork: reorder permissions when violating number of processes limitsEric Paris1-2/+2
When a task is attempting to violate the RLIMIT_NPROC limit we have a check to see if the task is sufficiently priviledged. The check first looks at CAP_SYS_ADMIN, then CAP_SYS_RESOURCE, then if the task is uid=0. A result is that tasks which are allowed by the uid=0 check are first checked against the security subsystem. This results in the security subsystem auditting a denial for sys_admin and sys_resource and then the task passing the uid=0 check. This patch rearranges the code to first check uid=0, since if we pass that we shouldn't hit the security system at all. We then check sys_resource, since it is the smallest capability which will solve the problem. Lastly we check the fallback everything cap_sysadmin. We don't want to give this capability many places since it is so powerful. This will eliminate many of the false positive/needless denial messages we get when a root task tries to violate the nproc limit. (note that kthreads count against root, so on a sufficiently large machine we can actually get past the default limits before any userspace tasks are launched.) Signed-off-by: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03exit.c: unexport __set_special_pids()Oleg Nesterov2-12/+12
Move __set_special_pids() from exit.c to sys.c close to its single caller and make it static. And rename it to set_special_pids(), another helper with this name has gone away. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03usermodehelper: kill the sub_info->path[0] checkOleg Nesterov1-8/+3
call_usermodehelper_exec() does nothing but returns success if path[0] == 0. The only user which needs this strange feature is request_module(), it can check modprobe_path[0] itself like other users do if they want to detect the "disabled by admin" case. Kill it. Not only it looks strange, it can confuse other callers. And this allows us to revert 264b83c0 ("usermodehelper: check subprocess_info->path != NULL"), do_execve(NULL) is safe. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Lucas De Marchi <lucas.de.marchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03ptrace: add ability to get/set signal-blocked maskAndrey Vagin1-2/+42
crtools uses a parasite code for dumping processes. The parasite code is injected into a process with help PTRACE_SEIZE. Currently crtools blocks signals from a parasite code. If a process has pending signals, crtools wait while a process handles these signals. This method is not suitable for stopped tasks. A stopped task can have a few pending signals, when we will try to execute a parasite code, we will need to drop SIGSTOP, but all other signals must remain pending, because a state of processes must not be changed during checkpointing. This patch adds two ptrace commands to set/get signal-blocked mask. I think gdb can use this commands too. [akpm@linux-foundation.org: be consistent with brace layout] Signed-off-by: Andrey Vagin <avagin@openvz.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03kprobes: handle empty/invalid input to debugfs "enabled" fileMathias Krause1-0/+3
When writing invalid input to 'debug/kprobes/enabled' it'll silently be ignored. Even worse, when writing an empty string to this file, the outcome is purely random as the switch statement will make its decision based on the value of an uninitialized stack variable. Fix this by handling invalid/empty input as error returning -EINVAL. Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03kernel/sys.c:do_sysinfo(): use get_monotonic_boottime()Oleg Nesterov1-2/+1
Change do_sysinfo() to use get_monotonic_boottime() instead of do_posix_clock_monotonic_gettime() + monotonic_to_bootbased(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Tomas Janousek <tjanouse@redhat.com> Cc: Tomas Smetana <tsmetana@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03kernel/sys.c: sys_reboot(): fix malformed panic messageliguang1-1/+1
If LINUX_REBOOT_CMD_HALT for reboot failed, the message "cannot halt" will stay on the same line with the next message, so append a '\n'. Signed-off-by: liguang <lig.fnst@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03drivers: avoid parsing names as kthread_run() format stringsKees Cook1-1/+1
Calling kthread_run with a single name parameter causes it to be handled as a format string. Many callers are passing potentially dynamic string content, so use "%s" in those cases to avoid any potential accidents. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm: use totalram_pages instead of num_physpages at runtimeJiang Liu1-2/+2
The global variable num_physpages is scheduled to be removed, so use totalram_pages instead of num_physpages at runtime. Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller12-120/+294
Conflicts: drivers/net/ethernet/freescale/fec_main.c drivers/net/ethernet/renesas/sh_eth.c net/ipv4/gre.c The GRE conflict is between a bug fix (kfree_skb --> kfree_skb_list) and the splitting of the gre.c code into seperate files. The FEC conflict was two sets of changes adding ethtool support code in an "!CONFIG_M5272" CPP protected block. Finally the sh_eth.c conflict was between one commit add bits set in the .eesr_err_check mask whilst another commit removed the .tx_error_check member and assignments. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-03Merge tag 'pm+acpi-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds10-29/+60
Pull power management and ACPI updates from Rafael Wysocki: "This time the total number of ACPI commits is slightly greater than the number of cpufreq commits, but Viresh Kumar (who works on cpufreq) remains the most active patch submitter. To me, the most significant change is the addition of offline/online device operations to the driver core (with the Greg's blessing) and the related modifications of the ACPI core hotplug code. Next are the freezer updates from Colin Cross that should make the freezing of tasks a bit less heavy weight. We also have a couple of regression fixes, a number of fixes for issues that have not been identified as regressions, two new drivers and a bunch of cleanups all over. Highlights: - Hotplug changes to support graceful hot-removal failures. It sometimes is necessary to fail device hot-removal operations gracefully if they cannot be carried out completely. For example, if memory from a memory module being hot-removed has been allocated for the kernel's own use and cannot be moved elsewhere, it's desirable to fail the hot-removal operation in a graceful way rather than to crash the kernel, but currenty a success or a kernel crash are the only possible outcomes of an attempted memory hot-removal. Needless to say, that is not a very attractive alternative and it had to be addressed. However, in order to make it work for memory, I first had to make it work for CPUs and for this purpose I needed to modify the ACPI processor driver. It's been split into two parts, a resident one handling the low-level initialization/cleanup and a modular one playing the actual driver's role (but it binds to the CPU system device objects rather than to the ACPI device objects representing processors). That's been sort of like a live brain surgery on a patient who's riding a bike. So this is a little scary, but since we found and fixed a couple of regressions it caused to happen during the early linux-next testing (a month ago), nobody has complained. As a bonus we remove some duplicated ACPI hotplug code, because the ACPI-based CPU hotplug is now going to use the common ACPI hotplug code. - Lighter weight freezing of tasks. These changes from Colin Cross and Mandeep Singh Baines are targeted at making the freezing of tasks a bit less heavy weight operation. They reduce the number of tasks woken up every time during the freezing, by using the observation that the freezer simply doesn't need to wake up some of them and wait for them all to call refrigerator(). The time needed for the freezer to decide to report a failure is reduced too. Also reintroduced is the check causing a lockdep warining to trigger when try_to_freeze() is called with locks held (which is generally unsafe and shouldn't happen). - cpufreq updates First off, a commit from Srivatsa S Bhat fixes a resume regression introduced during the 3.10 cycle causing some cpufreq sysfs attributes to return wrong values to user space after resume. The fix is kind of fresh, but also it's pretty obvious once Srivatsa has identified the root cause. Second, we have a new freqdomain_cpus sysfs attribute for the acpi-cpufreq driver to provide information previously available via related_cpus. From Lan Tianyu. Finally, we fix a number of issues, mostly related to the CPUFREQ_POSTCHANGE notifier and cpufreq Kconfig options and clean up some code. The majority of changes from Viresh Kumar with bits from Jacob Shin, Heiko Stübner, Xiaoguang Chen, Ezequiel Garcia, Arnd Bergmann, and Tang Yuantian. - ACPICA update A usual bunch of updates from the ACPICA upstream. During the 3.4 cycle we introduced support for ACPI 5 extended sleep registers, but they are only supposed to be used if the HW-reduced mode bit is set in the FADT flags and the code attempted to use them without checking that bit. That caused suspend/resume regressions to happen on some systems. Fix from Lv Zheng causes those registers to be used only if the HW-reduced mode bit is set. Apart from this some other ACPICA bugs are fixed and code cleanups are made by Bob Moore, Tomasz Nowicki, Lv Zheng, Chao Guan, and Zhang Rui. - cpuidle updates New driver for Xilinx Zynq processors is added by Michal Simek. Multidriver support simplification, addition of some missing kerneldoc comments and Kconfig-related fixes come from Daniel Lezcano. - ACPI power management updates Changes to make suspend/resume work correctly in Xen guests from Konrad Rzeszutek Wilk, sparse warning fix from Fengguang Wu and cleanups and fixes of the ACPI device power state selection routine. - ACPI documentation updates Some previously missing pieces of ACPI documentation are added by Lv Zheng and Aaron Lu (hopefully, that will help people to uderstand how the ACPI subsystem works) and one outdated doc is updated by Hanjun Guo. - Assorted ACPI updates We finally nailed down the IA-64 issue that was the reason for reverting commit 9f29ab11ddbf ("ACPI / scan: do not match drivers against objects having scan handlers"), so we can fix it and move the ACPI scan handler check added to the ACPI video driver back to the core. A mechanism for adding CMOS RTC address space handlers is introduced by Lan Tianyu to allow some EC-related breakage to be fixed on some systems. A spec-compliant implementation of acpi_os_get_timer() is added by Mika Westerberg. The evaluation of _STA is added to do_acpi_find_child() to avoid situations in which a pointer to a disabled device object is returned instead of an enabled one with the same _ADR value. From Jeff Wu. Intel BayTrail PCH (Platform Controller Hub) support is added to the ACPI driver for Intel Low-Power Subsystems (LPSS) and that driver is modified to work around a couple of known BIOS issues. Changes from Mika Westerberg and Heikki Krogerus. The EC driver is fixed by Vasiliy Kulikov to use get_user() and put_user() instead of dereferencing user space pointers blindly. Code cleanups are made by Bjorn Helgaas, Nicholas Mazzuca and Toshi Kani. - Assorted power management updates The "runtime idle" helper routine is changed to take the return values of the callbacks executed by it into account and to call rpm_suspend() if they return 0, which allows us to reduce the overall code bloat a bit (by dropping some code that's not necessary any more after that modification). The runtime PM documentation is updated by Alan Stern (to reflect the "runtime idle" behavior change). New trace points for PM QoS are added by Sahara (<keun-o.park@windriver.com>). PM QoS documentation is updated by Lan Tianyu. Code cleanups are made and minor issues are addressed by Bernie Thompson, Bjorn Helgaas, Julius Werner, and Shuah Khan. - devfreq updates New driver for the Exynos5-bus device from Abhilash Kesavan. Minor cleanups, fixes and MAINTAINERS update from MyungJoo Ham, Abhilash Kesavan, Paul Bolle, Rajagopal Venkat, and Wei Yongjun. - OMAP power management updates Adaptive Voltage Scaling (AVS) SmartReflex voltage control driver updates from Andrii Tseglytskyi and Nishanth Menon." * tag 'pm+acpi-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (162 commits) cpufreq: Fix cpufreq regression after suspend/resume ACPI / PM: Fix possible NULL pointer deref in acpi_pm_device_sleep_state() PM / Sleep: Warn about system time after resume with pm_trace cpufreq: don't leave stale policy pointer in cdbs->cur_policy acpi-cpufreq: Add new sysfs attribute freqdomain_cpus cpufreq: make sure frequency transitions are serialized ACPI: implement acpi_os_get_timer() according the spec ACPI / EC: Add HP Folio 13 to ec_dmi_table in order to skip DSDT scan ACPI: Add CMOS RTC Operation Region handler support ACPI / processor: Drop unused variable from processor_perflib.c cpufreq: tegra: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: s3c64xx: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: omap: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: imx6q: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: exynos: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: dbx500: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: davinci: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: arm-big-little: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: powernow-k8: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: pcc: call CPUFREQ_POSTCHANGE notfier in error cases ...
2013-07-03posix_timers: fix racy timer delta caching on task exitFrederic Weisbecker1-11/+11
When a task exits, we perform a caching of the remaining cputime delta before expiring of its timers. This is done from the following places: * When the task is reaped. We iterate through its list of posix cpu timers and store the remaining timer delta to the timer struct instead of the absolute value. (See posix_cpu_timers_exit() / posix_cpu_timers_exit_group() ) * When we call posix_cpu_timer_get() or posix_cpu_timer_schedule(). If the timer's task is considered dying when watched from these places, the same conversion from absolute to relative expiry time is performed. Then the given task's reference is released. (See clear_dead_task() ). The relevance of this caching is questionable but this is another and deeper debate. The big issue here is that these two sources of caching don't mix up very well together. More specifically, the caching can easily be done twice, resulting in a wrong delta as it gets spuriously substracted a second time by the elapsed clock. This can happen in the following scenario: 1) The task exits and gets reaped: we call posix_cpu_timers_exit() and the absolute timer expiry values are converted to a relative delta. 2) timer_gettime() -> posix_cpu_timer_get() is called and relies on clear_dead_task() because tsk->exit_state == EXIT_DEAD. The delta gets substracted again by the elapsed clock and we return a wrong result. To fix this, just remove the caching done on task reaping time. It doesn't bring much value on its own. The caching done from posix_cpu_timer_get/schedule is enough. And it would also be hard to get it really right: we could make it put and clear the target task in the timer struct so that readers know if they are dealing with a relative cached of absolute value. But it would be racy. The only safe way to do it would be to lock the itimer->it_lock so that we know nobody reads the cputime expiry value while we modify it and its target task reference. Doing so would involve some funny workarounds to avoid circular lock against the sighand lock. There is just no reason to maintain this. The user visible effect of this patch can be observed by running the following code: it creates a subthread that launches a posix cputimer which expires after 10 seconds. But then the subthread only busy loops for 2 seconds and exits. The parent reaps the subthread and read the timer value. Its expected value should the be the initial timer's expiration value minus the cputime elapsed in the subthread. Roughly 10 - 2 = 8 seconds: #include <sys/time.h> #include <stdio.h> #include <unistd.h> #include <time.h> #include <pthread.h> static timer_t id; static struct itimerspec val = { .it_value.tv_sec = 10, }, new; static void *thread(void *unused) { int err; struct timeval start, end, diff; timer_create(CLOCK_THREAD_CPUTIME_ID, NULL, &id); if (err < 0) { perror("Can't create timer\n"); return NULL; } /* Arm 10 sec timer */ err = timer_settime(id, 0, &val, NULL); if (err < 0) { perror("Can't set timer\n"); return NULL; } /* Exit after 2 seconds of execution */ gettimeofday(&start, NULL); do { gettimeofday(&end, NULL); timersub(&end, &start, &diff); } while (diff.tv_sec < 2); return NULL; } int main(int argc, char **argv) { pthread_t pthread; int err; err = pthread_create(&pthread, NULL, thread, NULL); if (err) { perror("Can't create thread\n"); return -1; } pthread_join(pthread, NULL); /* Just wait a little bit to make sure the child got reaped */ sleep(1); err = timer_gettime(id, &new); if (err) perror("Can't get timer value\n"); printf("%d %ld\n", new.it_value.tv_sec, new.it_value.tv_nsec); return 0; } Before the patch: $ ./posix_cpu_timers 6 2278074 After the patch: $ ./posix_cpu_timers 8 1158766 Before the patch, the elapsed time got two more seconds spuriously accounted. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Olivier Langlois <olivier@trillion01.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-07-03posix-timers: correctly get dying task time sample in posix_cpu_timer_schedule()Frederic Weisbecker1-0/+1
In order to re-arm a timer after it fired, we take a sample of the current process or thread cputime. If the task is dying though, we don't arm anything but we cache the remaining timer expiration delta for further reads. Something similar is performed in posix_cpu_timer_get() but here we forget to take the process wide cputime sample before caching it. As a result we are storing random stack content, leading every further reads of that timer to return junk values. Fix this by taking the appropriate sample in the case of process wide timers. This probably doesn't matter much in practice because, at this stage, the thread is the last one in the group and we reached exit_notify(). This implies that we called exit_itimers() and there should be no more timers to handle for that task. So this is likely dead code anyway but let's fix the current logic and the warning that came along: kernel/posix-cpu-timers.c: In function 'posix_cpu_timer_schedule': kernel/posix-cpu-timers.c:1127: warning: 'now' may be used uninitialized in this function Then we can start to think further about cleaning up that code. Reported-by: Andrew Morton <akpm@linux-foundation.org> Reported-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Chen Gang <gang.chen@asianux.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Olivier Langlois <olivier@trillion01.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-07-03posix_cpu_timers: consolidate expired timers checkFrederic Weisbecker1-85/+33
Consolidate the common code amongst per thread and per process timers list on tick time. List traversal, expiry check and subsequent updates can be shared in a common helper. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Olivier Langlois <olivier@trillion01.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-07-03posix_cpu_timers: consolidate timer list cleanupsFrederic Weisbecker1-29/+19
Cleaning up the posix cpu timers on task exit shares some common code among timer list types, most notably the list traversal and expiry time update. Unify this in a common helper. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Olivier Langlois <olivier@trillion01.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-07-03posix_cpu_timer: consolidate expiry time typeFrederic Weisbecker1-160/+106
The posix cpu timer expiry time is stored in a union of two types: a 64 bits field if we rely on scheduler precise accounting, or a cputime_t if we rely on jiffies. This results in quite some duplicate code and special cases to handle the two types. Just unify this into a single 64 bits field. cputime_t can always fit into it. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Olivier Langlois <olivier@trillion01.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-07-02Merge branch 'for-3.11-cpuset' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroupLinus Torvalds1-191/+287
Pull cpuset changes from Tejun Heo: "cpuset has always been rather odd about its configurations - a cgroup right after creation didn't allow any task executions before configuration, changing configuration in the parent modifies the descendants irreversibly and so on. These behaviors are inherently nasty and almost hostile against sharing the hierarchy with other controllers making it very difficult to use in unified hierarchy. Li is currently in the process of updating the behaviors for __DEVEL__sane_behavior which is the bulk of changes in this pull request. It isn't complete yet and the behaviors will change further but all changes are gated behind sane_behavior. In the process, the rather hairy work-item punting which was used to work around the limitations of cgroup descendant iterator was simplified." * 'for-3.11-cpuset' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: rename @cont to @cgrp cpuset: fix to migrate mm correctly in a corner case cpuset: allow to move tasks to empty cpusets cpuset: allow to keep tasks in empty cpusets cpuset: introduce effective_{cpumask|nodemask}_cpuset() cpuset: record old_mems_allowed in struct cpuset cpuset: remove async hotplug propagation work cpuset: let hotplug propagation work wait for task attaching cpuset: re-structure update_cpumask() a bit cpuset: remove cpuset_test_cpumask() cpuset: remove unnecessary variable in cpuset_attach() cpuset: cleanup guarantee_online_{cpus|mems}() cpuset: remove redundant check in cpuset_cpus_allowed_fallback()
2013-07-02Merge branch 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroupLinus Torvalds1-656/+880
Pull cgroup changes from Tejun Heo: "This pull request contains the following changes. - cgroup_subsys_state (css) reference counting has been converted to percpu-ref. css is what each resource controller embeds into its own control structure and perform reference count against. It may be used in hot paths of various subsystems and is similar to module refcnt in that aspect. For example, block-cgroup's css refcnting was showing up a lot in Mikulaus's device-mapper scalability work and this should alleviate it. - cgroup subtree iterator has been updated so that RCU read lock can be released after grabbing reference. This allows simplifying its users which requires blocking which used to build iteration list under RCU read lock and then traverse it outside. This pull request contains simplification of cgroup core and device-cgroup. A separate pull request will update cpuset. - Fixes for various bugs including corner race conditions and RCU usage bugs. - A lot of cleanups and some prepartory work for the planned unified hierarchy support." * 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (48 commits) cgroup: CGRP_ROOT_SUBSYS_BOUND should also be ignored when mounting an existing hierarchy cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when comparing mount options cgroup: fix deadlock on cgroup_mutex via drop_parsed_module_refcounts() cgroup: always use RCU accessors for protected accesses cgroup: fix RCU accesses around task->cgroups cgroup: fix RCU accesses to task->cgroups cgroup: grab cgroup_mutex in drop_parsed_module_refcounts() cgroup: fix cgroupfs_root early destruction path cgroup: reserve ID 0 for dummy_root and 1 for unified hierarchy cgroup: implement for_each_[builtin_]subsys() cgroup: move init_css_set initialization inside cgroup_mutex cgroup: s/for_each_subsys()/for_each_root_subsys()/ cgroup: clean up find_css_set() and friends cgroup: remove cgroup->actual_subsys_mask cgroup: prefix global variables with "cgroup_" cgroup: convert CFTYPE_* flags to enums cgroup: rename cont to cgrp cgroup: clean up cgroup_serial_nr_cursor cgroup: convert cgroup_cft_commit() to use cgroup_for_each_descendant_pre() cgroup: make serial_nr_cursor available throughout cgroup.c ...
2013-07-02Merge branch 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds2-1/+45
Pull workqueue changes from Tejun Heo: "Surprisingly, Lai and I didn't break too many things implementing custom pools and stuff last time around and there aren't any follow-up changes necessary at this point. The only change in this pull request is Viresh's patches to make some per-cpu workqueues to behave as unbound workqueues dependent on a boot param whose default can be configured via a config option. This leads to higher processing overhead / lower bandwidth as more work items are bounced across CPUs; however, it can lead to noticeable powersave in certain configurations - ~10% w/ idlish constant workload on a big.LITTLE configuration according to Viresh. This is because per-cpu workqueues interfere with how the scheduler perceives whether or not each CPU is idle by forcing pinned tasks on them, which makes the scheduler's power-aware scheduling decisions less effective. Its effectiveness is likely less pronounced on homogenous configurations and this type of optimization can probably be made automatic; however, the changes are pretty minimal and the affected workqueues are clearly marked, so it's an easy gain for some configurations for the time being with pretty unintrusive changes." * 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: fbcon: queue work on power efficient wq block: queue work on power efficient wq PHYLIB: queue work on system_power_efficient_wq workqueue: Add system wide power_efficient workqueues workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues