aboutsummaryrefslogtreecommitdiffstats
path: root/kernel (follow)
AgeCommit message (Collapse)AuthorFilesLines
2012-11-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-3/+19
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-26futex: avoid wake_futex() for a PI futex_qDarren Hart1-1/+17
Dave Jones reported a bug with futex_lock_pi() that his trinity test exposed. Sometime between queue_me() and taking the q.lock_ptr, the lock_ptr became NULL, resulting in a crash. While futex_wake() is careful to not call wake_futex() on futex_q's with a pi_state or an rt_waiter (which are either waiting for a futex_unlock_pi() or a PI futex_requeue()), futex_wake_op() and futex_requeue() do not perform the same test. Update futex_wake_op() and futex_requeue() to test for q.pi_state and q.rt_waiter and abort with -EINVAL if detected. To ensure any future breakage is caught, add a WARN() to wake_futex() if the same condition is true. This fix has seen 3 hours of testing with "trinity -c futex" on an x86_64 VM with 4 CPUS. [akpm@linux-foundation.org: tidy up the WARN()] Signed-off-by: Darren Hart <dvhart@linux.intel.com> Reported-by: Dave Jones <davej@redat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: John Kacur <jkacur@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-26watchdog: using u64 in get_sample_period()Chuansheng Liu1-2/+2
In get_sample_period(), unsigned long is not enough: watchdog_thresh * 2 * (NSEC_PER_SEC / 5) case1: watchdog_thresh is 10 by default, the sample value will be: 0xEE6B2800 case2: set watchdog_thresh is 20, the sample value will be: 0x1 DCD6 5000 In case2, we need use u64 to express the sample period. Otherwise, changing the threshold thru proc often can not be successful. Signed-off-by: liu chuansheng <chuansheng.liu@intel.com> Acked-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-18userns: make each net (net_ns) belong to a user_nsEric W. Biederman1-1/+1
The user namespace which creates a new network namespace owns that namespace and all resources created in it. This way we can target capability checks for privileged operations against network resources to the user_ns which created the network namespace in which the resource lives. Privilege to the user namespace which owns the network namespace, or any parent user namespace thereof, provides the same privilege to the network resource. This patch is reworked from a version originally by Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-30/+38
Minor line offset auto-merges. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-12Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-19/+22
Pull futex fix from Thomas Gleixner: "Single fix for a long standing futex race when taking over a futex whose owner died. You can end up with two owners, which violates quite some rules." * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Handle futex_pi OWNER_DIED take over correctly
2012-11-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller10-260/+236
Conflicts: drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c Minor conflict between the BCM_CNIC define removal in net-next and a bug fix added to net. Based upon a conflict resolution patch posted by Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-01time: remove the timecompare code.Richard Cochran2-194/+1
This patch removes the timecompare code from the kernel. The top five reasons to do this are: 1. There are no more users of this code. 2. The original idea was a bit weak. 3. The original author has disappeared. 4. The code was not general purpose but tuned to a particular hardware, 5. There are better ways to accomplish clock synchronization. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Acked-by: John Stultz <john.stultz@linaro.org> Tested-by: Bob Liu <lliubbo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-01futex: Handle futex_pi OWNER_DIED take over correctlyThomas Gleixner1-19/+22
Siddhesh analyzed a failure in the take over of pi futexes in case the owner died and provided a workaround. See: http://sourceware.org/bugzilla/show_bug.cgi?id=14076 The detailed problem analysis shows: Futex F is initialized with PTHREAD_PRIO_INHERIT and PTHREAD_MUTEX_ROBUST_NP attributes. T1 lock_futex_pi(F); T2 lock_futex_pi(F); --> T2 blocks on the futex and creates pi_state which is associated to T1. T1 exits --> exit_robust_list() runs --> Futex F userspace value TID field is set to 0 and FUTEX_OWNER_DIED bit is set. T3 lock_futex_pi(F); --> Succeeds due to the check for F's userspace TID field == 0 --> Claims ownership of the futex and sets its own TID into the userspace TID field of futex F --> returns to user space T1 --> exit_pi_state_list() --> Transfers pi_state to waiter T2 and wakes T2 via rt_mutex_unlock(&pi_state->mutex) T2 --> acquires pi_state->mutex and gains real ownership of the pi_state --> Claims ownership of the futex and sets its own TID into the userspace TID field of futex F --> returns to user space T3 --> observes inconsistent state This problem is independent of UP/SMP, preemptible/non preemptible kernels, or process shared vs. private. The only difference is that certain configurations are more likely to expose it. So as Siddhesh correctly analyzed the following check in futex_lock_pi_atomic() is the culprit: if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) { We check the userspace value for a TID value of 0 and take over the futex unconditionally if that's true. AFAICT this check is there as it is correct for a different corner case of futexes: the WAITERS bit became stale. Now the proposed change - if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) { + if (unlikely(ownerdied || + !(curval & (FUTEX_TID_MASK | FUTEX_WAITERS)))) { solves the problem, but it's not obvious why and it wreckages the "stale WAITERS bit" case. What happens is, that due to the WAITERS bit being set (T2 is blocked on that futex) it enforces T3 to go through lookup_pi_state(), which in the above case returns an existing pi_state and therefor forces T3 to legitimately fight with T2 over the ownership of the pi_state (via pi_state->mutex). Probelm solved! Though that does not work for the "WAITERS bit is stale" problem because if lookup_pi_state() does not find existing pi_state it returns -ERSCH (due to TID == 0) which causes futex_lock_pi() to return -ESRCH to user space because the OWNER_DIED bit is not set. Now there is a different solution to that problem. Do not look at the user space value at all and enforce a lookup of possibly available pi_state. If pi_state can be found, then the new incoming locker T3 blocks on that pi_state and legitimately races with T2 to acquire the rt_mutex and the pi_state and therefor the proper ownership of the user space futex. lookup_pi_state() has the correct order of checks. It first tries to find a pi_state associated with the user space futex and only if that fails it checks for futex TID value = 0. If no pi_state is available nothing can create new state at that point because this happens with the hash bucket lock held. So the above scenario changes to: T1 lock_futex_pi(F); T2 lock_futex_pi(F); --> T2 blocks on the futex and creates pi_state which is associated to T1. T1 exits --> exit_robust_list() runs --> Futex F userspace value TID field is set to 0 and FUTEX_OWNER_DIED bit is set. T3 lock_futex_pi(F); --> Finds pi_state and blocks on pi_state->rt_mutex T1 --> exit_pi_state_list() --> Transfers pi_state to waiter T2 and wakes it via rt_mutex_unlock(&pi_state->mutex) T2 --> acquires pi_state->mutex and gains ownership of the pi_state --> Claims ownership of the futex and sets its own TID into the userspace TID field of futex F --> returns to user space This covers all gazillion points on which T3 might come in between T1's exit_robust_list() clearing the TID field and T2 fixing it up. It also solves the "WAITERS bit stale" problem by forcing the take over. Another benefit of changing the code this way is that it makes it less dependent on untrusted user space values and therefor minimizes the possible wreckage which might be inflicted. As usual after staring for too long at the futex code my brain hurts so much that I really want to ditch that whole optimization of avoiding the syscall for the non contended case for PI futexes and rip out the maze of corner case handling code. Unfortunately we can't as user space relies on that existing behaviour, but at least thinking about it helps me to preserve my mental sanity. Maybe we should nevertheless :) Reported-and-tested-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1210232138540.2756@ionos Acked-by: Darren Hart <dvhart@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-10-31module: fix out-by-one error in kallsymsRusty Russell1-11/+16
Masaki found and patched a kallsyms issue: the last symbol in a module's symtab wasn't transferred. This is because we manually copy the zero'th entry (which is always empty) then copy the rest in a loop starting at 1, though from src[0]. His fix was minimal, I prefer to rewrite the loops in more standard form. There are two loops: one to get the size, and one to copy. Make these identical: always count entry 0 and any defined symbol in an allocated non-init section. This bug exists since the following commit was introduced. module: reduce symbol table for loaded modules (v2) commit: 4a4962263f07d14660849ec134ee42b63e95ea9a LKML: http://lkml.org/lkml/2012/10/24/27 Reported-by: Masaki Kimura <masaki.kimura.kz@hitachi.com> Cc: stable@kernel.org
2012-10-25Merge branch 'akpm' (Andrew's fixes)Linus Torvalds1-1/+11
Merge misc fixes from Andrew Morton: "18 total. 15 fixes and some updates to a device_cgroup patchset which bring it up to date with the version which I should have merged in the first place." * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (18 patches) fs/compat_ioctl.c: VIDEO_SET_SPU_PALETTE missing error check gen_init_cpio: avoid stack overflow when expanding drivers/rtc/rtc-imxdi.c: add missing spin lock initialization mm, numa: avoid setting zone_reclaim_mode unless a node is sufficiently distant pidns: limit the nesting depth of pid namespaces drivers/dma/dw_dmac: make driver's endianness configurable mm/mmu_notifier: allocate mmu_notifier in advance tools/testing/selftests/epoll/test_epoll.c: fix build UAPI: fix tools/vm/page-types.c mm/page_alloc.c:alloc_contig_range(): return early for err path rbtree: include linux/compiler.h for definition of __always_inline genalloc: stop crashing the system when destroying a pool backlight: ili9320: add missing SPI dependency device_cgroup: add proper checking when changing default behavior device_cgroup: stop using simple_strtoul() device_cgroup: rename deny_all to behavior cgroup: fix invalid rcu dereference mm: fix XFS oops due to dirty pages without buffers on s390
2012-10-25Makefile: Documentation for external tool should be correctH. Peter Anvin1-4/+2
If one includes documentation for an external tool, it should be correct. This is not: 1. Overriding the input to rngd should typically be neither necessary nor desired. This is especially so since newer versions of rngd support a number of different *types* of sources. 2. The default kernel-exported device is called /dev/hwrng not /dev/hwrandom nor /dev/hw_random (both of which were used in the past; however, kernel and udev seem to have converged on /dev/hwrng.) Overall it is better if the documentation for rngd is kept with rngd rather than in a kernel Makefile. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: David Howells <dhowells@redhat.com> Cc: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-25pidns: limit the nesting depth of pid namespacesAndrew Vagin1-1/+11
'struct pid' is a "variable sized struct" - a header with an array of upids at the end. The size of the array depends on a level (depth) of pid namespaces. Now a level of pidns is not limited, so 'struct pid' can be more than one page. Looks reasonable, that it should be less than a page. MAX_PIS_NS_LEVEL is not calculated from PAGE_SIZE, because in this case it depends on architectures, config options and it will be reduced, if someone adds a new fields in struct pid or struct upid. I suggest to set MAX_PIS_NS_LEVEL = 32, because it saves ability to expand "struct pid" and it's more than enough for all known for me use-cases. When someone finds a reasonable use case, we can add a config option or a sysctl parameter. In addition it will reduce the effect of another problem, when we have many nested namespaces and the oldest one starts dying. zap_pid_ns_processe will be called for each namespace and find_vpid will be called for each process in a namespace. find_vpid will be called minimum max_level^2 / 2 times. The reason of that is that when we found a bit in pidmap, we can't determine this pidns is top for this process or it isn't. vpid is a heavy operation, so a fork bomb, which create many nested namespace, can make a system inaccessible for a long time. For example my system becomes inaccessible for a few minutes with 4000 processes. [akpm@linux-foundation.org: return -EINVAL in response to excessive nesting, not -ENOMEM] Signed-off-by: Andrew Vagin <avagin@openvz.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-24Merge branch 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroupLinus Torvalds1-31/+10
Pull cgroup fixes from Tejun Heo: "This pull request contains three fixes. Two are reverts of task_lock() removal in cgroup fork path. The optimizations incorrectly assumed that threadgroup_lock can protect process forks (as opposed to thread creations) too. Further cleanup of cgroup fork path is scheduled. The third fixes cgroup emptiness notification loss." * 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: Revert "cgroup: Remove task_lock() from cgroup_post_fork()" Revert "cgroup: Drop task_lock(parent) on cgroup_fork()" cgroup: notify_on_release may not be triggered in some cases
2012-10-24Merge branch 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds1-1/+1
Pull workqueue fix from Tejun Heo: "This pull request contains one patch from Dan Magenheimer to fix cancel_delayed_work() regression introduced by its reimplementation using try_to_grab_pending(). The reimplementation made it incorrectly return %true when the work item is idle. There aren't too many consumers of the return value but it broke at least ramster." * 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: cancel_delayed_work() should return %false if work item is idle
2012-10-24workqueue: cancel_delayed_work() should return %false if work item is idleDan Magenheimer1-1/+1
57b30ae77b ("workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()") made cancel_delayed_work() always return %true unless someone else is also trying to cancel the work item, which is broken - if the target work item is idle, the return value should be %false. try_to_grab_pending() indicates that the target work item was idle by zero return value. Use it for return. Note that this brings cancel_delayed_work() in line with __cancel_work_timer() in return value handling. Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <444a6439-b1a4-4740-9e7e-bc37267cfe73@default>
2012-10-24Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-183/+166
Pull perf fixes from Ingo Molnar: "Most of these are uprobes race fixes from Oleg, and their preparatory cleanups. (It's larger than what I'd normally send for an -rc kernel, but they looked significant enough to not delay them.) There's also an oprofile fix and an uncore PMU fix." * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits) perf/x86: Disable uncore on virtualized CPUs oprofile, x86: Fix wrapping bug in op_x86_get_ctrl() ring-buffer: Check for uninitialized cpu buffer before resizing uprobes: Fix the racy uprobe->flags manipulation uprobes: Fix prepare_uprobe() race with itself uprobes: Introduce prepare_uprobe() uprobes: Fix handle_swbp() vs unregister() + register() race uprobes: Do not delete uprobe if uprobe_unregister() fails uprobes: Don't return success if alloc_uprobe() fails uprobes/x86: Only rep+nop can be emulated correctly uprobes: Simplify is_swbp_at_addr(), remove stale comments uprobes: Kill set_orig_insn()->is_swbp_at_addr() uprobes: Introduce copy_opcode(), kill read_opcode() uprobes: Kill set_swbp()->is_swbp_at_addr() uprobes: Restrict valid_vma(false) to skip VM_SHARED vmas uprobes: Change valid_vma() to demand VM_MAYEXEC rather than VM_EXEC uprobes: Change write_opcode() to use FOLL_FORCE uprobes: Move clear_thread_flag(TIF_UPROBE) to uprobe_notify_resume() uprobes: Kill UTASK_BP_HIT state uprobes: Fix UPROBE_SKIP_SSTEP checks in handle_swbp() ...
2012-10-22module_signing: fix printk format warningRandy Dunlap1-1/+1
Fix the warning: kernel/module_signing.c:195:2: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'size_t' by using the proper 'z' modifier for printing a size_t. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-21Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgentIngo Molnar1-0/+4
Pull ftrace ring-buffer resizing fix from Steve Rostedt. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-21Merge branch 'uprobes/core' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc into perf/urgentIngo Molnar1-183/+162
Pull various uprobes bugfixes from Oleg Nesterov - mostly race and failure path fixes. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-19use clamp_t in UNAME26 fixKees Cook1-1/+1
The min/max call needed to have explicit types on some architectures (e.g. mn10300). Use clamp_t instead to avoid the warning: kernel/sys.c: In function 'override_release': kernel/sys.c:1287:10: warning: comparison of distinct pointer types lacks a cast [enabled by default] Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-19MODSIGN: Move the magic string to the end of a module and eliminate the searchDavid Howells3-28/+25
Emit the magic string that indicates a module has a signature after the signature data instead of before it. This allows module_sig_check() to be made simpler and faster by the elimination of the search for the magic string. Instead we just need to do a single memcmp(). This works because at the end of the signature data there is the fixed-length signature information block. This block then falls immediately prior to the magic number. From the contents of the information block, it is trivial to calculate the size of the signature data and thus the size of the actual module data. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-19Revert "cgroup: Remove task_lock() from cgroup_post_fork()"Tejun Heo1-12/+3
This reverts commit 7e3aa30ac8c904a706518b725c451bb486daaae9. The commit incorrectly assumed that fork path always performed threadgroup_change_begin/end() and depended on that for synchronization against task exit and cgroup migration paths instead of explicitly grabbing task_lock(). threadgroup_change is not locked when forking a new process (as opposed to a new thread in the same process) and even if it were it wouldn't be effective as different processes use different threadgroup locks. Revert the incorrect optimization. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <20121008020000.GB2575@localhost> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: stable@vger.kernel.org
2012-10-19Revert "cgroup: Drop task_lock(parent) on cgroup_fork()"Tejun Heo1-17/+6
This reverts commit 7e381b0eb1e1a9805c37335562e8dc02e7d7848c. The commit incorrectly assumed that fork path always performed threadgroup_change_begin/end() and depended on that for synchronization against task exit and cgroup migration paths instead of explicitly grabbing task_lock(). threadgroup_change is not locked when forking a new process (as opposed to a new thread in the same process) and even if it were it wouldn't be effective as different processes use different threadgroup locks. Revert the incorrect optimization. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <20121008020000.GB2575@localhost> Acked-by: Li Zefan <lizefan@huawei.com> Bitterly-Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: stable@vger.kernel.org
2012-10-19pidns: remove recursion from free_pid_ns()Cyrill Gorcunov1-7/+14
free_pid_ns() operates in a recursive fashion: free_pid_ns(parent) put_pid_ns(parent) kref_put(&ns->kref, free_pid_ns); free_pid_ns thus if there was a huge nesting of namespaces the userspace may trigger avalanche calling of free_pid_ns leading to kernel stack exhausting and a panic eventually. This patch turns the recursion into an iterative loop. Based on a patch by Andrew Vagin. [akpm@linux-foundation.org: export put_pid_ns() to modules] Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-19kernel/sys.c: fix stack memory content leak via UNAME26Kees Cook1-5/+7
Calling uname() with the UNAME26 personality set allows a leak of kernel stack contents. This fixes it by defensively calculating the length of copy_to_user() call, making the len argument unsigned, and initializing the stack buffer to zero (now technically unneeded, but hey, overkill). CVE-2012-0957 Reported-by: PaX Team <pageexec@freemail.hu> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: PaX Team <pageexec@freemail.hu> Cc: Brad Spengler <spender@grsecurity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-16printk: Fix scheduling-while-atomic problem in console_cpu_notify()Paul E. McKenney1-1/+0
The console_cpu_notify() function runs with interrupts disabled in the CPU_DYING case. It therefore cannot block, for example, as will happen when it calls console_lock(). Therefore, remove the CPU_DYING leg of the switch statement to avoid this problem. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-16cgroup: notify_on_release may not be triggered in some casesDaisuke Nishimura1-2/+1
notify_on_release must be triggered when the last process in a cgroup is move to another. But if the first(and only) process in a cgroup is moved to another, notify_on_release is not triggered. # mkdir /cgroup/cpu/SRC # mkdir /cgroup/cpu/DST # # echo 1 >/cgroup/cpu/SRC/notify_on_release # echo 1 >/cgroup/cpu/DST/notify_on_release # # sleep 300 & [1] 8629 # # echo 8629 >/cgroup/cpu/SRC/tasks # echo 8629 >/cgroup/cpu/DST/tasks -> notify_on_release for /SRC must be triggered at this point, but it isn't. This is because put_css_set() is called before setting CGRP_RELEASABLE in cgroup_task_migrate(), and is a regression introduce by the commit:74a1166d(cgroups: make procs file writable), which was merged into v3.0. Cc: Ben Blum <bblum@andrew.cmu.edu> Cc: <stable@vger.kernel.org> # v3.0.x and later Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-10-14Merge branch 'modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linuxLinus Torvalds5-27/+578
Pull module signing support from Rusty Russell: "module signing is the highlight, but it's an all-over David Howells frenzy..." Hmm "Magrathea: Glacier signing key". Somebody has been reading too much HHGTTG. * 'modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (37 commits) X.509: Fix indefinite length element skip error handling X.509: Convert some printk calls to pr_devel asymmetric keys: fix printk format warning MODSIGN: Fix 32-bit overflow in X.509 certificate validity date checking MODSIGN: Make mrproper should remove generated files. MODSIGN: Use utf8 strings in signer's name in autogenerated X.509 certs MODSIGN: Use the same digest for the autogen key sig as for the module sig MODSIGN: Sign modules during the build process MODSIGN: Provide a script for generating a key ID from an X.509 cert MODSIGN: Implement module signature checking MODSIGN: Provide module signing public keys to the kernel MODSIGN: Automatically generate module signing keys if missing MODSIGN: Provide Kconfig options MODSIGN: Provide gitignore and make clean rules for extra files MODSIGN: Add FIPS policy module: signature checking hook X.509: Add a crypto key parser for binary (DER) X.509 certificates MPILIB: Provide a function to read raw data into an MPI X.509: Add an ASN.1 decoder X.509: Add simple ASN.1 grammar compiler ...
2012-10-13Merge tag 'for_linus-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdbLinus Torvalds4-5/+50
Pull KGDB/KDB fixes and cleanups from Jason Wessel: "Cleanups - Clean up compile warnings in kgdboc.c and x86/kernel/kgdb.c - Add module event hooks for simplified debugging with gdb Fixes - Fix kdb to stop paging with 'q' on bta and dmesg - Fix for data that scrolls off the vga console due to line wrapping when using the kdb pager New - The debug core registers for kernel module events which allows a kernel aware gdb to automatically load symbols and break on entry to a kernel module - Allow kgdboc=kdb to setup kdb on the vga console" * tag 'for_linus-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb: tty/console: fix warnings in drivers/tty/serial/kgdboc.c kdb,vt_console: Fix missed data due to pager overruns kdb: Fix dmesg/bta scroll to quit with 'q' kgdboc: Accept either kbd or kdb to activate the vga + keyboard kdb shell kgdb,x86: fix warning about unused variable mips,kgdb: fix recursive page fault with CONFIG_KPROBES kgdb: Add module event hooks
2012-10-13Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-9/+12
Pull perf updates from Ingo Molnar: "This tree includes some late late perf items that missed the first round: tools: - Bash auto completion improvements, now we can auto complete the tools long options, tracepoint event names, etc, from Namhyung Kim. - Look up thread using tid instead of pid in 'perf sched'. - Move global variables into a perf_kvm struct, from David Ahern. - Hists refactorings, preparatory for improved 'diff' command, from Jiri Olsa. - Hists refactorings, preparatory for event group viewieng work, from Namhyung Kim. - Remove double negation on optional feature macro definitions, from Namhyung Kim. - Remove several cases of needless global variables, on most builtins. - misc fixes kernel: - sysfs support for IBS on AMD CPUs, from Robert Richter. - Support for an upcoming Intel CPU, the Xeon-Phi / Knights Corner HPC blade PMU, from Vince Weaver. - misc fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits) perf: Fix perf_cgroup_switch for sw-events perf: Clarify perf_cpu_context::active_pmu usage by renaming it to ::unique_pmu perf/AMD/IBS: Add sysfs support perf hists: Add more helpers for hist entry stat perf hists: Move he->stat.nr_events initialization to a template perf hists: Introduce struct he_stat perf diff: Removing the total_period argument from output code perf tool: Add hpp interface to enable/disable hpp column perf tools: Removing hists pair argument from output path perf hists: Separate overhead and baseline columns perf diff: Refactor diff displacement possition info perf hists: Add struct hists pointer to struct hist_entry perf tools: Complete tracepoint event names perf/x86: Add support for Intel Xeon-Phi Knights Corner PMU perf evlist: Remove some unused methods perf evlist: Introduce add_newtp method perf kvm: Move global variables into a perf_kvm struct perf tools: Convert to BACKTRACE_SUPPORT perf tools: Long option completion support for each subcommands perf tools: Complete long option names of perf command ...
2012-10-13Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signalLinus Torvalds2-2/+6
Pull third pile of kernel_execve() patches from Al Viro: "The last bits of infrastructure for kernel_thread() et.al., with alpha/arm/x86 use of those. Plus sanitizing the asm glue and do_notify_resume() on alpha, fixing the "disabled irq while running task_work stuff" breakage there. At that point the rest of kernel_thread/kernel_execve/sys_execve work can be done independently for different architectures. The only pending bits that do depend on having all architectures converted are restrictred to fs/* and kernel/* - that'll obviously have to wait for the next cycle. I thought we'd have to wait for all of them done before we start eliminating the longjump-style insanity in kernel_execve(), but it turned out there's a very simple way to do that without flagday-style changes." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: alpha: switch to saner kernel_execve() semantics arm: switch to saner kernel_execve() semantics x86, um: convert to saner kernel_execve() semantics infrastructure for saner ret_from_kernel_thread semantics make sure that kernel_thread() callbacks call do_exit() themselves make sure that we always have a return path from kernel_execve() ppc: eeh_event should just use kthread_run() don't bother with kernel_thread/kernel_execve for launching linuxrc alpha: get rid of switch_stack argument of do_work_pending() alpha: don't bother passing switch_stack separately from regs alpha: take SIGPENDING/NOTIFY_RESUME loop into signal.c alpha: simplify TIF_NEED_RESCHED handling
2012-10-13Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds5-99/+199
Pull third pile of VFS updates from Al Viro: "Stuff from Jeff Layton, mostly. Sanitizing interplay between audit and namei, removing a lot of insanity from audit_inode() mess and getting things ready for his ESTALE patchset." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: procfs: don't need a PATH_MAX allocation to hold a string representation of an int vfs: embed struct filename inside of names_cache allocation if possible audit: make audit_inode take struct filename vfs: make path_openat take a struct filename pointer vfs: turn do_path_lookup into wrapper around struct filename variant audit: allow audit code to satisfy getname requests from its names_list vfs: define struct filename and have getname() return it vfs: unexport getname and putname symbols acct: constify the name arg to acct_on vfs: allocate page instead of names_cache buffer in mount_block_root audit: overhaul __audit_inode_child to accomodate retrying audit: optimize audit_compare_dname_path audit: make audit_compare_dname_path use parent_len helper audit: remove dirlen argument to audit_compare_dname_path audit: set the name_len in audit_inode for parent lookups audit: add a new "type" field to audit_names struct audit: reverse arguments to audit_inode_child audit: no need to walk list in audit_inode if name is NULL audit: pass in dentry to audit_copy_inode wherever possible audit: remove unnecessary NULL ptr checks from do_path_lookup
2012-10-12audit: make audit_inode take struct filenameJeff Layton1-2/+23
Keep a pointer to the audit_names "slot" in struct filename. Have all of the audit_inode callers pass a struct filename ponter to audit_inode instead of a string pointer. If the aname field is already populated, then we can skip walking the list altogether and just use it directly. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12vfs: make path_openat take a struct filename pointerJeff Layton1-3/+3
...and fix up the callers. For do_file_open_root, just declare a struct filename on the stack and fill out the .name field. For do_filp_open, make it also take a struct filename pointer, and fix up its callers to call it appropriately. For filp_open, add a variant that takes a struct filename pointer and turn filp_open into a wrapper around it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12audit: allow audit code to satisfy getname requests from its names_listJeff Layton1-0/+23
Currently, if we call getname() on a userland string more than once, we'll get multiple copies of the string and multiple audit_names records. Add a function that will allow the audit_names code to satisfy getname requests using info from the audit_names list, avoiding a new allocation and audit_names records. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12vfs: define struct filename and have getname() return itJeff Layton2-31/+37
getname() is intended to copy pathname strings from userspace into a kernel buffer. The result is just a string in kernel space. It would however be quite helpful to be able to attach some ancillary info to the string. For instance, we could attach some audit-related info to reduce the amount of audit-related processing needed. When auditing is enabled, we could also call getname() on the string more than once and not need to recopy it from userspace. This patchset converts the getname()/putname() interfaces to return a struct instead of a string. For now, the struct just tracks the string in kernel space and the original userland pointer for it. Later, we'll add other information to the struct as it becomes convenient. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12infrastructure for saner ret_from_kernel_thread semanticsAl Viro2-0/+4
* allow kernel_execve() leave the actual return to userland to caller (selected by CONFIG_GENERIC_KERNEL_EXECVE). Callers updated accordingly. * architecture that does select GENERIC_KERNEL_EXECVE in its Kconfig should have its ret_from_kernel_thread() do this: call schedule_tail call the callback left for it by copy_thread(); if it ever returns, that's because it has just done successful kernel_execve() jump to return from syscall IOW, its only difference from ret_from_fork() is that it does call the callback. * such an architecture should also get rid of ret_from_kernel_execve() and __ARCH_WANT_KERNEL_EXECVE This is the last part of infrastructure patches in that area - from that point on work on different architectures can live independently. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds6-164/+119
Pull timer core update from Thomas Gleixner: - Bug fixes (one for a longstanding dead loop issue) - Rework of time related vsyscalls - Alarm timer updates - Jiffies updates to remove compile time dependencies * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Cast raw_interval to u64 to avoid shift overflow timers: Fix endless looping between cascade() and internal_add_timer() time/jiffies: bring back unconditional LATCH definition time: Convert x86_64 to using new update_vsyscall time: Only do nanosecond rounding on GENERIC_TIME_VSYSCALL_OLD systems time: Introduce new GENERIC_TIME_VSYSCALL time: Convert CONFIG_GENERIC_TIME_VSYSCALL to CONFIG_GENERIC_TIME_VSYSCALL_OLD time: Move update_vsyscall definitions to timekeeper_internal.h time: Move timekeeper structure to timekeeper_internal.h for vsyscall changes jiffies: Remove compile time assumptions about CLOCK_TICK_RATE jiffies: Kill unused TICK_USEC_TO_NSEC alarmtimer: Rename alarmtimer_remove to alarmtimer_dequeue alarmtimer: Remove unused helpers & defines alarmtimer: Use hrtimer per-alarm instead of per-base alarmtimer: Implement minimum alarm interval for allowing suspend
2012-10-12Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-1/+70
Pull scheduler fixes from Ingo Molnar: "A CPU hotplug related crash fix and a nohz accounting fixlet." * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Update sched_domains_numa_masks[][] when new cpus are onlined sched: Ensure 'sched_domains_numa_levels' is safe to use in other functions nohz: Fix one jiffy count too far in idle cputime
2012-10-12Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-11/+16
Pull RCU fixes from Ingo Molnar: "This tree includes a shutdown/cpu-hotplug deadlock fix and a documentation fix." * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rcu: Advise most users not to enable RCU user mode rcu: Grace-period initialization excludes only RCU notifier
2012-10-12kdb,vt_console: Fix missed data due to pager overrunsJason Wessel1-5/+28
It is possible to miss data when using the kdb pager. The kdb pager does not pay attention to the maximum column constraint of the screen or serial terminal. This result is not incrementing the shown lines correctly and the pager will print more lines that fit on the screen. Obviously that is less than useful when using a VGA console where you cannot scroll back. The pager will now look at the kdb_buffer string to see how many characters are printed. It might not be perfect considering you can output ASCII that might move the cursor position, but it is a substantially better approximation for viewing dmesg and trace logs. This also means that the vt screen needs to set the kdb COLUMNS variable. Cc: <stable@vger.kernel.org> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2012-10-12kdb: Fix dmesg/bta scroll to quit with 'q'Jason Wessel2-0/+4
If you press 'q' the pager should exit instead of printing everything from dmesg which can really bog down a 9600 baud serial link. The same is true for the bta command. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2012-10-12kgdb: Add module event hooksJason Wessel1-0/+18
Allow gdb to auto load kernel modules when it is attached, which makes it trivially easy to debug module init functions or pre-set breakpoints in a kernel module that has not loaded yet. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2012-10-12acct: constify the name arg to acct_onJeff Layton1-1/+1
Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12audit: overhaul __audit_inode_child to accomodate retryingJeff Layton1-27/+30
In order to accomodate retrying path-based syscalls, we need to add a new "type" argument to audit_inode_child. This will tell us whether we're looking for a child entry that represents a create or a delete. If we find a parent, don't automatically assume that we need to create a new entry. Instead, use the information we have to try to find an existing entry first. Update it if one is found and create a new one if not. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12audit: optimize audit_compare_dname_pathJeff Layton4-12/+20
In the cases where we already know the length of the parent, pass it as a parm so we don't need to recompute it. In the cases where we don't know the length, pass in AUDIT_NAME_FULL (-1) to indicate that it should be determined. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12audit: make audit_compare_dname_path use parent_len helperEric Paris1-20/+7
Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12audit: remove dirlen argument to audit_compare_dname_pathJeff Layton4-10/+5
All the callers set this to NULL now. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12audit: set the name_len in audit_inode for parent lookupsJeff Layton3-12/+60
Currently, this gets set mostly by happenstance when we call into audit_inode_child. While that might be a little more efficient, it seems wrong. If the syscall ends up failing before audit_inode_child ever gets called, then you'll have an audit_names record that shows the full path but has the parent inode info attached. Fix this by passing in a parent flag when we call audit_inode that gets set to the value of LOOKUP_PARENT. We can then fix up the pathname for the audit entry correctly from the get-go. While we're at it, clean up the no-op macro for audit_inode in the !CONFIG_AUDITSYSCALL case. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>