aboutsummaryrefslogtreecommitdiffstats
path: root/init (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2015-09-11sys_membarrier(): system-wide memory barrier (generic, x86)Mathieu Desnoyers1-0/+12
Here is an implementation of a new system call, sys_membarrier(), which executes a memory barrier on all threads running on the system. It is implemented by calling synchronize_sched(). It can be used to distribute the cost of user-space memory barriers asymmetrically by transforming pairs of memory barriers into pairs consisting of sys_membarrier() and a compiler barrier. For synchronization primitives that distinguish between read-side and write-side (e.g. userspace RCU [1], rwlocks), the read-side can be accelerated significantly by moving the bulk of the memory barrier overhead to the write-side. The existing applications of which I am aware that would be improved by this system call are as follows: * Through Userspace RCU library (http://urcu.so) - DNS server (Knot DNS) https://www.knot-dns.cz/ - Network sniffer (http://netsniff-ng.org/) - Distributed object storage (https://sheepdog.github.io/sheepdog/) - User-space tracing (http://lttng.org) - Network storage system (https://www.gluster.org/) - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf) - Financial software (https://lkml.org/lkml/2015/3/23/189) Those projects use RCU in userspace to increase read-side speed and scalability compared to locking. Especially in the case of RCU used by libraries, sys_membarrier can speed up the read-side by moving the bulk of the memory barrier cost to synchronize_rcu(). * Direct users of sys_membarrier - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198) Microsoft core dotnet GC developers are planning to use the mprotect() side-effect of issuing memory barriers through IPIs as a way to implement Windows FlushProcessWriteBuffers() on Linux. They are referring to sys_membarrier in their github thread, specifically stating that sys_membarrier() is what they are looking for. To explain the benefit of this scheme, let's introduce two example threads: Thread A (non-frequent, e.g. executing liburcu synchronize_rcu()) Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock()) In a scheme where all smp_mb() in thread A are ordering memory accesses with respect to smp_mb() present in Thread B, we can change each smp_mb() within Thread A into calls to sys_membarrier() and each smp_mb() within Thread B into compiler barriers "barrier()". Before the change, we had, for each smp_mb() pairs: Thread A Thread B previous mem accesses previous mem accesses smp_mb() smp_mb() following mem accesses following mem accesses After the change, these pairs become: Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses As we can see, there are two possible scenarios: either Thread B memory accesses do not happen concurrently with Thread A accesses (1), or they do (2). 1) Non-concurrent Thread A vs Thread B accesses: Thread A Thread B prev mem accesses sys_membarrier() follow mem accesses prev mem accesses barrier() follow mem accesses In this case, thread B accesses will be weakly ordered. This is OK, because at that point, thread A is not particularly interested in ordering them with respect to its own accesses. 2) Concurrent Thread A vs Thread B accesses Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses In this case, thread B accesses, which are ensured to be in program order thanks to the compiler barrier, will be "upgraded" to full smp_mb() by synchronize_sched(). * Benchmarks On Intel Xeon E5405 (8 cores) (one thread is calling sys_membarrier, the other 7 threads are busy looping) 1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call. * User-space user of this system call: Userspace RCU library Both the signal-based and the sys_membarrier userspace RCU schemes permit us to remove the memory barrier from the userspace RCU rcu_read_lock() and rcu_read_unlock() primitives, thus significantly accelerating them. These memory barriers are replaced by compiler barriers on the read-side, and all matching memory barriers on the write-side are turned into an invocation of a memory barrier on all active threads in the process. By letting the kernel perform this synchronization rather than dumbly sending a signal to every process threads (as we currently do), we diminish the number of unnecessary wake ups and only issue the memory barriers on active threads. Non-running threads do not need to execute such barrier anyway, because these are implied by the scheduler context switches. Results in liburcu: Operations in 10s, 6 readers, 2 writers: memory barriers in reader: 1701557485 reads, 2202847 writes signal-based scheme: 9830061167 reads, 6700 writes sys_membarrier: 9952759104 reads, 425 writes sys_membarrier (dyn. check): 7970328887 reads, 425 writes The dynamic sys_membarrier availability check adds some overhead to the read-side compared to the signal-based scheme, but besides that, sys_membarrier slightly outperforms the signal-based scheme. However, this non-expedited sys_membarrier implementation has a much slower grace period than signal and memory barrier schemes. Besides diminishing the number of wake-ups, one major advantage of the membarrier system call over the signal-based scheme is that it does not need to reserve a signal. This plays much more nicely with libraries, and with processes injected into for tracing purposes, for which we cannot expect that signals will be unused by the application. An expedited version of this system call can be added later on to speed up the grace period. Its implementation will likely depend on reading the cpu_curr()->mm without holding each CPU's rq lock. This patch adds the system call to x86 and to asm-generic. [1] http://urcu.so membarrier(2) man page: MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2) NAME membarrier - issue memory barriers on a set of threads SYNOPSIS #include <linux/membarrier.h> int membarrier(int cmd, int flags); DESCRIPTION The cmd argument is one of the following: MEMBARRIER_CMD_QUERY Query the set of supported commands. It returns a bitmask of supported commands. MEMBARRIER_CMD_SHARED Execute a memory barrier on all threads running on the system. Upon return from system call, the caller thread is ensured that all running threads have passed through a state where all memory accesses to user-space addresses match program order between entry to and return from the system call (non-running threads are de facto in such a state). This covers threads from all pro=E2=80=90 cesses running on the system. This command returns 0. The flags argument needs to be 0. For future extensions. All memory accesses performed in program order from each targeted thread is guaranteed to be ordered with respect to sys_membarrier(). If we use the semantic "barrier()" to represent a compiler barrier forcing memory accesses to be performed in program order across the barrier, and smp_mb() to represent explicit memory barriers forcing full memory ordering across the barrier, we have the following ordering table for each pair of barrier(), sys_membarrier() and smp_mb(): The pair ordering is detailed as (O: ordered, X: not ordered): barrier() smp_mb() sys_membarrier() barrier() X X O smp_mb() X O O sys_membarrier() O O O RETURN VALUE On success, these system calls return zero. On error, -1 is returned, and errno is set appropriately. For a given command, with flags argument set to 0, this system call is guaranteed to always return the same value until reboot. ERRORS ENOSYS System call is not implemented. EINVAL Invalid arguments. Linux 2015-04-15 MEMBARRIER(2) Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nicholas Miell <nmiell@comcast.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Pranith Kumar <bobby.prani@gmail.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10kexec: split kexec_load syscall from kexec core codeDave Young1-2/+2
There are two kexec load syscalls, kexec_load another and kexec_file_load. kexec_file_load has been splited as kernel/kexec_file.c. In this patch I split kexec_load syscall code to kernel/kexec.c. And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and use kexec_file_load only, or vice verse. The original requirement is from Ted Ts'o, he want kexec kernel signature being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use kexec_load syscall can bypass the checking. Vivek Goyal proposed to create a common kconfig option so user can compile in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects KEXEC_CORE so that old config files still work. Because there's general code need CONFIG_KEXEC_CORE, so I updated all the architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects KEXEC_CORE in arch Kconfig. Also updated general kernel code with to kexec_load syscall. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Dave Young <dyoung@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Petr Tesarik <ptesarik@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10kmod: use system_unbound_wq instead of khelperFrederic Weisbecker1-1/+0
We need to launch the usermodehelper kernel threads with the widest affinity and this is partly why we use khelper. This workqueue has unbound properties and thus a wide affinity inherited by all its children. Now khelper also has special properties that we aren't much interested in: ordered and singlethread. There is really no need about ordering as all we do is creating kernel threads. This can be done concurrently. And singlethread is a useless limitation as well. The workqueue engine already proposes generic unbound workqueues that don't share these useless properties and handle well parallel jobs. The only worrysome specific is their affinity to the node of the current CPU. It's fine for creating the usermodehelper kernel threads but those inherit this affinity for longer jobs such as requesting modules. This patch proposes to use these node affine unbound workqueues assuming that a node is sufficient to handle several parallel usermodehelper requests. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04mm: send one IPI per CPU to TLB flush all entries after unmapping pagesMel Gorman1-0/+10
An IPI is sent to flush remote TLBs when a page is unmapped that was potentially accesssed by other CPUs. There are many circumstances where this happens but the obvious one is kswapd reclaiming pages belonging to a running process as kswapd and the task are likely running on separate CPUs. On small machines, this is not a significant problem but as machine gets larger with more cores and more memory, the cost of these IPIs can be high. This patch uses a simple structure that tracks CPUs that potentially have TLB entries for pages being unmapped. When the unmapping is complete, the full TLB is flushed on the assumption that a refill cost is lower than flushing individual entries. Architectures wishing to do this must give the following guarantee. If a clean page is unmapped and not immediately flushed, the architecture must guarantee that a write to that linear address from a CPU with a cached TLB entry will trap a page fault. This is essentially what the kernel already depends on but the window is much larger with this patch applied and is worth highlighting. The architecture should consider whether the cost of the full TLB flush is higher than sending an IPI to flush each individual entry. An additional architecture helper called flush_tlb_local is required. It's a trivial wrapper with some accounting in the x86 case. The impact of this patch depends on the workload as measuring any benefit requires both mapped pages co-located on the LRU and memory pressure. The case with the biggest impact is multiple processes reading mapped pages taken from the vm-scalability test suite. The test case uses NR_CPU readers of mapped files that consume 10*RAM. Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 Ops lru-file-mmap-read-elapsed 159.62 ( 0.00%) 120.68 ( 24.40%) Ops lru-file-mmap-read-time_range 30.59 ( 0.00%) 2.80 ( 90.85%) Ops lru-file-mmap-read-time_stddv 6.70 ( 0.00%) 0.64 ( 90.38%) 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 User 581.00 611.43 System 5804.93 4111.76 Elapsed 161.03 122.12 This is showing that the readers completed 24.40% faster with 29% less system CPU time. From vmstats, it is known that the vanilla kernel was interrupted roughly 900K times per second during the steady phase of the test and the patched kernel was interrupts 180K times per second. The impact is lower on a single socket machine. 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 Ops lru-file-mmap-read-elapsed 25.33 ( 0.00%) 20.38 ( 19.54%) Ops lru-file-mmap-read-time_range 0.91 ( 0.00%) 1.44 (-58.24%) Ops lru-file-mmap-read-time_stddv 0.28 ( 0.00%) 0.47 (-65.34%) 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 User 58.09 57.64 System 111.82 76.56 Elapsed 27.29 22.55 It's still a noticeable improvement with vmstat showing interrupts went from roughly 500K per second to 45K per second. The patch will have no impact on workloads with no memory pressure or have relatively few mapped pages. It will have an unpredictable impact on the workload running on the CPU being flushed as it'll depend on how many TLB entries need to be refilled and how long that takes. Worst case, the TLB will be completely cleared of active entries when the target PFNs were not resident at all. [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush] Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: buildsystem activationAndrea Arcangeli1-0/+11
This allows to select the userfaultfd during configuration to build it. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-15percpu-rwsem: kill CONFIG_PERCPU_RWSEMOleg Nesterov1-1/+0
Remove CONFIG_PERCPU_RWSEM, the next patch adds the unconditional user of percpu_rw_semaphore. Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2015-08-14Move certificate handling to its own directoryDavid Howells1-39/+0
Move certificate handling out of the kernel/ directory and into a certs/ directory to get all the weird stuff in one place and move the generated signing keys into this directory. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: David Woodhouse <David.Woodhouse@intel.com>
2015-08-12sign-file: Document dependency on OpenSSL devel librariesDavid Howells1-0/+4
The revised sign-file program is no longer a script that wraps the openssl program, but now rather a program that makes use of OpenSSL's crypto library. This means that to build the sign-file program, the kernel build process now has a dependency on the OpenSSL development packages in addition to OpenSSL itself. Document this in Kconfig and in module-signing.txt. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: David Woodhouse <David.Woodhouse@intel.com>
2015-08-07modsign: Add explicit CONFIG_SYSTEM_TRUSTED_KEYS optionDavid Woodhouse1-0/+13
Let the user explicitly provide a file containing trusted keys, instead of just automatically finding files matching *.x509 in the build tree and trusting whatever we find. This really ought to be an *explicit* configuration, and the build rules for dealing with the files were fairly painful too. Fix applied from James Morris that removes an '=' from a macro definition in kernel/Makefile as this is a feature that only exists from GNU make 3.82 onwards. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-07modsign: Use single PEM file for autogenerated keyDavid Woodhouse1-2/+2
The current rule for generating signing_key.priv and signing_key.x509 is a classic example of a bad rule which has a tendency to break parallel make. When invoked to create *either* target, it generates the other target as a side-effect that make didn't predict. So let's switch to using a single file signing_key.pem which contains both key and certificate. That matches what we do in the case of an external key specified by CONFIG_MODULE_SIG_KEY anyway, so it's also slightly cleaner. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-07modsign: Extract signing cert from CONFIG_MODULE_SIG_KEY if neededDavid Woodhouse1-4/+4
Where an external PEM file or PKCS#11 URI is given, we can get the cert from it for ourselves instead of making the user drop signing_key.x509 in place for us. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-07modsign: Allow external signing key to be specifiedDavid Woodhouse1-0/+14
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-07MODSIGN: Extract the blob PKCS#7 signature verifier from module signingDavid Howells1-10/+19
Extract the function that drives the PKCS#7 signature verification given a data blob and a PKCS#7 blob out from the module signing code and lump it with the system keyring code as it's generic. This makes it independent of module config options and opens it to use by the firmware loader. Signed-off-by: David Howells <dhowells@redhat.com> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Ming Lei <ming.lei@canonical.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Kyle McMartin <kyle@kernel.org>
2015-08-07MODSIGN: Use PKCS#7 messages as module signaturesDavid Howells1-0/+1
Move to using PKCS#7 messages as module signatures because: (1) We have to be able to support the use of X.509 certificates that don't have a subjKeyId set. We're currently relying on this to look up the X.509 certificate in the trusted keyring list. (2) PKCS#7 message signed information blocks have a field that supplies the data required to match with the X.509 certificate that signed it. (3) The PKCS#7 certificate carries fields that specify the digest algorithm used to generate the signature in a standardised way and the X.509 certificates specify the public key algorithm in a standardised way - so we don't need our own methods of specifying these. (4) We now have PKCS#7 message support in the kernel for signed kexec purposes and we can make use of this. To make this work, the old sign-file script has been replaced with a program that needs compiling in a previous patch. The rules to build it are added here. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Vivek Goyal <vgoyal@redhat.com>
2015-08-07fs, file table: reinit files_stat.max_files after deferred memory initialisationMel Gorman1-1/+1
Dave Hansen reported the following; My laptop has been behaving strangely with 4.2-rc2. Once I log in to my X session, I start getting all kinds of strange errors from applications and see this in my dmesg: VFS: file-max limit 8192 reached The problem is that the file-max is calculated before memory is fully initialised and miscalculates how much memory the kernel is using. This patch recalculates file-max after deferred memory initialisation. Note that using memory hotplug infrastructure would not have avoided this problem as the value is not recalculated after memory hot-add. 4.1: files_stat.max_files = 6582781 4.2-rc2: files_stat.max_files = 8192 4.2-rc2 patched: files_stat.max_files = 6562467 Small differences with the patch applied and 4.1 but not enough to matter. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Dave Hansen <dave.hansen@intel.com> Cc: Nicolai Stange <nicstange@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Alex Ng <alexng@microsoft.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-22rcu: Hide RCU_NOCB_CPU behind RCU_EXPERTPaul E. McKenney1-0/+1
This commit prevents Kconfig from asking the user about RCU_NOCB_CPU unless the user really wants to be asked. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de>
2015-07-14cgroup: implement the PIDs subsystemAleksa Sarai1-0/+16
Adds a new single-purpose PIDs subsystem to limit the number of tasks that can be forked inside a cgroup. Essentially this is an implementation of RLIMIT_NPROC that applies to a cgroup rather than a process tree. However, it should be noted that organisational operations (adding and removing tasks from a PIDs hierarchy) will *not* be prevented. Rather, the number of tasks in the hierarchy cannot exceed the limit through forking. This is due to the fact that, in the unified hierarchy, attach cannot fail (and it is not possible for a task to overcome its PIDs cgroup policy limit by attaching to a child cgroup -- even if migrating mid-fork it must be able to fork in the parent first). PIDs are fundamentally a global resource, and it is possible to reach PID exhaustion inside a cgroup without hitting any reasonable kmemcg policy. Once you've hit PID exhaustion, you're only in a marginally better state than OOM. This subsystem allows PID exhaustion inside a cgroup to be prevented. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-07-06rcu: Drop RCU_USER_QS in favor of NO_HZ_FULLPaul E. McKenney1-9/+0
The RCU_USER_QS Kconfig parameter is now just a synonym for NO_HZ_FULL, so this commit eliminates RCU_USER_QS, replacing all uses with NO_HZ_FULL. Reported-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
2015-07-04sched/stat: Simplify the sched_info accounting dependencyNaveen N. Rao1-0/+1
Both CONFIG_SCHEDSTATS=y and CONFIG_TASK_DELAY_ACCT=y track task sched_info, which results in ugly #if clauses. Simplify the code by introducing a synthethic CONFIG_SCHED_INFO switch, selected by both. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: a.p.zijlstra@chello.nl Cc: ricklind@us.ibm.com Link: http://lkml.kernel.org/r/8d19eef800811a94b0f91bcbeb27430a884d7433.1435255405.git.naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-01printk: Increase maximum CONFIG_LOG_BUF_SHIFT from 21 to 25Ingo Molnar1-1/+1
So I tried to some kernel debugging that produced a ton of kernel messages on a big box, and wanted to save them all: but CONFIG_LOG_BUF_SHIFT maxes out at 21 (2 MB). Increase it to 25 (32 MB). This does not affect any existing config or defaults. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-30mm: meminit: finish initialisation of struct pages before basic setupMel Gorman1-0/+2
Waiman Long reported that 24TB machines hit OOM during basic setup when struct page initialisation was deferred. One approach is to initialise memory on demand but it interferes with page allocator paths. This patch creates dedicated threads to initialise memory before basic setup. It then blocks on a rw_semaphore until completion as a wait_queue and counter is overkill. This may be slower to boot but it's simplier overall and also gets rid of a section mangling which existed so kswapd could do the initialisation. [akpm@linux-foundation.org: include rwsem.h, use DECLARE_RWSEM, fix comment, remove unneeded cast] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Waiman Long <waiman.long@hp.com Cc: Nathan Zimmer <nzimmer@sgi.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Scott Norton <scott.norton@hp.com> Tested-by: Daniel J Blueman <daniel@numascale.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25init/do_mounts.c: add create_dev() failure logVishnu Pratap Singh1-2/+7
If create_dev() function fails to create the root mount device (/dev/root), then it goes to panic as root device not found but there is no printk in this case. So I have added the log in case it fails to create the root device. It will help in debugging. [akpm@linux-foundation.org: simplify printk(), use pr_emerg(), display errno] Signed-off-by: Vishnu Pratap Singh <vishnu.ps@samsung.com> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Dan Ehrenberg <dehrenberg@chromium.org> Cc: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25fs, proc: introduce CONFIG_PROC_CHILDRENIago López Galeiras1-0/+1
Commit 818411616baf ("fs, proc: introduce /proc/<pid>/task/<tid>/children entry") introduced the children entry for checkpoint restore and the file is only available on kernels configured with CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE. This is available in most distributions (Fedora, Debian, Ubuntu, CoreOS) because they usually enable CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE. But Arch does not enable CONFIG_EXPERT or CONFIG_CHECKPOINT_RESTORE. However, the children proc file is useful outside of checkpoint restore. I would like to use it in rkt. The rkt process exec() another program it does not control, and that other program will fork()+exec() a child process. I would like to find the pid of the child process from an external tool without iterating in /proc over all processes to find which one has a parent pid equal to rkt. This commit introduces CONFIG_PROC_CHILDREN and makes CONFIG_CHECKPOINT_RESTORE select it. This allows enabling /proc/<pid>/task/<tid>/children without needing to enable CONFIG_CHECKPOINT_RESTORE and CONFIG_EXPERT. Alban tested that /proc/<pid>/task/<tid>/children is present when the kernel is configured with CONFIG_PROC_CHILDREN=y but without CONFIG_CHECKPOINT_RESTORE Signed-off-by: Iago López Galeiras <iago@endocode.com> Tested-by: Alban Crequy <alban@endocode.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Djalal Harouni <djalal@endocode.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-23modules: clarify CONFIG_MODULE_COMPRESS help, suggest 'N'.Rusty Russell1-15/+10
Andreas turned this option on, only to find out Debian (and Ubuntu!) don't enable support in their kmod builds. Shorten the text, and suggest N at the bottom (at least for now). Reported-by: Andreas Mohr <andim2@users.sf.net> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-06-10ACPI / init: Switch over platform to the ACPI mode laterRafael J. Wysocki1-0/+1
Commit 73f7d1ca3263 "ACPI / init: Run acpi_early_init() before timekeeping_init()" moved the ACPI subsystem initialization, including the ACPI mode enabling, to an earlier point in the initialization sequence, to allow the timekeeping subsystem use ACPI early. Unfortunately, that resulted in boot regressions on some systems and the early ACPI initialization was moved toward its original position in the kernel initialization code by commit c4e1acbb35e4 "ACPI / init: Invoke early ACPI initialization later". However, that turns out to be insufficient, as boot is still broken on the Tyan S8812 mainboard. To fix that issue, split the ACPI early initialization code into two pieces so the majority of it still located in acpi_early_init() and the part switching over the platform into the ACPI mode goes into a new function, acpi_subsystem_init(), executed at the original early ACPI initialization spot. That fixes the Tyan S8812 boot problem, but still allows ACPI tables to be loaded earlier which is useful to the EFI code in efi_enter_virtual_mode(). Link: https://bugzilla.kernel.org/show_bug.cgi?id=97141 Fixes: 73f7d1ca3263 "ACPI / init: Run acpi_early_init() before timekeeping_init()" Reported-and-tested-by: Marius Tolzmann <tolzmann@molgen.mpg.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Toshi Kani <toshi.kani@hp.com> Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org> Reviewed-by: Lee, Chun-Yi <jlee@suse.com>
2015-06-02writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACKTejun Heo1-0/+5
cgroup writeback requires support from both bdi and filesystem sides. Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by default. Also, define CONFIG_CGROUP_WRITEBACK which is enabled if both MEMCG and BLK_CGROUP are enabled. inode_cgwb_enabled() which determines whether a given inode's both bdi and fs support cgroup writeback is added. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACINGPeter Zijlstra1-0/+4
Andrew worried about the overhead on small systems; only use the fancy code when either perf or tracing is enabled. Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Steven Rostedt <rostedt@goodmis.org> Requested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-05-27rcu: Remove prompt for RCU implementationPranith Kumar1-12/+6
The RCU implementation is chosen based on PREEMPT and SMP config options and is not really a user-selectable choice. This commit removes the menu entry, given that there is not much point in calling something a choice when there is in fact no choice.. The TINY_RCU, TREE_RCU, and PREEMPT_RCU Kconfig options continue to be selected based solely on the values of the PREEMPT and SMP options. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-05-27rcu: Make RCU able to tolerate undefined CONFIG_RCU_KTHREAD_PRIOPaul E. McKenney1-0/+1
This commit updates the initialization of the kthread_prio boot parameter so that RCU will build even when CONFIG_RCU_KTHREAD_PRIO is undefined. The kthread_prio boot parameter is set to CONFIG_RCU_KTHREAD_PRIO if that is defined, otherwise to 1 if CONFIG_RCU_BOOST is defined and to zero otherwise. This commit then makes CONFIG_RCU_KTHREAD_PRIO depend on CONFIG_RCU_EXPERT, so that Kconfig users won't be asked about CONFIG_RCU_KTHREAD_PRIO unless they want to be. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2015-05-27rcu: Make RCU able to tolerate undefined CONFIG_RCU_FANOUT_LEAFPaul E. McKenney1-1/+1
This commit introduces an RCU_FANOUT_LEAF C-preprocessor macro so that RCU will build even when CONFIG_RCU_FANOUT_LEAF is undefined. The RCU_FANOUT_LEAF macro is set to the value of CONFIG_RCU_FANOUT_LEAF when defined, otherwise it is set to 32 for 32-bit systems and 64 for 64-bit systems. This commit then makes CONFIG_RCU_FANOUT_LEAF depend on CONFIG_RCU_EXPERT, so that Kconfig users won't be asked about CONFIG_RCU_FANOUT_LEAF unless they want to be. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2015-05-27rcu: Make RCU able to tolerate undefined CONFIG_RCU_FANOUTPaul E. McKenney1-1/+1
This commit introduces an RCU_FANOUT C-preprocessor macro so that RCU will build even when CONFIG_RCU_FANOUT is undefined. The RCU_FANOUT macro is set to the value of CONFIG_RCU_FANOUT when defined, otherwise it is set to 32 for 32-bit systems and 64 for 64-bit systems. This commit then makes CONFIG_RCU_FANOUT depend on CONFIG_RCU_EXPERT, so that Kconfig users won't be asked about CONFIG_RCU_FANOUT unless they want to be. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2015-05-27rcu: Break dependency of RCU_FANOUT_LEAF on RCU_FANOUTPaul E. McKenney1-2/+2
RCU_FANOUT_LEAF's range and default values depend on the value of RCU_FANOUT, which at the time seemed like a cute way to save two lines of Kconfig code. However, adding a dependency from both of these Kconfig parameters on RCU_EXPERT requires that RCU_FANOUT_LEAF operate correctly even if RCU_FANOUT is undefined. This commit therefore allows RCU_FANOUT_LEAF to take on the full range of permitted values, even in cases where RCU_FANOUT is undefined. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Eliminate redundant "default" as suggested by Pranith Kumar. ] Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2015-05-27rcu: Create RCU_EXPERT Kconfig and hide booleans behind itPaul E. McKenney1-2/+17
This commit creates an RCU_EXPERT Kconfig and hides the independent boolean RCU-related user-visible Kconfig parameters behind it, namely RCU_FAST_NO_HZ and RCU_BOOST. This prevents Kconfig from asking about these parameters unless the user really wants to be asked. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2015-05-27rcu: Convert CONFIG_RCU_FANOUT_EXACT to boot parameterPaul E. McKenney1-14/+0
The CONFIG_RCU_FANOUT_EXACT Kconfig parameter is used primarily (and perhaps only) by rcutorture to verify that RCU works correctly in specific rcu_node combining-tree configurations. It therefore does not make much sense have this as a question to people attempting to configure their kernels. So this commit creates an rcutree.rcu_fanout_exact= boot parameter that rcutorture can use, and eliminates the original CONFIG_RCU_FANOUT_EXACT Kconfig parameter. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2015-05-27rcu: Directly drive RCU_USER_QS from KconfigPaul E. McKenney1-9/+1
Currently, Kconfig will ask the user whether RCU_USER_QS should be set. This is silly because Kconfig already has all the information that it needs to set this parameter. This commit therefore directly drives the value of RCU_USER_QS via NO_HZ_FULL's "select" statement. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
2015-05-27rcu: Directly drive TASKS_RCU from KconfigPaul E. McKenney1-3/+1
Currently, Kconfig will ask the user whether TASKS_RCU should be set. This is silly because Kconfig already has all the information that it needs to set this parameter. This commit therefore directly drives the value of TASKS_RCU via "select" statements. Which means that as subsystems require TASKS_RCU, those subsystems will need to add "select" statements of their own. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2015-05-26sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsemTejun Heo1-0/+1
The cgroup side of threadgroup locking uses signal_struct->group_rwsem to synchronize against threadgroup changes. This per-process rwsem adds small overhead to thread creation, exit and exec paths, forces cgroup code paths to do lock-verify-unlock-retry dance in a couple places and makes it impossible to atomically perform operations across multiple processes. This patch replaces signal_struct->group_rwsem with a global percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader side and contained in cgroups proper. This patch converts one-to-one. This does make writer side heavier and lower the granularity; however, cgroup process migration is a fairly cold path, we do want to optimize thread operations over it and cgroup migration operations don't take enough time for the lower granularity to matter. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>
2015-05-20module: add extra argument for parse_params() callbackLuis R. Rodriguez1-10/+15
This adds an extra argument onto parse_params() to be used as a way to make the unused callback a bit more useful and generic by allowing the caller to pass on a data structure of its choice. An example use case is to allow us to easily make module parameters for every module which we will do next. @ parse @ identifier name, args, params, num, level_min, level_max; identifier unknown, param, val, doing; type s16; @@ extern char *parse_args(const char *name, char *args, const struct kernel_param *params, unsigned num, s16 level_min, s16 level_max, + void *arg, int (*unknown)(char *param, char *val, const char *doing + , void *arg )); @ parse_mod @ identifier name, args, params, num, level_min, level_max; identifier unknown, param, val, doing; type s16; @@ char *parse_args(const char *name, char *args, const struct kernel_param *params, unsigned num, s16 level_min, s16 level_max, + void *arg, int (*unknown)(char *param, char *val, const char *doing + , void *arg )) { ... } @ parse_args_found @ expression R, E1, E2, E3, E4, E5, E6; identifier func; @@ ( R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, func); | R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, &func); | R = parse_args(E1, E2, E3, E4, E5, E6, + NULL, NULL); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, func); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, &func); | parse_args(E1, E2, E3, E4, E5, E6, + NULL, NULL); ) @ parse_args_unused depends on parse_args_found @ identifier parse_args_found.func; @@ int func(char *param, char *val, const char *unused + , void *arg ) { ... } @ mod_unused depends on parse_args_found @ identifier parse_args_found.func; expression A1, A2, A3; @@ - func(A1, A2, A3); + func(A1, A2, A3, NULL); Generated-by: Coccinelle SmPL Cc: cocci@systeme.lip6.fr Cc: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Christoph Hellwig <hch@infradead.org> Cc: Felipe Contreras <felipe.contreras@gmail.com> Cc: Ewan Milne <emilne@redhat.com> Cc: Jean Delvare <jdelvare@suse.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Jani Nikula <jani.nikula@intel.com> Cc: linux-kernel@vger.kernel.org Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-08perf_event: Don't allow vmalloc() backed perf on powerpcMichael Ellerman1-1/+1
On powerpc the perf event interrupt is not masked when interrupts are disabled, allowing it to function as an NMI. This causes problems if perf is using vmalloc. If we take a page fault on the vmalloc region the fault handler will fail the page fault because it detects we are coming in from an NMI (see do_hash_page()). We don't actually need or want vmalloc backed perf so just disable it on powerpc. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <linuxppc-dev@ozlabs.org> Cc: Andrew Morton <akpm@osdl.org> Cc: Anton Blanchard <anton@samba.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@ghostprotocols.net Cc: sukadev@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1430720799-18426-1-git-send-email-mpe@ellerman.id.au Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-05init: fix regression by supporting devices with major:minor:offset formatChen Yu1-2/+3
Commit 283e7ad02 ("init: stricter checking of major:minor root= values") was so strict that it exposed the fact that a previously unknown device format was being used. Distributions like Ubuntu uses klibc (rather than uswsusp) to resume system from hibernation. klibc expressed the swap partition/file in the form of major:minor:offset. For example, 8:3:0 represents a swap partition in klibc, and klibc's resume process in initrd will finally echo 8:3:0 to /sys/power/resume for manually resuming. However, due to commit 283e7ad02's stricter checking, 8:3:0 will be treated as an invalid device format, and manual resuming from hibernation will fail. Fix this by adding support for devices with major:minor:offset format when resuming from hibernation. Reported-by: Prigent, Christophe <christophe.prigent@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Acked-by: Rafael J. Wysocki <rjw@rjwysocki.net> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-17kernel/fork.c: new function for max_threadsHeinrich Schuchardt1-2/+2
PAGE_SIZE is not guaranteed to be equal to or less than 8 times the THREAD_SIZE. E.g. architecture hexagon may have page size 1M and thread size 4096. This would lead to a division by zero in the calculation of max_threads. With this patch the buggy code is moved to a separate function set_max_threads. The error is not fixed. After fixing the problem in a separate patch the new function can be reused to adjust max_threads after adding or removing memory. Argument mempages of function fork_init() is removed as totalram_pages is an exported symbol. The creation of separate patches for refactoring to a new function and for fixing the logic was suggested by Ingo Molnar. Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15kernel: conditionally support non-root users, groups and capabilitiesIulia Manda1-1/+18
There are a lot of embedded systems that run most or all of their functionality in init, running as root:root. For these systems, supporting multiple users is not necessary. This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for non-root users, non-root groups, and capabilities optional. It is enabled under CONFIG_EXPERT menu. When this symbol is not defined, UID and GID are zero in any possible case and processes always have all capabilities. The following syscalls are compiled out: setuid, setregid, setgid, setreuid, setresuid, getresuid, setresgid, getresgid, setgroups, getgroups, setfsuid, setfsgid, capget, capset. Also, groups.c is compiled out completely. In kernel/capability.c, capable function was moved in order to avoid adding two ifdef blocks. This change saves about 25 KB on a defconfig build. The most minimal kernels have total text sizes in the high hundreds of kB rather than low MB. (The 25k goes down a bit with allnoconfig, but not that much. The kernel was booted in Qemu. All the common functionalities work. Adding users/groups is not possible, failing with -ENOSYS. Bloat-o-meter output: add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650) [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Iulia Manda <iulia.manda21@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15init: stricter checking of major:minor root= valuesDan Ehrenberg1-1/+2
In the kernel command-line, previously, root=1:2jakshflaksjdhfa would be accepted and interpreted just like root=1:2. This patch adds stricter checking so that additional characters after major:minor are rejected by root=. The goal of this change is to help in unifying DM's interpretation of its block device argument by using existing kernel code (name_to_dev_t). But DM rejects malformed major:minor pairs, it seems reasonable for root= to reject them as well. Signed-off-by: Dan Ehrenberg <dehrenberg@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15init: export name_to_dev_t and mark name argument as constDan Ehrenberg1-1/+2
DM will switch its device lookup code to using name_to_dev_t() so it must be exported. Also, the @name argument should be marked const. Signed-off-by: Dan Ehrenberg <dehrenberg@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-14lib/ioremap.c: add huge I/O map capability interfacesToshi Kani1-0/+2
Add ioremap_pud_enabled() and ioremap_pmd_enabled(), which return 1 when I/O mappings with pud/pmd are enabled on the kernel. ioremap_huge_init() calls arch_ioremap_pud_supported() and arch_ioremap_pmd_supported() to initialize the capabilities at boot-time. A new kernel option "nohugeiomap" is also added, so that user can disable the huge I/O map capabilities when necessary. Signed-off-by: Toshi Kani <toshi.kani@hp.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Robert Elliott <Elliott@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-13cpu: Defer smpboot kthread unparking until CPU known to schedulerPaul E. McKenney1-0/+1
Currently, smpboot_unpark_threads() is invoked before the incoming CPU has been added to the scheduler's runqueue structures. This might potentially cause the unparked kthread to run on the wrong CPU, since the correct CPU isn't fully set up yet. That causes a sporadic, hard to debug boot crash triggering on some systems, reported by Borislav Petkov, and bisected down to: 2a442c9c6453 ("x86: Use common outgoing-CPU-notification code") This patch places smpboot_unpark_threads() in a CPU hotplug notifier with priority set so that these kthreads are unparked just after the CPU has been added to the runqueues. Reported-and-tested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-11Documentation/memcg: update memcg/kmem statusVladimir Davydov1-6/+0
Memcg/kmem reclaim support has been finally merged. Reflect this in the documentation. Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2015-04-02bpf: Fix the build on BPF_SYSCALL=y && !CONFIG_TRACING kernels, make it more configurableIngo Molnar1-1/+1
So bpf_tracing.o depends on CONFIG_BPF_SYSCALL - but that's not its only dependency, it also depends on the tracing infrastructure and on kprobes, without which it will fail to build with: In file included from kernel/trace/bpf_trace.c:14:0: kernel/trace/trace.h: In function ‘trace_test_and_set_recursion’: kernel/trace/trace.h:491:28: error: ‘struct task_struct’ has no member named ‘trace_recursion’ unsigned int val = current->trace_recursion; [...] It took quite some time to trigger this build failure, because right now BPF_SYSCALL is very obscure, depends on CONFIG_EXPERT. So also make BPF_SYSCALL more configurable, not just under CONFIG_EXPERT. If BPF_SYSCALL, tracing and kprobes are enabled then enable the bpf_tracing gateway as well. We might want to make this an interactive option later on, although I'd not complicate it unnecessarily: enabling BPF_SYSCALL is enough of an indicator that the user wants BPF support. Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David S. Miller <davem@davemloft.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-06init/main: fix reset_device commentFrans Klaver1-1/+1
Fix incorrect spelling of situation. Signed-off-by: Frans Klaver <fransklaver@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-03-05cpuset: initialize cpuset a bit earlyZefan Li1-1/+1
Now we call ss->bind() in cgroup_init(), so cgroup_init() will call cpuset_bind() and then the latter will access top_cpuset's cpumask, which is NULL, because cpuset_init() is called after cgroup_init() The simplest fix is to swap cgroup_init() and cpuset_init(). Cc: Vladimir Davydov <vdavydov@parallels.com> Fixes: 295458e67284 ("cgroup: call cgroup_subsys->bind on cgroup subsys initialization") Reported by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vladimir Davydov <vdavydov@parallels.com>