aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/pps (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2009-12-15hugetlb: handle memory hot-plug eventsLee Schermerhorn2-6/+50
Register per node hstate attributes only for nodes with memory. As suggested by David Rientjes. With Memory Hotplug, memory can be added to a memoryless node and a node with memory can become memoryless. Therefore, add a memory on/off-line notifier callback to [un]register a node's attributes on transition to/from memoryless state. N.B., Only tested build, boot, libhugetlbfs regression. i.e., no memory hotplug testing. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Andi Kleen <andi@firstfloor.org> Acked-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15mm: clear node in N_HIGH_MEMORY and stop kswapd when all memory is offlinedDavid Rientjes3-6/+27
When memory is hot-removed, its node must be cleared in N_HIGH_MEMORY if there are no present pages left. In such a situation, kswapd must also be stopped since it has nothing left to do. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15hugetlb: use only nodes with memory for huge pagesLee Schermerhorn2-23/+24
Register per node hstate sysfs attributes only for nodes with memory. Global replacement of 'all online nodes" with "all nodes with memory" in mm/hugetlb.c. Suggested by David Rientjes. A subsequent patch will handle adding/removing of per node hstate sysfs attributes when nodes transition to/from memoryless state via memory hotplug. NOTE: this patch has not been tested with memoryless nodes. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15hugetlb: update hugetlb documentation for NUMA controlsLee Schermerhorn1-85/+176
Update the kernel huge tlb documentation to describe the numa memory policy based huge page management. Additionaly, the patch includes a fair amount of rework to improve consistency, eliminate duplication and set the context for documenting the memory policy interaction. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15hugetlb: add per node hstate attributesLee Schermerhorn3-26/+298
Add the per huge page size control/query attributes to the per node sysdevs: /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/ nr_hugepages - r/w free_huge_pages - r/o surplus_huge_pages - r/o The patch attempts to re-use/share as much of the existing global hstate attribute initialization and handling, and the "nodes_allowed" constraint processing as possible. Calling set_max_huge_pages() with no node indicates a change to global hstate parameters. In this case, any non-default task mempolicy will be used to generate the nodes_allowed mask. A valid node id indicates an update to that node's hstate parameters, and the count argument specifies the target count for the specified node. From this info, we compute the target global count for the hstate and construct a nodes_allowed node mask contain only the specified node. Setting the node specific nr_hugepages via the per node attribute effectively ignores any task mempolicy or cpuset constraints. With this patch: (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB ./ ../ free_hugepages nr_hugepages surplus_hugepages Starting from: Node 0 HugePages_Total: 0 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 Node 1 HugePages_Surp: 0 Node 2 HugePages_Total: 0 Node 2 HugePages_Free: 0 Node 2 HugePages_Surp: 0 Node 3 HugePages_Total: 0 Node 3 HugePages_Free: 0 Node 3 HugePages_Surp: 0 vm.nr_hugepages = 0 Allocate 16 persistent huge pages on node 2: (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages [Note that this is equivalent to: numactl -m 2 hugeadmin --pool-pages-min 2M:+16 ] Yields: Node 0 HugePages_Total: 0 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 Node 1 HugePages_Surp: 0 Node 2 HugePages_Total: 16 Node 2 HugePages_Free: 16 Node 2 HugePages_Surp: 0 Node 3 HugePages_Total: 0 Node 3 HugePages_Free: 0 Node 3 HugePages_Surp: 0 vm.nr_hugepages = 16 Global controls work as expected--reduce pool to 8 persistent huge pages: (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages Node 0 HugePages_Total: 0 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 Node 1 HugePages_Surp: 0 Node 2 HugePages_Total: 8 Node 2 HugePages_Free: 8 Node 2 HugePages_Surp: 0 Node 3 HugePages_Total: 0 Node 3 HugePages_Free: 0 Node 3 HugePages_Surp: 0 Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15hugetlb: add generic definition of NUMA_NO_NODELee Schermerhorn3-4/+9
Move definition of NUMA_NO_NODE from ia64 and x86_64 arch specific headers to generic header 'linux/numa.h' for use in generic code. NUMA_NO_NODE replaces bare '-1' where it's used in this series to indicate "no node id specified". Ultimately, it can be used to replace the -1 elsewhere where it is used similarly. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15hugetlb: derive huge pages nodes allowed from task mempolicyLee Schermerhorn5-15/+153
This patch derives a "nodes_allowed" node mask from the numa mempolicy of the task modifying the number of persistent huge pages to control the allocation, freeing and adjusting of surplus huge pages when the pool page count is modified via the new sysctl or sysfs attribute "nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows: * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer is produced. This will cause the hugetlb subsystem to use node_online_map as the "nodes_allowed". This preserves the behavior before this patch. * For "preferred" mempolicy, including explicit local allocation, a nodemask with the single preferred node will be produced. "local" policy will NOT track any internode migrations of the task adjusting nr_hugepages. * For "bind" and "interleave" policy, the mempolicy's nodemask will be used. * Other than to inform the construction of the nodes_allowed node mask, the actual mempolicy mode is ignored. That is, all modes behave like interleave over the resulting nodes_allowed mask with no "fallback". See the updated documentation [next patch] for more information about the implications of this patch. Examples: Starting with: Node 0 HugePages_Total: 0 Node 1 HugePages_Total: 0 Node 2 HugePages_Total: 0 Node 3 HugePages_Total: 0 Default behavior [with or without this patch] balances persistent hugepage allocation across nodes [with sufficient contiguous memory]: sysctl vm.nr_hugepages[_mempolicy]=32 yields: Node 0 HugePages_Total: 8 Node 1 HugePages_Total: 8 Node 2 HugePages_Total: 8 Node 3 HugePages_Total: 8 Of course, we only have nr_hugepages_mempolicy with the patch, but with default mempolicy, nr_hugepages_mempolicy behaves the same as nr_hugepages. Applying mempolicy--e.g., with numactl [using '-m' a.k.a. '--membind' because it allows multiple nodes to be specified and it's easy to type]--we can allocate huge pages on individual nodes or sets of nodes. So, starting from the condition above, with 8 huge pages per node, add 8 more to node 2 using: numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40 This yields: Node 0 HugePages_Total: 8 Node 1 HugePages_Total: 8 Node 2 HugePages_Total: 16 Node 3 HugePages_Total: 8 The incremental 8 huge pages were restricted to node 2 by the specified mempolicy. Similarly, we can use mempolicy to free persistent huge pages from specified nodes: numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32 yields: Node 0 HugePages_Total: 4 Node 1 HugePages_Total: 4 Node 2 HugePages_Total: 16 Node 3 HugePages_Total: 8 The 8 huge pages freed were balanced over nodes 0 and 1. [rientjes@google.com: accomodate reworked NODEMASK_ALLOC] Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15hugetlb: factor init_nodemask_of_node()Lee Schermerhorn1-3/+8
Factor init_nodemask_of_node() out of the nodemask_of_node() macro. This will be used to populate the huge pages "nodes_allowed" nodemask for a single node when basing nodes_allowed on a preferred/local mempolicy or when a persistent huge page pool page count is modified via a per node sysfs attribute. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15hugetlb: add nodemask arg to huge page alloc, free and surplus adjust functionsLee Schermerhorn1-53/+72
In preparation for constraining huge page allocation and freeing by the controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer to the allocate, free and surplus adjustment functions. For now, pass NULL to indicate default behavior--i.e., use node_online_map. A subsqeuent patch will derive a non-default mask from the controlling task's numa mempolicy. Note that this method of updating the global hstate nr_hugepages under the constraint of a nodemask simplifies keeping the global state consistent--especially the number of persistent and surplus pages relative to reservations and overcommit limits. There are undoubtedly other ways to do this, but this works for both interfaces: mempolicy and per node attributes. [rientjes@google.com: fix HIGHMEM compile error] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15hugetlb: rework hstate_next_node_* functionsLee Schermerhorn1-25/+45
Modify the hstate_next_node* functions to allow them to be called to obtain the "start_nid". Then, whereas prior to this patch we unconditionally called hstate_next_node_to_{alloc|free}(), whether or not we successfully allocated/freed a huge page on the node, now we only call these functions on failure to alloc/free to advance to next allowed node. Factor out the next_node_allowed() function to handle wrap at end of node_online_map. In this version, the allowed nodes include all of the online nodes. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15nodemask: make NODEMASK_ALLOC more generalDavid Rientjes1-7/+8
This is a series of patches to provide control over the location of the allocation and freeing of persistent huge pages on a NUMA platform. Please consider for merging into mmotm. This series uses two mechanisms to constrain the nodes from which persistent huge pages are allocated: 1) the task NUMA mempolicy of the task modifying a new sysctl "nr_hugepages_mempolicy", based on a suggestion by Mel Gorman; and 2) a subset of the hugepages hstate sysfs attributes have been added [in V4] to each node system device under: /sys/devices/node/node[0-9]*/hugepages The per node attibutes allow direct assignment of a huge page count on a specific node, regardless of the task's mempolicy or cpuset constraints. This patch: NODEMASK_ALLOC(x, m) assumes x is a type of struct, which is unnecessary. It's perfectly reasonable to use this macro to allocate a nodemask_t, which is anonymous, either dynamically or on the stack depending on NODES_SHIFT. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15mm: move inc_zone_page_state(NR_ISOLATED) to just isolated placeKOSAKI Motohiro3-8/+11
Christoph pointed out inc_zone_page_state(NR_ISOLATED) should be placed in right after isolate_page(). This patch does it. Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15/dev/mem: remove redundant parameter from do_write_kmem()Wu Fengguang1-8/+6
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Avi Kivity <avi@qumranet.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15/dev/mem: remove the "written" variable in write_kmem()Wu Fengguang1-27/+22
Also rename "len" to "sz". No behavior change. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Avi Kivity <avi@qumranet.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15/dev/mem: make size_inside_page() logic straightWu Fengguang1-22/+12
Also convert more size_inside_page() users. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Avi Kivity <avi@qumranet.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15/dev/mem: cleanup unxlate_dev_mem_ptr() callsWu Fengguang1-8/+6
No behaviour change. [akpm@linux-foundation.org: cleanuplets] [akpm@linux-foundation.org: remove unused `ret'] Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Andi Kleen <ak@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Avi Kivity <avi@qumranet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15/dev/mem: introduce size_inside_page()Wu Fengguang1-41/+19
Introduce size_inside_page() to replace duplicate /dev/mem code. Also apply it to /dev/kmem, whose alignment logic was buggy. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Andi Kleen <ak@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Avi Kivity <avi@qumranet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15/dev/mem: remove redundant test on lenWu Fengguang1-8/+6
The len test in write_kmem() is always true, so can be reduced. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Andi Kleen <ak@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Avi Kivity <avi@qumranet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap()KOSAKI Motohiro1-8/+28
On ia64, the following test program exit abnormally, because glibc thread library called abort(). ======================================================== (gdb) bt #0 0xa000000000010620 in __kernel_syscall_via_break () #1 0x20000000003208e0 in raise () from /lib/libc.so.6.1 #2 0x2000000000324090 in abort () from /lib/libc.so.6.1 #3 0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0 #4 0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0 #5 0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1 ======================================================== The fact is, glibc call munmap() when thread exitng time for freeing stack, and it assume munlock() never fail. However, munmap() often make vma splitting and it with many mapcount make -ENOMEM. Oh well, that's crazy, because stack unmapping never increase mapcount. The maxcount exceeding is only temporary. internal temporary exceeding shouldn't make ENOMEM. This patch does it. test_max_mapcount.c ================================================================== #include<stdio.h> #include<stdlib.h> #include<string.h> #include<pthread.h> #include<errno.h> #include<unistd.h> #define THREAD_NUM 30000 #define MAL_SIZE (8*1024*1024) void *wait_thread(void *args) { void *addr; addr = malloc(MAL_SIZE); sleep(10); return NULL; } void *wait_thread2(void *args) { sleep(60); return NULL; } int main(int argc, char *argv[]) { int i; pthread_t thread[THREAD_NUM], th; int ret, count = 0; pthread_attr_t attr; ret = pthread_attr_init(&attr); if(ret) { perror("pthread_attr_init"); } ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED); if(ret) { perror("pthread_attr_setdetachstate"); } for (i = 0; i < THREAD_NUM; i++) { ret = pthread_create(&th, &attr, wait_thread, NULL); if(ret) { fprintf(stderr, "[%d] ", count); perror("pthread_create"); } else { printf("[%d] create OK.\n", count); } count++; ret = pthread_create(&thread[i], &attr, wait_thread2, NULL); if(ret) { fprintf(stderr, "[%d] ", count); perror("pthread_create"); } else { printf("[%d] create OK.\n", count); } count++; } sleep(3600); return 0; } ================================================================== [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15page-types: exit early when invoked with -d|--describeAlex Chiang1-2/+1
On a system with large amount of memory (256GB), invoking page-types can take quite a long time, which is unreasonable considering the user only wants a description of the flags: # time ./page-types -d 0x10 0x0000000000000010 ____D_____________________________ dirty real 0m34.285s user 0m1.966s sys 0m32.313s This is because we still walk the entire address range. Exiting early seems like a reasonble solution: # time ./page-types -d 0x10 0x0000000000000010 ____D_____________________________ dirty real 0m0.007s user 0m0.001s sys 0m0.005s Signed-off-by: Alex Chiang <achiang@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Haicheng Li <haicheng.li@intel.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15page-types: whitespace alignmentAlex Chiang1-23/+23
Align the output when page-type -h is invoked. Signed-off-by: Alex Chiang <achiang@hp.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15page-types: learn to describe flags directly from command lineAlex Chiang1-1/+20
Teach page-types to describe page flags directly from the command line. Why is this useful? For instance, if you're using memory hotplug and see this in /var/log/messages: kernel: removing from LRU failed 3836dd0/1/1e00000000000010 It would be nice to decode those page flags without staring at the source. Example usage and output: # Documentation/vm/page-types -d 0x10 0x0000000000000010 ____D_____________________________ dirty # Documentation/vm/page-types -d anon 0x0000000000001000 ____________a_____________________ anonymous # Documentation/vm/page-types -d anon,0x10 0x0000000000001010 ____D_______a_____________________ dirty,anonymous [achiang@hp.com: documentation] Signed-off-by: Alex Chiang <achiang@hp.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Haicheng Li <haicheng.li@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15page-types: unsigned cannot be less than 0 in add_page()Roel Kluin1-1/+1
If not signed, testing of the read() return value in this function will not work. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15page-types: constify read only arraysTommi Rantala1-3/+3
Signed-off-by: Tommi Rantala <tt.rantala@gmail.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15oom: dump stack and VM state when oom killer panicsDavid Rientjes1-16/+24
The oom killer header, including information such as the allocation order and gfp mask, current's cpuset and memory controller, call trace, and VM state information is currently only shown when the oom killer has selected a task to kill. This information is omitted, however, when the oom killer panics either because of panic_on_oom sysctl settings or when no killable task was found. It is still relevant to know crucial pieces of information such as the allocation order and VM state when diagnosing such issues, especially at boot. This patch displays the oom killer header whenever it panics so that bug reports can include pertinent information to debug the issue, if possible. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15MAINTAINERS: new kbuild maintainerMichal Marek1-1/+4
Sam was fine with handing over kbuild maintainership to me. The git trees are already in linux-next, a merge request will follow shortly. Acked-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Michal Marek <mmarek@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15hfs: fix a potential buffer overflowAmerigo Wang3-1/+21
A specially-crafted Hierarchical File System (HFS) filesystem could cause a buffer overflow to occur in a process's kernel stack during a memcpy() call within the hfs_bnode_read() function (at fs/hfs/bnode.c:24). The attacker can provide the source buffer and length, and the destination buffer is a local variable of a fixed length. This local variable (passed as "&entry" from fs/hfs/dir.c:112 and allocated on line 60) is stored in the stack frame of hfs_bnode_read()'s caller, which is hfs_readdir(). Because the hfs_readdir() function executes upon any attempt to read a directory on the filesystem, it gets called whenever a user attempts to inspect any filesystem contents. [amwang@redhat.com: modify this patch and fix coding style problems] Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Eugene Teo <eteo@redhat.com> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Dave Anderson <anderson@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15bsdacct: fix uid/gid misreportingAlexey Dobriyan1-1/+2
commit d8e180dcd5bbbab9cd3ff2e779efcf70692ef541 "bsdacct: switch credentials for writing to the accounting file" introduced credential switching during final acct data collecting. However, uid/gid pair continued to be collected from current which became credentials of who created acct file, not who exits. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=14676 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reported-by: Juho K. Juopperi <jkj@kapsi.fi> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Reviewed-by: Michal Schmidt <mschmidt@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-14udf: Avoid IO in udf_clear_inodeJan Kara3-22/+41
It is not very good to do IO in udf_clear_inode. First, VFS does not really expect inode to become dirty there and thus we have to write it ourselves, second, memory reclaim gets blocked waiting for IO when it does not really expect it, third, the IO pattern (e.g. on umount) resulting from writes in udf_clear_inode is bad and it slows down writing a lot. The reason why UDF needed to do IO in udf_clear_inode is that UDF standard mandates extent length to exactly match inode size. But when we allocate extents to a file or directory, we don't really know what exactly the final file size will be and thus temporarily set it to block boundary and later truncate it to exact length in udf_clear_inode. Now, this is changed to truncate to final file size in udf_release_file for regular files. For directories and symlinks, we do the truncation at the moment when learn what the final file size will be. Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-14udf: Try harder when looking for VAT inodeJan Kara1-8/+24
Some disks do not contain VAT inode in the last recorded block as required by the standard but a few blocks earlier (or the number of recorded blocks is wrong). So look for the VAT inode a bit before the end of the media. Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-14udf: Fix compilation with UDFFS_DEBUG enabledJan Kara1-1/+1
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-14i2c-core: i2c bus should support PM entries in struct dev_pm_opssonic zhang1-0/+35
Struct dev_pm_ops is not configured in current i2c bus type. i2c drivers only depends on suspend/resume entries in struct dev_pm_ops are not informed of PM suspend and resume events by i2c framework. Signed-off-by: Sonic Zhang <sonic.zhang@analog.com> Signed-off-by: Jean Delvare <khali@linux-fr.org>
2009-12-14i2c: Get rid of I2C_CLIENT_MODULE_PARMJean Delvare1-35/+0
There is no user left of I2C_CLIENT_MODULE_PARM, so we can finally get rid of this ugly macro. Signed-off-by: Jean Delvare <khali@linux-fr.org> Tested-by: Wolfram Sang <w.sang@pengutronix.de>
2009-12-14i2c: Drop I2C_CLIENT_INSMOD_2 to 8Jean Delvare19-75/+26
These macros simply declare an enum, so drivers might as well declare it themselves. This puts an end to the arbitrary limit of 8 chip types per i2c driver. Signed-off-by: Jean Delvare <khali@linux-fr.org> Tested-by: Wolfram Sang <w.sang@pengutronix.de>
2009-12-14i2c: Drop I2C_CLIENT_INSMOD_1Jean Delvare30-121/+30
This macro simply declares an enum, so drivers might as well declare it themselves. Signed-off-by: Jean Delvare <khali@linux-fr.org> Tested-by: Wolfram Sang <w.sang@pengutronix.de>
2009-12-14i2c: Get rid of struct i2c_client_address_dataJean Delvare50-89/+66
Struct i2c_client_address_data only contains one field at this point, which makes its usefulness questionable. Get rid of it and pass simple address lists around instead. Signed-off-by: Jean Delvare <khali@linux-fr.org> Tested-by: Wolfram Sang <w.sang@pengutronix.de>
2009-12-14i2c: Drop the kind parameter from detect callbacksJean Delvare49-100/+84
The "kind" parameter always has value -1, and nobody is using it any longer, so we can remove it. Signed-off-by: Jean Delvare <khali@linux-fr.org> Tested-by: Wolfram Sang <w.sang@pengutronix.de>
2009-12-14PCI: Global variable decls must match the defs in section attributesDavid Howells1-1/+1
Global variable declarations must match the definitions in section attributes as the compiler is at liberty to vary the method it uses to access a variable, depending on the section it is in. When building the FRV arch, I now see: drivers/built-in.o: In function `pci_apply_final_quirks': drivers/pci/quirks.c:2606: relocation truncated to fit: R_FRV_GPREL12 against symbol `pci_dfl_cache_line_size' defined in .devinit.data section in drivers/built-in.o drivers/pci/quirks.c:2623: relocation truncated to fit: R_FRV_GPREL12 against symbol `pci_dfl_cache_line_size' defined in .devinit.data section in drivers/built-in.o drivers/pci/quirks.c:2630: relocation truncated to fit: R_FRV_GPREL12 against symbol `pci_dfl_cache_line_size' defined in .devinit.data section in drivers/built-in.o because the declaration of pci_dfl_cache_line_size in linux/pci.h does not match the definition in drivers/pci/pci.c. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-14FRV: Fix no-hardware-breakpoint caseDavid Howells1-1/+1
If there is no hardware breakpoint support, modify_user_hw_breakpoint() tries to return a NULL pointer through as an 'int' return value: In file included from kernel/exit.c:53: include/linux/hw_breakpoint.h: In function 'modify_user_hw_breakpoint': include/linux/hw_breakpoint.h:96: warning: return makes integer from pointer without a cast Return 0 instead. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-14Documentation: rw_lock lessons learnedWilliam Allen Simpson1-100/+84
In recent months, two different network projects erroneously strayed down the rw_lock path. Update the Documentation based upon comments by Eric Dumazet and Paul E. McKenney in those threads. Further updates await somebody else with more expertise. Changes: - Merged with extensive content by Stephen Hemminger. - Fix one of the comments by Linus Torvalds. Signed-off-by: William.Allen.Simpson@gmail.com Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-14x86, mce: Clean up thermal init by introducing intel_thermal_supported()Hidetoshi Seto1-5/+12
It looks better to have a common function. No change in functionality. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> LKML-Reference: <4B25FDDC.407@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Cyrill Gorcunov <gorcunov@openvz.org>
2009-12-14x86, mce: Thermal monitoring depends on APIC being enabledCyrill Gorcunov1-2/+3
Add check if APIC is not disabled since thermal monitoring depends on it. As only apic gets disabled we should not try to install "thermal monitor" vector, print out that thermal monitoring is enabled and etc... Note that "Intel Correct Machine Check Interrupts" already has such a check. Also I decided to not add cpu_has_apic check into mcheck_intel_therm_init since even if it'll call apic_read on disabled apic -- it's safe here and allow us to save a few code bytes. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> LKML-Reference: <4B25FDC2.3020401@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-14perf sched: Fix build failure on sparcDavid Miller2-4/+8
Here, tvec->tv_usec is "unsigned int" not "unsigned long". Since the type is different on every platform, it's probably best to just use long printf formats and cast. Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091213.235622.53363059.davem@davemloft.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-14x86: Gart: fix breakage due to IOMMU initialization cleanupYinghai Lu2-6/+8
This fixes the following breakage of the commit 75f1cdf1dda92cae037ec848ae63690d91913eac: - GART systems that don't AGP with broken BIOS and more than 4GB memory are forced to use swiotlb. They can allocate aperture by hand and use GART. - GART systems without GAP must disable GART on shutdown. - swiotlb usage is forced by the boot option, gart_iommu_hole_init() is not called, so we disable GART early_gart_iommu_check(). Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> LKML-Reference: <1260759135-6450-3-git-send-email-fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-14x86: Move swiotlb initialization before dma32_free_bootmemFUJITA Tomonori1-1/+4
The commit 75f1cdf1dda92cae037ec848ae63690d91913eac introduced a bug that we initialize SWIOTLB right after dma32_free_bootmem so we wrongly steal memory area allocated for GART with broken BIOS earlier. This moves swiotlb initialization before dma32_free_bootmem(). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: yinghai@kernel.org LKML-Reference: <1260759135-6450-2-git-send-email-fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-14x86: Fix build warning in arch/x86/mm/mmio-mod.cJoe Perches1-1/+1
Stephen Rothwell reported these warnings: arch/x86/mm/mmio-mod.c: In function 'print_pte': arch/x86/mm/mmio-mod.c:100: warning: too many arguments for format arch/x86/mm/mmio-mod.c:106: warning: too many arguments for format The 'fmt' was left out accidentally. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Joe Perches <joe@perches.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus <torvalds@linux-foundation.org> LKML-Reference: <1260775443.18538.16.camel@Joe-Laptop.home> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-14x86: Remove usedac in feature-removal-schedule.txtFUJITA Tomonori1-7/+0
The reason of removal, "replaced by allowdac and no dac combination" is incorrect. There is no way to do the same thing with "allowdac" and "nodac" combination. The usedac option enables us to stop via_no_dac() setting forbid_dac to 1. That is, someone who uses VIA bridges can use DAC with this option even if some of VIA bridges seem to be broken about DAC. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Acked-by: WANG Cong <amwang@redhat.com> Cc: gcosta@redhat.com LKML-Reference: <20091214104423X.fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-14perf bench: Add "all" pseudo subsystem and "all" pseudo suiteHitoshi Mitake3-7/+57
This patch adds a new "all" pseudo subsystem and an "all" pseudo suite. These are for testing all subsystem and its all suite, or all suite of one subsystem. (This patch also contains a few trivial comment fixes for bench/* and output style fixes. I judged that there are no necessity to make them into individual patch.) Example of use: | % ./perf bench sched all # Test all suites of sched subsystem | # Running sched/messaging benchmark... | # 20 sender and receiver processes per group | # 10 groups == 400 processes run | | Total time: 0.414 [sec] | | # Running sched/pipe benchmark... | # Extecuted 1000000 pipe operations between two tasks | | Total time: 10.999 [sec] | | 10.999317 usecs/op | 90914 ops/sec | | % ./perf bench all # Test all suites of all subsystems | # Running sched/messaging benchmark... | # 20 sender and receiver processes per group | # 10 groups == 400 processes run | | Total time: 0.420 [sec] | | # Running sched/pipe benchmark... | # Extecuted 1000000 pipe operations between two tasks | | Total time: 11.741 [sec] | | 11.741346 usecs/op | 85169 ops/sec | | # Running mem/memcpy benchmark... | # Copying 1MB Bytes from 0x7ff33e920010 to 0x7ff3401ae010 ... | | 808.407437 MB/Sec Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1260691319-4683-1-git-send-email-mitake@dcl.info.waseda.ac.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-14microblaze: Remove rt_sigsuspend wrapperMichal Simek3-13/+1
Generic rt_sigsuspend syscalls doesn't need any asm wrapper. Signed-off-by: Michal Simek <monstr@monstr.eu>
2009-12-14microblaze: nommu: Don't clobber R11 on syscallssteve@digidescorp.com1-2/+0
The noMMU syscall trap has a bug that causes R11 to be zero on return to userland. Remove the extra "save" of R11 responsible for the bug. Remove reloading of mode indicator because r11 already contains it. Signed-off-by: Steven J. Magnani <steve@digidescorp.com> Signed-off-by: Michal Simek <monstr@monstr.eu>