aboutsummaryrefslogtreecommitdiffstats
path: root/include (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2010-05-27fix fs/sysv s_dirt handlingAl Viro1-0/+1
got broken on ->sync_fs() conversion a year ago, nobody noticed... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27fat: convert to use the new truncate convention.npiggin@suse.de3-15/+57
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27ext2: convert to use the new truncate convention.npiggin@suse.de3-36/+119
I also have commented a possible bug in existing ext2 code, marked with XXX. Cc: linux-ext4@vger.kernel.org Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27tmpfs: convert to use the new truncate conventionnpiggin@suse.de1-21/+22
Cc: Christoph Hellwig <hch@lst.de> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27fs: convert simple fs to new truncateNick Piggin5-19/+15
Convert simple filesystems: ramfs, configfs, sysfs, block_dev to new truncate sequence. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27kill spurious reference to vmtruncatenpiggin@suse.de10-22/+37
Lots of filesystems calls vmtruncate despite not implementing the old ->truncate method. Switch them to use simple_setsize and add some comments about the truncate code where it seems fitting. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27fs: introduce new truncate sequencenpiggin@suse.de8-63/+300
Introduce a new truncate calling sequence into fs/mm subsystems. Rather than setattr > vmtruncate > truncate, have filesystems call their truncate sequence from ->setattr if filesystem specific operations are required. vmtruncate is deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced previously should be used. simple_setattr is introduced for simple in-ram filesystems to implement the new truncate sequence. Eventually all filesystems should be converted to implement a setattr, and the default code in notify_change should go away. simple_setsize is also introduced to perform just the ATTR_SIZE portion of simple_setattr (ie. changing i_size and trimming pagecache). To implement the new truncate sequence: - filesystem specific manipulations (eg freeing blocks) must be done in the setattr method rather than ->truncate. - vmtruncate can not be used by core code to trim blocks past i_size in the event of write failure after allocation, so this must be performed in the fs code. - convert usage of helpers block_write_begin, nobh_write_begin, cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed variants. These avoid calling vmtruncate to trim blocks (see previous). - inode_setattr should not be used. generic_setattr is a new function to be used to copy simple attributes into the generic inode. - make use of the better opportunity to handle errors with the new sequence. Big problem with the previous calling sequence: the filesystem is not called until i_size has already changed. This means it is not allowed to fail the call, and also it does not know what the previous i_size was. Also, generic code calling vmtruncate to truncate allocated blocks in case of error had no good way to return a meaningful error (or, for example, atomically handle block deallocation). Cc: Christoph Hellwig <hch@lst.de> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27fs/super: fix kernel-doc warningRandy Dunlap1-2/+2
Fix fs/super.c kernel-doc warning and function notation: Warning(fs/super.c:957): No description found for parameter 'sb' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27fs/minix: bugfix, number of indirect block ptrs per block depends on block sizeErik van der Kouwe1-12/+15
The MINIX filesystem driver used a constant number of indirect block pointers in an indirect block. This worked only for filesystems with 1kb block, while the MINIX default block size is now 4kb. As a consequence, large files were read incorrectly on such filesystems and writing a large file would cause the filesystem to become corrupted. This patch computes the number of indirect block pointers based on the block size, making the driver work for each block size. I would like to thank Feiran Zheng ('Fam') for pointing out the cause of the corruption. Signed-off-by: Erik van der Kouwe <vdkouwe@cs.vu.nl> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27rename the generic fsync implementationsChristoph Hellwig23-32/+44
We don't name our generic fsync implementations very well currently. The no-op implementation for in-memory filesystems currently is called simple_sync_file which doesn't make too much sense to start with, the the generic one for simple filesystems is called simple_fsync which can lead to some confusion. This patch renames the generic file fsync method to generic_file_fsync to match the other generic_file_* routines it is supposed to be used with, and the no-op implementation to noop_fsync to make it obvious what to expect. In addition add some documentation for both methods. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27drop unused dentry argument to ->fsyncChristoph Hellwig69-157/+129
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27fs: Add missing mutex_unlockJulia Lawall1-4/+9
Add a mutex_unlock missing on the error path. At other exists from the function that return an error flag, the mutex is unlocked, so do the same here. The semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression E1; @@ * mutex_lock(E1,...); <+... when != E1 if (...) { ... when != E1 * return ...; } ...+> * mutex_unlock(E1,...); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27Fix racy use of anon_inode_getfd() in perf_event.cAl Viro1-18/+22
once anon_inode_getfd() is called, you can't expect *anything* about struct file that descriptor points to - another thread might be doing whatever it likes with descriptor table at that point. Cc: stable <stable@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27get rid of the magic around f_count in aioAl Viro4-15/+14
__aio_put_req() plays sick games with file refcount. What it wants is fput() from atomic context; it's almost always done with f_count > 1, so they only have to deal with delayed work in rare cases when their reference happens to be the last one. Current code decrements f_count and if it hasn't hit 0, everything is fine. Otherwise it keeps a pointer to struct file (with zero f_count!) around and has delayed work do __fput() on it. Better way to do it: use atomic_long_add_unless( , -1, 1) instead of !atomic_long_dec_and_test(). IOW, decrement it only if it's not the last reference, leave refcount alone if it was. And use normal fput() in delayed work. I've made that atomic_long_add_unless call a new helper - fput_atomic(). Drops a reference to file if it's safe to do in atomic (i.e. if that's not the last one), tells if it had been able to do that. aio.c converted to it, __fput() use is gone. req->ki_file *always* contributes to refcount now. And __fput() became static. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27VFS: fix recent breakage of FS_REVAL_DOTNeil Brown1-1/+1
Commit 1f36f774b22a0ceb7dd33eca626746c81a97b6a5 broke FS_REVAL_DOT semantics. In particular, before this patch, the command ls -l in an NFS mounted directory would always check if the directory on the server had changed and if so would flush and refill the pagecache for the dir. After this patch, the same "ls -l" will repeatedly return stale date until the cached attributes for the directory time out. The following patch fixes this by ensuring the d_revalidate is called by do_last when "." is being looked-up. link_path_walk has already called d_revalidate, but in that case LOOKUP_OPEN is not set so nfs_lookup_verify_inode chooses not to do any validation. The following patch restores the original behaviour. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27Revert "anon_inode: set S_IFREG on the anon_inode"Al Viro1-1/+1
This reverts commit a7cf4145bb86aaf85d4d4d29a69b50b688e2e49d.
2010-05-27[IA64] Fix build breakageTony Luck2-0/+21
In commit 0ac0c0d0f837c499afd02a802f9cf52d3027fa3b cpusets: randomize node rotor used in cpuset_mem_spread_node() Jack Steiner fixed a problem with too many small tasks being assigned to node 0. Copy his code to ia64 to avoid build error. arch/ia64/kernel/smpboot.c:641: error: ‘cpu_to_node_map’ undeclared (first use in this function) In commit 3bccd996276b108c138e8176793a26ecef54d573 numa: ia64: use generic percpu var numa_node_id() implementation Lee Schermerhorn added some set_numa_node() calls - but these only work on CONFIG_NUMA=y configurations. Surround the calls with #ifdef CONFIG_NUMA Signed-off-by: Tony Luck <tony.luck@intel.com>
2010-05-27posix_timer: Fix error path in timer_createAndrey Vagin1-7/+4
Move CLOCK_DISPATCH(which_clock, timer_create, (new_timer)) after all posible EFAULT erros. *_timer_create may allocate/get resources. (for example posix_cpu_timer_create does get_task_struct) [ tglx: fold the remove crappy comment patch into this ] Signed-off-by: Andrey Vagin <avagin@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: <stable@kernel.org> Reviewed-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-05-27hwmon: (lm75) Add support for the Texas Instruments TMP105Shubhrajyoti Datta2-1/+4
Add support for the Texas Instruments TMP105 temperature sensor device. Signed-off-by: Shubhrajyoti Datta <shubhrajyoti@ti.com> Acked-by: Jonathan Cameron <jic23@cam.ac.uk> Signed-off-by: Jean Delvare <khali@linux-fr.org>
2010-05-27hwmon: (ltc4245) Read only one GPIO pinIra W. Snyder2-16/+6
Read only one of the GPIO pins as an analog voltage. The ADC can be switched to a different GPIO pin at runtime, but this is not supported. Previously, this driver would report the analog voltage of the currently selected GPIO pin as all three GPIO voltages: in9_input, in10_input and in11_input. Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu> Signed-off-by: Jean Delvare <khali@linux-fr.org> Cc: stable@kernel.org
2010-05-27hwmon: (dme1737) Add SCH5127 supportJuerg Haefliger2-113/+266
Add support for the hardware monitoring capabilities of the SCH5127 chip to the dme1737 driver. Signed-off-by: Juerg Haefliger <juergh@gmail.com> Signed-off-by: Jean Delvare <khali@linux-fr.org> Tested-by: Jeff Rickman <jrickman@myamigos.us>
2010-05-27hwmon: (tmp102) Don't always stop chip at exitJean Delvare1-10/+28
Only stop the chip at driver exit if it was stopped when driver was loaded. Leave it running otherwise. Also restore the device configuration if probe failed, to not leave the system in a dangling state. Signed-off-by: Jean Delvare <khali@linux-fr.org> Cc: Steven King <sfking@fdwdc.com>
2010-05-27hwmon: (tmp102) Fix suspend and resume functionsJean Delvare1-4/+12
Suspend and resume functions shouldn't overwrite the configuration register. They should only alter the one bit they have to touch. Also don't assume that register reads and writes always succeed. Handle errors properly, shall they happen. Signed-off-by: Jean Delvare <khali@linux-fr.org> Cc: Steven King <sfking@fdwdc.com>
2010-05-27hwmon: (tmp102) Various fixesJean Delvare3-42/+39
Fixes from my driver review: http://lists.lm-sensors.org/pipermail/lm-sensors/2010-March/028051.html Only the small changes are in there, more important changes will come later separately as time permits. * Drop the remnants of the now gone detect function * The TMP102 has no known compatible chip * Include the right header files * Clarify why byte swapping of register values is needed * Strip resolution info bit from temperature register value * Set cache lifetime to 1/3 second * Don't arbitrarily reject limit values; clamp as needed * Make limit writing unconditional * Don't check for transaction types the driver doesn't use * Properly check for error when setting configuration * Report error on failed probe * Make the driver load automatically where needed * Various other minor fixes Signed-off-by: Jean Delvare <khali@linux-fr.org> Cc: Steven King <sfking@fdwdc.com>
2010-05-27hwmon: Driver for TI TMP102 temperature sensorSteven King4-0/+335
Driver for the TI TMP102. The TI TMP102 is similar to the LM75. It differs from the LM75 by having a 16-bit conf register and the temp registers have a minimum resolution of 12 bits; the extended conf register can select 13-bit resolution (which this driver does) and also change the update rate (which this driver currently doesn't use). [JD: Fix tmp102_exit tag, must be __exit, not __init.] Signed-off-by: Steven King <sfking@fdwdc.com> Signed-off-by: Jean Delvare <khali@linux-fr.org>
2010-05-27hwmon: EMC1403 thermal sensor supportKalhan Trisal3-0/+355
Provides support for the EMC1403 thermal sensor. Only reporting of values is supported. The various Moorestown specific extras to do with thermal alerts and the like are not in this version of the driver. Considerably edited and tidied up by Alan Cox, plus fixes and detection bits from Jean Delvare. Signed-off-by: Kalhan Trisal <kalhan.trisal@intel.com> Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Jean Delvare <khali@linux-fr.org>
2010-05-27hwmon: (applesmc) Add temperature sensor labels to sysfs interfaceAlex Murray1-1/+147
The Apple SMC uses a systematic labeling scheme for the hardware temperature sensors. This scheme is currently hidden from userland. Since the sensor set, and consequently the numbering, differs between models, an extensive database of configurations is required for an application such as fan control. This patch adds the SMC labels to the hwmon sysfs interface, allowing applications to use the sensors more intelligibly. [rydberg@euromail.se: fixed error handling] Signed-off-by: Alex Murray <murray.alex@gmail.com> Signed-off-by: Henrik Rydberg <rydberg@euromail.se> Signed-off-by: Jean Delvare <khali@linux-fr.org>
2010-05-27hwmon: (applesmc) Add generic support for MacBook Pro 7Henrik Rydberg1-0/+9
This patch adds generic support for the MacBook Pro 7 family based on the 7,1 model. Signed-off-by: Henrik Rydberg <rydberg@euromail.se> Signed-off-by: Jean Delvare <khali@linux-fr.org>
2010-05-27hwmon: (applesmc) Add generic support for MacBook Pro 6Bernhard Froemel1-0/+10
This patch adds generic support for the MacBook Pro 6 family based on the 6,2 model. [rydberg@euromail.se: patch cleanup] Signed-off-by: Bernhard Froemel <froemel@vmars.tuwien.ac.at> Signed-off-by: Henrik Rydberg <rydberg@euromail.se> Signed-off-by: Jean Delvare <khali@linux-fr.org>
2010-05-27hwmon: (applesmc) Add support for MacBook Pro 5,3 and 5,4Henrik Rydberg1-0/+19
The MacBookPro 5,3 model has two fans, whereas the 5,4 model has only one. This patch adds explicit support for the 5,3 and 5,4 models. Signed-off-by: Henrik Rydberg <rydberg@euromail.se> Signed-off-by: Jean Delvare <khali@linux-fr.org>
2010-05-27hwmon: (tmp401) Reorganize code to get rid of static forward declarationsAndre Prendel1-108/+97
Signed-off-by: Andre Prendel <andre.prendel@gmx.de> Acked-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Jean Delvare <khali@linux-fr.org>
2010-05-27hwmon: (tmp401) Use constants for sysfs file permissionsAndre Prendel1-22/+28
Replace octal representation of file permissions by the corresponding constants. Signed-off-by: Andre Prendel <andre.prendel@gmx.de> Acked-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Jean Delvare <khali@linux-fr.org>
2010-05-27hwmon: (adm1031) Allow setting update rateJean Delvare1-2/+66
Based on earlier work by Ira W. Snyder. The adm1031 chip is capable of using a runtime configurable sampling rate, using the fan filter register. Add support for reading and setting the update rate via sysfs. Signed-off-by: Jean Delvare <khali@linux-fr.org> Acked-by: Ira W. Snyder <iws@ovro.caltech.edu>
2010-05-27hwmon: Add description of the update_rate sysfs attributeIra W. Snyder1-3/+10
The update_rate attribute can be used by drivers to let userspace choose the update rate of the chip, if it is configurable. Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu> Signed-off-by: Jean Delvare <khali@linux-fr.org>
2010-05-27hwmon: (lm90) Use programmed update rateIra W. Snyder1-1/+2
The lm90 driver programs the sensor chip to update its readings at 2 Hz (500 ms between readings). However, the driver only does reads from the chip at intervals of 2 * HZ (2000 ms between readings). Change the driver update rate to the programmed update rate. Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu> Signed-off-by: Jean Delvare <khali@linux-fr.org>
2010-05-27hwmon: (f71882fg) Acquire I/O regions while we're working with themGiel van Schijndel1-0/+8
Acquire the I/O region for the Super I/O chip while we're working on it. Signed-off-by: Giel van Schijndel <me@mortis.eu> Cc: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Jean Delvare <khali@linux-fr.org>
2010-05-27hwmon: (f71882fg) Code cleanupGiel van Schijndel1-12/+6
Some code cleanup: properly use previously defined functions, rather than duplicating their code. Signed-off-by: Giel van Schijndel <me@mortis.eu> Cc: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Jean Delvare <khali@linux-fr.org>
2010-05-27hwmon: (f71882fg) Use strict_stro(l|ul) instead of simple_strto$1Giel van Schijndel1-29/+104
Use the strict_strol and strict_stroul functions instead of simple_strol and simple_stroul respectively in sysfs functions. Signed-off-by: Giel van Schijndel <me@mortis.eu> Acked-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Jean Delvare <khali@linux-fr.org>
2010-05-27hwmon: (f71882fg) Fixed braces coding style issuesGiel van Schijndel1-6/+5
Fixed several coding style issues. Signed-off-by: Giel van Schijndel <me@mortis.eu> Acked-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Jean Delvare <khali@linux-fr.org>
2010-05-27hwmon: (lm63) Add basic support for LM64Matthew Garrett3-9/+25
The LM64 appears to be an LM63 with added GPIO lines. Add support for the hwmon functionality - GPIO can be added at some later stage if someone has a need for them. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Jean Delvare <khali@linux-fr.org>
2010-05-27hwmon: (asus_atk0110) Don't load if ACPI resources aren't enforcedJean Delvare3-0/+18
When the user passes the kernel parameter acpi_enforce_resources=lax, the ACPI resources are no longer protected, so a native driver can make use of them. In that case, we do not want the asus_atk0110 to be loaded. Unfortunately, this driver loads automatically due to its MODULE_DEVICE_TABLE, so the user ends up with two drivers loaded for the same device - this is bad. So I suggest that we prevent the asus_atk0110 driver from loading if acpi_enforce_resources=lax. Signed-off-by: Jean Delvare <khali@linux-fr.org> Acked-by: Luca Tettamanti <kronos.it@gmail.com> Cc: Len Brown <lenb@kernel.org>
2010-05-27Avoid warning when CPU hotplug isn't enabledLinus Torvalds1-6/+3
Commit e9fb7631ebcd ("cpu-hotplug: introduce cpu_notify(), __cpu_notify(), cpu_notify_nofail()") also introduced this annoying warning: kernel/cpu.c:157: warning: 'cpu_notify_nofail' defined but not used when CONFIG_HOTPLUG_CPU wasn't set. So move that helper inside the #ifdef CONFIG_HOTPLUG_CPU region, and simplify it while at it. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27SFI: add sysfs interface for SFI tables.Feng Tang4-0/+167
Analogous to ACPI's /sys/firmware/acpi/tables/... create /sys/firmware/sfi/tables/ The tables are primariy for the kernel, but sometimes it is useful for user-space to be able to read them. Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2010-05-27Input: s3c2410_ts - restore accidentially dropped s3c24xx idsVasily Khoruzhick1-0/+2
Without s3c24xx ids driver doesn't attach on s3c2410 and s3c244x Signed-off-by: Vasily Khoruzhick <anarsoul@gmail.com> Acked-by: Ben Dooks <ben-linux@fluff.org> Signed-off-by: Dmitry Torokhov <dtor@mail.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27numa: update Documentation/vm/numa, add memoryless node infoLee Schermerhorn1-39/+147
Kamezawa Hiroyuki requested documentation for the numa_mem_id() and slab related changes. He suggested Documentation/vm/numa for this documentation. Looking at this file, it seems to me to be hopelessly out of date relative to current Linux NUMA support. At the risk of going down a rathole, I have made an attempt to rewrite the doc at a slightly higher level [I think] and provide pointers to other in-tree documents and out-of-tree man pages that cover the details. Let the games begin. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27numa: in-kernel profiling: use cpu_to_mem() for per cpu allocationsLee Schermerhorn1-2/+2
In kernel profiling requires that we be able to allocate "local" memory for each cpu. Use "cpu_to_mem()" instead of "cpu_to_node()" to support memoryless nodes. Depends on the "numa_mem_id()" patch. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27numa: slab: use numa_mem_id() for slab local memory nodeLee Schermerhorn1-21/+22
Example usage of generic "numa_mem_id()": The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes well. Specifically, the "fast path"--____cache_alloc()--will never succeed as slab doesn't cache offnode object on the per cpu queues, and for memoryless nodes, all memory will be "off node" relative to numa_node_id(). This adds significant overhead to all kmem cache allocations, incurring a significant regression relative to earlier kernels [from before slab.c was reorganized]. This patch uses the generic topology function "numa_mem_id()" to return the "effective local memory node" for the calling context. This is the first node in the local node's generic fallback zonelist-- the same node that "local" mempolicy-based allocations would use. This lets slab cache these "local" allocations and avoid fallback/refill on every allocation. N.B.: Slab will need to handle node and memory hotplug events that could change the value returned by numa_mem_id() for any given node if recent changes to address memory hotplug don't already address this. E.g., flush all per cpu slab queues before rebuilding the zonelists while the "machine" is held in the stopped state. Performance impact on "hackbench 400 process 200" 2.6.34-rc3-mmotm-100405-1609 no-patch this-patch ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup The slowdown of the patched kernel from ~12 sec to ~28 seconds when configured with memoryless nodes is the result of all cpus allocating from a single node's mm pagepool. The cache lines of the single node are distributed/interleaved over the memory of the real physical nodes, but the zone lock, list heads, ... of the single node with memory still each live in a single cache line that is accessed from all processors. x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845 Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27numa: ia64: support numa_mem_id() for memoryless nodesLee Schermerhorn2-0/+5
Enable 'HAVE_MEMORYLESS_NODES' by default when NUMA configured on ia64. Initialize percpu 'numa_mem' variable when starting secondary cpus. Generic initialization will handle the boot cpu. Nothing uses 'numa_mem_id()' yet. Subsequent patch with modify slab to use this. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27numa: introduce numa_mem_id()- effective local memory node idLee Schermerhorn4-1/+114
Introduce numa_mem_id(), based on generic percpu variable infrastructure to track "nearest node with memory" for archs that support memoryless nodes. Define API in <linux/topology.h> when CONFIG_HAVE_MEMORYLESS_NODES defined, else stubs. Architectures will define HAVE_MEMORYLESS_NODES if/when they support them. Archs can override definitions of: numa_mem_id() - returns node number of "local memory" node set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem' cpu_to_mem() - return numa_mem for specified cpu; may be used as lvalue Generic initialization of 'numa_mem' occurs in __build_all_zonelists(). This will initialize the boot cpu at boot time, and all cpus on change of numa_zonelist_order, or when node or memory hot-plug requires zonelist rebuild. Archs that support memoryless nodes will need to initialize 'numa_mem' for secondary cpus as they're brought on-line. [akpm@linux-foundation.org: fix build] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27numa: ia64: use generic percpu var numa_node_id() implementationLee Schermerhorn3-5/+10
ia64: Use generic percpu implementation of numa_node_id() + intialize per cpu 'numa_node' + remove ia64 cpu_to_node() macro; use generic + define CONFIG_USE_PERCPU_NUMA_NODE_ID when NUMA configured Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>