aboutsummaryrefslogtreecommitdiffstats
path: root/fs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2012-12-17ubifs: use prandom_bytesAkinobu Mita1-5/+3
This also converts filling memory loop to use memset. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: David Laight <david.laight@aculab.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eilon Greenstein <eilong@broadcom.com> Cc: Michel Lespinasse <walken@google.com> Cc: Robert Love <robert.w.love@intel.com> Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17exec: use -ELOOP for max recursion depthKees Cook4-15/+6
To avoid an explosion of request_module calls on a chain of abusive scripts, fail maximum recursion with -ELOOP instead of -ENOEXEC. As soon as maximum recursion depth is hit, the error will fail all the way back up the chain, aborting immediately. This also has the side-effect of stopping the user's shell from attempting to reexecute the top-level file as a shell script. As seen in the dash source: if (cmd != path_bshell && errno == ENOEXEC) { *argv-- = cmd; *argv = cmd = path_bshell; goto repeat; } The above logic was designed for running scripts automatically that lacked the "#!" header, not to re-try failed recursion. On a legitimate -ENOEXEC, things continue to behave as the shell expects. Additionally, when tracking recursion, the binfmt handlers should not be involved. The recursion being tracked is the depth of calls through search_binary_handler(), so that function should be exclusively responsible for tracking the depth. Signed-off-by: Kees Cook <keescook@chromium.org> Cc: halfdog <me@halfdog.net> Cc: P J P <ppandit@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17proc: pid/status: show all supplementary groupsArtem Bityutskiy1-1/+1
We display a list of supplementary group for each process in /proc/<pid>/status. However, we show only the first 32 groups, not all of them. Although this is rare, but sometimes processes do have more than 32 supplementary groups, and this kernel limitation breaks user-space apps that rely on the group list in /proc/<pid>/status. Number 32 comes from the internal NGROUPS_SMALL macro which defines the length for the internal kernel "small" groups buffer. There is no apparent reason to limit to this value. This patch removes the 32 groups printing limit. The Linux kernel limits the amount of supplementary groups by NGROUPS_MAX, which is currently set to 65536. And this is the maximum count of groups we may possibly print. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17/proc/pid/status: add "Seccomp" fieldKees Cook1-0/+8
It is currently impossible to examine the state of seccomp for a given process. While attaching with gdb and attempting "call prctl(PR_GET_SECCOMP,...)" will work with some situations, it is not reliable. If the process is in seccomp mode 1, this query will kill the process (prctl not allowed), if the process is in mode 2 with prctl not allowed, it will similarly be killed, and in weird cases, if prctl is filtered to return errno 0, it can look like seccomp is disabled. When reviewing the state of running processes, there should be a way to externally examine the seccomp mode. ("Did this build of Chrome end up using seccomp?" "Did my distro ship ssh with seccomp enabled?") This adds the "Seccomp" line to /proc/$pid/status. Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Morris <jmorris@namei.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17procfs: add VmFlags field in smaps outputCyrill Gorcunov1-0/+53
During c/r sessions we've found that there is no way at the moment to fetch some VMA associated flags, such as mlock() and madvise(). This leads us to a problem -- we don't know if we should call for mlock() and/or madvise() after restore on the vma area we're bringing back to life. This patch intorduces a new field into "smaps" output called VmFlags, where all set flags associated with the particular VMA is shown as two letter mnemonics. [ Strictly speaking for c/r we only need mlock/madvise bits but it has been said that providing just a few flags looks somehow inconsistent. So all flags are here now. ] This feature is made available on CONFIG_CHECKPOINT_RESTORE=n kernels, as other applications may start to use these fields. The data is encoded in a somewhat awkward two letters mnemonic form, to encourage userspace to be prepared for fields being added or removed in the future. [a.p.zijlstra@chello.nl: props to use for_each_set_bit] [sfr@canb.auug.org.au: props to use array instead of struct] [akpm@linux-foundation.org: overall redesign and simplification] [akpm@linux-foundation.org: remove unneeded braces per sfr, avoid using bloaty for_each_set_bit()] Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17proc: don't show nonexistent capabilitiesAndrew Vagin1-0/+9
Without this patch it is really hard to interpret a bounding set, if CAP_LAST_CAP is unknown for a current kernel. Non-existant capabilities can not be deleted from a bounding set with help of prctl. E.g.: Here are two examples without/with this patch. CapBnd: ffffffe0fdecffff CapBnd: 00000000fdecffff I suggest to hide non-existent capabilities. Here is two reasons. * It's logically and easier for using. * It helps to checkpoint-restore capabilities of tasks, because tasks can be restored on another kernel, where CAP_LAST_CAP is bigger. Signed-off-by: Andrew Vagin <avagin@openvz.org> Cc: Andrew G. Morgan <morgan@kernel.org> Reviewed-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17fs/fat: strip "cp" prefix from codepage in displayDave Reisner1-1/+2
Option parsing code expects an unsigned integer for the codepage option, but prefixes and stores this option with "cp" before passing to load_nls(). This makes the displayed option in /proc an invalid one. Strip the prefix when printing so that the displayed option is valid for reuse. Signed-off-by: Dave Reisner <dreisner@archlinux.org> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17fat: ix mount option parsingJan Kara1-11/+11
parse_options() is supposed to return value < 0 on error however we returned 0 (success) in a lot of cases. This actually was not a problem in practice because match_token() used by parse_options() is clever and catches most of the problems for us. Signed-off-by: Jan Kara <jack@suse.cz> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17fat: provide option for setting timezone offsetJan Kara3-9/+28
So far FAT either offsets time stamps by sys_tz.minuteswest or leaves them as they are (when tz=UTC mount option is used). However in some cases it is useful if one can specify time stamp offset on his own (e.g. when time zone of the camera connected is different from time zone of the computer, or when HW clock is in UTC and thus sys_tz.minuteswest == 0). So provide a mount option time_offset= which allows user to specify offset in minutes that should be applied to time stamps on the filesystem. akpm: this code would work incorrectly when used via `mount -o remount', because cached inodes would not be updated. But fatfs's fat_remount() is basically a no-op anyway. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17fat: notify when discard is not supportedNamjae Jeon1-0/+9
Change fatfs so that a warning is emitted when an attempt is made to mount a filesystem with the unsupported `discard' option. ext4 aready does this: http://patchwork.ozlabs.org/patch/192668/ Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Signed-off-by: Amit Sahrawat <amit.sahrawat83@gmail.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17binfmt_elf: fix corner case kfree of uninitialized dataAlan Cox1-1/+3
If elf_core_dump() is called and fill_note_info() fails in the kmalloc() then it returns 0 but has not yet initialised all the needed fields. As a result we do a kfree(randomness) after correctly skipping the thread data. [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17procfs: use kbasename()Andy Shevchenko1-5/+1
[yongjun_wei@trendmicro.com.cn: remove duplicated include] Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17fs/notify/inode_mark.c: make fsnotify_find_inode_mark_locked() staticTushar Behera1-2/+3
Fixes following sparse warning: fs/notify/inode_mark.c:127:22: warning: symbol 'fsnotify_find_inode_mark_locked' was not declared. Should it be static? Signed-off-by: Tushar Behera <tushar.behera@linaro.org> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17lseek: the "whence" argument is called "whence"Andrew Morton21-94/+94
But the kernel decided to call it "origin" instead. Fix most of the sites. Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fsLinus Torvalds4-10/+14
Pull ext3, udf, quota fixes from Jan Kara: "Some ext3 & quota cleanups and couple of udf fixes" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: quota: Use the pre-processor to compile out quotactl_cmd_write when !CONFIG_BLOCK ext3: drop if around WARN_ON ext3: get rid of the duplicate code on ext3_fill_super udf: remove un-needed variable from inode_getblk udf: don't increment lenExtents while writing to a hole udf: fix memory leak while allocating blocks during write
2012-12-16Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4Linus Torvalds29-1080/+4047
Pull ext4 update from Ted Ts'o: "There are two major features for this merge window. The first is inline data, which allows small files or directories to be stored in the in-inode extended attribute area. (This requires that the file system use inodes which are at least 256 bytes or larger; 128 byte inodes do not have any room for in-inode xattrs.) The second new feature is SEEK_HOLE/SEEK_DATA support. This is enabled by the extent status tree patches, and this infrastructure will be used to further optimize ext4 in the future. Beyond that, we have the usual collection of code cleanups and bug fixes." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (63 commits) ext4: zero out inline data using memset() instead of empty_zero_page ext4: ensure Inode flags consistency are checked at build time ext4: Remove CONFIG_EXT4_FS_XATTR ext4: remove unused variable from ext4_ext_in_cache() ext4: remove redundant initialization in ext4_fill_super() ext4: remove redundant code in ext4_alloc_inode() ext4: use sync_inode_metadata() when syncing inode metadata ext4: enable ext4 inline support ext4: let fallocate handle inline data correctly ext4: let ext4_truncate handle inline data correctly ext4: evict inline data out if we need to strore xattr in inode ext4: let fiemap work with inline data ext4: let ext4_rename handle inline dir ext4: let empty_dir handle inline dir ext4: let ext4_delete_entry() handle inline data ext4: make ext4_delete_entry generic ext4: let ext4_find_entry handle inline data ext4: create a new function search_dir ext4: let ext4_readdir handle inline data ext4: let add_dir_entry handle inline data properly ...
2012-12-16Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-securityLinus Torvalds2-16/+8
Pull security subsystem updates from James Morris: "A quiet cycle for the security subsystem with just a few maintenance updates." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: Smack: create a sysfs mount point for smackfs Smack: use select not depends in Kconfig Yama: remove locking from delete path Yama: add RCU to drop read locking drivers/char/tpm: remove tasklet and cleanup KEYS: Use keyring_alloc() to create special keyrings KEYS: Reduce initial permissions on keys KEYS: Make the session and process keyrings per-thread seccomp: Make syscall skipping and nr changes more consistent key: Fix resource leak keys: Fix unreachable code KEYS: Add payload preparsing opportunity prior to key instantiate or update
2012-12-15Merge tag 'for-v3.8' of git://git.infradead.org/users/cbou/linux-pstoreLinus Torvalds2-12/+34
Pull pstore update from Anton Vorontsov: "Here are just a few fixups for the pstore subsystem, nothing special this time" * tag 'for-v3.8' of git://git.infradead.org/users/cbou/linux-pstore: pstore/ftrace: Adjust for ftrace_ops->func prototype change pstore/ram: Fix bounds checks for mem_size, record_size, console_size and ftrace_size pstore/ram: Fix undefined usage of rounddown_pow_of_two(0) pstore/ram: Fixup section annotations
2012-12-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmwLinus Torvalds16-194/+380
Pull GFS2 updates from Steven Whitehouse: "The main feature this time is the new Orlov allocator and the patches leading up to it which allow us to allocate new inodes from their own allocation context, rather than borrowing that of their parent directory. It is this change which then allows us to choose a different location for subdirectories when required. This works exactly as per the ext3 implementation from the users point of view. In addition to that, we've got a speed up in gfs2_rbm_from_block() from Bob Peterson, three locking related improvements from Dave Teigland plus a selection of smaller bug fixes and clean ups." * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: GFS2: Set gl_object during inode create GFS2: add error check while allocating new inodes GFS2: don't reference inode's glock during block allocation trace GFS2: remove redundant lvb pointer GFS2: only use lvb on glocks that need it GFS2: skip dlm_unlock calls in unmount GFS2: Fix one RG corner case GFS2: Eliminate redundant buffer_head manipulation in gfs2_unlink_inode GFS2: Use dirty_inode in gfs2_dir_add GFS2: Fix truncation of journaled data files GFS2: Add Orlov allocator GFS2: Use proper allocation context for new inodes GFS2: Add test for resource group congestion status GFS2: Rename glops go_xmote_th to go_sync GFS2: Speed up gfs2_rbm_from_block GFS2: Review bug traps in glops.c
2012-12-13Merge branch 'autofs' (patches from Ian Kent)Linus Torvalds2-51/+41
Merge emailed autofs cleanup/fix patches from Ian Kent * autofs: autofs4 - use simple_empty() for empty directory check autofs4 - dont clear DCACHE_NEED_AUTOMOUNT on rootless mount
2012-12-13autofs4 - use simple_empty() for empty directory checkIan Kent1-17/+5
For direct (and offset) mounts, if an automounted mount is manually umounted the trigger mount dentry can appear non-empty causing it to not trigger mounts. This can also happen if there is a file handle leak in a user space automounting application. This happens because, when a ioctl control file handle is opened on the mount, a cursor dentry is created which causes list_empty() to see the dentry as non-empty. Since there is a case where listing the directory of these dentrys is needed, the use of dcache_dir_*() functions for .open() and .release() is needed. Consequently simple_empty() must be used instead of list_empty() when checking for an empty directory. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-13autofs4 - dont clear DCACHE_NEED_AUTOMOUNT on rootless mountIan Kent2-34/+36
The DCACHE_NEED_AUTOMOUNT flag is cleared on mount and set on expire for autofs rootless multi-mount dentrys to prevent unnecessary calls to ->d_automount(). Since DCACHE_MANAGE_TRANSIT is always set on autofs dentrys ->d_managed() is always called so the check can be done in ->d_manage() without the need to change the flag. This still avoids unnecessary calls to ->d_automount(), adds negligible overhead and eliminates a seriously ugly check in the expire code. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-13Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds4-10/+6
Merge misc VM changes from Andrew Morton: "The rest of most-of-MM. The other MM bits await a slab merge. This patch includes the addition of a huge zero_page. Not a performance boost but it an save large amounts of physical memory in some situations. Also a bunch of Fujitsu engineers are working on memory hotplug. Which, as it turns out, was badly broken. About half of their patches are included here; the remainder are 3.8 material." However, this merge disables CONFIG_MOVABLE_NODE, which was totally broken. We don't add new features with "default y", nor do we add Kconfig questions that are incomprehensible to most people without any help text. Does the feature even make sense without compaction or memory hotplug? * akpm: (54 commits) mm/bootmem.c: remove unused wrapper function reserve_bootmem_generic() mm/memory.c: remove unused code from do_wp_page() asm-generic, mm: pgtable: consolidate zero page helpers mm/hugetlb.c: fix warning on freeing hwpoisoned hugepage hwpoison, hugetlbfs: fix RSS-counter warning hwpoison, hugetlbfs: fix "bad pmd" warning in unmapping hwpoisoned hugepage mm: protect against concurrent vma expansion memcg: do not check for mm in __mem_cgroup_count_vm_event tmpfs: support SEEK_DATA and SEEK_HOLE (reprise) mm: provide more accurate estimation of pages occupied by memmap fs/buffer.c: remove redundant initialization in alloc_page_buffers() fs/buffer.c: do not inline exported function writeback: fix a typo in comment mm: introduce new field "managed_pages" to struct zone mm, oom: remove statically defined arch functions of same name mm, oom: remove redundant sleep in pagefault oom handler mm, oom: cleanup pagefault oom handler memory_hotplug: allow online/offline memory to result movable node numa: add CONFIG_MOVABLE_NODE for movable-dedicated node mm, memcg: avoid unnecessary function call when memcg is disabled ...
2012-12-13Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivialLinus Torvalds15-17/+16
Pull trivial branch from Jiri Kosina: "Usual stuff -- comment/printk typo fixes, documentation updates, dead code elimination." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) HOWTO: fix double words typo x86 mtrr: fix comment typo in mtrr_bp_init propagate name change to comments in kernel source doc: Update the name of profiling based on sysfs treewide: Fix typos in various drivers treewide: Fix typos in various Kconfig wireless: mwifiex: Fix typo in wireless/mwifiex driver messages: i2o: Fix typo in messages/i2o scripts/kernel-doc: check that non-void fcts describe their return value Kernel-doc: Convention: Use a "Return" section to describe return values radeon: Fix typo and copy/paste error in comments doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c various: Fix spelling of "asynchronous" in comments. Fix misspellings of "whether" in comments. eisa: Fix spelling of "asynchronous". various: Fix spelling of "registered" in comments. doc: fix quite a few typos within Documentation target: iscsi: fix comment typos in target/iscsi drivers treewide: fix typo of "suport" in various comments and Kconfig treewide: fix typo of "suppport" in various comments ...
2012-12-13quota: Use the pre-processor to compile out quotactl_cmd_write when !CONFIG_BLOCKLee Jones1-0/+4
quotactl_cmd_write() is only ever invoked when BLOCK is configured. When !CONFIG_BLOCK, the build warning below is displayed. Let's fix that. fs/quota/quota.c:311:12: warning: ‘quotactl_cmd_write’ defined but not used [-Wunused-function] Cc: Jan Kara <jack@suse.cz> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz>
2012-12-13ext3: drop if around WARN_ONJulia Lawall1-2/+1
Just use WARN_ON rather than an if containing only WARN_ON(1). A simplified version of the semantic patch that makes this transformation is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression e; @@ - if (e) WARN_ON(1); + WARN_ON(e); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Jan Kara <jack@suse.cz>
2012-12-13ext3: get rid of the duplicate code on ext3_fill_superZhao Hongjiang1-3/+0
Setting s_mount_opt to 0 is unnecessary because we use kzalloc() for sb allocation. s_resuid and s_resgid are set again few lines below based on values in on disk superblock. Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>
2012-12-13udf: remove un-needed variable from inode_getblkNamjae Jeon1-3/+0
The variable last_block is not needed. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: Jan Kara <jack@suse.cz>
2012-12-13udf: don't increment lenExtents while writing to a holeNamjae Jeon1-2/+5
Incrementing lenExtents even while writing to a hole is bad for performance as calls to udf_discard_prealloc and udf_truncate_tail_extent would not return from start if isize != lenExtents Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: Jan Kara <jack@suse.cz>
2012-12-13udf: fix memory leak while allocating blocks during writeNamjae Jeon1-0/+4
Need to brelse the buffer_head stored in cur_epos and next_epos. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: Jan Kara <jack@suse.cz>
2012-12-12pstore/ftrace: Adjust for ftrace_ops->func prototype changeAnton Vorontsov1-1/+3
This commit fixes the following warning: fs/pstore/ftrace.c:51:2: warning: initialization from incompatible pointer type [enabled by default] fs/pstore/ftrace.c:51:2: warning: (near initialization for ‘pstore_ftrace_ops.func’) [enabled by defaula Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
2012-12-12pstore/ram: Fix bounds checks for mem_size, record_size, console_size and ftrace_sizeArve Hjønnevåg1-2/+16
The bounds check in ramoops_init_prz was incorrect and ramoops_init_przs had no check. Additionally, ramoops_init_przs allows record_size to be 0, but ramoops_pstore_write_buf would always crash in this case. Signed-off-by: Arve Hjønnevåg <arve@android.com> Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
2012-12-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds1-4/+5
Pull networking changes from David Miller: 1) Allow to dump, monitor, and change the bridge multicast database using netlink. From Cong Wang. 2) RFC 5961 TCP blind data injection attack mitigation, from Eric Dumazet. 3) Networking user namespace support from Eric W. Biederman. 4) tuntap/virtio-net multiqueue support by Jason Wang. 5) Support for checksum offload of encapsulated packets (basically, tunneled traffic can still be checksummed by HW). From Joseph Gasparakis. 6) Allow BPF filter access to VLAN tags, from Eric Dumazet and Daniel Borkmann. 7) Bridge port parameters over netlink and BPDU blocking support from Stephen Hemminger. 8) Improve data access patterns during inet socket demux by rearranging socket layout, from Eric Dumazet. 9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and Jon Maloy. 10) Update TCP socket hash sizing to be more in line with current day realities. The existing heurstics were choosen a decade ago. From Eric Dumazet. 11) Fix races, queue bloat, and excessive wakeups in ATM and associated drivers, from Krzysztof Mazur and David Woodhouse. 12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions in VXLAN driver, from David Stevens. 13) Add "oops_only" mode to netconsole, from Amerigo Wang. 14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also allow DCB netlink to work on namespaces other than the initial namespace. From John Fastabend. 15) Support PTP in the Tigon3 driver, from Matt Carlson. 16) tun/vhost zero copy fixes and improvements, plus turn it on by default, from Michael S. Tsirkin. 17) Support per-association statistics in SCTP, from Michele Baldessari. And many, many, driver updates, cleanups, and improvements. Too numerous to mention individually. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits) net/mlx4_en: Add support for destination MAC in steering rules net/mlx4_en: Use generic etherdevice.h functions. net: ethtool: Add destination MAC address to flow steering API bridge: add support of adding and deleting mdb entries bridge: notify mdb changes via netlink ndisc: Unexport ndisc_{build,send}_skb(). uapi: add missing netconf.h to export list pkt_sched: avoid requeues if possible solos-pci: fix double-free of TX skb in DMA mode bnx2: Fix accidental reversions. bna: Driver Version Updated to 3.1.2.1 bna: Firmware update bna: Add RX State bna: Rx Page Based Allocation bna: TX Intr Coalescing Fix bna: Tx and Rx Optimizations bna: Code Cleanup and Enhancements ath9k: check pdata variable before dereferencing it ath5k: RX timestamp is reported at end of frame ath9k_htc: RX timestamp is reported at end of frame ...
2012-12-12fs/buffer.c: remove redundant initialization in alloc_page_buffers()Yan Hong1-3/+0
buffer_head comes from kmem_cache_zalloc(), no need to zero its fields. Signed-off-by: Yan Hong <clouds.yan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12fs/buffer.c: do not inline exported functionYan Hong1-2/+1
It makes no sense to inline an exported function. Signed-off-by: Yan Hong <clouds.yan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12writeback: fix a typo in commentYan Hong1-1/+1
Signed-off-by: Yan Hong <clouds.yan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12procfs: use N_MEMORY instead N_HIGH_MEMORYLai Jiangshan2-3/+3
N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Lin Feng <linfeng@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12thp: change split_huge_page_pmd() interfaceKirill A. Shutemov1-1/+1
Pass vma instead of mm and add address parameter. In most cases we already have vma on the stack. We provides split_huge_page_pmd_mm() for few cases when we have mm, but not vma. This change is preparation to huge zero pmd splitting implementation. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signalLinus Torvalds12-71/+38
Pull big execve/kernel_thread/fork unification series from Al Viro: "All architectures are converted to new model. Quite a bit of that stuff is actually shared with architecture trees; in such cases it's literally shared branch pulled by both, not a cherry-pick. A lot of ugliness and black magic is gone (-3KLoC total in this one): - kernel_thread()/kernel_execve()/sys_execve() redesign. We don't do syscalls from kernel anymore for either kernel_thread() or kernel_execve(): kernel_thread() is essentially clone(2) with callback run before we return to userland, the callbacks either never return or do successful do_execve() before returning. kernel_execve() is a wrapper for do_execve() - it doesn't need to do transition to user mode anymore. As a result kernel_thread() and kernel_execve() are arch-independent now - they live in kernel/fork.c and fs/exec.c resp. sys_execve() is also in fs/exec.c and it's completely architecture-independent. - daemonize() is gone, along with its parts in fs/*.c - struct pt_regs * is no longer passed to do_fork/copy_process/ copy_thread/do_execve/search_binary_handler/->load_binary/do_coredump. - sys_fork()/sys_vfork()/sys_clone() unified; some architectures still need wrappers (ones with callee-saved registers not saved in pt_regs on syscall entry), but the main part of those suckers is in kernel/fork.c now." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (113 commits) do_coredump(): get rid of pt_regs argument print_fatal_signal(): get rid of pt_regs argument ptrace_signal(): get rid of unused arguments get rid of ptrace_signal_deliver() arguments new helper: signal_pt_regs() unify default ptrace_signal_deliver flagday: kill pt_regs argument of do_fork() death to idle_regs() don't pass regs to copy_process() flagday: don't pass regs to copy_thread() bfin: switch to generic vfork, get rid of pointless wrappers xtensa: switch to generic clone() openrisc: switch to use of generic fork and clone unicore32: switch to generic clone(2) score: switch to generic fork/vfork/clone c6x: sanitize copy_thread(), get rid of clone(2) wrapper, switch to generic clone() take sys_fork/sys_vfork/sys_clone prototypes to linux/syscalls.h mn10300: switch to generic fork/vfork/clone h8300: switch to generic fork/vfork/clone tile: switch to generic clone() ... Conflicts: arch/microblaze/include/asm/Kbuild
2012-12-12Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds19-954/+787
Pull CIFS fixes from Steve French: "This includes a set of misc. cifs fixes (most importantly some byte range lock related write fixes from Pavel, and some ACL and idmap related fixes from Jeff) but also includes the SMB2.02 dialect enablement, and a key fix for SMB3 mounts. Default authentication upgraded to ntlmv2 for cifs (it was already ntlmv2 for smb2)" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: (43 commits) CIFS: Fix write after setting a read lock for read oplock files cifs: parse the device name into UNC and prepath cifs: fix up handling of prefixpath= option cifs: clean up handling of unc= option cifs: fix SID binary to string conversion fix "disabling echoes and oplocks" on SMB2 mounts Do not send SMB2 signatures for SMB3 frames cifs: deal with id_to_sid embedded sid reply corner case cifs: fix hardcoded default security descriptor length cifs: extra sanity checking for cifs.idmap keys cifs: avoid extra allocation for small cifs.idmap keys cifs: simplify id_to_sid and sid_to_id mapping code CIFS: Fix possible data coherency problem after oplock break to None CIFS: Do not permit write to a range mandatory locked with a read lock cifs: rename cifs_readdir_lookup to cifs_prime_dcache and make it void return cifs: Add CONFIG_CIFS_DEBUG and rename use of CIFS_DEBUG cifs: Make CIFS_DEBUG possible to undefine SMB3 mounts fail with access denied to some servers cifs: Remove unused cEVENT macro cifs: always zero out smb_vol before parsing options ...
2012-12-12Merge tag 'for-linus-v3.8-rc1' of git://oss.sgi.com/xfs/xfsLinus Torvalds69-2308/+3737
Pull xfs update from Ben Myers: "There is plenty going on, including the cleanup of xfssyncd, metadata verifiers, CRC infrastructure for the log, tracking of inodes with speculative allocation, a cleanup of xfs_fs_subr.c, fixes for XFS_IOC_ZERO_RANGE, and important fix related to log replay (only update the last_sync_lsn when a transaction completes), a fix for deadlock on AGF buffers, documentation and comment updates, and a few more cleanups and fixes. Details: - remove the xfssyncd mess - only update the last_sync_lsn when a transaction completes - zero allocation_args on the kernel stack - fix AGF/alloc workqueue deadlock - silence uninitialised f.file warning - Update inode alloc comments - Update mount options documentation - report projid32bit feature in geometry call - speculative preallocation inode tracking - fix attr tree double split corruption - fix broken error handling in xfs_vm_writepage - drop buffer io reference when a bad bio is built - add more attribute tree trace points - growfs infrastructure changes for 3.8 - fs/xfs/xfs_fs_subr.c die die die - add CRC infrastructure - add CRC checks to the log - Remove description of nodelaylog mount option from xfs.txt - inode allocation should use unmapped buffers - byte range granularity for XFS_IOC_ZERO_RANGE - fix direct IO nested transaction deadlock - fix stray dquot unlock when reclaiming dquots - fix sparse reported log CRC endian issue" Fix up trivial conflict in fs/xfs/xfs_fsops.c due to the same patch having been applied twice (commits eaef854335ce and 1375cb65e87b: "xfs: growfs: don't read garbage for new secondary superblocks") with later updates to the affected code in the XFS tree. * tag 'for-linus-v3.8-rc1' of git://oss.sgi.com/xfs/xfs: (78 commits) xfs: fix sparse reported log CRC endian issue xfs: fix stray dquot unlock when reclaiming dquots xfs: fix direct IO nested transaction deadlock. xfs: byte range granularity for XFS_IOC_ZERO_RANGE xfs: inode allocation should use unmapped buffers. xfs: Remove the description of nodelaylog mount option from xfs.txt xfs: add CRC checks to the log xfs: add CRC infrastructure xfs: convert buffer verifiers to an ops structure. xfs: connect up write verifiers to new buffers xfs: add pre-write metadata buffer verifier callbacks xfs: add buffer pre-write callback xfs: Add verifiers to dir2 data readahead. xfs: add xfs_da_node verification xfs: factor and verify attr leaf reads xfs: factor dir2 leaf read xfs: factor out dir2 data block reading xfs: factor dir2 free block reading xfs: verify dir2 block format buffers xfs: factor dir2 block read operations ...
2012-12-12Merge tag 'dlm-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlmLinus Torvalds5-14/+47
Pull dlm updates from David Teigland: "This set fixes some conditions in which value blocks are invalidated, and includes two trivial cleanups." * tag 'dlm-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: fix lvb invalidation conditions fs/dlm: remove CONFIG_EXPERIMENTAL dlm: remove unused variable in *dlm_lowcomms_get_buffer()
2012-12-12Merge tag 'please-pull-pstore_mevent' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linuxLinus Torvalds4-14/+17
Pull pstore fixes from Tony Luck: "Patch series to allow EFI variable backend to pstore to hold multiple records." * tag 'please-pull-pstore_mevent' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: efi_pstore: Add a format check for an existing variable name at erasing time efi_pstore: Add a format check for an existing variable name at reading time efi_pstore: Add a sequence counter to a variable name efi_pstore: Add ctime to argument of erase callback efi_pstore: Remove a logic erasing entries from a write callback to hold multiple logs efi_pstore: Add a logic erasing entries to an erase callback efi_pstore: Check remaining space with QueryVariableInfo() before writing data
2012-12-11Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-2/+2
Pull scheduler updates from Ingo Molnar: "The biggest change affects group scheduling: we now track the runnable average on a per-task entity basis, allowing a smoother, exponential decay average based load/weight estimation instead of the previous binary on-the-runqueue/off-the-runqueue load weight method. This will inevitably disturb workloads that were in some sort of borderline balancing state or unstable equilibrium, so an eye has to be kept on regressions. For that reason the new load average is only limited to group scheduling (shares distribution) at the moment (which was also hurting the most from the prior, crude weight calculation and whose scheduling quality wins most from this change) - but we plan to extend this to regular SMP balancing as well in the future, which will simplify and speed up things a bit. Other changes involve ongoing preparatory work to extend NOHZ to the scheduler as well, eventually allowing completely irq-free user-space execution." * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) Revert "sched/autogroup: Fix crash on reboot when autogroup is disabled" cputime: Comment cputime's adjusting code cputime: Consolidate cputime adjustment code cputime: Rename thread_group_times to thread_group_cputime_adjusted cputime: Move thread_group_cputime() to sched code vtime: Warn if irqs aren't disabled on system time accounting APIs vtime: No need to disable irqs on vtime_account() vtime: Consolidate a bit the ctx switch code vtime: Explicitly account pending user time on process tick vtime: Remove the underscore prefix invasion sched/autogroup: Fix crash on reboot when autogroup is disabled cputime: Separate irqtime accounting from generic vtime cputime: Specialize irq vtime hooks kvm: Directly account vtime to system on guest switch vtime: Make vtime_account_system() irqsafe vtime: Gather vtime declarations to their own header file sched: Describe CFS load-balancer sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking sched: Make __update_entity_runnable_avg() fast sched: Update_cfs_shares at period edge ...
2012-12-11Merge branch 'akpm' (Andrew's patchbomb)Linus Torvalds11-78/+82
Merge misc updates from Andrew Morton: "About half of most of MM. Going very early this time due to uncertainty over the coreautounifiednumasched things. I'll send the other half of most of MM tomorrow. The rest of MM awaits a slab merge from Pekka." * emailed patches from Andrew Morton: (71 commits) memory_hotplug: ensure every online node has NORMAL memory memory_hotplug: handle empty zone when online_movable/online_kernel mm, memory-hotplug: dynamic configure movable memory and portion memory drivers/base/node.c: cleanup node_state_attr[] bootmem: fix wrong call parameter for free_bootmem() avr32, kconfig: remove HAVE_ARCH_BOOTMEM mm: cma: remove watermark hacks mm: cma: skip watermarks check for already isolated blocks in split_free_page() mm, oom: fix race when specifying a thread as the oom origin mm, oom: change type of oom_score_adj to short mm: cleanup register_node() mm, mempolicy: remove duplicate code mm/vmscan.c: try_to_freeze() returns boolean mm: introduce putback_movable_pages() virtio_balloon: introduce migration primitives to balloon pages mm: introduce compaction and migration for ballooned pages mm: introduce a common interface for balloon pages mobility mm: redefine address_space.assoc_mapping mm: adjust address_space_operations.migratepage() return code arch/sparc/kernel/sys_sparc_64.c: s/COLOUR/COLOR/ ...
2012-12-11mm, oom: change type of oom_score_adj to shortDavid Rientjes1-5/+5
The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000, so this range can be represented by the signed short type with no functional change. The extra space this frees up in struct signal_struct will be used for per-thread oom kill flags in the next patch. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11mm: redefine address_space.assoc_mappingRafael Aquini4-9/+9
Overhaul struct address_space.assoc_mapping renaming it to address_space.private_data and its type is redefined to void*. By this approach we consistently name the .private_* elements from struct address_space as well as allow extended usage for address_space association with other data structures through ->private_data. Also, all users of old ->assoc_mapping element are converted to reflect its new name and type change (->private_data). Signed-off-by: Rafael Aquini <aquini@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <andi@firstfloor.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11mm: adjust address_space_operations.migratepage() return codeRafael Aquini1-2/+2
Memory fragmentation introduced by ballooning might reduce significantly the number of 2MB contiguous memory blocks that can be used within a guest, thus imposing performance penalties associated with the reduced number of transparent huge pages that could be used by the guest workload. This patch-set follows the main idea discussed at 2012 LSFMMS session: "Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/ to introduce the required changes to the virtio_balloon driver, as well as the changes to the core compaction & migration bits, in order to make those subsystems aware of ballooned pages and allow memory balloon pages become movable within a guest, thus avoiding the aforementioned fragmentation issue Following are numbers that prove this patch benefits on allowing compaction to be more effective at memory ballooned guests. Results for STRESS-HIGHALLOC benchmark, from Mel Gorman's mmtests suite, running on a 4gB RAM KVM guest which was ballooning 512mB RAM in 64mB chunks, at every minute (inflating/deflating), while test was running: ===BEGIN stress-highalloc STRESS-HIGHALLOC highalloc-3.7 highalloc-3.7 rc4-clean rc4-patch Pass 1 55.00 ( 0.00%) 62.00 ( 7.00%) Pass 2 54.00 ( 0.00%) 62.00 ( 8.00%) while Rested 75.00 ( 0.00%) 80.00 ( 5.00%) MMTests Statistics: duration 3.7 3.7 rc4-clean rc4-patch User 1207.59 1207.46 System 1300.55 1299.61 Elapsed 2273.72 2157.06 MMTests Statistics: vmstat 3.7 3.7 rc4-clean rc4-patch Page Ins 3581516 2374368 Page Outs 11148692 10410332 Swap Ins 80 47 Swap Outs 3641 476 Direct pages scanned 37978 33826 Kswapd pages scanned 1828245 1342869 Kswapd pages reclaimed 1710236 1304099 Direct pages reclaimed 32207 31005 Kswapd efficiency 93% 97% Kswapd velocity 804.077 622.546 Direct efficiency 84% 91% Direct velocity 16.703 15.682 Percentage direct scans 2% 2% Page writes by reclaim 79252 9704 Page writes file 75611 9228 Page writes anon 3641 476 Page reclaim immediate 16764 11014 Page rescued immediate 0 0 Slabs scanned 2171904 2152448 Direct inode steals 385 2261 Kswapd inode steals 659137 609670 Kswapd skipped wait 1 69 THP fault alloc 546 631 THP collapse alloc 361 339 THP splits 259 263 THP fault fallback 98 50 THP collapse fail 20 17 Compaction stalls 747 499 Compaction success 244 145 Compaction failures 503 354 Compaction pages moved 370888 474837 Compaction move failure 77378 65259 ===END stress-highalloc This patch: Introduce MIGRATEPAGE_SUCCESS as the default return code for address_space_operations.migratepage() method and documents the expected return code for the same method in failure cases. Signed-off-by: Rafael Aquini <aquini@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <andi@firstfloor.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11mm: use vm_unmapped_area() in hugetlbfsMichel Lespinasse1-34/+8
Update the hugetlb_get_unmapped_area function to make use of vm_unmapped_area() instead of implementing a brute force search. Signed-off-by: Michel Lespinasse <walken@google.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11mm: support more pagesizes for MAP_HUGETLB/SHM_HUGETLBAndi Kleen1-13/+50
There was some desire in large applications using MAP_HUGETLB or SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on others. This is useful together with NUMA policy: use 2MB interleaving on some mappings, but 1GB on local mappings. This patch extends the IPC/SHM syscall interfaces slightly to allow specifying the page size. It borrows some upper bits in the existing flag arguments and allows encoding the log of the desired page size in addition to the *_HUGETLB flag. When 0 is specified the default size is used, this makes the change fully compatible. Extending the internal hugetlb code to handle this is straight forward. Instead of a single mount it just keeps an array of them and selects the right mount based on the specified page size. When no page size is specified it uses the mount of the default page size. The change is not visible in /proc/mounts because internal mounts don't appear there. It also has very little overhead: the additional mounts just consume a super block, but not more memory when not used. I also exported the new flags to the user headers (they were previously under __KERNEL__). Right now only symbols for x86 and some other architecture for 1GB and 2MB are defined. The interface should already work for all other architectures though. Only architectures that define multiple hugetlb sizes actually need it (that is currently x86, tile, powerpc). However tile and powerpc have user configurable hugetlb sizes, so it's not easy to add defines. A program on those architectures would need to query sysfs and use the appropiate log2. [akpm@linux-foundation.org: cleanups] [rientjes@google.com: fix build] [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>