linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2014-05-27	ext4: fix ZERO_RANGE test failure in data journalling	Namjae Jeon	1	-0/+7
	xfstests generic/091 is failing when mounting ext4 with data=journal. I think that this regression is same problem that occurred prior to collapse range issue. So ZERO RANGE also need to call ext4_force_commit as collapse range. Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-05-26	ext4: reduce contention on s_orphan_lock	Jan Kara	1	-44/+65
	Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that we avoid taking global s_orphan_lock in some cases and hold it for shorter time in other cases. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-26	ext4: use sbi in ext4_orphan_{add\|del}()	Jan Kara	1	-16/+15
	Use sbi pointer consistently in ext4_orphan_del() instead of opencoding it sometimes. Also ext4_orphan_add() uses EXT4_SB(sb) often so create sbi variable for it as well and use it. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12	ext4: use EXT_MAX_BLOCKS in ext4_es_can_be_merged()	Lukas Czerner	1	-1/+7
	In ext4_es_can_be_merged() when checking whether we can merge two extents we should use EXT_MAX_BLOCKS instead of defining it manually. Also if it is really the case we should notify userspace because clearly there is a bug in extent status tree implementation since this should never happen. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2014-05-12	ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access	liang xie	10	-0/+33
	Make them more consistently Signed-off-by: xieliang <xieliang@xiaomi.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12	ext4: remove unnecessary double parentheses	Lukas Czerner	5	-11/+11
	Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12	ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails	Andrey Tsyvarev	1	-3/+1
	Caches from 'ext4_groupinfo_caches' may be in use by other mounts, which have already existed. So, it is incorrect to destroy them when newly requested mount fails. Found by Linux File System Verification project (linuxtesting.org). Signed-off-by: Andrey Tsyvarev <tsyvarev@ispras.ru> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-05-12	ext4: make local functions static	Stephen Hemminger	9	-54/+26
	I have been running make namespacecheck to look for unneeded globals, and found these in ext4. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12	ext4: fix block bitmap validation when bigalloc, ^flex_bg	Darrick J. Wong	1	-5/+7
	On a bigalloc,^flex_bg filesystem, the ext4_valid_block_bitmap function fails to convert from blocks to clusters when spot-checking the validity of the bitmap block that we've just read from disk. This causes ext4 to think that the bitmap is garbage, which results in the block group being taken offline when it's not necessary. Add in the necessary EXT4_B2C() calls to perform the conversions. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12	ext4: fix block bitmap initialization under sparse_super2	Darrick J. Wong	2	-15/+22
	The ext4_bg_has_super() function doesn't know about the new rules for where backup superblocks go on a sparse_super2 filesystem. Therefore, block bitmap initialization doesn't know that it shouldn't reserve space for backups in groups that are never going to contain backups. The result of this is e2fsck complaining about the block bitmap being incorrect (fortunately not in a way that results in cross-linked files), so fix the whole thing. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12	ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem	Darrick J. Wong	1	-0/+10
	On a filesystem with a 1k block size, the group descriptors live in block 2, not block 1. If the filesystem has bigalloc,meta_bg set, however, the calculation of the group descriptor table location does not take this into account and returns the wrong block number. Fix the calculation to return the correct value for this case. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12	ext4: avoid unneeded lookup when xattr name is invalid	Zhang Zhen	1	-0/+3
	In ext4_xattr_set_handle() we have checked the xattr name's length. So we should also check it in ext4_xattr_get() to avoid unneeded lookup caused by invalid name. Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12	ext4: fix data integrity sync in ordered mode	Namjae Jeon	5	-11/+29
	When we perform a data integrity sync we tag all the dirty pages with PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages. Later we check for this tag in write_cache_pages_da and creates a struct mpage_da_data containing contiguously indexed pages tagged with this tag and sync these pages with a call to mpage_da_map_and_submit. This process is done in while loop until all the PAGECACHE_TAG_TOWRITE pages are synced. We also do journal start and stop in each iteration. journal_stop could initiate journal commit which would call ext4_writepage which in turn will call ext4_bio_write_page even for delayed OR unwritten buffers. When ext4_bio_write_page is called for such buffers, even though it does not sync them but it clears the PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages are also not synced by the currently running data integrity sync. We will end up with dirty pages although sync is completed. This could cause a potential data loss when the sync call is followed by a truncate_pagecache call, which is exactly the case in collapse_range. (It will cause generic/127 failure in xfstests) To avoid this issue, we can use set_page_writeback_keepwrite instead of set_page_writeback, which doesn't clear TOWRITE tag. Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-21	ext4: remove obsoleted check	Dmitry Monakhov	1	-2/+1
	BH can not be NULL at this point, ext4_read_dirblock() always return non null value, and we already have done all necessery checks. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-04-21	ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode	Theodore Ts'o	3	-19/+25
	To avoid potential data races, use a spinlock which protects the raw (on-disk) inode. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-21	ext4: fix locking for O_APPEND writes	Theodore Ts'o	1	-16/+26
	Al Viro pointed out that locking for O_APPEND writes was problematic, since the location of the write isn't known until after we take the i_mutex, which impacts the ext4_unaligned_aio() and s_bitmap_maxbytes check. For O_APPEND always assume that the write is unaligned so call ext4_unwritten_wait(). And to solve the second problem, take the i_mutex earlier before we start the s_bitmap_maxbytes check. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-21	ext4: factor out common code in ext4_file_write()	Theodore Ts'o	1	-34/+22
	This shouldn't change any logic flow; just delete duplicated code. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-21	ext4: move ext4_file_dio_write() into ext4_file_write()	Theodore Ts'o	1	-74/+64
	This commit doesn't actually change anything; it just moves code around in preparation for some code simplification work. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-21	ext4: inline generic_file_aio_write() into ext4_file_write()	Theodore Ts'o	1	-4/+16
	Copy generic_file_aio_write() into ext4_file_write(). This is part of a patch series which allows us to simplify ext4_file_write() and ext4_file_dio_write(), by calling __generic_file_aio_write() directly. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-20	ext4: rename uninitialized extents to unwritten	Lukas Czerner	9	-163/+163
	Currently in ext4 there is quite a mess when it comes to naming unwritten extents. Sometimes we call it uninitialized and sometimes we refer to it as unwritten. The right name for the extent which has been allocated but does not contain any written data is _unwritten_. Other file systems are using this name consistently, even the buffer head state refers to it as unwritten. We need to fix this confusion in ext4. This commit changes every reference to an uninitialized extent (meaning allocated but unwritten) to unwritten extent. This includes comments, function names and variable names. It even covers abbreviation of the word uninitialized (such as uninit) and some misspellings. This commit does not change any of the code paths at all. This has been confirmed by comparing md5sums of the assembly code of each object file after all the function names were stripped from it. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-20	ext4: get rid of EXT4_MAP_UNINIT flag	Lukas Czerner	4	-13/+7
	Currently EXT4_MAP_UNINIT is used in dioread_nolock case to mark the cases where we're using dioread_nolock and we're writing into either unallocated, or unwritten extent, because we need to make sure that any DIO write into that inode will wait for the extent conversion. However EXT4_MAP_UNINIT is not only entirely misleading name but also unnecessary because we can check for EXT4_MAP_UNWRITTEN in the dioread_nolock case instead. This commit removes EXT4_MAP_UNINIT flag. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-20	Linux 3.15-rc2	Linus Torvalds	1	-1/+1

2014-04-20	perf tools: Improve error reporting	Adrien BAK	2	-2/+9
	In the current version, when using perf record, if something goes wrong in tools/perf/builtin-record.c:375 session = perf_session__new(file, false, NULL); The error message: "Not enough memory for reading per file header" is issued. This error message seems to be outdated and is not very helpful. This patch proposes to replace this error message by "Perf session creation failed" I believe this issue has been brought to lkml: https://lkml.org/lkml/2014/2/24/458 although this patch only tackles a (small) part of the issue. Additionnaly, this patch improves error reporting in tools/perf/util/data.c open_file_write. Currently, if the call to open fails, the user is unaware of it. This patch logs the error, before returning the error code to the caller. Reported-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Adrien BAK <adrien.bak@metascale.org> Link: http://lkml.kernel.org/r/1397786443.3093.4.camel@beast [ Reorganize the changelog into paragraphs ] [ Added empty line after fd declaration in open_file_write ] Signed-off-by: Jiri Olsa <jolsa@redhat.com>
2014-04-20	perf tools: Adjust symbols in VDSO	Vladimir Nikulichev	1	-0/+2
	pert-report doesn't resolve function names in VDSO: $ perf report --stdio -g flat,0.0,15,callee --sort pid ... 8.76% 0x7fff6b1fe861 __gettimeofday ACE_OS::gettimeofday() ... In this case symbol values should be adjusted the same way as for executables, relocatable objects and prelinked libraries. After fix: $ perf report --stdio -g flat,0.0,15,callee --sort pid ... 8.76% __vdso_gettimeofday __gettimeofday ACE_OS::gettimeofday() Signed-off-by: Vladimir Nikulichev <nvs@tbricks.com> Tested-by: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Adrian Hunter <adrian.hunter@intel.com> Link: http://lkml.kernel.org/r/969812.163009436-sendEmail@nvs Signed-off-by: Jiri Olsa <jolsa@redhat.com>
2014-04-20	perf kvm: Fix 'Min time' counting in report command	Alexander Yarygin	1	-0/+1
	Every event in the perf-kvm has a 'stats' structure, which contains max/min/average/etc times of handling this event. The problem is that the 'perf-kvm stat report' command always shows that 'min time' is 0us for every event. Example: # perf kvm stat report Analyze events for all VCPUs: VM-EXIT Samples Samples% Time% Min Time Max Time Avg time [..] 0xB2 MSCH 12 0.07% 0.00% 0us 8us 7.31us ( +- 2.11% ) 0xB2 CHSC 12 0.07% 0.00% 0us 18us 9.39us ( +- 9.49% ) 0xB2 STPX 8 0.05% 0.00% 0us 2us 1.88us ( +- 7.18% ) 0xB2 STSI 7 0.04% 0.00% 0us 44us 16.49us ( +- 38.20% ) [..] This happens because the 'stats' structure is not initialized and stats->min equals to 0. Lets initialize the structure for every event after its allocation using init_stats() function. This initializes stats->min to -1 and makes 'Min time' statistics counting work: # perf kvm stat report Analyze events for all VCPUs: VM-EXIT Samples Samples% Time% Min Time Max Time Avg time [..] 0xB2 MSCH 12 0.07% 0.00% 6us 8us 7.31us ( +- 2.11% ) 0xB2 CHSC 12 0.07% 0.00% 7us 18us 9.39us ( +- 9.49% ) 0xB2 STPX 8 0.05% 0.00% 1us 2us 1.88us ( +- 7.18% ) 0xB2 STSI 7 0.04% 0.00% 1us 44us 16.49us ( +- 38.20% ) [..] Signed-off-by: Alexander Yarygin <yarygin@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: David Ahern <dsahern@gmail.com> Link: http://lkml.kernel.org/r/1397053319-2130-3-git-send-email-borntraeger@de.ibm.com [ Fixing the perf examples changelog output ] Signed-off-by: Jiri Olsa <jolsa@redhat.com>
2014-04-19	ext4: disable COLLAPSE_RANGE for bigalloc	Namjae Jeon	1	-0/+3
	Once COLLAPSE RANGE is be disable for ext4 with bigalloc feature till finding root-cause of problem. It will be enable with fixing that regression of xfstest(generic 075 and 091) again. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-19	ext4: fix COLLAPSE_RANGE failure with 1KB block size	Namjae Jeon	1	-3/+10
	When formatting with 1KB or 2KB(not aligned with PAGE SIZE) block size, xfstests generic/075 and 091 are failing. The offset supplied to function truncate_pagecache_range is block size aligned. In this function start offset is re-aligned to PAGE_SIZE by rounding_up to the next page boundary. Due to this rounding up, old data remains in the page cache when blocksize is less than page size and start offset is not aligned with page size. In case of collapse range, we need to align start offset to page size boundary by doing a round down operation instead of round up. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-19	coredump: fix va_list corruption	Eric Dumazet	1	-1/+6
	A va_list needs to be copied in case it needs to be used twice. Thanks to Hugh for debugging this issue, leading to various panics. Tested: lpq84:~# echo "\|/foobar12345 %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h" >/proc/sys/kernel/core_pattern 'produce_core' is simply : main() { (int )0 = 1;} lpq84:~# ./produce_core Segmentation fault (core dumped) lpq84:~# dmesg \| tail -1 [ 614.352947] Core dump to \|/foobar12345 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 (null) pipe failed Notice the last argument was replaced by a NULL (we were lucky enough to not crash, but do not try this on your production machine !) After fix : lpq83:~# echo "\|/foobar12345 %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h" >/proc/sys/kernel/core_pattern lpq83:~# ./produce_core Segmentation fault lpq83:~# dmesg \| tail -1 [ 740.800441] Core dump to \|/foobar12345 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 pipe failed Fixes: 5fe9d8ca21cc ("coredump: cn_vprintf() has no reason to call vsnprintf() twice") Signed-off-by: Eric Dumazet <edumazet@google.com> Diagnosed-by: Hugh Dickins <hughd@google.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org # 3.11+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18	thp: close race between split and zap huge pages	Kirill A. Shutemov	1	-3/+10
	Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have the same root cause. Let's look to them one by one. The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!". It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page(). From my testing I see that page_mapcount() is higher than mapcount here. I think it happens due to race between zap_huge_pmd() and page_check_address_pmd(). page_check_address_pmd() misses PMD which is under zap: CPU0 CPU1 zap_huge_pmd() pmdp_get_and_clear() __split_huge_page() anon_vma_interval_tree_foreach() __split_huge_page_splitting() page_check_address_pmd() mm_find_pmd() /* * We check if PMD present without taking ptl: no * serialization against zap_huge_pmd(). We miss this PMD, * it's not accounted to 'mapcount' in __split_huge_page(). / pmd_present(pmd) == 0 BUG_ON(mapcount != page_mapcount(page)) // CRASH!!! page_remove_rmap(page) atomic_add_negative(-1, &page->_mapcount) The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!". It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd(). This happens in similar way: CPU0 CPU1 zap_huge_pmd() pmdp_get_and_clear() page_remove_rmap(page) atomic_add_negative(-1, &page->_mapcount) __split_huge_page() anon_vma_interval_tree_foreach() __split_huge_page_splitting() page_check_address_pmd() mm_find_pmd() pmd_present(pmd) == 0 / The same comment as above / / * No crash this time since we already decremented page->_mapcount in * zap_huge_pmd(). / BUG_ON(mapcount != page_mapcount(page)) / * We split the compound page here into small pages without * serialization against zap_huge_pmd() */ __split_huge_page_refcount() VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!! So my understanding the problem is pmd_present() check in mm_find_pmd() without taking page table lock. The bug was introduced by me commit with commit 117b0791ac42. Sorry for that. :( Let's open code mm_find_pmd() in page_check_address_pmd() and do the check under page table lock. Note that __page_check_address() does the same for PTE entires if sync != 0. I've stress tested split and zap code paths for 36+ hours by now and don't see crashes with the patch applied. Before it took <20 min to trigger the first bug and few hours for second one (if we ignore first). [1] https://lkml.kernel.org/g/<53440991.9090001@oracle.com> [2] https://lkml.kernel.org/g/<5310C56C.60709@oracle.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Cc: Bob Liu <lliubbo@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Dave Jones <davej@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [3.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18	mm: fix new kernel-doc warning in filemap.c	Randy Dunlap	1	-1/+0
	Fix new kernel-doc warning in mm/filemap.c: Warning(mm/filemap.c:2600): Excess function parameter 'ppos' description in '__generic_file_aio_write' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18	mm: fix CONFIG_DEBUG_VM_RB description	Davidlohr Bueso	1	-2/+1
	This appears to be a copy/paste error. Update the description to reflect extra rbtree debug and checks for the config option instead of duplicating CONFIG_DEBUG_VM. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18	mm: use paravirt friendly ops for NUMA hinting ptes	Mel Gorman	1	-8/+23
	David Vrabel identified a regression when using automatic NUMA balancing under Xen whereby page table entries were getting corrupted due to the use of native PTE operations. Quoting him Xen PV guest page tables require that their entries use machine addresses if the preset bit (_PAGE_PRESENT) is set, and (for successful migration) non-present PTEs must use pseudo-physical addresses. This is because on migration MFNs in present PTEs are translated to PFNs (canonicalised) so they may be translated back to the new MFN in the destination domain (uncanonicalised). pte_mknonnuma(), pmd_mknonnuma(), pte_mknuma() and pmd_mknuma() set and clear the _PAGE_PRESENT bit using pte_set_flags(), pte_clear_flags(), etc. In a Xen PV guest, these functions must translate MFNs to PFNs when clearing _PAGE_PRESENT and translate PFNs to MFNs when setting _PAGE_PRESENT. His suggested fix converted p[te\|md]_[set\|clear]_flags to using paravirt-friendly ops but this is overkill. He suggested an alternative of using p[te\|md]_modify in the NUMA page table operations but this is does more work than necessary and would require looking up a VMA for protections. This patch modifies the NUMA page table operations to use paravirt friendly operations to set/clear the flags of interest. Unfortunately this will take a performance hit when updating the PTEs on CONFIG_PARAVIRT but I do not see a way around it that does not break Xen. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David Vrabel <david.vrabel@citrix.com> Tested-by: David Vrabel <david.vrabel@citrix.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Anvin <hpa@zytor.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steven Noonan <steven@uplinklabs.net> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18	mips: export flush_icache_range	Kees Cook	1	-2/+2
	The lkdtm module performs tests against executable memory ranges, so it needs to flush the icache for proper behaviors. Other architectures already export this, so do the same for MIPS. [akpm@linux-foundation.org: relocate export sites] Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Sanjay Lal <sanjayl@kymasys.com> Cc: John Crispin <blogic@openwrt.org> Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18	mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()	Mizuma, Masayoshi	1	-0/+1
	soft lockup in freeing gigantic hugepage fixed in commit 55f67141a892 "mm: hugetlb: fix softlockup when a large number of hugepages are freed." can happen in return_unused_surplus_pages(), so let's fix it. Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18	wait: explain the shadowing and type inconsistencies	Peter Zijlstra	1	-1/+13
	Stick in a comment before someone else tries to fix the sparse warning this generates. Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-o2ro6f3vkxklni0bc8f7m68s@git.kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18	Shiraz has moved	Viresh Kumar	11	-13/+14
	shiraz.hashim@st.com email-id doesn't exist anymore as he has left the company. Replace ST's id with shiraz.linux.kernel@gmail.com. It also updates .mailmap file to fix address for 'git shortlog'. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Shiraz Hashim <shiraz.linux.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18	Documentation/vm/numa_memory_policy.txt: fix wrong document in numa_memory_policy.txt	Tang Chen	1	-3/+2
	In document numa_memory_policy.txt, the following examples for flag MPOL_F_RELATIVE_NODES are incorrect. For example, consider a task that is attached to a cpuset with mems 2-5 that sets an Interleave policy over the same set with MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the interleave now occurs over nodes 3,5-6. If the cpuset's mems then change to 0,2-3,5, then the interleave occurs over nodes 0,3,5. According to the comment of the patch adding flag MPOL_F_RELATIVE_NODES, the nodemasks the user specifies should be considered relative to the current task's mems_allowed. (https://lkml.org/lkml/2008/2/29/428) And according to numa_memory_policy.txt, if the user's nodemask includes nodes that are outside the range of the new set of allowed nodes, then the remap wraps around to the beginning of the nodemask and, if not already set, sets the node in the mempolicy nodemask. So in the example, if the user specifies 2-5, for a task whose mems_allowed is 3-7, the nodemasks should be remapped the third, fourth, fifth, sixth node in mems_allowed. like the following: mems_allowed: 3 4 5 6 7 relative index: 0 1 2 3 4 5 So the nodemasks should be remapped to 3,5-7, but not 3,5-6. And for a task whose mems_allowed is 0,2-3,5, the nodemasks should be remapped to 0,2-3,5, but not 0,3,5. mems_allowed: 0 2 3 5 relative index: 0 1 2 3 4 5 Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: Randy Dunlap <rdunlap@infradead.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18	powerpc/mm: fix ".__node_distance" undefined	Mike Qiu	1	-0/+1
	CHK include/config/kernel.release CHK include/generated/uapi/linux/version.h CHK include/generated/utsrelease.h ... Building modules, stage 2. WARNING: 1 bad relocations c0000000013d6a30 R_PPC64_ADDR64 uprobes_fetch_type_table WRAP arch/powerpc/boot/zImage.pseries WRAP arch/powerpc/boot/zImage.epapr MODPOST 1849 modules ERROR: ".__node_distance" [drivers/block/nvme.ko] undefined! make[1]: * [__modpost] Error 1 make: * [modules] Error 2 make: *** Waiting for unfinished jobs.... The reason is symbol "__node_distance" not been exported in powerpc. Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Jesse Larrew <jlarrew@linux.vnet.ibm.com> Cc: Robert Jennings <rcj@linux.vnet.ibm.com> Cc: Alistair Popple <alistair@popple.id.au> Cc: Mike Qiu <qiudayu@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18	kernel/watchdog.c:touch_softlockup_watchdog(): use raw_cpu_write()	Andrew Morton	1	-1/+5
	Fix: BUG: using __this_cpu_write() in preemptible [00000000] code: systemd-udevd/497 caller is __this_cpu_preempt_check+0x13/0x20 CPU: 3 PID: 497 Comm: systemd-udevd Tainted: G W 3.15.0-rc1 #9 Hardware name: Hewlett-Packard HP EliteBook 8470p/179B, BIOS 68ICF Ver. F.02 04/27/2012 Call Trace: check_preemption_disabled+0xe1/0xf0 __this_cpu_preempt_check+0x13/0x20 touch_nmi_watchdog+0x28/0x40 Reported-by: Luis Henriques <luis.henriques@canonical.com> Tested-by: Luis Henriques <luis.henriques@canonical.com> Cc: Eric Piel <eric.piel@tremplin-utc.net> Cc: Robert Moore <robert.moore@intel.com> Cc: Lv Zheng <lv.zheng@intel.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18	init/Kconfig: move the trusted keyring config option to general setup	Peter Foley	1	-12/+12
	The SYSTEM_TRUSTED_KEYRING config option is not in any menu, causing it to show up in the toplevel of the kernel configuration. Fix this by moving it under the General Setup menu. Signed-off-by: Peter Foley <pefoley2@pefoley.com> Cc: David Howells <dhowells@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18	vmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state()	Christoph Lameter	1	-1/+1
	Seems to be called with preemption enabled. Therefore it must use mod_zone_page_state instead. Signed-off-by: Christoph Lameter <cl@linux.com> Reported-by: Grygorii Strashko <grygorii.strashko@ti.com> Tested-by: Grygorii Strashko <grygorii.strashko@ti.com> Cc: Tejun Heo <tj@kernel.org> Cc: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18	net: sctp: cache auth_enable per endpoint	Vlad Yasevich	7	-60/+92
	Currently, it is possible to create an SCTP socket, then switch auth_enable via sysctl setting to 1 and crash the system on connect: Oops[#1]: CPU: 0 PID: 0 Comm: swapper Not tainted 3.14.1-mipsgit-20140415 #1 task: ffffffff8056ce80 ti: ffffffff8055c000 task.ti: ffffffff8055c000 [...] Call Trace: [<ffffffff8043c4e8>] sctp_auth_asoc_set_default_hmac+0x68/0x80 [<ffffffff8042b300>] sctp_process_init+0x5e0/0x8a4 [<ffffffff8042188c>] sctp_sf_do_5_1B_init+0x234/0x34c [<ffffffff804228c8>] sctp_do_sm+0xb4/0x1e8 [<ffffffff80425a08>] sctp_endpoint_bh_rcv+0x1c4/0x214 [<ffffffff8043af68>] sctp_rcv+0x588/0x630 [<ffffffff8043e8e8>] sctp6_rcv+0x10/0x24 [<ffffffff803acb50>] ip6_input+0x2c0/0x440 [<ffffffff8030fc00>] __netif_receive_skb_core+0x4a8/0x564 [<ffffffff80310650>] process_backlog+0xb4/0x18c [<ffffffff80313cbc>] net_rx_action+0x12c/0x210 [<ffffffff80034254>] __do_softirq+0x17c/0x2ac [<ffffffff800345e0>] irq_exit+0x54/0xb0 [<ffffffff800075a4>] ret_from_irq+0x0/0x4 [<ffffffff800090ec>] rm7k_wait_irqoff+0x24/0x48 [<ffffffff8005e388>] cpu_startup_entry+0xc0/0x148 [<ffffffff805a88b0>] start_kernel+0x37c/0x398 Code: dd0900b8 000330f8 0126302d <dcc60000> 50c0fff1 0047182a a48306a0 03e00008 00000000 ---[ end trace b530b0551467f2fd ]--- Kernel panic - not syncing: Fatal exception in interrupt What happens while auth_enable=0 in that case is, that ep->auth_hmacs is initialized to NULL in sctp_auth_init_hmacs() when endpoint is being created. After that point, if an admin switches over to auth_enable=1, the machine can crash due to NULL pointer dereference during reception of an INIT chunk. When we enter sctp_process_init() via sctp_sf_do_5_1B_init() in order to respond to an INIT chunk, the INIT verification succeeds and while we walk and process all INIT params via sctp_process_param() we find that net->sctp.auth_enable is set, therefore do not fall through, but invoke sctp_auth_asoc_set_default_hmac() instead, and thus, dereference what we have set to NULL during endpoint initialization phase. The fix is to make auth_enable immutable by caching its value during endpoint initialization, so that its original value is being carried along until destruction. The bug seems to originate from the very first days. Fix in joint work with Daniel Borkmann. Reported-by: Joshua Kinard <kumba@gentoo.org> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Tested-by: Joshua Kinard <kumba@gentoo.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-18	tg3: update rx_jumbo_pending ring param only when jumbo frames are enabled	Ivan Vecera	1	-1/+3
	The patch fixes a problem with dropped jumbo frames after usage of 'ethtool -G ... rx'. Scenario: 1. ip link set eth0 up 2. ethtool -G eth0 rx N # <- This zeroes rx-jumbo 3. ip link set mtu 9000 dev eth0 The ethtool command set rx_jumbo_pending to zero so any received jumbo packets are dropped and you need to use 'ethtool -G eth0 rx-jumbo N' to workaround the issue. The patch changes the logic so rx_jumbo_pending value is changed only if jumbo frames are enabled (MTU > 1500). Signed-off-by: Ivan Vecera <ivecera@redhat.com> Acked-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-18	vlan: Fix lockdep warning when vlan dev handle notification	dingtianhong	2	-5/+42
	When I open the LOCKDEP config and run these steps: modprobe 8021q vconfig add eth2 20 vconfig add eth2.20 30 ifconfig eth2 xx.xx.xx.xx then the Call Trace happened: [32524.386288] ============================================= [32524.386293] [ INFO: possible recursive locking detected ] [32524.386298] 3.14.0-rc2-0.7-default+ #35 Tainted: G O [32524.386302] --------------------------------------------- [32524.386306] ifconfig/3103 is trying to acquire lock: [32524.386310] (&vlan_netdev_addr_lock_key/1){+.....}, at: [<ffffffff814275f4>] dev_mc_sync+0x64/0xb0 [32524.386326] [32524.386326] but task is already holding lock: [32524.386330] (&vlan_netdev_addr_lock_key/1){+.....}, at: [<ffffffff8141af83>] dev_set_rx_mode+0x23/0x40 [32524.386341] [32524.386341] other info that might help us debug this: [32524.386345] Possible unsafe locking scenario: [32524.386345] [32524.386350] CPU0 [32524.386352] ---- [32524.386354] lock(&vlan_netdev_addr_lock_key/1); [32524.386359] lock(&vlan_netdev_addr_lock_key/1); [32524.386364] [32524.386364] * DEADLOCK * [32524.386364] [32524.386368] May be due to missing lock nesting notation [32524.386368] [32524.386373] 2 locks held by ifconfig/3103: [32524.386376] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff81431d42>] rtnl_lock+0x12/0x20 [32524.386387] #1: (&vlan_netdev_addr_lock_key/1){+.....}, at: [<ffffffff8141af83>] dev_set_rx_mode+0x23/0x40 [32524.386398] [32524.386398] stack backtrace: [32524.386403] CPU: 1 PID: 3103 Comm: ifconfig Tainted: G O 3.14.0-rc2-0.7-default+ #35 [32524.386409] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [32524.386414] ffffffff81ffae40 ffff8800d9625ae8 ffffffff814f68a2 ffff8800d9625bc8 [32524.386421] ffffffff810a35fb ffff8800d8a8d9d0 00000000d9625b28 ffff8800d8a8e5d0 [32524.386428] 000003cc00000000 0000000000000002 ffff8800d8a8e5f8 0000000000000000 [32524.386435] Call Trace: [32524.386441] [<ffffffff814f68a2>] dump_stack+0x6a/0x78 [32524.386448] [<ffffffff810a35fb>] __lock_acquire+0x7ab/0x1940 [32524.386454] [<ffffffff810a323a>] ? __lock_acquire+0x3ea/0x1940 [32524.386459] [<ffffffff810a4874>] lock_acquire+0xe4/0x110 [32524.386464] [<ffffffff814275f4>] ? dev_mc_sync+0x64/0xb0 [32524.386471] [<ffffffff814fc07a>] _raw_spin_lock_nested+0x2a/0x40 [32524.386476] [<ffffffff814275f4>] ? dev_mc_sync+0x64/0xb0 [32524.386481] [<ffffffff814275f4>] dev_mc_sync+0x64/0xb0 [32524.386489] [<ffffffffa0500cab>] vlan_dev_set_rx_mode+0x2b/0x50 [8021q] [32524.386495] [<ffffffff8141addf>] __dev_set_rx_mode+0x5f/0xb0 [32524.386500] [<ffffffff8141af8b>] dev_set_rx_mode+0x2b/0x40 [32524.386506] [<ffffffff8141b3cf>] __dev_open+0xef/0x150 [32524.386511] [<ffffffff8141b177>] __dev_change_flags+0xa7/0x190 [32524.386516] [<ffffffff8141b292>] dev_change_flags+0x32/0x80 [32524.386524] [<ffffffff8149ca56>] devinet_ioctl+0x7d6/0x830 [32524.386532] [<ffffffff81437b0b>] ? dev_ioctl+0x34b/0x660 [32524.386540] [<ffffffff814a05b0>] inet_ioctl+0x80/0xa0 [32524.386550] [<ffffffff8140199d>] sock_do_ioctl+0x2d/0x60 [32524.386558] [<ffffffff81401a52>] sock_ioctl+0x82/0x2a0 [32524.386568] [<ffffffff811a7123>] do_vfs_ioctl+0x93/0x590 [32524.386578] [<ffffffff811b2705>] ? rcu_read_lock_held+0x45/0x50 [32524.386586] [<ffffffff811b39e5>] ? __fget_light+0x105/0x110 [32524.386594] [<ffffffff811a76b1>] SyS_ioctl+0x91/0xb0 [32524.386604] [<ffffffff815057e2>] system_call_fastpath+0x16/0x1b ======================================================================== The reason is that all of the addr_lock_key for vlan dev have the same class, so if we change the status for vlan dev, the vlan dev and its real dev will hold the same class of addr_lock_key together, so the warning happened. we should distinguish the lock depth for vlan dev and its real dev. v1->v2: Convert the vlan_netdev_addr_lock_key to an array of eight elements, which could support to add 8 vlan id on a same vlan dev, I think it is enough for current scene, because a netdev's name is limited to IFNAMSIZ which could not hold 8 vlan id, and the vlan dev would not meet the same class key with its real dev. The new function vlan_dev_get_lockdep_subkey() will return the subkey and make the vlan dev could get a suitable class key. v2->v3: According David's suggestion, I use the subclass to distinguish the lock key for vlan dev and its real dev, but it make no sense, because the difference for subclass in the lock_class_key doesn't mean that the difference class for lock_key, so I use lock_depth to distinguish the different depth for every vlan dev, the same depth of the vlan dev could have the same lock_class_key, I import the MAX_LOCK_DEPTH from the include/linux/sched.h, I think it is enough here, the lockdep should never exceed that value. v3->v4: Add a huge array of locking keys will waste static kernel memory and is not a appropriate method, we could use _nested() variants to fix the problem, calculate the depth for every vlan dev, and use the depth as the subclass for addr_lock_key. Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-18	ARC: Delete stale barrier.h	Vineet Gupta	1	-37/+0
	Commit 93ea02bb8435 ("arch: Clean up asm/barrier.h implementations") wired generic barrier.h for ARC, but failed to delete the existing file. In 3.15, due to rcupdate.h updates, this causes a build breakage on ARC: CC arch/arc/kernel/asm-offsets.s In file included from include/linux/sched.h:45:0, from arch/arc/kernel/asm-offsets.c:9: include/linux/rculist.h: In function __list_add_rcu: include/linux/rculist.h:54:2: error: implicit declaration of function smp_store_release [-Werror=implicit-function-declaration] rcu_assign_pointer(list_next_rcu(prev), new); ^ Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18	ext4: use EINVAL if not a regular file in ext4_collapse_range()	Theodore Ts'o	1	-1/+1
	Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-18	ext4: enforce we are operating on a regular file in ext4_zero_range()	jon ernst	1	-0/+3
	Signed-off-by: Jon Ernst <jonernst07@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-18	ext4: fix extent merging in ext4_ext_shift_path_extents()	Lukas Czerner	1	-7/+8
	There is a bug in ext4_ext_shift_path_extents() where if we actually manage to merge a extent we would skip shifting the next extent. This will result in in one extent in the extent tree not being properly shifted. This is causing failure in various xfstests tests using fsx or fsstress with collapse range support. It will also cause file system corruption which looks something like: e2fsck 1.42.9 (4-Feb-2014) Pass 1: Checking inodes, blocks, and sizes Inode 20 has out of order extents (invalid logical block 3, physical block 492938, len 2) Clear? yes ... when running e2fsck. It's also very easily reproducible just by running fsx without any parameters. I can usually hit the problem within a minute. Fix it by increasing ex_start only if we're not merging the extent. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
2014-04-18	ext4: discard preallocations after removing space	Lukas Czerner	2	-2/+1
	Currently in ext4_collapse_range() and ext4_punch_hole() we're discarding preallocation twice. Once before we attempt to do any changes and second time after we're done with the changes. While the second call to ext4_discard_preallocations() in ext4_punch_hole() case is not needed, we need to discard preallocation right after ext4_ext_remove_space() in collapse range case because in the case we had to restart a transaction in the middle of removing space we might have new preallocations created. Remove unneeded ext4_discard_preallocations() ext4_punch_hole() and move it to the better place in ext4_collapse_range() Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-18	ext4: no need to truncate pagecache twice in collapse range	Lukas Czerner	1	-1/+1
	We're already calling truncate_pagecache() before we attempt to do any actual job so there is not need to truncate pagecache once more using truncate_setsize() after we're finished. Remove truncate_setsize() and replace it just with i_size_write() note that we're holding appropriate locks. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>