aboutsummaryrefslogtreecommitdiffstats
path: root/mm (follow)
AgeCommit message (Collapse)AuthorFilesLines
2016-02-24thp: call pmdp_invalidate() with correct virtual addressKirill A. Shutemov1-4/+5
Sebastian Ott and Gerald Schaefer reported random crashes on s390. It was bisected to my THP refcounting patchset. The problem is that pmdp_invalidated() called with wrong virtual address. It got offset up by HPAGE_PMD_SIZE by loop over ptes. The solution is to introduce new variable to be used in loop and don't touch 'haddr'. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-and-tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Reported-and-tested-by Sebastian Ott <sebott@linux.vnet.ibm.com> Reviewed-by: Will Deacon <will.deacon@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-20Merge tag 'powerpc-4.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linuxLinus Torvalds1-0/+1
Pull powerpc fixes from Michael Ellerman: - Fix build error on 32-bit with checkpoint restart from Aneesh Kumar - Fix dedotify for binutils >= 2.26 from Andreas Schwab - Don't trace hcalls on offline CPUs from Denis Kirjanov - eeh: Fix stale cached primary bus from Gavin Shan - eeh: Fix stale PE primary bus from Gavin Shan - mm: Fix Multi hit ERAT cause by recent THP update from Aneesh Kumar K.V - ioda: Set "read" permission when "write" is set from Alexey Kardashevskiy * tag 'powerpc-4.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/ioda: Set "read" permission when "write" is set powerpc/mm: Fix Multi hit ERAT cause by recent THP update powerpc/powernv: Fix stale PE primary bus powerpc/eeh: Fix stale cached primary bus powerpc/pseries: Don't trace hcalls on offline CPUs powerpc: Fix dedotify for binutils >= 2.26 powerpc/book3s_32: Fix build error with checkpoint restart
2016-02-18mm: slab: free kmem_cache_node after destroy sysfs fileDmitry Safonov5-27/+29
When slub_debug alloc_calls_show is enabled we will try to track location and user of slab object on each online node, kmem_cache_node structure and cpu_cache/cpu_slub shouldn't be freed till there is the last reference to sysfs file. This fixes the following panic: BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 IP: list_locations+0x169/0x4e0 PGD 257304067 PUD 438456067 PMD 0 Oops: 0000 [#1] SMP CPU: 3 PID: 973074 Comm: cat ve: 0 Not tainted 3.10.0-229.7.2.ovz.9.30-00007-japdoll-dirty #2 9.30 Hardware name: DEPO Computers To Be Filled By O.E.M./H67DE3, BIOS L1.60c 07/14/2011 task: ffff88042a5dc5b0 ti: ffff88037f8d8000 task.ti: ffff88037f8d8000 RIP: list_locations+0x169/0x4e0 Call Trace: alloc_calls_show+0x1d/0x30 slab_attr_show+0x1b/0x30 sysfs_read_file+0x9a/0x1a0 vfs_read+0x9c/0x170 SyS_read+0x58/0xb0 system_call_fastpath+0x16/0x1b Code: 5e 07 12 00 b9 00 04 00 00 3d 00 04 00 00 0f 4f c1 3d 00 04 00 00 89 45 b0 0f 84 c3 00 00 00 48 63 45 b0 49 8b 9c c4 f8 00 00 00 <48> 8b 43 20 48 85 c0 74 b6 48 89 df e8 46 37 44 00 48 8b 53 10 CR2: 0000000000000020 Separated __kmem_cache_release from __kmem_cache_shutdown which now called on slab_kmem_cache_release (after the last reference to sysfs file object has dropped). Reintroduced locking in free_partial as sysfs file might access cache's partial list after shutdowning - partial revert of the commit 69cb8e6b7c29 ("slub: free slabs without holding locks"). Zap __remove_partial and use remove_partial (w/o underscores) as free_partial now takes list_lock which s partial revert for commit 1e4dd9461fab ("slub: do not assert not having lock in removing freed partial") Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com> Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18mm/hugetlb.c: fix incorrect proc nr_hugepages valueVaishali Thakkar1-2/+4
Currently incorrect default hugepage pool size is reported by proc nr_hugepages when number of pages for the default huge page size is specified twice. When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages indicates the current number of pre-allocated huge pages of the default size. Basically /proc/sys/vm/nr_hugepages displays default_hstate-> max_huge_pages and after boot time pre-allocation, max_huge_pages should equal the number of pre-allocated pages (nr_hugepages). Test case: Note that this is specific to x86 architecture. Boot the kernel with command line option 'default_hugepagesz=1G hugepages=X hugepagesz=2M hugepages=Y hugepagesz=1G hugepages=Z'. After boot, 'cat /proc/sys/vm/nr_hugepages' and 'sysctl -a | grep hugepages' returns the value X. However, dmesg output shows that Z huge pages were pre-allocated. So, the root cause of the problem here is that the global variable default_hstate_max_huge_pages is set if a default huge page size is specified (directly or indirectly) on the command line. After the command line processing in hugetlb_init, if default_hstate_max_huge_pages is set, the value is assigned to default_hstae.max_huge_pages. However, default_hstate.max_huge_pages may have already been set based on the number of pre-allocated huge pages of default_hstate size. The solution to this problem is if hstate->max_huge_pages is already set then it should not set as a result of global max_huge_pages value. Basically if the value of the variable hugepages is set multiple times on a command line for a specific supported hugepagesize then proc layer should consider the last specified value. Signed-off-by: Vaishali Thakkar <vaishali.thakkar@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18mm: fix regression in remap_file_pages() emulationKirill A. Shutemov1-5/+29
Grazvydas Ignotas has reported a regression in remap_file_pages() emulation. Testcase: #define _GNU_SOURCE #include <assert.h> #include <stdlib.h> #include <stdio.h> #include <sys/mman.h> #define SIZE (4096 * 3) int main(int argc, char **argv) { unsigned long *p; long i; p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (p == MAP_FAILED) { perror("mmap"); return -1; } for (i = 0; i < SIZE / 4096; i++) p[i * 4096 / sizeof(*p)] = i; if (remap_file_pages(p, 4096, 0, 1, 0)) { perror("remap_file_pages"); return -1; } if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) { perror("remap_file_pages"); return -1; } assert(p[0] == 1); munmap(p, SIZE); return 0; } The second remap_file_pages() fails with -EINVAL. The reason is that remap_file_pages() emulation assumes that the target vma covers whole area we want to over map. That assumption is broken by first remap_file_pages() call: it split the area into two vma. The solution is to check next adjacent vmas, if they map the same file with the same flags. Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Grazvydas Ignotas <notasas@gmail.com> Tested-by: Grazvydas Ignotas <notasas@gmail.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18thp, dax: do not try to withdraw pgtable from non-anon VMAKirill A. Shutemov1-1/+2
DAX doesn't deposit pgtables when it maps huge pages: nothing to withdraw. It can lead to crash. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-15powerpc/mm: Fix Multi hit ERAT cause by recent THP updateAneesh Kumar K.V1-0/+1
With ppc64 we use the deposited pgtable_t to store the hash pte slot information. We should not withdraw the deposited pgtable_t without marking the pmd none. This ensure that low level hash fault handling will skip this huge pte and we will handle them at upper levels. Recent change to pmd splitting changed the above in order to handle the race between pmd split and exit_mmap. The race is explained below. Consider following race: CPU0 CPU1 shrink_page_list() add_to_swap() split_huge_page_to_list() __split_huge_pmd_locked() pmdp_huge_clear_flush_notify() // pmd_none() == true exit_mmap() unmap_vmas() zap_pmd_range() // no action on pmd since pmd_none() == true pmd_populate() As result the THP will not be freed. The leak is detected by check_mm(): BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512 The above required us to not mark pmd none during a pmd split. The fix for ppc is to clear the huge pte of _PAGE_USER, so that low level fault handling code skip this pte. At higher level we do take ptl lock. That should serialze us against the pmd split. Once the lock is acquired we do check the pmd again using pmd_same. That should always return false for us and hence we should retry the access. We do the pmd_same check in all case after taking plt with THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and huge_pmd_set_accessed) Also make sure we wait for irq disable section in other cpus to finish before flipping a huge pte entry with a regular pmd entry. Code paths like find_linux_pte_or_hugepte depend on irq disable to get a stable pte_t pointer. A parallel thp split need to make sure we don't convert a pmd pte to a regular pmd entry without waiting for the irq disable section to finish. Fixes: eef1b3ba053a ("thp: implement split_huge_pmd()") Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-02-11mm,thp: fix spellos in describing __HAVE_ARCH_FLUSH_PMD_TLB_RANGEVineet Gupta1-2/+2
[akpm@linux-foundation.org: s/threshhold/threshold/] Signed-off-by: Vineet Gupta <vgupta@synopsys.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-11mm,thp: khugepaged: call pte flush at the time of collapseVineet Gupta1-1/+3
This showed up on ARC when running LMBench bw_mem tests as Overlapping TLB Machine Check Exception triggered due to STLB entry (2M pages) overlapping some NTLB entry (regular 8K page). bw_mem 2m touches a large chunk of vaddr creating NTLB entries. In the interim khugepaged kicks in, collapsing the contiguous ptes into a single pmd. pmdp_collapse_flush()->flush_pmd_tlb_range() is called to flush out NTLB entries for the ptes. This for ARC (by design) can only shootdown STLB entries (for pmd). The stray NTLB entries cause the overlap with the subsequent STLB entry for collapsed page. So make pmdp_collapse_flush() call pte flush interface not pmd flush. Note that originally all thp flush call sites in generic code called flush_tlb_range() leaving it to architecture to implement the flush for pte and/or pmd. Commit 12ebc1581ad11454 changed this by calling a new opt-in API flush_pmd_tlb_range() which made the semantics more explicit but failed to distinguish the pte vs pmd flush in generic code, which is what this patch fixes. Note that ARC can fixed w/o touching the generic pmdp_collapse_flush() by defining a ARC version, but that defeats the purpose of generic version, plus sementically this is the right thing to do. Fixes STAR 9000961194: LMBench on AXS103 triggering duplicate TLB exceptions with super pages Fixes: 12ebc1581ad11454 ("mm,thp: introduce flush_pmd_tlb_range") Signed-off-by: Vineet Gupta <vgupta@synopsys.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.4] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-11mm/backing-dev.c: fix error path in wb_init()Rasmus Villemoes1-1/+1
We need to use post-decrement to get percpu_counter_destroy() called on &wb->stat[0]. Moreover, the pre-decremebt would cause infinite out-of-bounds accesses if the setup code failed at i==0. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-11mm, dax: check for pmd_none() after split_huge_pmd()Kirill A. Shutemov2-2/+6
DAX implements split_huge_pmd() by clearing pmd. This simple approach reduces memory overhead, as we don't need to deposit page table on huge page mapping to make split_huge_pmd() never-fail. PTE table can be allocated and populated later on page fault from backing store. But one side effect is that have to check if pmd is pmd_none() after split_huge_pmd(). In most places we do this already to deal with parallel MADV_DONTNEED. But I found two call sites which is not affected by MADV_DONTNEED (due down_write(mmap_sem)), but need to have the check to work with DAX properly. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-11mm: fix filemap.c kernel doc warningRandy Dunlap1-0/+1
Add missing kernel-doc notation for function parameter 'gfp_mask' to fix kernel-doc warning. mm/filemap.c:1898: warning: No description found for parameter 'gfp_mask' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05thp: make deferred_split_scan() work againKirill A. Shutemov1-1/+1
We need to iterate over split_queue, not local empty list to get anything split from the shrinker. Fixes: e3ae19535c66 ("thp: limit number of object to scan on deferred_split_scan()") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm: replace vma_lock_anon_vma with anon_vma_lock_read/writeKonstantin Khlebnikov1-30/+25
Sequence vma_lock_anon_vma() - vma_unlock_anon_vma() isn't safe if anon_vma appeared between lock and unlock. We have to check anon_vma first or call anon_vma_prepare() to be sure that it's here. There are only few users of these legacy helpers. Let's get rid of them. This patch fixes anon_vma lock imbalance in validate_mm(). Write lock isn't required here, read lock is enough. And reorders expand_downwards/expand_upwards: security_mmap_addr() and wrapping-around check don't have to be under anon vma lock. Link: https://lkml.kernel.org/r/CACT4Y+Y908EjM2z=706dv4rV6dWtxTLK9nFg9_7DhRMLppBo2g@mail.gmail.com Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm, hugetlb: don't require CMA for runtime gigantic pagesVlastimil Babka2-2/+2
Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation at runtime") has added the runtime gigantic page allocation via alloc_contig_range(), making this support available only when CONFIG_CMA is enabled. Because it doesn't depend on MIGRATE_CMA pageblocks and the associated infrastructure, it is possible with few simple adjustments to require only CONFIG_MEMORY_ISOLATION instead of full CONFIG_CMA. After this patch, alloc_contig_range() and related functions are available and used for gigantic pages with just CONFIG_MEMORY_ISOLATION enabled. Note CONFIG_CMA selects CONFIG_MEMORY_ISOLATION. This allows supporting runtime gigantic pages without the CMA-specific checks in page allocator fastpaths. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm/hugetlb: fix gigantic page initialization/allocationMike Kravetz1-2/+3
Attempting to preallocate 1G gigantic huge pages at boot time with "hugepagesz=1G hugepages=1" on the kernel command line will prevent booting with the following: kernel BUG at mm/hugetlb.c:1218! When mapcount accounting was reworked, the setting of compound_mapcount_ptr in prep_compound_gigantic_page was overlooked. As a result, the validation of mapcount in free_huge_page fails. The "BUG_ON" checks in free_huge_page were also changed to "VM_BUG_ON_PAGE" to assist with debugging. Fixes: 53f9263baba69 ("mm: rework mapcount accounting to enable 4k mapping of THPs") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: David Rientjes <rientjes@google.com> Tested-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm: downgrade VM_BUG in isolate_lru_page() to warningKirill A. Shutemov1-1/+1
Calling isolate_lru_page() is wrong and shouldn't happen, but it not nessesary fatal: the page just will not be isolated if it's not on LRU. Let's downgrade the VM_BUG_ON_PAGE() to WARN_RATELIMIT(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mempolicy: do not try to queue pages from !vma_migratable()Kirill A. Shutemov1-9/+5
Maybe I miss some point, but I don't see a reason why we try to queue pages from non migratable VMAs. This testcase steps on VM_BUG_ON_PAGE() in isolate_lru_page(): #include <fcntl.h> #include <unistd.h> #include <stdio.h> #include <sys/mman.h> #include <numaif.h> #define SIZE 0x2000 int foo; int main() { int fd; char *p; unsigned long mask = 2; fd = open("/dev/sg0", O_RDWR); p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0); /* Faultin pages */ foo = p[0] + p[0x1000]; mbind(p, SIZE, MPOL_BIND, &mask, 4, MPOL_MF_MOVE | MPOL_MF_STRICT); return 0; } The only case when we can queue pages from such VMA is MPOL_MF_STRICT plus MPOL_MF_MOVE or MPOL_MF_MOVE_ALL for VMA which has pages on LRU, but gfp mask is not sutable for migaration (see mapping_gfp_mask() check in vma_migratable()). That's looks like a bug to me. Let's filter out non-migratable vma at start of queue_pages_test_walk() and go to queue_pages_pte_range() only if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL flag is set. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progressTetsuo Handa1-1/+1
Jan Stancek has reported that system occasionally hanging after "oom01" testcase from LTP triggers OOM. Guessing from a result that there is a kworker thread doing memory allocation and the values between "Node 0 Normal free:" and "Node 0 Normal:" differs when hanging, vmstat is not up-to-date for some reason. According to commit 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress"), it meant to force the kworker thread to take a short sleep, but it by error used schedule_timeout(1). We missed that schedule_timeout() in state TASK_RUNNING doesn't do anything. Fix it by using schedule_timeout_uninterruptible(1) which forces the kworker thread to take a short sleep in order to make sure that vmstat is up-to-date. Fixes: 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Jan Stancek <jstancek@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Cristopher Lameter <clameter@sgi.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Arkadiusz Miskiewicz <arekm@maven.pl> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05vmstat: make vmstat_update deferrableMichal Hocko1-1/+1
Commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again and shut down on idle") made vmstat_shepherd deferrable. vmstat_update itself is still useing standard timer which might interrupt idle task. This is possible because "mm, vmstat: make quiet_vmstat lighter" removed cancel_delayed_work from the quiet_vmstat. Change vmstat_work to use DEFERRABLE_WORK to prevent from pointless wakeups from the idle context. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm, vmstat: make quiet_vmstat lighterMichal Hocko1-22/+46
Mike has reported a considerable overhead of refresh_cpu_vm_stats from the idle entry during pipe test: 12.89% [kernel] [k] refresh_cpu_vm_stats.isra.12 4.75% [kernel] [k] __schedule 4.70% [kernel] [k] mutex_unlock 3.14% [kernel] [k] __switch_to This is caused by commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again and shut down on idle") which has placed quiet_vmstat into cpu_idle_loop. The main reason here seems to be that the idle entry has to get over all zones and perform atomic operations for each vmstat entry even though there might be no per cpu diffs. This is a pointless overhead for _each_ idle entry. Make sure that quiet_vmstat is as light as possible. First of all it doesn't make any sense to do any local sync if the current cpu is already set in oncpu_stat_off because vmstat_update puts itself there only if there is nothing to do. Then we can check need_update which should be a cheap way to check for potential per-cpu diffs and only then do refresh_cpu_vm_stats. The original patch also did cancel_delayed_work which we are not doing here. There are two reasons for that. Firstly cancel_delayed_work from idle context will blow up on RT kernels (reported by Mike): CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.5.0-rt3 #7 Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013 Call Trace: dump_stack+0x49/0x67 ___might_sleep+0xf5/0x180 rt_spin_lock+0x20/0x50 try_to_grab_pending+0x69/0x240 cancel_delayed_work+0x26/0xe0 quiet_vmstat+0x75/0xa0 cpu_idle_loop+0x38/0x3e0 cpu_startup_entry+0x13/0x20 start_secondary+0x114/0x140 And secondly, even on !RT kernels it might add some non trivial overhead which is not necessary. Even if the vmstat worker wakes up and preempts idle then it will be most likely a single shot noop because the stats were already synced and so it would end up on the oncpu_stat_off anyway. We just need to teach both vmstat_shepherd and vmstat_update to stop scheduling the worker if there is nothing to do. [mgalbraith@suse.de: cancel pending work of the cpu_stat_off CPU] Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm/Kconfig: correct description of DEFERRED_STRUCT_PAGE_INITVlastimil Babka1-4/+5
The description mentions kswapd threads, while the deferred struct page initialization is actually done by one-off "pgdatinitX" threads. Fix the description so that potentially users are not confused about pgdatinit threads using CPU after boot instead of kswapd. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05memblock: don't mark memblock_phys_mem_size() as __initDavid Gibson1-1/+1
At the moment memblock_phys_mem_size() is marked as __init, and so is discarded after boot. This is different from most of the memblock functions which are marked __init_memblock, and are only discarded after boot if memory hotplug is not configured. To allow for upcoming code which will need memblock_phys_mem_size() in the hotplug path, change it from __init to __init_memblock. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm: validate_mm browse_rb SMP race conditionAndrea Arcangeli1-2/+5
The mmap_sem for reading in validate_mm called from expand_stack is not enough to prevent the argumented rbtree rb_subtree_gap information to change from under us because expand_stack may be running from other threads concurrently which will hold the mmap_sem for reading too. The argumented rbtree is updated with vma_gap_update under the page_table_lock so use it in browse_rb() too to avoid false positives. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-03Merge branch 'akpm' (patches from Andrew)Linus Torvalds6-73/+103
Merge fixes from Andrew Morton: "18 fixes" [ The 18 fixes turned into 17 commits, because one of the fixes was a fix for another patch in the series that I just folded in by editing the patch manually - hopefully correctly - Linus ] * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: fix memory leak in copy_huge_pmd() drivers/hwspinlock: fix race between radix tree insertion and lookup radix-tree: fix race in gang lookup mm/vmpressure.c: fix subtree pressure detection mm: polish virtual memory accounting mm: warn about VmData over RLIMIT_DATA Documentation: cgroup-v2: add memory.stat::sock description mm: memcontrol: drop superfluous entry in the per-memcg stats array drivers/scsi/sg.c: mark VMA as VM_IO to prevent migration proc: revert /proc/<pid>/maps [stack:TID] annotation numa: fix /proc/<pid>/numa_maps for hugetlbfs on s390 MAINTAINERS: update Seth email ocfs2/cluster: fix memory leak in o2hb_region_release lib/test-string_helpers.c: fix and improve string_get_size() tests thp: limit number of object to scan on deferred_split_scan() thp: change deferred_split_count() to return number of THP in queue thp: make split_queue per-node
2016-02-03mm: retire GUP WARN_ON_ONCE that outlived its usefulnessHugh Dickins2-8/+1
Trinity is now hitting the WARN_ON_ONCE we added in v3.15 commit cda540ace6a1 ("mm: get_user_pages(write,force) refuse to COW in shared areas"). The warning has served its purpose, nobody was harmed by that change, so just remove the warning to generate less noise from Trinity. Which reminds me of the comment I wrongly left behind with that commit (but was spotted at the time by Kirill), which has since moved into a separate function, and become even more obscure: delete it. Reported-by: Dave Jones <davej@codemonkey.org.uk> Suggested-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-03mm: fix memory leak in copy_huge_pmd()Matthew Wilcox1-7/+10
We allocate a pgtable but do not attach it to anything if the PMD is in a DAX VMA, causing it to leak. We certainly try to not free pgtables associated with the huge zero page if the zero page is in a DAX VMA, so I think this is the right solution. This needs to be properly audited. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-03mm/vmpressure.c: fix subtree pressure detectionVladimir Davydov1-2/+1
When vmpressure is called for the entire subtree under pressure we mistakenly use vmpressure->scanned instead of vmpressure->tree_scanned when checking if vmpressure work is to be scheduled. This results in suppressing all vmpressure events in the legacy cgroup hierarchy. Fix it. Fixes: 8e8ae645249b ("mm: memcontrol: hook up vmpressure to socket pressure") Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-03mm: polish virtual memory accountingKonstantin Khlebnikov1-4/+19
* add VM_STACK as alias for VM_GROWSUP/DOWN depending on architecture * always account VMAs with flag VM_STACK as stack (as it was before) * cleanup classifying helpers * update comments and documentation Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Tested-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-03mm: warn about VmData over RLIMIT_DATAKonstantin Khlebnikov2-6/+33
This patch provides a way of working around a slight regression introduced by commit 84638335900f ("mm: rework virtual memory accounting"). Before that commit RLIMIT_DATA have control only over size of the brk region. But that change have caused problems with all existing versions of valgrind, because it set RLIMIT_DATA to zero. This patch fixes rlimit check (limit actually in bytes, not pages) and by default turns it into warning which prints at first VmData misuse: "mmap: top (795): VmData 516096 exceed data ulimit 512000. Will be forbidden soon." Behavior is controlled by boot param ignore_rlimit_data=y/n and by sysfs /sys/module/kernel/parameters/ignore_rlimit_data. For now it set to "y". [akpm@linux-foundation.org: tweak kernel-parameters.txt text[ Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Link: http://lkml.kernel.org/r/20151228211015.GL2194@uranus Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Kees Cook <keescook@google.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-03proc: revert /proc/<pid>/maps [stack:TID] annotationJohannes Weiner1-26/+1
Commit b76437579d13 ("procfs: mark thread stack correctly in proc/<pid>/maps") added [stack:TID] annotation to /proc/<pid>/maps. Finding the task of a stack VMA requires walking the entire thread list, turning this into quadratic behavior: a thousand threads means a thousand stacks, so the rendering of /proc/<pid>/maps needs to look at a million combinations. The cost is not in proportion to the usefulness as described in the patch. Drop the [stack:TID] annotation to make /proc/<pid>/maps (and /proc/<pid>/numa_maps) usable again for higher thread counts. The [stack] annotation inside /proc/<pid>/task/<tid>/maps is retained, as identifying the stack VMA there is an O(1) operation. Siddesh said: "The end users needed a way to identify thread stacks programmatically and there wasn't a way to do that. I'm afraid I no longer remember (or have access to the resources that would aid my memory since I changed employers) the details of their requirement. However, I did do this on my own time because I thought it was an interesting project for me and nobody really gave any feedback then as to its utility, so as far as I am concerned you could roll back the main thread maps information since the information is available in the thread-specific files" Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com> Cc: Shaohua Li <shli@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-03thp: limit number of object to scan on deferred_split_scan()Kirill A. Shutemov1-4/+6
If we have a lot of pages in queue to be split, deferred_split_scan() can spend unreasonable amount of time under spinlock with disabled interrupts. Let's cap number of pages to split on scan by sc->nr_to_scan. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-03thp: change deferred_split_count() to return number of THP in queueKirill A. Shutemov1-7/+8
I've got meaning of shrinker::count_objects() wrong: it should return number of potentially freeable objects, which is not necessary correlate with freeable memory. Returning 256 per THP in queue is not reasonable: shrinker::scan_objects() never called with nr_to_scan > 128 in my setup. Let's return 1 per THP and correct scan_object accordingly. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-03thp: make split_queue per-nodeKirill A. Shutemov2-23/+31
Andrea Arcangeli suggested to make split queue per-node to improve scalability. Let's do it. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-01Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimmLinus Torvalds1-2/+7
Pull libnvdimm fixes from Dan Williams: "1/ Fixes to the libnvdimm 'pfn' device that establishes a reserved area for storing a struct page array. 2/ Fixes for dax operations on a raw block device to prevent pagecache collisions with dax mappings. 3/ A fix for pfn_t usage in vm_insert_mixed that lead to a null pointer de-reference. These have received build success notification from the kbuild robot across 153 configs and pass the latest ndctl tests" * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: phys_to_pfn_t: use phys_addr_t mm: fix pfn_t to page conversion in vm_insert_mixed block: use DAX for partition table reads block: revert runtime dax control of the raw block device fs, block: force direct-I/O for dax-enabled block devices devm_memremap_pages: fix vmem_altmap lifetime + alignment handling libnvdimm, pfn: fix restoring memmap location libnvdimm: fix mode determination for e820 devices
2016-01-31mm: fix pfn_t to page conversion in vm_insert_mixedDan Williams1-2/+7
pfn_t_to_page() honors the flags in the pfn_t value to determine if a pfn is backed by a page. However, vm_insert_mixed() was originally written to use pfn_valid() to make this determination. To restore the old/correct behavior, ignore the pfn_t flags in the !pfn_t_devmap() case and fallback to trusting pfn_valid(). Fixes: 01c8f1c44b83 ("mm, dax, gpu: convert vm_insert_mixed to pfn_t") Cc: Dave Hansen <dave@sr71.net> Cc: David Airlie <airlied@linux.ie> Reported-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Tested-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-29Merge branch 'stable/for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mmLinus Torvalds1-2/+2
Pull cleancache cleanups from Konrad Rzeszutek Wilk: "Simple cleanups" * 'stable/for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm: include/linux/cleancache.h: Clean up code cleancache: constify cleancache_ops structure
2016-01-27cleancache: constify cleancache_ops structureJulia Lawall1-2/+2
The cleancache_ops structure is never modified, so declare it as const. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-24vmstat: Remove BUG_ON from vmstat_updateChristoph Lameter1-11/+1
If we detect that there is nothing to do just set the flag and do not check if it was already set before. Races really do not matter. If the flag is set by any code then the shepherd will start dealing with the situation and reenable the vmstat workers when necessary again. Since commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again and shut down on idle") quiet_vmstat might update cpu_stat_off and mark a particular cpu to be handled by vmstat_shepherd. This might trigger a VM_BUG_ON in vmstat_update because the work item might have been sleeping during the idle period and see the cpu_stat_off updated after the wake up. The VM_BUG_ON is therefore misleading and no more appropriate. Moreover it doesn't really suite any protection from real bugs because vmstat_shepherd will simply reschedule the vmstat_work anytime it sees a particular cpu set or vmstat_update would do the same from the worker context directly. Even when the two would race the result wouldn't be incorrect as the counters update is fully idempotent. Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Christoph Lameter <cl@linux.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-23Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds3-19/+18
Pull final vfs updates from Al Viro: - The ->i_mutex wrappers (with small prereq in lustre) - a fix for too early freeing of symlink bodies on shmem (they need to be RCU-delayed) (-stable fodder) - followup to dedupe stuff merged this cycle * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: abort dedupe loop if fatal signals are pending make sure that freeing shmem fast symlinks is RCU-delayed wrappers for ->i_mutex access lustre: remove unused declaration
2016-01-22tree wide: use kvfree() than conditional kfree()/vfree()Tetsuo Handa1-11/+7
There are many locations that do if (memory_was_allocated_by_vmalloc) vfree(ptr); else kfree(ptr); but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory using is_vmalloc_addr(). Unless callers have special reasons, we can replace this branch with kvfree(). Please check and reply if you found problems. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Jan Kara <jack@suse.com> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net> Acked-by: David Rientjes <rientjes@google.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Boris Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-22dax: add support for fsync/syncRoss Zwisler1-0/+6
To properly handle fsync/msync in an efficient way DAX needs to track dirty pages so it is able to flush them durably to media on demand. The tracking of dirty pages is done via the radix tree in struct address_space. This radix tree is already used by the page writeback infrastructure for tracking dirty pages associated with an open file, and it already has support for exceptional (non struct page*) entries. We build upon these features to add exceptional entries to the radix tree for DAX dirty PMD or PTE pages at fault time. [dan.j.williams@intel.com: fix dax_pmd_dbg build warning] Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jeff Layton <jlayton@poochiereds.net> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-22mm: add find_get_entries_tag()Ross Zwisler1-0/+68
Add find_get_entries_tag() to the family of functions that include find_get_entries(), find_get_pages() and find_get_pages_tag(). This is needed for DAX dirty page handling because we need a list of both page offsets and radix tree entries ('indices' and 'entries' in this function) that are marked with the PAGECACHE_TAG_TOWRITE tag. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeff Layton <jlayton@poochiereds.net> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-22dax: support dirty DAX entries in radix treeRoss Zwisler4-39/+60
Add support for tracking dirty DAX entries in the struct address_space radix tree. This tree is already used for dirty page writeback, and it already supports the use of exceptional (non struct page*) entries. In order to properly track dirty DAX pages we will insert new exceptional entries into the radix tree that represent dirty DAX PTE or PMD pages. These exceptional entries will also contain the writeback addresses for the PTE or PMD faults that we can use at fsync/msync time. There are currently two types of exceptional entries (shmem and shadow) that can be placed into the radix tree, and this adds a third. We rely on the fact that only one type of exceptional entry can be found in a given radix tree based on its usage. This happens for free with DAX vs shmem but we explicitly prevent shadow entries from being added to radix trees for DAX mappings. The only shadow entries that would be generated for DAX radix trees would be to track zero page mappings that were created for holes. These pages would receive minimal benefit from having shadow entries, and the choice to have only one type of exceptional entry in a given radix tree makes the logic simpler both in clear_exceptional_entry() and in the rest of DAX. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jeff Layton <jlayton@poochiereds.net> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-22make sure that freeing shmem fast symlinks is RCU-delayedAl Viro1-5/+4
Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-22wrappers for ->i_mutex accessAl Viro3-14/+14
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-21mm: fix kernel crash in khugepaged threadyalin wang1-3/+3
This crash is caused by NULL pointer deference, in page_to_pfn() marco, when page == NULL : Unable to handle kernel NULL pointer dereference at virtual address 00000000 Internal error: Oops: 94000006 [#1] SMP Modules linked in: CPU: 1 PID: 26 Comm: khugepaged Tainted: G W 4.3.0-rc6-next-20151022ajb-00001-g32f3386-dirty #3 PC is at khugepaged+0x378/0x1af8 LR is at khugepaged+0x418/0x1af8 Process khugepaged (pid: 26, stack limit = 0xffffffc079638020) Call trace: khugepaged+0x378/0x1af8 kthread+0xdc/0xf4 ret_from_fork+0xc/0x40 Code: 35001700 f0002c60 aa0703e3 f9009fa0 (f94000e0) ---[ end trace 637503d8e28ae69e ]--- Kernel panic - not syncing: Fatal exception CPU2: stopping CPU: 2 PID: 0 Comm: swapper/2 Tainted: G D W 4.3.0-rc6-next-20151022ajb-00001-g32f3386-dirty #3 Hardware name: linux,dummy-virt (DT) [akpm@linux-foundation.org: fix fat-fingered merge resolution] Signed-off-by: yalin wang <yalin.wang2010@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21mm: fix mlock accoutingKirill A. Shutemov1-1/+1
Tetsuo Handa reported underflow of NR_MLOCK on munlock. Testcase: #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #define BASE ((void *)0x400000000000) #define SIZE (1UL << 21) int main(int argc, char *argv[]) { void *addr; system("grep Mlocked /proc/meminfo"); addr = mmap(BASE, SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_LOCKED | MAP_FIXED, -1, 0); if (addr == MAP_FAILED) printf("mmap() failed\n"), exit(1); munmap(addr, SIZE); system("grep Mlocked /proc/meminfo"); return 0; } It happens on munlock_vma_page() due to unfortunate choice of nr_pages data type: __mod_zone_page_state(zone, NR_MLOCK, -nr_pages); For unsigned int nr_pages, implicitly casted to long in __mod_zone_page_state(), it becomes something around UINT_MAX. munlock_vma_page() usually called for THP as small pages go though pagevec. Let's make nr_pages signed int. Similar fixes in 6cdb18ad98a4 ("mm/vmstat: fix overflow in mod_zone_page_state()") used `long' type, but `int' here is OK for a count of the number of sub-pages in a huge page. Fixes: ff6a6da60b89 ("mm: accelerate munlock() treatment of THP pages") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Michel Lespinasse <walken@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> [4.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21thp: change pmd_trans_huge_lock() interface to return ptlKirill A. Shutemov3-13/+20
After THP refcounting rework we have only two possible return values from pmd_trans_huge_lock(): success and failure. Return-by-pointer for ptl doesn't make much sense in this case. Let's convert pmd_trans_huge_lock() to return ptl on success and NULL on failure. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20mm: memcontrol: add "sock" to cgroup2 memory.statJohannes Weiner1-0/+6
Provide statistics on how much of a cgroup's memory footprint is made up of socket buffers from network connections owned by the group. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>