aboutsummaryrefslogtreecommitdiffstats
path: root/mm (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2019-09-11swiotlb-xen: remove xen_swiotlb_dma_mmap and xen_swiotlb_dma_get_sgtableChristoph Hellwig1-27/+2
There is no need to wrap the common version, just wire them up directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2019-09-11xen: remove the exports for xen_{create,destroy}_contiguous_regionChristoph Hellwig2-4/+0
These routines are only used by swiotlb-xen, which cannot be modular. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2019-09-11xen/arm: remove xen_dma_opsChristoph Hellwig4-8/+4
arm and arm64 can just use xen_swiotlb_dma_ops directly like x86, no need for a pointer indirection. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2019-09-11xen/arm: simplify dma_cache_maintChristoph Hellwig1-40/+21
Calculate the required operation in the caller, and pass it directly instead of recalculating it for each page, and use simple arithmetics to get from the physical address to Xen page size aligned chunks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2019-09-11xen/arm: use dev_is_dma_coherentChristoph Hellwig3-21/+6
Use the dma-noncoherent dev_is_dma_coherent helper instead of the home grown variant. Note that both are always initialized to the same value in arch_setup_dma_ops. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2019-09-11xen/arm: consolidate page-coherent.hChristoph Hellwig3-150/+80
Shared the duplicate arm/arm64 code in include/xen/arm/page-coherent.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2019-09-11xen/arm: use dma-noncoherent.h calls for xen-swiotlb cache maintainanceChristoph Hellwig4-75/+28
Copy the arm64 code that uses the dma-direct/swiotlb helpers for DMA on-coherent devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2019-09-04arm: remove wrappers for the generic dma remap helpersChristoph Hellwig1-27/+5
Remove a few tiny wrappers around the generic dma remap code. Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04dma-mapping: introduce a dma_common_find_pages helperChristoph Hellwig4-20/+16
A helper to find the backing page array based on a virtual address. This also ensures we do the same vm_flags check everywhere instead of slightly different or missing ones in a few places. Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04dma-mapping: always use VM_DMA_COHERENT for generic DMA remapChristoph Hellwig5-28/+21
Currently the generic dma remap allocator gets a vm_flags passed by the caller that is a little confusing. We just introduced a generic vmalloc-level flag to identify the dma coherent allocations, so use that everywhere and remove the now pointless argument. Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04vmalloc: lift the arm flag for coherent mappings to common codeChristoph Hellwig4-19/+13
The arm architecture had a VM_ARM_DMA_CONSISTENT flag to mark DMA coherent remapping for a while. Lift this flag to common code so that we can use it generically. We also check it in the only place VM_USERMAP is directly check so that we can entirely replace that flag as well (although I'm not even sure why we'd want to allow remapping DMA appings, but I'd rather not change behavior). Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04dma-mapping: provide a better default ->get_required_maskChristoph Hellwig4-27/+14
Most dma_map_ops instances are IOMMUs that work perfectly fine in 32-bits of IOVA space, and the generic direct mapping code already provides its own routines that is intelligent based on the amount of memory actually present. Wire up the dma-direct routine for the ARM direct mapping code as well, and otherwise default to the constant 32-bit mask. This way we only need to override it for the occasional odd IOMMU that requires 64-bit IOVA support, or IOMMU drivers that are more efficient if they can fall back to the direct mapping. Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04dma-mapping: remove the dma_declare_coherent_memory exportChristoph Hellwig1-1/+0
dma_declare_coherent_memory is something that the platform setup code (which pretty much means the device tree these days) need to do so that drivers can use the memory as declared by the platform. Drivers themselves have no business calling this function. Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04remoteproc: don't allow modular buildChristoph Hellwig1-1/+1
Remoteproc started using dma_declare_coherent_memory recently, which is a bad idea from drivers, and the maintainers agreed to fix that. But until that is fixed only allow building the driver built in so that we can remove the dma_declare_coherent_memory export and prevent other drivers from "accidentally" using it like remoteproc. Note that the driver would also leak the declared coherent memory on unload if it actually was built as a module at the moment. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
2019-09-04dma-mapping: remove the dma_mmap_from_dev_coherent exportChristoph Hellwig1-1/+0
dma_mmap_from_dev_coherent is only used by dma_map_ops instances, none of which is modular. Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04dma-mapping: remove dma_release_declared_memoryChristoph Hellwig3-28/+0
This function is entirely unused given that declared memory is generally provided by platform setup code. Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04dma-mapping: remove dma_{alloc,free,mmap}_writecombineChristoph Hellwig2-15/+5
We can already use DMA_ATTR_WRITE_COMBINE or the _wc prefixed version, so remove the third way of doing things. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Reviewed-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
2019-09-04dma-mapping: remove CONFIG_ARCH_NO_COHERENT_DMA_MMAPChristoph Hellwig7-15/+5
CONFIG_ARCH_NO_COHERENT_DMA_MMAP is now functionally identical to !CONFIG_MMU, so remove the separate symbol. The only difference is that arm did not set it for !CONFIG_MMU, but arm uses a separate dma mapping implementation including its own mmap method, which is handled by moving the CONFIG_MMU check in dma_can_mmap so that is only applies to the dma-direct case, just as the other ifdefs for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
2019-09-04parisc: don't set ARCH_NO_COHERENT_DMA_MMAPChristoph Hellwig3-3/+0
parisc is the only architecture that sets ARCH_NO_COHERENT_DMA_MMAP when an MMU is enabled. AFAIK this is because parisc CPUs use VIVT caches, which means exporting normally cachable memory to userspace is relatively dangrous due to cache aliasing. But normally cachable memory is only allocated by dma_alloc_coherent on parisc when using the sba_iommu or ccio_iommu drivers, so just remove the .mmap implementation for them so that we don't have to set ARCH_NO_COHERENT_DMA_MMAP, which I plan to get rid of. Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04arm-nommu: call dma_mmap_from_dev_coherent directlyChristoph Hellwig1-2/+3
There is no need to go through dma_common_mmap for the arm-nommu dma mmap implementation as the only possible memory not handled above could be that from the per-device coherent pool. Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04ALSA: pcm: use dma_can_mmap() to check if a device supports dma_mmap_*Christoph Hellwig1-7/+6
Replace the local hack with the dma_can_mmap helper to check if a given device supports mapping DMA allocations to userspace. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Takashi Iwai <tiwai@suse.de>
2019-09-04dma-mapping: add a dma_can_mmap helperChristoph Hellwig2-0/+28
Add a helper to check if DMA allocations for a specific device can be mapped to userspace using dma_mmap_*. Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04dma-mapping: explicitly wire up ->mmap and ->get_sgtableChristoph Hellwig15-8/+42
While the default ->mmap and ->get_sgtable implementations work for the majority of our dma_map_ops impementations they are inherently safe for others that don't use the page allocator or CMA and/or use their own way of remapping not covered by the common code. So remove the defaults if these methods are not wired up, but instead wire up the default implementations for all safe instances. Fixes: e1c7e324539a ("dma-mapping: always provide the dma_map_ops based implementation") Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04dma-mapping: move the dma_get_sgtable API comments from arm to common codeChristoph Hellwig2-11/+11
The comments are spot on and should be near the central API, not just near a single implementation. Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-03dma-mapping: fix filename referencesAndy Shevchenko5-7/+4
After commit cf65a0f6f6ff ("dma-mapping: move all DMA mapping code to kernel/dma") some of the files are referring to outdated information, i.e. old file names of DMA mapping sources. Fix it here. Note, the lines with "Glue code for..." have been removed completely. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-03iommu/dma: add a new dma_map_ops of get_merge_boundary()Yoshihiro Shimoda1-0/+8
This patch adds a new dma_map_ops of get_merge_boundary() to expose the DMA merge boundary if the domain type is IOMMU_DOMAIN_DMA. Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Reviewed-by: Simon Horman <horms+renesas@verge.net.au> Acked-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-03dma-mapping: introduce dma_get_merge_boundary()Yoshihiro Shimoda3-0/+25
This patch adds a new DMA API "dma_get_merge_boundary". This function returns the DMA merge boundary if the DMA layer can merge the segments. This patch also adds the implementation for a new dma_map_ops pointer. Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Reviewed-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-03mmc: queue: use bigger segments if DMA MAP layer can merge the segmentsYoshihiro Shimoda2-3/+33
When the max_segs of a mmc host is smaller than 512, the mmc subsystem tries to use 512 segments if DMA MAP layer can merge the segments, and then the mmc subsystem exposes such information to the block layer by using blk_queue_can_use_dma_map_merging(). Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-03block: add a helper function to merge the segmentsYoshihiro Shimoda2-0/+25
This patch adds a helper function whether a queue can merge the segments by the DMA MAP layer (e.g. via IOMMU). Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Simon Horman <horms+renesas@verge.net.au Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-08-29MIPS: document mixing "slightly different CCAs"Christoph Hellwig1-0/+7
Based on an email from Paul Burton, quoting section 4.8 "Cacheability and Coherency Attributes and Access Types" of "MIPS Architecture Volume 1: Introduction to the MIPS32 Architecture" (MD00080, revision 6.01). Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Paul Burton <paul.burton@mips.com>
2019-08-29arm64: document the choice of page attributes for pgprot_dmacoherentChristoph Hellwig1-0/+8
Based on an email from Will Deacon. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Will Deacon <will@kernel.org> Acked-by: Mark Rutland <mark.rutland@arm.com>
2019-08-29dma-mapping: make dma_atomic_pool_init self-containedChristoph Hellwig6-28/+14
The memory allocated for the atomic pool needs to have the same mapping attributes that we use for remapping, so use pgprot_dmacoherent instead of open coding it. Also deduct a suitable zone to allocate the memory from based on the presence of the DMA zones. Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-08-29dma-mapping: remove arch_dma_mmap_pgprotChristoph Hellwig13-34/+35
arch_dma_mmap_pgprot is used for two things: 1) to override the "normal" uncached page attributes for mapping memory coherent to devices that can't snoop the CPU caches 2) to provide the special DMA_ATTR_WRITE_COMBINE semantics on older arm systems and some mips platforms Replace one with the pgprot_dmacoherent macro that is already provided by arm and much simpler to use, and lift the DMA_ATTR_WRITE_COMBINE handling to common code with an explicit arch opt-in. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k Acked-by: Paul Burton <paul.burton@mips.com> # mips
2019-08-26arm-nommu: remove the unused pgprot_dmacoherent defineChristoph Hellwig1-1/+0
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-08-26unicore32: remove the unused pgprot_dmacoherent defineChristoph Hellwig1-2/+0
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-08-25Linux 5.3-rc6Linus Torvalds1-1/+1
2019-08-24mm/kasan: fix false positive invalid-free reports with CONFIG_KASAN_SW_TAGS=yAndrey Ryabinin1-2/+8
The code like this: ptr = kmalloc(size, GFP_KERNEL); page = virt_to_page(ptr); offset = offset_in_page(ptr); kfree(page_address(page) + offset); may produce false-positive invalid-free reports on the kernel with CONFIG_KASAN_SW_TAGS=y. In the example above we lose the original tag assigned to 'ptr', so kfree() gets the pointer with 0xFF tag. In kfree() we check that 0xFF tag is different from the tag in shadow hence print false report. Instead of just comparing tags, do the following: 1) Check that shadow doesn't contain KASAN_TAG_INVALID. Otherwise it's double-free and it doesn't matter what tag the pointer have. 2) If pointer tag is different from 0xFF, make sure that tag in the shadow is the same as in the pointer. Link: http://lkml.kernel.org/r/20190819172540.19581-1-aryabinin@virtuozzo.com Fixes: 7f94ffbc4c6a ("kasan: add hooks implementation for tag-based mode") Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reported-by: Walter Wu <walter-zh.wu@mediatek.com> Reported-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24mm/zsmalloc.c: fix race condition in zs_destroy_poolHenry Burns1-2/+59
In zs_destroy_pool() we call flush_work(&pool->free_work). However, we have no guarantee that migration isn't happening in the background at that time. Since migration can't directly free pages, it relies on free_work being scheduled to free the pages. But there's nothing preventing an in-progress migrate from queuing the work *after* zs_unregister_migration() has called flush_work(). Which would mean pages still pointing at the inode when we free it. Since we know at destroy time all objects should be free, no new migrations can come in (since zs_page_isolate() fails for fully-free zspages). This means it is sufficient to track a "# isolated zspages" count by class, and have the destroy logic ensure all such pages have drained before proceeding. Keeping that state under the class spinlock keeps the logic straightforward. In this case a memory leak could lead to an eventual crash if compaction hits the leaked page. This crash would only occur if people are changing their zswap backend at runtime (which eventually starts destruction). Link: http://lkml.kernel.org/r/20190809181751.219326-2-henryburns@google.com Fixes: 48b4800a1c6a ("zsmalloc: page migration support") Signed-off-by: Henry Burns <henryburns@google.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Jonathan Adams <jwadams@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24mm/zsmalloc.c: migration can leave pages in ZS_EMPTY indefinitelyHenry Burns1-4/+15
In zs_page_migrate() we call putback_zspage() after we have finished migrating all pages in this zspage. However, the return value is ignored. If a zs_free() races in between zs_page_isolate() and zs_page_migrate(), freeing the last object in the zspage, putback_zspage() will leave the page in ZS_EMPTY for potentially an unbounded amount of time. To fix this, we need to do the same thing as zs_page_putback() does: schedule free_work to occur. To avoid duplicated code, move the sequence to a new putback_zspage_deferred() function which both zs_page_migrate() and zs_page_putback() call. Link: http://lkml.kernel.org/r/20190809181751.219326-1-henryburns@google.com Fixes: 48b4800a1c6a ("zsmalloc: page migration support") Signed-off-by: Henry Burns <henryburns@google.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Jonathan Adams <jwadams@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24mm, page_owner: handle THP splits correctlyVlastimil Babka1-0/+4
THP splitting path is missing the split_page_owner() call that split_page() has. As a result, split THP pages are wrongly reported in the page_owner file as order-9 pages. Furthermore when the former head page is freed, the remaining former tail pages are not listed in the page_owner file at all. This patch fixes that by adding the split_page_owner() call into __split_huge_page(). Link: http://lkml.kernel.org/r/20190820131828.22684-2-vbabka@suse.cz Fixes: a9627bc5e34e ("mm/page_owner: introduce split_page_owner and replace manual handling") Reported-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctxOleg Nesterov1-12/+13
userfaultfd_release() should clear vm_flags/vm_userfaultfd_ctx even if mm->core_state != NULL. Otherwise a page fault can see userfaultfd_missing() == T and use an already freed userfaultfd_ctx. Link: http://lkml.kernel.org/r/20190820160237.GB4983@redhat.com Fixes: 04f5866e41fb ("coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Peter Xu <peterx@redhat.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24psi: get poll_work to run when calling poll syscall next timeJason Xing1-0/+8
Only when calling the poll syscall the first time can user receive POLLPRI correctly. After that, user always fails to acquire the event signal. Reproduce case: 1. Get the monitor code in Documentation/accounting/psi.txt 2. Run it, and wait for the event triggered. 3. Kill and restart the process. The question is why we can end up with poll_scheduled = 1 but the work not running (which would reset it to 0). And the answer is because the scheduling side sees group->poll_kworker under RCU protection and then schedules it, but here we cancel the work and destroy the worker. The cancel needs to pair with resetting the poll_scheduled flag. Link: http://lkml.kernel.org/r/1566357985-97781-1-git-send-email-joseph.qi@linux.alibaba.com Signed-off-by: Jason Xing <kerneljasonxing@linux.alibaba.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Caspar Zhang <caspar@linux.alibaba.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24mm: memcontrol: flush percpu vmevents before releasing memcgRoman Gushchin1-1/+21
Similar to vmstats, percpu caching of local vmevents leads to an accumulation of errors on non-leaf levels. This happens because some leftovers may remain in percpu caches, so that they are never propagated up by the cgroup tree and just disappear into nonexistence with on releasing of the memory cgroup. To fix this issue let's accumulate and propagate percpu vmevents values before releasing the memory cgroup similar to what we're doing with vmstats. Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate only over online cpus. Link: http://lkml.kernel.org/r/20190819202338.363363-4-guro@fb.com Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty") Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24mm: memcontrol: flush percpu vmstats before releasing memcgRoman Gushchin1-0/+40
Percpu caching of local vmstats with the conditional propagation by the cgroup tree leads to an accumulation of errors on non-leaf levels. Let's imagine two nested memory cgroups A and A/B. Say, a process belonging to A/B allocates 100 pagecache pages on the CPU 0. The percpu cache will spill 3 times, so that 32*3=96 pages will be accounted to A/B and A atomic vmstat counters, 4 pages will remain in the percpu cache. Imagine A/B is nearby memory.max, so that every following allocation triggers a direct reclaim on the local CPU. Say, each such attempt will free 16 pages on a new cpu. That means every percpu cache will have -16 pages, except the first one, which will have 4 - 16 = -12. A/B and A atomic counters will not be touched at all. Now a user removes A/B. All percpu caches are freed and corresponding vmstat numbers are forgotten. A has 96 pages more than expected. As memory cgroups are created and destroyed, errors do accumulate. Even 1-2 pages differences can accumulate into large numbers. To fix this issue let's accumulate and propagate percpu vmstat values before releasing the memory cgroup. At this point these numbers are stable and cannot be changed. Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate only over online cpus. Link: http://lkml.kernel.org/r/20190819202338.363363-2-guro@fb.com Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty") Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24parisc: fix compilation errrorsQian Cai1-2/+1
Commit 0cfaee2af3a0 ("include/asm-generic/5level-fixup.h: fix variable 'p4d' set but not used") converted a few functions from macros to static inline, which causes parisc to complain, In file included from include/asm-generic/4level-fixup.h:38:0, from arch/parisc/include/asm/pgtable.h:5, from arch/parisc/include/asm/io.h:6, from include/linux/io.h:13, from sound/core/memory.c:9: include/asm-generic/5level-fixup.h:14:18: error: unknown type name 'pgd_t'; did you mean 'pid_t'? #define p4d_t pgd_t ^ include/asm-generic/5level-fixup.h:24:28: note: in expansion of macro 'p4d_t' static inline int p4d_none(p4d_t p4d) ^~~~~ It is because "4level-fixup.h" is included before "asm/page.h" where "pgd_t" is defined. Link: http://lkml.kernel.org/r/20190815205305.1382-1-cai@lca.pw Fixes: 0cfaee2af3a0 ("include/asm-generic/5level-fixup.h: fix variable 'p4d' set but not used") Signed-off-by: Qian Cai <cai@lca.pw> Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24mm, page_alloc: move_freepages should not examine struct page of reserved memoryDavid Rientjes1-15/+4
After commit 907ec5fca3dc ("mm: zero remaining unavailable struct pages"), struct page of reserved memory is zeroed. This causes page->flags to be 0 and fixes issues related to reading /proc/kpageflags, for example, of reserved memory. The VM_BUG_ON() in move_freepages_block(), however, assumes that page_zone() is meaningful even for reserved memory. That assumption is no longer true after the aforementioned commit. There's no reason why move_freepages_block() should be testing the legitimacy of page_zone() for reserved memory; its scope is limited only to pages on the zone's freelist. Note that pfn_valid() can be true for reserved memory: there is a backing struct page. The check for page_to_nid(page) is also buggy but reserved memory normally only appears on node 0 so the zeroing doesn't affect this. Move the debug checks to after verifying PageBuddy is true. This isolates the scope of the checks to only be for buddy pages which are on the zone's freelist which move_freepages_block() is operating on. In this case, an incorrect node or zone is a bug worthy of being warned about (and the examination of struct page is acceptable bcause this memory is not reserved). Why does move_freepages_block() gets called on reserved memory? It's simply math after finding a valid free page from the per-zone free area to use as fallback. We find the beginning and end of the pageblock of the valid page and that can bring us into memory that was reserved per the e820. pfn_valid() is still true (it's backed by a struct page), but since it's zero'd we shouldn't make any inferences here about comparing its node or zone. The current node check just happens to succeed most of the time by luck because reserved memory typically appears on node 0. The fix here is to validate that we actually have buddy pages before testing if there's any type of zone or node strangeness going on. We noticed it almost immediately after bringing 907ec5fca3dc in on CONFIG_DEBUG_VM builds. It depends on finding specific free pages in the per-zone free area where the math in move_freepages() will bring the start or end pfn into reserved memory and wanting to claim that entire pageblock as a new migratetype. So the path will be rare, require CONFIG_DEBUG_VM, and require fallback to a different migratetype. Some struct pages were already zeroed from reserve pages before 907ec5fca3c so it theoretically could trigger before this commit. I think it's rare enough under a config option that most people don't run that others may not have noticed. I wouldn't argue against a stable tag and the backport should be easy enough, but probably wouldn't single out a commit that this is fixing. Mel said: : The overhead of the debugging check is higher with this patch although : it'll only affect debug builds and the path is not particularly hot. : If this was a concern, I think it would be reasonable to simply remove : the debugging check as the zone boundaries are checked in : move_freepages_block and we never expect a zone/node to be smaller than : a pageblock and stuck in the middle of another zone. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1908122036560.10779@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24mm/z3fold.c: fix race between migration and destructionHenry Burns1-0/+89
In z3fold_destroy_pool() we call destroy_workqueue(&pool->compact_wq). However, we have no guarantee that migration isn't happening in the background at that time. Migration directly calls queue_work_on(pool->compact_wq), if destruction wins that race we are using a destroyed workqueue. Link: http://lkml.kernel.org/r/20190809213828.202833-1-henryburns@google.com Signed-off-by: Henry Burns <henryburns@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Jonathan Adams <jwadams@google.com> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24drm/mediatek: include dma-mapping headerDave Airlie1-0/+1
Although it builds fine here in my arm cross compile, it seems either via some other patches in -next or some Kconfig combination, this fails to build for everyone. Include linux/dma-mapping.h should fix it. Signed-off-by: Dave Airlie <airlied@redhat.com>
2019-08-23KVM: arm/arm64: VGIC: Properly initialise private IRQ affinityAndre Przywara1-10/+20
At the moment we initialise the target *mask* of a virtual IRQ to the VCPU it belongs to, even though this mask is only defined for GICv2 and quickly runs out of bits for many GICv3 guests. This behaviour triggers an UBSAN complaint for more than 32 VCPUs: ------ [ 5659.462377] UBSAN: Undefined behaviour in virt/kvm/arm/vgic/vgic-init.c:223:21 [ 5659.471689] shift exponent 32 is too large for 32-bit type 'unsigned int' ------ Also for GICv3 guests the reporting of TARGET in the "vgic-state" debugfs dump is wrong, due to this very same problem. Because there is no requirement to create the VGIC device before the VCPUs (and QEMU actually does it the other way round), we can't safely initialise mpidr or targets in kvm_vgic_vcpu_init(). But since we touch every private IRQ for each VCPU anyway later (in vgic_init()), we can just move the initialisation of those fields into there, where we definitely know the VGIC type. On the way make sure we really have either a VGICv2 or a VGICv3 device, since the existing code is just checking for "VGICv3 or not", silently ignoring the uninitialised case. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reported-by: Dave Martin <dave.martin@arm.com> Tested-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
2019-08-23RDMA/siw: Fix 64/32bit pointer inconsistencyBernard Metzler9-109/+108
Fixes improper casting between addresses and unsigned types. Changes siw_pbl_get_buffer() function to return appropriate dma_addr_t, and not u64. Also fixes debug prints. Now any potentially kernel private pointers are printed formatted as '%pK', to allow keeping that information secret. Fixes: d941bfe500be ("RDMA/siw: Change CQ flags from 64->32 bits") Fixes: b0fff7317bb4 ("rdma/siw: completion queue methods") Fixes: 8b6a361b8c48 ("rdma/siw: receive path") Fixes: b9be6f18cf9e ("rdma/siw: transmit path") Fixes: f29dd55b0236 ("rdma/siw: queue pair methods") Fixes: 2251334dcac9 ("rdma/siw: application buffer management") Fixes: 303ae1cdfdf7 ("rdma/siw: application interface") Fixes: 6c52fdc244b5 ("rdma/siw: connection management") Fixes: a531975279f3 ("rdma/siw: main include file") Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Reported-by: Jason Gunthorpe <jgg@ziepe.ca> Reported-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com> Link: https://lore.kernel.org/r/20190822173738.26817-1-bmt@zurich.ibm.com Signed-off-by: Doug Ledford <dledford@redhat.com>