aboutsummaryrefslogtreecommitdiffstats
path: root/arch/powerpc/mm (follow)
AgeCommit message (Collapse)AuthorFilesLines
2020-01-31powerpc: book3s64: convert to pin_user_pages() and put_user_page()John Hubbard1-5/+5
1. Convert from get_user_pages() to pin_user_pages(). 2. As required by pin_user_pages(), release these pages via put_user_page(). Link: http://lkml.kernel.org/r/20200107224558.2362728-21-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-04mm/memory_hotplug: shrink zones when offlining memoryDavid Hildenbrand1-2/+1
We currently try to shrink a single zone when removing memory. We use the zone of the first page of the memory we are removing. If that memmap was never initialized (e.g., memory was never onlined), we will read garbage and can trigger kernel BUGs (due to a stale pointer): BUG: unable to handle page fault for address: 000000000000353d #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] SMP PTI CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4 Workqueue: kacpi_hotplug acpi_hotplug_work_fn RIP: 0010:clear_zone_contiguous+0x5/0x10 Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840 RSP: 0018:ffffad2400043c98 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000 RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40 RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000 R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680 FS: 0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __remove_pages+0x4b/0x640 arch_remove_memory+0x63/0x8d try_remove_memory+0xdb/0x130 __remove_memory+0xa/0x11 acpi_memory_device_remove+0x70/0x100 acpi_bus_trim+0x55/0x90 acpi_device_hotplug+0x227/0x3a0 acpi_hotplug_work_fn+0x1a/0x30 process_one_work+0x221/0x550 worker_thread+0x50/0x3b0 kthread+0x105/0x140 ret_from_fork+0x3a/0x50 Modules linked in: CR2: 000000000000353d Instead, shrink the zones when offlining memory or when onlining failed. Introduce and use remove_pfn_range_from_zone(() for that. We now properly shrink the zones, even if we have DIMMs whereby - Some memory blocks fall into no zone (never onlined) - Some memory blocks fall into multiple zones (offlined+re-onlined) - Multiple memory blocks that fall into different zones Drop the zone parameter (with a potential dubious value) from __remove_pages() and __remove_section(). Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: <stable@vger.kernel.org> [5.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-23powerpc/mm: Mark get_slice_psize() & slice_addr_is_low() as notraceMichael Ellerman1-2/+2
These slice routines are called from the SLB miss handler, which can lead to warnings from the IRQ code, because we have not reconciled the IRQ state properly: WARNING: CPU: 72 PID: 30150 at arch/powerpc/kernel/irq.c:258 arch_local_irq_restore.part.0+0xcc/0x100 Modules linked in: CPU: 72 PID: 30150 Comm: ftracetest Not tainted 5.5.0-rc2-gcc9x-g7e0165b2f1a9 #1 NIP: c00000000001d83c LR: c00000000029ab90 CTR: c00000000026cf90 REGS: c0000007eee3b960 TRAP: 0700 Not tainted (5.5.0-rc2-gcc9x-g7e0165b2f1a9) MSR: 8000000000021033 <SF,ME,IR,DR,RI,LE> CR: 22242844 XER: 20000000 CFAR: c00000000001d780 IRQMASK: 0 ... NIP arch_local_irq_restore.part.0+0xcc/0x100 LR trace_graph_entry+0x270/0x340 Call Trace: trace_graph_entry+0x254/0x340 (unreliable) function_graph_enter+0xe4/0x1a0 prepare_ftrace_return+0xa0/0x130 ftrace_graph_caller+0x44/0x94 # (get_slice_psize()) slb_allocate_user+0x7c/0x100 do_slb_fault+0xf8/0x300 instruction_access_slb_common+0x140/0x180 Fixes: 48e7b7695745 ("powerpc/64s/hash: Convert SLB miss handlers to C") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191221121337.4894-1-mpe@ellerman.id.au
2019-12-16powerpc/8xx: fix bogus __init on mmu_mapin_ram_chunk()Christophe Leroy1-1/+1
Remove __init qualifier for mmu_mapin_ram_chunk() as it is called by mmu_mark_initmem_nx() and mmu_mark_rodata_ro() which are not __init functions. At the same time, mark it static as it is only used in this file. Reported-by: kbuild test robot <lkp@intel.com> Fixes: a2227a277743 ("powerpc/32: Don't populate page tables for block mapped pages except on the 8xx") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/56648921986a6b3e7315b1fbbf4684f21bd2dea8.1576310997.git.christophe.leroy@c-s.fr
2019-12-13powerpc: Ensure that swiotlb buffer is allocated from low memoryMike Rapoport1-0/+8
Some powerpc platforms (e.g. 85xx) limit DMA-able memory way below 4G. If a system has more physical memory than this limit, the swiotlb buffer is not addressable because it is allocated from memblock using top-down mode. Force memblock to bottom-up mode before calling swiotlb_init() to ensure that the swiotlb buffer is DMA-able. Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191204123524.22919-1-rppt@kernel.org
2019-12-05powerpc/pmem: Fix kernel crash due to wrong range value usage in flush_dcache_rangeAneesh Kumar K.V1-1/+1
This patch fix the below kernel crash. BUG: Unable to handle kernel data access on read at 0xc000000380000000 Faulting instruction address: 0xc00000000008b6f0 cpu 0x5: Vector: 300 (Data Access) at [c0000000d8587790] pc: c00000000008b6f0: arch_remove_memory+0x150/0x210 lr: c00000000008b720: arch_remove_memory+0x180/0x210 sp: c0000000d8587a20 msr: 800000000280b033 dar: c000000380000000 dsisr: 40000000 current = 0xc0000000d8558600 paca = 0xc00000000fff8f00 irqmask: 0x03 irq_happened: 0x01 pid = 1220, comm = ndctl enter ? for help memunmap_pages+0x33c/0x410 devm_action_release+0x30/0x50 release_nodes+0x30c/0x3a0 device_release_driver_internal+0x178/0x240 unbind_store+0x74/0x190 drv_attr_store+0x44/0x60 sysfs_kf_write+0x74/0xa0 kernfs_fop_write+0x1b0/0x260 __vfs_write+0x3c/0x70 vfs_write+0xe4/0x200 ksys_write+0x7c/0x140 system_call+0x5c/0x68 Fixes: 076265907cf9 ("powerpc: Chunk calls to flush_dcache_range in arch_*_memory") Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191204052909.59145-1-aneesh.kumar@linux.ibm.com
2019-12-01Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-0/+1
Merge updates from Andrew Morton: "Incoming: - a small number of updates to scripts/, ocfs2 and fs/buffer.c - most of MM I still have quite a lot of material (mostly not MM) staged after linux-next due to -next dependencies. I'll send those across next week as the preprequisites get merged up" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (135 commits) mm/page_io.c: annotate refault stalls from swap_readpage mm/Kconfig: fix trivial help text punctuation mm/Kconfig: fix indentation mm/memory_hotplug.c: remove __online_page_set_limits() mm: fix typos in comments when calling __SetPageUptodate() mm: fix struct member name in function comments mm/shmem.c: cast the type of unmap_start to u64 mm: shmem: use proper gfp flags for shmem_writepage() mm/shmem.c: make array 'values' static const, makes object smaller userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK fs/userfaultfd.c: wp: clear VM_UFFD_MISSING or VM_UFFD_WP during userfaultfd_register() userfaultfd: wrap the common dst_vma check into an inlined function userfaultfd: remove unnecessary WARN_ON() in __mcopy_atomic_hugetlb() userfaultfd: use vma_pagesize for all huge page size calculation mm/madvise.c: use PAGE_ALIGN[ED] for range checking mm/madvise.c: replace with page_size() in madvise_inject_error() mm/mmap.c: make vma_merge() comment more easy to understand mm/hwpoison-inject: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops autonuma: reduce cache footprint when scanning page tables autonuma: fix watermark checking in migrate_balanced_pgdat() ...
2019-12-01powerpc/mm: remove pmd_huge/pud_huge stubs and include hugetlb.hMike Kravetz1-0/+1
Patch series "hugetlbfs: convert macros to static inline, fix sparse warning". The definition for huge_pte_offset() in <linux/hugetlb.h> causes a sparse warning in the !CONFIG_HUGETLB_PAGE. Fix this as well as converting all macros in this block of definitions to static inlines for better type checking. When making the above changes, build errors were found in powerpc due to duplicate definitions. A separate powerpc specific patch is included as a requisite to remove the definitions and get them from <linux/hugetlb.h>. This patch (of 2): This removes the power specific stubs created by commit aad71e3928be ("powerpc/mm: Fix build break with RADIX=y & HUGETLBFS=n") used when !CONFIG_HUGETLB_PAGE. Instead, it addresses the build break by getting the definitions from <linux/hugetlb.h>. This allows the macros in <linux/hugetlb.h> to be replaced with static inlines. Link: http://lkml.kernel.org/r/20191112194558.139389-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Ben Dooks <ben.dooks@codethink.co.uk> Cc: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-30Merge tag 'powerpc-5.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linuxLinus Torvalds18-135/+764
Pull powerpc updates from Michael Ellerman: "Highlights: - Infrastructure for secure boot on some bare metal Power9 machines. The firmware support is still in development, so the code here won't actually activate secure boot on any existing systems. - A change to xmon (our crash handler / pseudo-debugger) to restrict it to read-only mode when the kernel is lockdown'ed, otherwise it's trivial to drop into xmon and modify kernel data, such as the lockdown state. - Support for KASLR on 32-bit BookE machines (Freescale / NXP). - Fixes for our flush_icache_range() and __kernel_sync_dicache() (VDSO) to work with memory ranges >4GB. - Some reworks of the pseries CMM (Cooperative Memory Management) driver to make it behave more like other balloon drivers and enable some cleanups of generic mm code. - A series of fixes to our hardware breakpoint support to properly handle unaligned watchpoint addresses. Plus a bunch of other smaller improvements, fixes and cleanups. Thanks to: Alastair D'Silva, Andrew Donnellan, Aneesh Kumar K.V, Anthony Steinhauser, Cédric Le Goater, Chris Packham, Chris Smart, Christophe Leroy, Christopher M. Riedl, Christoph Hellwig, Claudio Carvalho, Daniel Axtens, David Hildenbrand, Deb McLemore, Diana Craciun, Eric Richter, Geert Uytterhoeven, Greg Kroah-Hartman, Greg Kurz, Gustavo L. F. Walbon, Hari Bathini, Harish, Jason Yan, Krzysztof Kozlowski, Leonardo Bras, Mathieu Malaterre, Mauro S. M. Rodrigues, Michal Suchanek, Mimi Zohar, Nathan Chancellor, Nathan Lynch, Nayna Jain, Nick Desaulniers, Oliver O'Halloran, Qian Cai, Rasmus Villemoes, Ravi Bangoria, Sam Bobroff, Santosh Sivaraj, Scott Wood, Thomas Huth, Tyrel Datwyler, Vaibhav Jain, Valentin Longchamp, YueHaibing" * tag 'powerpc-5.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (144 commits) powerpc/fixmap: fix crash with HIGHMEM x86/efi: remove unused variables powerpc: Define arch_is_kernel_initmem_freed() for lockdep powerpc/prom_init: Use -ffreestanding to avoid a reference to bcmp powerpc: Avoid clang warnings around setjmp and longjmp powerpc: Don't add -mabi= flags when building with Clang powerpc: Fix Kconfig indentation powerpc/fixmap: don't clear fixmap area in paging_init() selftests/powerpc: spectre_v2 test must be built 64-bit powerpc/powernv: Disable native PCIe port management powerpc/kexec: Move kexec files into a dedicated subdir. powerpc/32: Split kexec low level code out of misc_32.S powerpc/sysdev: drop simple gpio powerpc/83xx: map IMMR with a BAT. powerpc/32s: automatically allocate BAT in setbat() powerpc/ioremap: warn on early use of ioremap() powerpc: Add support for GENERIC_EARLY_IOREMAP powerpc/fixmap: Use __fix_to_virt() instead of fix_to_virt() powerpc/8xx: use the fixmapped IMMR in cpm_reset() powerpc/8xx: add __init to cpm1 init functions ...
2019-11-29powerpc/fixmap: fix crash with HIGHMEMChristophe Leroy1-0/+6
Commit f2bb86937d86 ("powerpc/fixmap: don't clear fixmap area in paging_init()") removed the clearing of fixmap area in order to avoid clearing fixmapped areas set earlier. However unlike all other users of fixmap which use __set_fixmap(), HIGHMEM functions directly use __set_pte_at(). This means the page table must pre-exist, otherwise the following crash can be encoutered due to the lack of entry in the PGD. Oops: Kernel access of bad area, sig: 11 [#1] BE PAGE_SIZE=4K MMU=Hash PowerMac Modules linked in: CPU: 0 PID: 1 Comm: swapper Not tainted 5.4.0+ #2528 NIP: c0144ce8 LR: c0144ccc CTR: 00000080 REGS: ef0b5aa0 TRAP: 0300 Not tainted (5.4.0+) MSR: 00009032 <EE,ME,IR,DR,RI> CR: 44282842 XER: 00000000 DAR: fffdf000 DSISR: 42000000 GPR00: c0144ccc ef0b5b58 ef0b0000 fffdf000 fffdf000 00000000 c0000f7c 00000000 GPR08: c0833000 fffdf000 00000000 ef1c53c9 24042842 00000000 00000000 00000000 GPR16: 00000000 00000000 ef7e7358 effe8160 00000000 c08a9660 c0851644 00000004 GPR24: c08c70a8 00002dc2 00000000 00000001 00000201 effe8160 effe8160 00000000 NIP [c0144ce8] prep_new_page+0x138/0x178 LR [c0144ccc] prep_new_page+0x11c/0x178 Call Trace: [ef0b5b58] [c0144ccc] prep_new_page+0x11c/0x178 (unreliable) [ef0b5b88] [c0147218] get_page_from_freelist+0x1fc/0xd88 [ef0b5c38] [c0148328] __alloc_pages_nodemask+0xd4/0xbb4 [ef0b5cf8] [c0142ba8] __vmalloc_node_range+0x1b4/0x2e0 [ef0b5d38] [c0142dd0] vzalloc+0x48/0x58 [ef0b5d58] [c0301c8c] check_partition+0x58/0x244 [ef0b5d78] [c02ffe80] blk_add_partitions+0x44/0x2cc [ef0b5db8] [c01a32d8] bdev_disk_changed+0x68/0xfc [ef0b5de8] [c01a4494] __blkdev_get+0x290/0x460 [ef0b5e28] [c02fdd40] __device_add_disk+0x480/0x4d8 [ef0b5e68] [c0810688] brd_init+0xc0/0x188 [ef0b5e88] [c0005194] do_one_initcall+0x40/0x19c [ef0b5ee8] [c07dd4dc] kernel_init_freeable+0x164/0x230 [ef0b5f28] [c0005408] kernel_init+0x18/0x10c [ef0b5f38] [c0014274] ret_from_kernel_thread+0x14/0x1c Partially revert that commit to still clear the fixmap area dedicated to HIGHMEM. Fixes: f2bb86937d86 ("powerpc/fixmap: don't clear fixmap area in paging_init()") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/d42fa9747df5afa41e67b08e374c98d3b40529c9.1574927918.git.christophe.leroy@c-s.fr
2019-11-25powerpc/fixmap: don't clear fixmap area in paging_init()Christophe Leroy1-8/+0
fixmap is intended to map things permanently like the IMMR region on FSL SOC (8xx, 83xx, ...), so don't clear it when initialising paging() Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/41c99bc06394a6bc2888631cb98a3ed2ae281ddb.1568295907.git.christophe.leroy@c-s.fr
2019-11-21Merge branch 'for-next/zone-dma' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux into dma-mapping-for-nextChristoph Hellwig1-5/+15
Pull in a stable branch from the arm64 tree that adds the zone_dma_bits variable to avoid creating hard to resolve conflicts with that addition.
2019-11-20dma-mapping: drop the dev argument to arch_sync_dma_for_*Christoph Hellwig1-4/+4
These are pure cache maintainance routines, so drop the unused struct device argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
2019-11-19powerpc/32s: automatically allocate BAT in setbat()Christophe Leroy1-1/+10
If no BAT is given to setbat(), select an available BAT. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/a212bd36fbd6179e0929b6c727febc35132ac25c.1568665466.git.christophe.leroy@c-s.fr
2019-11-19powerpc/ioremap: warn on early use of ioremap()Christophe Leroy2-0/+3
Powerpc now has EARLY_IOREMAP. Next step is to convert all early users of ioremap() to early_ioremap(). Add a warning to help locate those users. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/b4f03a68ee8e68773c8973d74ec35f9c82c72871.1568295907.git.christophe.leroy@c-s.fr
2019-11-18powerpc/32: Don't populate page tables for block mapped pages except on the 8xx.Christophe Leroy2-7/+50
Commit d2f15e0979ee ("powerpc/32: always populate page tables for Abatron BDI.") wrongly sets page tables for any PPC32 for using BDI, and does't update them after init (remove RX on init section, set text and rodata read-only) Only the 8xx requires page tables to be populated for using the BDI. They also need to be populated in order to see the mappings in /sys/kernel/debug/kernel_page_tables On BOOK3S_32, pages that are not mapped by page tables are mapped by BATs. The BDI knows BATs and they can be viewed in /sys/kernel/debug/powerpc/block_address_translation Only set pagetables for RAM and IMMR on the 8xx and properly update them at the end of init. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/c8610942203e0d93fcb02ad20c57edd3adb4c9d3.1566554029.git.christophe.leroy@c-s.fr
2019-11-18powerpc/mm: Show if a bad page fault on data is read or write.Christophe Leroy1-2/+4
DSISR (or ESR on some CPUs) has a bit to tell if the fault is due to a read or a write. Display it. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Santosh Sivaraj <santosh@fossix.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/4f88d7e6fda53b5f80a71040ab400242f6c8cb93.1566400889.git.christophe.leroy@c-s.fr
2019-11-14Merge branch 'topic/kaslr-book3e32' into nextMichael Ellerman7-12/+426
This is a slight rebase of Scott's next branch, which contained the KASLR support for book3e 32-bit, to squash in a couple of small fixes. See the original pull request: https://lore.kernel.org/r/20191022232155.GA26174@home.buserror.net
2019-11-13powerpc/fsl_booke/kaslr: support nokaslr cmdline parameterJason Yan1-0/+7
One may want to disable kaslr when boot, so provide a cmdline parameter 'nokaslr' to support this. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Diana Craciun <diana.craciun@nxp.com> Tested-by: Diana Craciun <diana.craciun@nxp.com> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Scott Wood <oss@buserror.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13powerpc/fsl_booke/kaslr: clear the original kernel if randomizedJason Yan3-0/+14
The original kernel still exists in the memory, clear it now. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Diana Craciun <diana.craciun@nxp.com> Tested-by: Diana Craciun <diana.craciun@nxp.com> Signed-off-by: Scott Wood <oss@buserror.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13powerpc/fsl_booke/32: randomize the kernel image offsetJason Yan1-2/+323
After we have the basic support of relocate the kernel in some appropriate place, we can start to randomize the offset now. Entropy is derived from the banner and timer, which will change every build and boot. This not so much safe so additionally the bootloader may pass entropy via the /chosen/kaslr-seed node in device tree. We will use the first 512M of the low memory to randomize the kernel image. The memory will be split in 64M zones. We will use the lower 8 bit of the entropy to decide the index of the 64M zone. Then we chose a 16K aligned offset inside the 64M zone to put the kernel in. We also check if we will overlap with some areas like the dtb area, the initrd area or the crashkernel area. If we cannot find a proper area, kaslr will be disabled and boot from the original kernel. Some pieces of code are derived from arch/x86/boot/compressed/kaslr.c or arch/arm64/kernel/kaslr.c such as rotate_xor(). Credit goes to Kees and Ard. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Diana Craciun <diana.craciun@nxp.com> Tested-by: Diana Craciun <diana.craciun@nxp.com> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Scott Wood <oss@buserror.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13powerpc/fsl_booke/32: implement KASLR infrastructureJason Yan4-2/+75
This patch add support to boot kernel from places other than KERNELBASE. Since CONFIG_RELOCATABLE has already supported, what we need to do is map or copy kernel to a proper place and relocate. Freescale Book-E parts expect lowmem to be mapped by fixed TLB entries(TLB1). The TLB1 entries are not suitable to map the kernel directly in a randomized region, so we chose to copy the kernel to a proper place and restart to relocate. The offset of the kernel was not randomized yet(a fixed 64M is set). We will randomize it in the next patch. Signed-off-by: Jason Yan <yanaijie@huawei.com> Tested-by: Diana Craciun <diana.craciun@nxp.com> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Scott Wood <oss@buserror.net> [mpe: Use PTRRELOC() in early_init()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13powerpc/fsl_booke/32: introduce reloc_kernel_entry() helperJason Yan1-0/+1
Add a new helper reloc_kernel_entry() to jump back to the start of the new kernel. After we put the new kernel in a randomized place we can use this new helper to enter the kernel and begin to relocate again. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Diana Craciun <diana.craciun@nxp.com> Tested-by: Diana Craciun <diana.craciun@nxp.com> Signed-off-by: Scott Wood <oss@buserror.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13powerpc/fsl_booke/32: introduce create_kaslr_tlb_entry() helperJason Yan1-0/+1
Add a new helper create_kaslr_tlb_entry() to create a tlb entry by the virtual and physical address. This is a preparation to support boot kernel at a randomized address. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Diana Craciun <diana.craciun@nxp.com> Tested-by: Diana Craciun <diana.craciun@nxp.com> Signed-off-by: Scott Wood <oss@buserror.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13powerpc: introduce kernstart_virt_addr to store the kernel baseJason Yan1-0/+2
Now the kernel base is a fixed value - KERNELBASE. To support KASLR, we need a variable to store the kernel base. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Diana Craciun <diana.craciun@nxp.com> Tested-by: Diana Craciun <diana.craciun@nxp.com> Signed-off-by: Scott Wood <oss@buserror.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13powerpc: move memstart_addr and kernstart_addr to init-common.cJason Yan3-10/+5
These two variables are both defined in init_32.c and init_64.c. Move them to init-common.c and make them __ro_after_init. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Diana Craciun <diana.craciun@nxp.com> Tested-by: Diana Craciun <diana.craciun@nxp.com> Signed-off-by: Scott Wood <oss@buserror.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-07powerpc: Don't flush caches when adding memoryAlastair D'Silva1-2/+0
This operation takes a significant amount of time when hotplugging large amounts of memory (~50 seconds with 890GB of persistent memory). This was orignally in commit fb5924fddf9e ("powerpc/mm: Flush cache on memory hot(un)plug") to support memtrace, but the flush on add is not needed as it is flushed on remove. Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191104023305.9581-7-alastair@au1.ibm.com
2019-11-07powerpc: Chunk calls to flush_dcache_range in arch_*_memoryAlastair D'Silva1-2/+25
When presented with large amounts of memory being hotplugged (in my test case, ~890GB), the call to flush_dcache_range takes a while (~50 seconds), triggering RCU stalls. This patch breaks up the call into 1GB chunks, calling cond_resched() inbetween to allow the scheduler to run. Fixes: fb5924fddf9e ("powerpc/mm: Flush cache on memory hot(un)plug") Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191104023305.9581-6-alastair@au1.ibm.com
2019-11-07powerpc: Convert flush_icache_range & friends to CAlastair D'Silva1-1/+149
Similar to commit 22e9c88d486a ("powerpc/64: reuse PPC32 static inline flush_dcache_range()") this patch converts the following ASM symbols to C: flush_icache_range() __flush_dcache_icache() __flush_dcache_icache_phys() This was done as we discovered a long-standing bug where the length of the range was truncated due to using a 32 bit shift instead of a 64 bit one. By converting these functions to C, it becomes easier to maintain. flush_dcache_icache_phys() retains a critical assembler section as we must ensure there are no memory accesses while the data MMU is disabled (authored by Christophe Leroy). Since this has no external callers, it has also been made static, allowing the compiler to inline it within flush_dcache_icache_page(). Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> [mpe: Minor fixups, don't export __flush_dcache_icache()] Link: https://lore.kernel.org/r/20191104023305.9581-5-alastair@au1.ibm.com
2019-11-05powerpc/book3s64/hash: Add cond_resched to avoid soft lockup warningAneesh Kumar K.V1-0/+1
With large memory (8TB and more) hotplug, we can get soft lockup warnings as below. These were caused by a long loop without any explicit cond_resched which is a problem for !PREEMPT kernels. Avoid this using cond_resched() while inserting hash page table entries. We already do similar cond_resched() in __add_pages(), see commit f64ac5e6e306 ("mm, memory_hotplug: add scheduling point to __add_pages"). rcu: 3-....: (24002 ticks this GP) idle=13e/1/0x4000000000000002 softirq=722/722 fqs=12001 (t=24003 jiffies g=4285 q=2002) NMI backtrace for cpu 3 CPU: 3 PID: 3870 Comm: ndctl Not tainted 5.3.0-197.18-default+ #2 Call Trace: dump_stack+0xb0/0xf4 (unreliable) nmi_cpu_backtrace+0x124/0x130 nmi_trigger_cpumask_backtrace+0x1ac/0x1f0 arch_trigger_cpumask_backtrace+0x28/0x3c rcu_dump_cpu_stacks+0xf8/0x154 rcu_sched_clock_irq+0x878/0xb40 update_process_times+0x48/0x90 tick_sched_handle.isra.16+0x4c/0x80 tick_sched_timer+0x68/0xe0 __hrtimer_run_queues+0x180/0x430 hrtimer_interrupt+0x110/0x300 timer_interrupt+0x108/0x2f0 decrementer_common+0x114/0x120 --- interrupt: 901 at arch_add_memory+0xc0/0x130 LR = arch_add_memory+0x74/0x130 memremap_pages+0x494/0x650 devm_memremap_pages+0x3c/0xa0 pmem_attach_disk+0x188/0x750 nvdimm_bus_probe+0xac/0x2c0 really_probe+0x148/0x570 driver_probe_device+0x19c/0x1d0 device_driver_attach+0xcc/0x100 bind_store+0x134/0x1c0 drv_attr_store+0x44/0x60 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1a0/0x270 __vfs_write+0x3c/0x70 vfs_write+0xd0/0x260 ksys_write+0xdc/0x130 system_call+0x5c/0x68 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191001084656.31277-1-aneesh.kumar@linux.ibm.com
2019-11-05powerpc/mm/book3s64/radix: Flush the full mm even when need_flush_all is setAneesh Kumar K.V1-2/+1
With the previous patch, we should now not be using need_flush_all for powerpc. But then make sure we force a PID tlbie flush with RIC=2 if we ever find need_flush_all set. Also don't reset it after a mmu gather flush. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191024075801.22434-3-aneesh.kumar@linux.ibm.com
2019-11-05powerpc/mm/book3s64/radix: Use freed_tables instead of need_flush_allAneesh Kumar K.V1-8/+3
With commit 22a61c3c4f13 ("asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather") we now track whether we freed page table in mmu_gather. Use that to decide whether to flush Page Walk Cache. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191024075801.22434-2-aneesh.kumar@linux.ibm.com
2019-11-05powerpc/mm/book3s64/radix: Remove unused code.Aneesh Kumar K.V1-60/+6
mm_tlb_flush_nested change was added in the mmu gather tlb flush to handle the case of parallel pte invalidate happening with mmap_sem held in read mode. This fix was done by commit 02390f66bd23 ("powerpc/64s/radix: Fix MADV_[FREE|DONTNEED] TLB flush miss problem with THP") and the problem is explained in detail in commit 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem") This was later updated by commit 7a30df49f63a ("mm: mmu_gather: remove __tlb_reset_range() for force flush") to do a full mm flush rather than a range flush. By commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") we are also now allowing a page table free in mmap_sem read mode which means we should do a PWC flush too. Our current full mm flush imply a PWC flush. With all the above change the mm_tlb_flush_nested(mm) branch in radix__tlb_flush will never be taken because for the nested case we would have taken the if (tlb->fullmm) branch. This patch removes the unused code. Also, remove the gflush change in __radix__flush_tlb_range that was added to handle the range tlb flush code. We only check for THP there because hugetlb is flushed via a different code path where page size is explicitly specified. This is a partial revert of commit 02390f66bd23 ("powerpc/64s/radix: Fix MADV_[FREE|DONTNEED] TLB flush miss problem with THP") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191024075801.22434-1-aneesh.kumar@linux.ibm.com
2019-11-01dma/direct: turn ARCH_ZONE_DMA_BITS into a variableNicolas Saenz Julienne1-5/+15
Some architectures, notably ARM, are interested in tweaking this depending on their runtime DMA addressing limitations. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2019-10-28powerpc/book3s64/hash: Use secondary hash for bolted mapping if the primary is fullAneesh Kumar K.V2-10/+41
With bolted hash page table entry, kernel currently only use primary hash group when inserting the hash page table entry. In the rare case where kernel find all the 8 primary hash slot occupied by bolted entries, this can result in hash page table insert failure for bolted entries. Avoid this by using the secondary hash group. This is different from what kernel does for the non-bolted mapping. With non-bolted entries kernel will try secondary before removing an existing entry from hash page table group. With bolted prefer primary hash group and hence try to insert the page table entry by removing a slot from primary before trying the secondary hash group. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191024093542.29777-3-aneesh.kumar@linux.ibm.com
2019-10-28powerpc/pseries: Don't fail hash page table insert for bolted mappingAneesh Kumar K.V1-1/+8
If the hypervisor returned H_PTEG_FULL for H_ENTER hcall, retry a hash page table insert by removing a random entry from the group. After some runtime, it is very well possible to find all the 8 hash page table entry slot in the hpte group used for mapping. Don't fail a bolted entry insert in that case. With Storage class memory a user can find this error easily since a namespace enable/disable is equivalent to memory add/remove. This results in failures as reported below: $ ndctl create-namespace -r region1 -t pmem -m devdax -a 65536 -s 100M libndctl: ndctl_dax_enable: dax1.3: failed to enable Error: namespace1.2: failed to enable failed to create namespace: No such device or address In kernel log we find the details as below: Unable to create mapping for hot added memory 0xc000042006000000..0xc00004200d000000: -1 dax_pmem: probe of dax1.3 failed with error -14 This indicates that we failed to create a bolted hash table entry for direct-map address backing the namespace. We also observe failures such that not all namespaces will be enabled with ndctl enable-namespace all command. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191024093542.29777-2-aneesh.kumar@linux.ibm.com
2019-10-28powerpc/nvdimm: Update vmemmap_populated to check sub-section rangeAneesh Kumar K.V1-16/+38
With commit: 7cc7867fb061 ("mm/devm_memremap_pages: enable sub-section remap") pmem namespaces are remapped in 2M chunks. On architectures like ppc64 we can map the memmap area using 16MB hugepage size and that can cover a memory range of 16G. While enabling new pmem namespaces, since memory is added in sub-section chunks, before creating a new memmap mapping, kernel should check whether there is an existing memmap mapping covering the new pmem namespace. Currently, this is validated by checking whether the section covering the range is already initialized or not. Considering there can be multiple namespaces in the same section this can result in wrong validation. Update this to check for sub-sections in the range. This is done by checking for all pfns in the range we are mapping. We could optimize this by checking only just one pfn in each sub-section. But since this is not fast-path we keep this simple. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190917123851.22553-1-aneesh.kumar@linux.ibm.com
2019-10-11powerpc/pkeys: remove unused pkey_allows_readwriteQian Cai1-10/+0
pkey_allows_readwrite() was first introduced in the commit 5586cf61e108 ("powerpc: introduce execute-only pkey"), but the usage was removed entirely in the commit a4fcc877d4e1 ("powerpc/pkeys: Preallocate execute-only key"). Found by the "-Wunused-function" compiler warning flag. Fixes: a4fcc877d4e1 ("powerpc/pkeys: Preallocate execute-only key") Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1568733750-14580-1-git-send-email-cai@lca.pw
2019-09-29Merge tag 'libnvdimm-fixes-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimmLinus Torvalds3-8/+18
More libnvdimm updates from Dan Williams: - Complete the reworks to interoperate with powerpc dynamic huge page sizes - Fix a crash due to missed accounting for the powerpc 'struct page'-memmap mapping granularity - Fix badblock initialization for volatile (DRAM emulated) pmem ranges - Stop triggering request_key() notifications to userspace when NVDIMM-security is disabled / not present - Miscellaneous small fixups * tag 'libnvdimm-fixes-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: libnvdimm/region: Enable MAP_SYNC for volatile regions libnvdimm: prevent nvdimm from requesting key when security is disabled libnvdimm/region: Initialize bad block for volatile namespaces libnvdimm/nfit_test: Fix acpi_handle redefinition libnvdimm/altmap: Track namespace boundaries in altmap libnvdimm: Fix endian conversion issues  libnvdimm/dax: Pick the right alignment default when creating dax devices powerpc/book3s64: Export has_transparent_hugepage() related functions.
2019-09-28Merge tag 'powerpc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linuxLinus Torvalds4-23/+141
Pull powerpc fixes from Michael Ellerman: "An assortment of fixes that were either missed by me, or didn't arrive quite in time for the first v5.4 pull. - Most notable is a fix for an issue with tlbie (broadcast TLB invalidation) on Power9, when using the Radix MMU. The tlbie can race with an mtpid (move to PID register, essentially MMU context switch) on another thread of the core, which can cause stores to continue to go to a page after it's unmapped. - A fix in our KVM code to add a missing barrier, the lack of which has been observed to cause missed IPIs and subsequently stuck CPUs in the host. - A change to the way we initialise PCR (Processor Compatibility Register) to make it forward compatible with future CPUs. - On some older PowerVM systems our H_BLOCK_REMOVE support could oops, fix it to detect such systems and fallback to the old invalidation method. - A fix for an oops seen on some machines when using KASAN on 32-bit. - A handful of other minor fixes, and two new selftests. Thanks to: Alistair Popple, Aneesh Kumar K.V, Christophe Leroy, Gustavo Romero, Joel Stanley, Jordan Niethe, Laurent Dufour, Michael Roth, Oliver O'Halloran" * tag 'powerpc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/eeh: Fix eeh eeh_debugfs_break_device() with SRIOV devices powerpc/nvdimm: use H_SCM_QUERY hcall on H_OVERLAP error powerpc/nvdimm: Use HCALL error as the return value selftests/powerpc: Add test case for tlbie vs mtpidr ordering issue powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9 powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flag powerpc/book3s64/mm: Don't do tlbie fixup for some hardware revisions powerpc/pseries: Call H_BLOCK_REMOVE when supported powerpc/pseries: Read TLB Block Invalidate Characteristics KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag powerpc/mm: Fix an Oops in kasan_mmu_init() powerpc/mm: Add a helper to select PAGE_KERNEL_RO or PAGE_READONLY powerpc/64s: Set reserved PCR bits powerpc: Fix definition of PCR bits to work with old binutils powerpc/book3s64/radix: Remove WARN_ON in destroy_context() powerpc/tm: Add tm-poison test
2019-09-26mm: treewide: clarify pgtable_page_{ctor,dtor}() namingMark Rutland1-3/+3
The naming of pgtable_page_{ctor,dtor}() seems to have confused a few people, and until recently arm64 used these erroneously/pointlessly for other levels of page table. To make it incredibly clear that these only apply to the PTE level, and to align with the naming of pgtable_pmd_page_{ctor,dtor}(), let's rename them to pgtable_pte_page_{ctor,dtor}(). These changes were generated with the following shell script: ---- git grep -lw 'pgtable_page_.tor' | while read FILE; do sed -i '{s/pgtable_page_ctor/pgtable_pte_page_ctor/}' $FILE; sed -i '{s/pgtable_page_dtor/pgtable_pte_page_dtor/}' $FILE; done ---- ... with the documentation re-flowed to remain under 80 columns, and whitespace fixed up in macros to keep backslashes aligned. There should be no functional change as a result of this patch. Link: http://lkml.kernel.org/r/20190722141133.3116-1-mark.rutland@arm.com Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24thp: update split_huge_page_pmd() commentKefeng Wang1-1/+1
According to 78ddc5347341 ("thp: rename split_huge_page_pmd() to split_huge_pmd()"), update related comment. Link: http://lkml.kernel.org/r/20190731033406.185285-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24mm: introduce compound_nr()Matthew Wilcox (Oracle)1-1/+1
Replace 1 << compound_order(page) with compound_nr(page). Minor improvements in readability. Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24mm: introduce page_shift()Matthew Wilcox (Oracle)1-5/+2
Replace PAGE_SHIFT + compound_order(page) with the new page_shift() function. Minor improvements in readability. [akpm@linux-foundation.org: fix build in tce_page_is_contained()] Link: http://lkml.kernel.org/r/201907241853.yNQTrJWd%25lkp@intel.com Link: http://lkml.kernel.org/r/20190721104612.19120-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24libnvdimm/altmap: Track namespace boundaries in altmapAneesh Kumar K.V1-1/+16
With PFN_MODE_PMEM namespace, the memmap area is allocated from the device area. Some architectures map the memmap area with large page size. On architectures like ppc64, 16MB page for memap mapping can map 262144 pfns. This maps a namespace size of 16G. When populating memmap region with 16MB page from the device area, make sure the allocated space is not used to map resources outside this namespace. Such usage of device area will prevent a namespace destroy. Add resource end pnf in altmap and use that to check if the memmap area allocation can map pfn outside the namespace. On ppc64 in such case we fallback to allocation from memory. This fix kernel crash reported below: [ 132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 devm_memremap_pages_release+0x2d8/0x2e0 [ 133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000 [ 133.464760] Faulting instruction address: 0xc00000000007580c [ 133.464766] Oops: Kernel access of bad area, sig: 11 [#1] [ 133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries ..... [ 133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0 [ 133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0 [ 133.464910] Call Trace: [ 133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0 (unreliable) [ 133.464921] [c000007cbfd0f8d0] [c000000000370a44] section_deactivate+0x1a4/0x240 [ 133.464928] [c000007cbfd0f980] [c000000000386270] __remove_pages+0x3a0/0x590 [ 133.464935] [c000007cbfd0fa50] [c000000000074158] arch_remove_memory+0x88/0x160 [ 133.464942] [c000007cbfd0fae0] [c0000000003be8c0] devm_memremap_pages_release+0x150/0x2e0 [ 133.464949] [c000007cbfd0fb70] [c000000000738ea0] devm_action_release+0x30/0x50 [ 133.464955] [c000007cbfd0fb90] [c00000000073a5a4] release_nodes+0x344/0x400 [ 133.464961] [c000007cbfd0fc40] [c00000000073378c] device_release_driver_internal+0x15c/0x250 [ 133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110 [ 133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70 [ 133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0 [ 133.464987] [c000007cbfd0fd10] [c0000000004a1dfc] kernfs_fop_write+0x17c/0x250 [ 133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70 [ 133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250 djbw: Aneesh notes that this crash can likely be triggered in any kernel that supports 'papr_scm', so flagging that commit for -stable consideration. Fixes: b5beae5e224f ("powerpc/pseries: Add driver for PAPR SCM regions") Cc: <stable@vger.kernel.org> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Pankaj Gupta <pagupta@redhat.com> Tested-by: Santosh Sivaraj <santosh@fossix.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Link: https://lore.kernel.org/r/20190910062826.10041-1-aneesh.kumar@linux.ibm.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-09-24powerpc/book3s64: Export has_transparent_hugepage() related functions.Aneesh Kumar K.V2-7/+2
In later patch, we want to use hash_transparent_hugepage() in a kernel module. Export two related functions. Acked-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/20190924042440.27946-1-aneesh.kumar@linux.ibm.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-09-24powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9Aneesh Kumar K.V2-11/+98
On POWER9, under some circumstances, a broadcast TLB invalidation will fail to invalidate the ERAT cache on some threads when there are parallel mtpidr/mtlpidr happening on other threads of the same core. This can cause stores to continue to go to a page after it's unmapped. The workaround is to force an ERAT flush using PID=0 or LPID=0 tlbie flush. This additional TLB flush will cause the ERAT cache invalidation. Since we are using PID=0 or LPID=0, we don't get filtered out by the TLB snoop filtering logic. We need to still follow this up with another tlbie to take care of store vs tlbie ordering issue explained in commit: a5d4b5891c2f ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9"). The presence of ERAT cache implies we can still get new stores and they may miss store queue marking flush. Cc: stable@vger.kernel.org Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190924035254.24612-3-aneesh.kumar@linux.ibm.com
2019-09-24powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flagAneesh Kumar K.V2-3/+3
Rename the #define to indicate this is related to store vs tlbie ordering issue. In the next patch, we will be adding another feature flag that is used to handles ERAT flush vs tlbie ordering issue. Fixes: a5d4b5891c2f ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9") Cc: stable@vger.kernel.org # v4.16+ Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190924035254.24612-2-aneesh.kumar@linux.ibm.com
2019-09-21Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds1-6/+6
Pull hmm updates from Jason Gunthorpe: "This is more cleanup and consolidation of the hmm APIs and the very strongly related mmu_notifier interfaces. Many places across the tree using these interfaces are touched in the process. Beyond that a cleanup to the page walker API and a few memremap related changes round out the series: - General improvement of hmm_range_fault() and related APIs, more documentation, bug fixes from testing, API simplification & consolidation, and unused API removal - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE, and make them internal kconfig selects - Hoist a lot of code related to mmu notifier attachment out of drivers by using a refcount get/put attachment idiom and remove the convoluted mmu_notifier_unregister_no_release() and related APIs. - General API improvement for the migrate_vma API and revision of its only user in nouveau - Annotate mmu_notifiers with lockdep and sleeping region debugging Two series unrelated to HMM or mmu_notifiers came along due to dependencies: - Allow pagemap's memremap_pages family of APIs to work without providing a struct device - Make walk_page_range() and related use a constant structure for function pointers" * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits) libnvdimm: Enable unit test infrastructure compile checks mm, notifier: Catch sleeping/blocking for !blockable kernel.h: Add non_block_start/end() drm/radeon: guard against calling an unpaired radeon_mn_unregister() csky: add missing brackets in a macro for tlb.h pagewalk: use lockdep_assert_held for locking validation pagewalk: separate function pointers from iterator data mm: split out a new pagewalk.h header from mm.h mm/mmu_notifiers: annotate with might_sleep() mm/mmu_notifiers: prime lockdep mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports mm/hmm: hmm_range_fault() infinite loop mm/hmm: hmm_range_fault() NULL pointer bug mm/hmm: fix hmm_range_fault()'s handling of swapped out pages mm/mmu_notifiers: remove unregister_no_release RDMA/odp: remove ib_ucontext from ib_umem RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm' RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr RDMA/mlx5: Use ib_umem_start instead of umem.address ...
2019-09-21powerpc/mm: Fix an Oops in kasan_mmu_init()Christophe Leroy1-1/+14
Uncompressing Kernel Image ... OK Loading Device Tree to 01ff7000, end 01fff74f ... OK [ 0.000000] printk: bootconsole [udbg0] enabled [ 0.000000] BUG: Unable to handle kernel data access at 0xf818c000 [ 0.000000] Faulting instruction address: 0xc0013c7c [ 0.000000] Thread overran stack, or stack corrupted [ 0.000000] Oops: Kernel access of bad area, sig: 11 [#1] [ 0.000000] BE PAGE_SIZE=16K PREEMPT [ 0.000000] Modules linked in: [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 5.3.0-rc4-s3k-dev-00743-g5abe4a3e8fd3-dirty #2080 [ 0.000000] NIP: c0013c7c LR: c0013310 CTR: 00000000 [ 0.000000] REGS: c0c5ff38 TRAP: 0300 Not tainted (5.3.0-rc4-s3k-dev-00743-g5abe4a3e8fd3-dirty) [ 0.000000] MSR: 00001032 <ME,IR,DR,RI> CR: 99033955 XER: 80002100 [ 0.000000] DAR: f818c000 DSISR: 82000000 [ 0.000000] GPR00: c0013310 c0c5fff0 c0ad6ac0 c0c600c0 f818c031 82000000 00000000 ffffffff [ 0.000000] GPR08: 00000000 f1f1f1f1 c0013c2c c0013304 99033955 00400008 00000000 07ff9598 [ 0.000000] GPR16: 00000000 07ffb94c 00000000 00000000 00000000 00000000 00000000 f818cfb2 [ 0.000000] GPR24: 00000000 00000000 00001000 ffffffff 00000000 c07dbf80 00000000 f818c000 [ 0.000000] NIP [c0013c7c] do_page_fault+0x50/0x904 [ 0.000000] LR [c0013310] handle_page_fault+0xc/0x38 [ 0.000000] Call Trace: [ 0.000000] Instruction dump: [ 0.000000] be010080 91410014 553fe8fe 3d40c001 3d20f1f1 7d800026 394a3c2c 3fffe000 [ 0.000000] 6129f1f1 900100c4 9181007c 91410018 <913f0000> 3d2001f4 6129f4f4 913f0004 Don't map the early shadow page read-only yet when creating the new page tables for the real shadow memory, otherwise the memblock allocations that immediately follows to create the real shadow pages that are about to replace the early shadow page trigger a page fault if they fall into the region being worked on at the moment. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Fixes: 2edb16efc899 ("powerpc/32: Add KASAN support") Cc: stable@vger.kernel.org Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/fe86886fb8db44360417cee0dc515ad47ca6ef72.1566382750.git.christophe.leroy@c-s.fr